Re: [PATCH] bsg referencing bus driver module

2018-04-22 Thread James Bottomley
On Fri, 2018-04-20 at 16:44 -0600, Anatoliy Glagolev wrote:
>  
> > This patch isn't applyable because your mailer has changed all the
> > tabs to spaces.
> > 
> > I also think there's no need to do it this way.  I think what we
> > need is for fc_bsg_remove() to wait until the bsg queue is
> > drained.  It does look like the author thought this happened
> > otherwise the code wouldn't have the note.  If we fix it that way
> > we can do the same thing in all the other transport classes that
> > use bsg (which all have a similar issue).
> > 
> > James
> > 
> 
> Thanks, James. Sorry about the tabs; re-sending.
> 
> On fc_bsg_remove()...: are you suggesting to implement the whole fix
> in scsi_transport_fc.c?

Yes, but it's not just scsi_transport_fc, scsi_transport_sas has the
same issue.  I think it's probably just the one liner addition of
blk_drain_queue() that fixes this.  There should probably be a block
primitive that does the correct queue reference dance and calls
blk_cleanup_queue() and blk_drain_queue() in order.

>  That would be nice, but I do not see how that
> is possible. Even with the queue drained bsg still holds a reference
> to the Scsi_Host via bsg_class_device; bsg_class_device itself is
> referenced on bsg_open and kept around while a user-mode process
> keeps a handle to bsg.

Once you've called bsg_unregister_queue(), the queue will be destroyed
and the reference released once the last job is drained, meaning the
user can keep the bsg device open, but it will just return errors
because of the lack of queue.  This scenario allows removal to proceed
without being held hostage by open devices.

> Even if we somehow implement the waiting the call may be stuck
> forever if the user-mode process keeps the handle.

No it won't: after blk_cleanup_queue(), the queue is in bypass mode: no
requests queued after this do anything other than complete with error,
so they never make it into SCSI.

> I think handling it via a rererence to the module is more consistent
> with the way things are done in Linux. You suggested the approach
> youself back in "Waiting for scsi_host_template release" discussion.

That was before I analyzed the code paths.  Module release is tricky,
because the module exit won't be called until the references drop to
zero, so you have to be careful about not creating a situation where
module exit never gets called and module exit code should force stuff
to detach and wait for the forcing to complete to make up for the
reference circularity problem.  If you do it purely by refcounting, the
module actually may never release (that's why scsi_remove_host works
the way it does, for instance).

James



Re: [PATCH] bsg referencing bus driver module

2018-04-20 Thread James Bottomley
On Thu, 2018-04-19 at 15:10 -0700, Anatoliy Glagolev wrote:
> Updated: rebased on recent Linux, cc-ed maintainers per instructions
> in MAINTAINERS file
> 
> From df939b80d02bf37b21efaaef8ede86cfd39b0cb8 Mon Sep 17 00:00:00
> 2001
> From: Anatoliy Glagolev 
> Date: Thu, 19 Apr 2018 15:06:06 -0600
> Subject: [PATCH] bsg referencing parent module
> 
> Fixing a bug when bsg layer holds the last reference to device
> when the device's module has been unloaded. Upon dropping the
> reference the device's release function may touch memory of
> the unloaded module.
> ---
>  block/bsg-lib.c  | 24 ++--
>  block/bsg.c  | 29 ++---
>  drivers/scsi/scsi_transport_fc.c |  8 ++--
>  include/linux/bsg-lib.h  |  4 
>  include/linux/bsg.h  |  5 +
>  5 files changed, 63 insertions(+), 7 deletions(-)
> 
> diff --git a/block/bsg-lib.c b/block/bsg-lib.c
> index fc2e5ff..90c28fd 100644
> --- a/block/bsg-lib.c
> +++ b/block/bsg-lib.c
> @@ -309,6 +309,25 @@ struct request_queue *bsg_setup_queue(struct
> device *dev, const char *name,
>   bsg_job_fn *job_fn, int dd_job_size,
>   void (*release)(struct device *))
>  {
> + return bsg_setup_queue_ex(dev, name, job_fn, dd_job_size, release,
> + NULL);
> +}
> +EXPORT_SYMBOL_GPL(bsg_setup_queue);
> +
> +/**
> + * bsg_setup_queue - Create and add the bsg hooks so we can receive
> requests
> + * @dev: device to attach bsg device to
> + * @name: device to give bsg device
> + * @job_fn: bsg job handler
> + * @dd_job_size: size of LLD data needed for each job
> + * @release: @dev release function
> + * @dev_module: @dev's module
> + */
> +struct request_queue *bsg_setup_queue_ex(struct device *dev, const
> char *name,
> + bsg_job_fn *job_fn, int dd_job_size,
> + void (*release)(struct device *),
> + struct module *dev_module)

This patch isn't applyable because your mailer has changed all the tabs
to spaces.

I also think there's no need to do it this way.  I think what we need
is for fc_bsg_remove() to wait until the bsg queue is drained.  It does
look like the author thought this happened otherwise the code wouldn't
have the note.  If we fix it that way we can do the same thing in all
the other transport classes that use bsg (which all have a similar
issue).

James



Re: usercopy whitelist woe in scsi_sense_cache

2018-04-17 Thread James Bottomley
On Mon, 2018-04-16 at 20:12 -0700, Kees Cook wrote:
> I still haven't figured this out, though... any have a moment to look
> at this?

Just to let you know you're not alone ... but I can't make any sense of
this either.  The bfdq is the elevator_data, which is initialised when
the scheduler is attached, so it shouldn't change.  Is it possible to
set a data break point on elevator_data after it's initialised and see
if it got changed by something?

James



Re: [PATCH] scsi: resolve COMMAND_SIZE at compile time

2018-03-10 Thread James Bottomley
On Sat, 2018-03-10 at 14:29 +0100, Stephen Kitt wrote:
> Hi Bart,
> 
> On Fri, 9 Mar 2018 22:47:12 +, Bart Van Assche  c.com>
> wrote:
> > 
> > On Fri, 2018-03-09 at 23:33 +0100, Stephen Kitt wrote:
> > > 
> > > +/*
> > > + * SCSI command sizes are as follows, in bytes, for fixed size
> > > commands,
> > > per
> > > + * group: 6, 10, 10, 12, 16, 12, 10, 10. The top three bits of
> > > an opcode
> > > + * determine its group.
> > > + * The size table is encoded into a 32-bit value by subtracting
> > > each
> > > value
> > > + * from 16, resulting in a value of 1715488362
> > > + * (6 << 28 + 6 << 24 + 4 << 20 + 0 << 16 + 4 << 12 + 6 << 8 + 6
> > > << 4 +
> > > 10).
> > > + * Command group 3 is reserved and should never be used.
> > > + */
> > > +#define COMMAND_SIZE(opcode) \
> > > + (16 - (15 & (1715488362 >> (4 * (((opcode) >> 5) &
> > > 7)  
> > 
> > To me this seems hard to read and hard to verify. Could this have
> > been
> > written as a combination of ternary expressions, e.g. using a gcc
> > statement
> > expression to ensure that opcode is evaluated once?
> 
> That’s what I’d tried initially, e.g.
> 
> #define COMMAND_SIZE(opcode) ({ \
> int index = ((opcode) >> 5) & 7; \
> index == 0 ? 6 : (index == 4 ? 16 : index == 3 || index == 5 ? 12 :
> 10); \
> })
> 
> But gcc still reckons that results in a VLA, defeating the initial
> purpose of
> the exercise.
> 
> Does it help if I make the magic value construction clearer?
> 
> #define SCSI_COMMAND_SIZE_TBL (   \
>      (16 -  6)\
>   + ((16 - 10) <<  4) \
>   + ((16 - 10) <<  8) \
>   + ((16 - 12) << 12) \
>   + ((16 - 16) << 16) \
>   + ((16 - 12) << 20) \
>   + ((16 - 10) << 24) \
>   + ((16 - 10) << 28))
> 
> #define COMMAND_SIZE(opcode)  
> \
>   (16 - (15 & (SCSI_COMMAND_SIZE_TBL >> (4 * (((opcode) >> 5) &
> 7)

Couldn't we do the less clever thing of making the array a static const
and moving it to a header?  That way the compiler should be able to
work it out at compile time.

James


signature.asc
Description: This is a digitally signed message part


Re: [LSF/MM TOPIC] Two blk-mq related topics

2018-01-29 Thread James Bottomley
On Mon, 2018-01-29 at 14:00 -0700, Jens Axboe wrote:
> On 1/29/18 1:56 PM, James Bottomley wrote:
> > 
> > On Mon, 2018-01-29 at 23:46 +0800, Ming Lei wrote:
> > [...]
> > > 
> > > 2. When to enable SCSI_MQ at default again?
> > 
> > I'm not sure there's much to discuss ... I think the basic answer
> > is as soon as Christoph wants to try it again.
> 
> FWIW, internally I've been running various IO intensive workloads on
> what is essentially 4.12 upstream with scsi-mq the default (with
> mq-deadline as the scheduler) and comparing IO workloads with a
> previous 4.6 kernel (without scsi-mq), and things are looking
> great.
> 
> We're never going to iron out the last kinks with it being off
> by default, I think we should attempt to flip the switch again
> for 4.16.

Absolutely, I agree we turn it on ASAP.  I just don't want to be on the
receiving end of Linus' flamethrower because a bug we already had
reported against scsi-mq caused problems.  Get confirmation from the
original reporters (or as close to it as you can) that their problems
are fixed and we're good to go; he won't kick us nearly as hard for new
bugs that turn up.

James



Re: [LSF/MM TOPIC] Two blk-mq related topics

2018-01-29 Thread James Bottomley
On Mon, 2018-01-29 at 23:46 +0800, Ming Lei wrote:
[...]
> 2. When to enable SCSI_MQ at default again?

I'm not sure there's much to discuss ... I think the basic answer is as
soon as Christoph wants to try it again.

> SCSI_MQ is enabled on V3.17 firstly, but disabled at default. In
> V4.13-rc1, it is enabled at default, but later the patch is reverted
> in V4.13-rc7, and becomes disabled at default too.
> 
> Now both the original reported PM issue(actually SCSI quiesce) and
> the sequential IO performance issue have been addressed.

Is the blocker bug just not closed because no-one thought to do it:

https://bugzilla.kernel.org/show_bug.cgi?id=178381

(we have confirmed that this issue is now fixed with the original
reporter?)

And did the Huawei guy (Jonathan Cameron) confirm his performance issue
was fixed (I don't think I saw email that he did)?

James




Re: [LSF/MM TOPIC] Springfield: smart and automated storage

2018-01-27 Thread James Bottomley
On Sat, 2018-01-27 at 09:37 +0100, Jan Tulak wrote:
> Springfield is a collection of projects unifying multiple levels of
> the storage stack and providing a general API for automation, health
> and status monitoring, as well as sane and easy configuration across
> multiple levels of the storage stack. It is a scalable solution,
> working from a single node to large deployments with a mix of base
> metal, containers and VMs. For the first time, it was presented on
> Vault 2017 in a birds of a feather session.
> 
> Springfield builds upon and enhances the existing work of those
> projects:
> * udisks
> * libblockdev
> * blivet
> * libstoragemgmt
> 
> It is a coordinated effort to overcome many of the shortcomings of
> the current situation and provide a unified approach to storage
> management, so there is nothing like a binary or a library with the
> name “Springfield.”
> 
> An example of things we are trying to tackle is a useful storage
> reporting, like notifications about filesystem being full, especially
> with thin provisioning and multiple levels of the storage stack,
> where
> the user-visible status is not a true representation of what is
> happening underneath.
> 
> We want to see how the community sees this effort. What shortcomings
> of the current situation in storage layering can Springfield address
> (is there something we don’t see, but a project like this could
> help)? How would you use it, or what would make it more useful to
> you?

Do you have a link to the project page and the source tree(s)?

James



Re: [Lsf-pc] [LSF/MM TOPIC] A high-performance userspace block driver

2018-01-16 Thread James Bottomley
On Tue, 2018-01-16 at 18:23 -0500, Theodore Ts'o wrote:
> On Tue, Jan 16, 2018 at 06:52:40AM -0800, Matthew Wilcox wrote:
> > 
> > 
> > I see the improvements that Facebook have been making to the nbd
> > driver, and I think that's a wonderful thing.  Maybe the outcome of
> > this topic is simply: "Shut up, Matthew, this is good enough".
> > 
> > It's clear that there's an appetite for userspace block devices;
> > not for swap devices or the root device, but for accessing data
> > that's stored in that silo over there, and I really don't want to
> > bring that entire mess of CORBA / Go / Rust / whatever into the
> > kernel to get to it, but it would be really handy to present it as
> > a block device.
> 
> ... and using iSCSI was too painful and heavyweight.

>From what I've seen a reasonable number of storage over IP cloud
implementations are actually using AoE.  The argument goes that the
protocol is about ideal (at least as compared to iSCSI or FCoE) and the
company behind it doesn't seem to want to add any more features that
would bloat it.

James



Re: [PATCH 1/2] scsi-mq: Only show the CDB if available

2017-12-05 Thread James Bottomley
On Wed, 2017-12-06 at 00:38 +0800, Ming Lei wrote:
> On Tue, Dec 05, 2017 at 04:22:33PM +, Bart Van Assche wrote:
> > 
> > On Tue, 2017-12-05 at 13:00 +0800, Ming Lei wrote:
> > > 
> > > No, do not mix two different things in one patch, especially the
> > > fix part need to be backported to stable.
> > > 
> > > The fix part should aim at V4.15, and the other part can be a
> > > V4.16 stuff.
> > 
> > Does this mean that you do not plan to post a v5 of your patch and
> > that you want me to rework this patch series? I can do that.
> 
> I believe V4 has been OK for merge, actually the only concern from
> James is that 'set the cmnd to NULL *before* calling free so we
> narrow the race window.', but that isn't required as I explained,
> even though you don't do that in this patch too.
> 
>   https://marc.info/?t=15103046433=1=2
> 
> I will work on V5 if Martin/James thinks it is needed.

I don't buy that it isn't needed.  The point (and the pattern) is for a
destructive action set the signal *before* you execute the action not
after.  The reason should be obvious: if you set it after you invite a
race where the check says OK but the object has gone.  Even if the race
is highly unlikely, the pattern point still holds.

James



Re: [PATCH V4] scsi_debugfs: fix crash in scsi_show_rq()

2017-11-15 Thread James Bottomley
On Wed, 2017-11-15 at 18:09 +0800, Ming Lei wrote:
> On Tue, Nov 14, 2017 at 10:14:52AM -0800, James Bottomley wrote:
> > 
> > On Tue, 2017-11-14 at 08:55 +0800, Ming Lei wrote:
> > > 
> > > Hi James,
> > > 
> > > On Mon, Nov 13, 2017 at 10:55:52AM -0800, James Bottomley wrote:
> > > > 
> > > > 
> > > > On Sat, 2017-11-11 at 10:43 +0800, Ming Lei wrote:
> > > > > 
> > > > > 
> > > > > So from CPU1's review, cmd->cmnd is in a remote NUMA node,
> > > > > __scsi_format_command() is executed much slower than
> > > > > mempool_free().
> > > > > So when mempool_free() returns, __scsi_format_command() may
> > > > > not fetched the buffer in L1 cache yet, then use-after-free
> > > > > is still triggered.
> > > > > 
> > > > > That is why I say this use-after-free is inevitable no matter
> > > > > 'setting SCpnt->cmnd to NULL before calling mempool_free()'
> > > > > or not.
> > > > 
> > > > The bottom line is that there are several creative ways around
> > > > this but the proposed code is currently broken and simply
> > > > putting a comment in saying so doesn't make it acceptable.
> > > 
> > > As I explained above, I didn't see one really workable way. Or
> > > please correct it if I am wrong.
> > 
> > I simply can't believe it's beyond the wit of man to solve a use
> > after free race.  About 40% of kernel techniques are devoted to
> > this.  All I really care about is not losing the PI information we
> > previously had.  I agree with Bart that NULL cmnd is a good
> > indicator, so it seems reasonable to use it.  If you have another
> > mechanism, feel free to propose it.
> 
> Hi James,
> 
> This patch is my proposal, no others thought of yet.
> 
> We can fix the use-after-free easily via lock, rcu and ..., but some
> cost has to pay. In this case, we can't wait too long in show_rq(),
> otherwise we may lose important debug info, so I do not have better
> way.
> 
> IMO this use-after-free is actually no harm, I don't think we have to
> fix it, but it should be better to not let utility warn on this case.

Fine, so lose the snide comment and set the cmnd to NULL *before*
calling free so we narrow the race window.

James



Re: [PATCH V4] scsi_debugfs: fix crash in scsi_show_rq()

2017-11-14 Thread James Bottomley
On Tue, 2017-11-14 at 08:55 +0800, Ming Lei wrote:
> Hi James,
> 
> On Mon, Nov 13, 2017 at 10:55:52AM -0800, James Bottomley wrote:
> > 
> > On Sat, 2017-11-11 at 10:43 +0800, Ming Lei wrote:
> > > 
> > > So from CPU1's review, cmd->cmnd is in a remote NUMA node,
> > > __scsi_format_command() is executed much slower than
> > > mempool_free().
> > > So when mempool_free() returns, __scsi_format_command() may not
> > > fetched the buffer in L1 cache yet, then use-after-free
> > > is still triggered.
> > > 
> > > That is why I say this use-after-free is inevitable no matter
> > > 'setting SCpnt->cmnd to NULL before calling mempool_free()' or
> > > not.
> > 
> > The bottom line is that there are several creative ways around this
> > but the proposed code is currently broken and simply putting a
> > comment in saying so doesn't make it acceptable.
> 
> As I explained above, I didn't see one really workable way. Or please
> correct it if I am wrong.

I simply can't believe it's beyond the wit of man to solve a use after
free race.  About 40% of kernel techniques are devoted to this.  All I
really care about is not losing the PI information we previously had.
 I agree with Bart that NULL cmnd is a good indicator, so it seems
reasonable to use it.  If you have another mechanism, feel free to
propose it.

James



Re: [PATCH V4] scsi_debugfs: fix crash in scsi_show_rq()

2017-11-13 Thread James Bottomley
On Sat, 2017-11-11 at 10:43 +0800, Ming Lei wrote:
> On Fri, Nov 10, 2017 at 08:51:58AM -0800, James Bottomley wrote:
> > 
> > On Fri, 2017-11-10 at 17:01 +0800, Ming Lei wrote:
> > > 
> > > cmd->cmnd can be allocated/freed dynamically in case of
> > > T10_PI_TYPE2_PROTECTION, so we should check it in scsi_show_rq()
> > > because this request may have been freed already here, and cmd-
> > > >cmnd has been set as null.
> > > 
> > > We choose to accept read-after-free and dump request data as far
> > > as possible.
> > > 
> > > This patch fixs the following kernel crash when dumping request
> > > via block's debugfs interface:
> > > 
> > > [  252.962045] BUG: unable to handle kernel NULL pointer
> > > dereference
> > > at   (null)
> > > [  252.963007] IP: scsi_format_opcode_name+0x1a/0x1c0
> > > [  252.963007] PGD 25e75a067 P4D 25e75a067 PUD 25e75b067 PMD 0
> > > [  252.963007] Oops:  [#1] PREEMPT SMP
> > > [  252.963007] Dumping ftrace buffer:
> > > [  252.963007](ftrace buffer empty)
> > > [  252.963007] Modules linked in: scsi_debug ebtable_filter
> > > ebtables
> > > ip6table_filter ip6_tables xt_CHECKSUM iptable_mangle
> > > ipt_MASQUERADE
> > > nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4
> > > nf_defrag_ipv4
> > > nf_nat_ipv4 nf_nat nf_conntrack libcrc32c bridge stp llc
> > > iptable_filter fuse ip_tables sd_mod sg mptsas mptscsih mptbase
> > > crc32c_intel ahci libahci nvme serio_raw scsi_transport_sas
> > > libata
> > > lpc_ich nvme_core virtio_scsi binfmt_misc dm_mod iscsi_tcp
> > > libiscsi_tcp libiscsi scsi_transport_iscsi null_blk configs
> > > [  252.963007] CPU: 1 PID: 1881 Comm: cat Not tainted 4.14.0-
> > > rc2.blk_mq_io_hang+ #516
> > > [  252.963007] Hardware name: QEMU Standard PC (Q35 + ICH9,
> > > 2009),
> > > BIOS 1.9.3-1.fc25 04/01/2014
> > > [  252.963007] task: 88025e6f6000 task.stack:
> > > c90001bd
> > > [  252.963007] RIP: 0010:scsi_format_opcode_name+0x1a/0x1c0
> > > [  252.963007] RSP: 0018:c90001bd3c50 EFLAGS: 00010286
> > > [  252.963007] RAX: 4843 RBX: 0050 RCX:
> > > 
> > > [  252.963007] RDX:  RSI: 0050 RDI:
> > > c90001bd3cd8
> > > [  252.963007] RBP: c90001bd3c88 R08: 1000 R09:
> > > 
> > > [  252.963007] R10: 880275134000 R11: 88027513406c R12:
> > > 0050
> > > [  252.963007] R13: c90001bd3cd8 R14:  R15:
> > > 
> > > [  252.963007] FS:  7f4d11762700()
> > > GS:88027fc4()
> > > knlGS:
> > > [  252.963007] CS:  0010 DS:  ES:  CR0: 80050033
> > > [  252.963007] CR2:  CR3: 00025e789003 CR4:
> > > 003606e0
> > > [  252.963007] DR0:  DR1:  DR2:
> > > 
> > > [  252.963007] DR3:  DR6: fffe0ff0 DR7:
> > > 0400
> > > [  252.963007] Call Trace:
> > > [  252.963007]  __scsi_format_command+0x27/0xc0
> > > [  252.963007]  scsi_show_rq+0x5c/0xc0
> > > [  252.963007]  ? seq_printf+0x4e/0x70
> > > [  252.963007]  ? blk_flags_show+0x5b/0xf0
> > > [  252.963007]  __blk_mq_debugfs_rq_show+0x116/0x130
> > > [  252.963007]  blk_mq_debugfs_rq_show+0xe/0x10
> > > [  252.963007]  seq_read+0xfe/0x3b0
> > > [  252.963007]  ? __handle_mm_fault+0x631/0x1150
> > > [  252.963007]  full_proxy_read+0x54/0x90
> > > [  252.963007]  __vfs_read+0x37/0x160
> > > [  252.963007]  ? security_file_permission+0x9b/0xc0
> > > [  252.963007]  vfs_read+0x96/0x130
> > > [  252.963007]  SyS_read+0x55/0xc0
> > > [  252.963007]  entry_SYSCALL_64_fastpath+0x1a/0xa5
> > > [  252.963007] RIP: 0033:0x7f4d1127e9b0
> > > [  252.963007] RSP: 002b:7ffd27082568 EFLAGS: 0246
> > > ORIG_RAX:
> > > 
> > > [  252.963007] RAX: ffda RBX: 7f4d1154bb20 RCX:
> > > 7f4d1127e9b0
> > > [  252.963007] RDX: 0002 RSI: 7f4d115a7000 RDI:
> > > 0003
> > > [  252.963007] RBP: 00021010 R08:  R09:
> > > 
> > > [  252.963007] R10: 037b R11: 0246 R12:
> > > 

Re: [PATCH V4] scsi_debugfs: fix crash in scsi_show_rq()

2017-11-10 Thread James Bottomley
On Fri, 2017-11-10 at 17:01 +0800, Ming Lei wrote:
> cmd->cmnd can be allocated/freed dynamically in case of
> T10_PI_TYPE2_PROTECTION,
> so we should check it in scsi_show_rq() because this request may have
> been freed already here, and cmd->cmnd has been set as null.
> 
> We choose to accept read-after-free and dump request data as far as
> possible.
> 
> This patch fixs the following kernel crash when dumping request via
> block's
> debugfs interface:
> 
> [  252.962045] BUG: unable to handle kernel NULL pointer dereference
> at   (null)
> [  252.963007] IP: scsi_format_opcode_name+0x1a/0x1c0
> [  252.963007] PGD 25e75a067 P4D 25e75a067 PUD 25e75b067 PMD 0
> [  252.963007] Oops:  [#1] PREEMPT SMP
> [  252.963007] Dumping ftrace buffer:
> [  252.963007](ftrace buffer empty)
> [  252.963007] Modules linked in: scsi_debug ebtable_filter ebtables
> ip6table_filter ip6_tables xt_CHECKSUM iptable_mangle ipt_MASQUERADE
> nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4
> nf_nat_ipv4 nf_nat nf_conntrack libcrc32c bridge stp llc
> iptable_filter fuse ip_tables sd_mod sg mptsas mptscsih mptbase
> crc32c_intel ahci libahci nvme serio_raw scsi_transport_sas libata
> lpc_ich nvme_core virtio_scsi binfmt_misc dm_mod iscsi_tcp
> libiscsi_tcp libiscsi scsi_transport_iscsi null_blk configs
> [  252.963007] CPU: 1 PID: 1881 Comm: cat Not tainted 4.14.0-
> rc2.blk_mq_io_hang+ #516
> [  252.963007] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
> BIOS 1.9.3-1.fc25 04/01/2014
> [  252.963007] task: 88025e6f6000 task.stack: c90001bd
> [  252.963007] RIP: 0010:scsi_format_opcode_name+0x1a/0x1c0
> [  252.963007] RSP: 0018:c90001bd3c50 EFLAGS: 00010286
> [  252.963007] RAX: 4843 RBX: 0050 RCX:
> 
> [  252.963007] RDX:  RSI: 0050 RDI:
> c90001bd3cd8
> [  252.963007] RBP: c90001bd3c88 R08: 1000 R09:
> 
> [  252.963007] R10: 880275134000 R11: 88027513406c R12:
> 0050
> [  252.963007] R13: c90001bd3cd8 R14:  R15:
> 
> [  252.963007] FS:  7f4d11762700() GS:88027fc4()
> knlGS:
> [  252.963007] CS:  0010 DS:  ES:  CR0: 80050033
> [  252.963007] CR2:  CR3: 00025e789003 CR4:
> 003606e0
> [  252.963007] DR0:  DR1:  DR2:
> 
> [  252.963007] DR3:  DR6: fffe0ff0 DR7:
> 0400
> [  252.963007] Call Trace:
> [  252.963007]  __scsi_format_command+0x27/0xc0
> [  252.963007]  scsi_show_rq+0x5c/0xc0
> [  252.963007]  ? seq_printf+0x4e/0x70
> [  252.963007]  ? blk_flags_show+0x5b/0xf0
> [  252.963007]  __blk_mq_debugfs_rq_show+0x116/0x130
> [  252.963007]  blk_mq_debugfs_rq_show+0xe/0x10
> [  252.963007]  seq_read+0xfe/0x3b0
> [  252.963007]  ? __handle_mm_fault+0x631/0x1150
> [  252.963007]  full_proxy_read+0x54/0x90
> [  252.963007]  __vfs_read+0x37/0x160
> [  252.963007]  ? security_file_permission+0x9b/0xc0
> [  252.963007]  vfs_read+0x96/0x130
> [  252.963007]  SyS_read+0x55/0xc0
> [  252.963007]  entry_SYSCALL_64_fastpath+0x1a/0xa5
> [  252.963007] RIP: 0033:0x7f4d1127e9b0
> [  252.963007] RSP: 002b:7ffd27082568 EFLAGS: 0246 ORIG_RAX:
> 
> [  252.963007] RAX: ffda RBX: 7f4d1154bb20 RCX:
> 7f4d1127e9b0
> [  252.963007] RDX: 0002 RSI: 7f4d115a7000 RDI:
> 0003
> [  252.963007] RBP: 00021010 R08:  R09:
> 
> [  252.963007] R10: 037b R11: 0246 R12:
> 00022000
> [  252.963007] R13: 7f4d1154bb78 R14: 1000 R15:
> 0002
> [  252.963007] Code: c6 e8 1b ca 24 00 eb 8c e8 74 2c ae ff 0f 1f 40
> 00 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54 53 49 89 fd 49 89 f4
> 48 83 ec 18 <44> 0f b6 32 48 c7 45 c8 00 00 00 00 65 48 8b 04 25 28
> 00 00 00
> [  252.963007] RIP: scsi_format_opcode_name+0x1a/0x1c0 RSP:
> c90001bd3c50
> [  252.963007] CR2: 
> [  252.963007] ---[ end trace 83c5bddfbaa6573c ]---
> [  252.963007] Kernel panic - not syncing: Fatal exception
> [  252.963007] Dumping ftrace buffer:
> [  252.963007](ftrace buffer empty)
> [  252.963007] Kernel Offset: disabled
> [  252.963007] ---[ end Kernel panic - not syncing: Fatal exception
> 
> Fixes: 0eebd005dd07(scsi: Implement blk_mq_ops.show_rq())
> Cc: Bart Van Assche <bart.vanass...@sandisk.com>
> Cc: Omar Sandoval <osan...@fb.com>
> Cc: Martin K. Petersen <martin.peter...@oracle.com>
> Cc: James Bottomley <

Re: [PATCH V3] scsi_debugfs: fix crash in scsi_show_rq()

2017-11-08 Thread James Bottomley
On Wed, 2017-11-08 at 21:21 +0800, Ming Lei wrote:
> cmd->cmnd can be allocated/freed dynamically in case of
> T10_PI_TYPE2_PROTECTION,
> so we should check it in scsi_show_rq() because this request may have
> been freed in scsi_show_rq().
> 
> This patch fixs the following kernel crash when dumping request via
> block's
> debugfs interface:
> 
> [  252.962045] BUG: unable to handle kernel NULL pointer dereference
> at   (null)
> [  252.963007] IP: scsi_format_opcode_name+0x1a/0x1c0
> [  252.963007] PGD 25e75a067 P4D 25e75a067 PUD 25e75b067 PMD 0
> [  252.963007] Oops:  [#1] PREEMPT SMP
> [  252.963007] Dumping ftrace buffer:
> [  252.963007](ftrace buffer empty)
> [  252.963007] Modules linked in: scsi_debug ebtable_filter ebtables
> ip6table_filter ip6_tables xt_CHECKSUM iptable_mangle ipt_MASQUERADE
> nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4
> nf_nat_ipv4 nf_nat nf_conntrack libcrc32c bridge stp llc
> iptable_filter fuse ip_tables sd_mod sg mptsas mptscsih mptbase
> crc32c_intel ahci libahci nvme serio_raw scsi_transport_sas libata
> lpc_ich nvme_core virtio_scsi binfmt_misc dm_mod iscsi_tcp
> libiscsi_tcp libiscsi scsi_transport_iscsi null_blk configs
> [  252.963007] CPU: 1 PID: 1881 Comm: cat Not tainted 4.14.0-
> rc2.blk_mq_io_hang+ #516
> [  252.963007] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
> BIOS 1.9.3-1.fc25 04/01/2014
> [  252.963007] task: 88025e6f6000 task.stack: c90001bd
> [  252.963007] RIP: 0010:scsi_format_opcode_name+0x1a/0x1c0
> [  252.963007] RSP: 0018:c90001bd3c50 EFLAGS: 00010286
> [  252.963007] RAX: 4843 RBX: 0050 RCX:
> 
> [  252.963007] RDX:  RSI: 0050 RDI:
> c90001bd3cd8
> [  252.963007] RBP: c90001bd3c88 R08: 1000 R09:
> 
> [  252.963007] R10: 880275134000 R11: 88027513406c R12:
> 0050
> [  252.963007] R13: c90001bd3cd8 R14:  R15:
> 
> [  252.963007] FS:  7f4d11762700() GS:88027fc4()
> knlGS:
> [  252.963007] CS:  0010 DS:  ES:  CR0: 80050033
> [  252.963007] CR2:  CR3: 00025e789003 CR4:
> 003606e0
> [  252.963007] DR0:  DR1:  DR2:
> 
> [  252.963007] DR3:  DR6: fffe0ff0 DR7:
> 0400
> [  252.963007] Call Trace:
> [  252.963007]  __scsi_format_command+0x27/0xc0
> [  252.963007]  scsi_show_rq+0x5c/0xc0
> [  252.963007]  ? seq_printf+0x4e/0x70
> [  252.963007]  ? blk_flags_show+0x5b/0xf0
> [  252.963007]  __blk_mq_debugfs_rq_show+0x116/0x130
> [  252.963007]  blk_mq_debugfs_rq_show+0xe/0x10
> [  252.963007]  seq_read+0xfe/0x3b0
> [  252.963007]  ? __handle_mm_fault+0x631/0x1150
> [  252.963007]  full_proxy_read+0x54/0x90
> [  252.963007]  __vfs_read+0x37/0x160
> [  252.963007]  ? security_file_permission+0x9b/0xc0
> [  252.963007]  vfs_read+0x96/0x130
> [  252.963007]  SyS_read+0x55/0xc0
> [  252.963007]  entry_SYSCALL_64_fastpath+0x1a/0xa5
> [  252.963007] RIP: 0033:0x7f4d1127e9b0
> [  252.963007] RSP: 002b:7ffd27082568 EFLAGS: 0246 ORIG_RAX:
> 
> [  252.963007] RAX: ffda RBX: 7f4d1154bb20 RCX:
> 7f4d1127e9b0
> [  252.963007] RDX: 0002 RSI: 7f4d115a7000 RDI:
> 0003
> [  252.963007] RBP: 00021010 R08:  R09:
> 
> [  252.963007] R10: 037b R11: 0246 R12:
> 00022000
> [  252.963007] R13: 7f4d1154bb78 R14: 1000 R15:
> 0002
> [  252.963007] Code: c6 e8 1b ca 24 00 eb 8c e8 74 2c ae ff 0f 1f 40
> 00 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54 53 49 89 fd 49 89 f4
> 48 83 ec 18 <44> 0f b6 32 48 c7 45 c8 00 00 00 00 65 48 8b 04 25 28
> 00 00 00
> [  252.963007] RIP: scsi_format_opcode_name+0x1a/0x1c0 RSP:
> c90001bd3c50
> [  252.963007] CR2: 
> [  252.963007] ---[ end trace 83c5bddfbaa6573c ]---
> [  252.963007] Kernel panic - not syncing: Fatal exception
> [  252.963007] Dumping ftrace buffer:
> [  252.963007](ftrace buffer empty)
> [  252.963007] Kernel Offset: disabled
> [  252.963007] ---[ end Kernel panic - not syncing: Fatal exception
> 
> Fixes: 0eebd005dd07(scsi: Implement blk_mq_ops.show_rq())
> Cc: Bart Van Assche <bart.vanass...@sandisk.com>
> Cc: Omar Sandoval <osan...@fb.com>
> Cc: Martin K. Petersen <martin.peter...@oracle.com>
> Cc: James Bottomley <james.bottom...@hansenpartnership.com>
> Cc: Hannes Reinecke <h...@suse.com>
> Signed-off-by: Ming Lei <ming.

Re: [PATCH V2] scsi_debugfs: fix crash in scsi_show_rq()

2017-11-07 Thread James Bottomley
On Wed, 2017-11-08 at 09:15 +0800, Ming Lei wrote:
> On Wed, Nov 08, 2017 at 01:06:44AM +, Bart Van Assche wrote:
> > 
> > On Wed, 2017-11-08 at 08:59 +0800, Ming Lei wrote:
> > > 
> > > On Tue, Nov 07, 2017 at 04:13:48PM +, Bart Van Assche wrote:
> > > > 
> > > > On Tue, 2017-11-07 at 23:21 +0800, Ming Lei wrote:
> > > > > 
> > > > > diff --git a/drivers/scsi/scsi_debugfs.c
> > > > > b/drivers/scsi/scsi_debugfs.c
> > > > > index 5e9755008aed..7a50878446b4 100644
> > > > > --- a/drivers/scsi/scsi_debugfs.c
> > > > > +++ b/drivers/scsi/scsi_debugfs.c
> > > > > @@ -9,7 +9,10 @@ void scsi_show_rq(struct seq_file *m, struct
> > > > > request *rq)
> > > > >   int msecs = jiffies_to_msecs(jiffies - cmd-
> > > > > >jiffies_at_alloc);
> > > > >   char buf[80];
> > > > >  
> > > > > - __scsi_format_command(buf, sizeof(buf), cmd->cmnd,
> > > > > cmd->cmd_len);
> > > > > + if (cmd->cmnd == scsi_req(rq)->cmd)
> > > > > + __scsi_format_command(buf, sizeof(buf), cmd-
> > > > > >cmnd, cmd->cmd_len);
> > > > > + else
> > > > > + strcpy(buf, "unknown");
> > > > >   seq_printf(m, ", .cmd=%s, .retries=%d, allocated
> > > > > %d.%03d s ago", buf,
> > > > >      cmd->retries, msecs / 1000, msecs %
> > > > > 1000);
> > > > >  }
> > > > 
> > > > This change introduces a new bug, namely that "unknown" will
> > > > appear in the debugfs output if (cmd->cmnd != scsi_req(rq)-
> > > > >cmd). Have you considered to use
> > > 
> > > Because there isn't reliable way to get the command safely, and I
> > > don't think it is a new bug.
> > > 
> > > > 
> > > > the test (cmd->cmnd != NULL) instead?
> > > 
> > > No, that is worse, because you may cause use-after-free if you
> > > read a freed buffer.
> > 
> > It seems like your operating mode is still to contradict all
> > feedback you get instead of trying to address the feedback you get?
> > 
> > Anyway, this is a debugging facility so I'm not convinced that
> > avoiding a (very sporadic) use-after-free in this code is better
> > than omitting very useful output.
> 
> OK, if no one objects the use-after-free, because this way may
> trigger warning from some utility, such as KASAN.

Good grief, this list is supposed to be for technical discussions not
kindergarten playground squabbles; could you both please act your age?

The patch as proposed would lose us all information about PI commands,
hence the objection.  For the concern about use after free on the NULL
check, what about modifying sd_uninit_command() to NULL SCpnt->cmnd
before it calls mempool_free()?  That will likely eliminate the race
window for printing the command.

James



Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance

2017-08-11 Thread James Bottomley
On Fri, 2017-08-11 at 01:11 -0700, Christoph Hellwig wrote:
> [+ Martin and linux-scsi]
> 
> Given that we need this big pile and a few bfq fixes to avoid
> major regressesions I'm tempted to revert the default to scsi-mq
> for 4.14, but bring it back a little later for 4.15.
> 
> What do you think?  Maybe for 4.15 we could also do it through the
> block tree where all the fixes will be queued.

Given the severe workload regressions Mel reported, I think that's
wise.

I also think we wouldn't have found all these problems if it hadn't
been the default, so the original patch was the best way of trying to
find out if we were ready for the switch and forcing all the issues
out.

Thanks,

James



Re: [PATCH] scsi: sanity check for timeout in sg_io()

2017-05-10 Thread James Bottomley
On Wed, 2017-05-10 at 15:24 +0200, Hannes Reinecke wrote:
> sg_io() is using msecs_to_jiffies() to convert a passed in timeout
> value (in milliseconds) to a jiffies value. However, if the value
> is too large msecs_to_jiffies() will return MAX_JIFFY_OFFSET, which
> will be truncated to -2 and cause the timeout to be set to 1.3
> _years_. Which is probably too long for most applications.

What makes you think that?  If a user specified timeout in ms is over
MAX_JIFFY_OFFSET on, say a 250HZ machine, then they're already asking
for something around this length of time in ms, which is probably their
attempt at machine infinity, meaning they want effectively no timeout. 
 Your patch would now give them the default timeout, which looks way
too short.

James

> Signed-off-by: Hannes Reinecke 
> ---
>  block/scsi_ioctl.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
> index 4a294a5..53b95ea 100644
> --- a/block/scsi_ioctl.c
> +++ b/block/scsi_ioctl.c
> @@ -231,6 +231,7 @@ static int blk_fill_sghdr_rq(struct request_queue
> *q, struct request *rq,
>struct sg_io_hdr *hdr, fmode_t mode)
>  {
>   struct scsi_request *req = scsi_req(rq);
> + unsigned long timeout;
>  
>   if (copy_from_user(req->cmd, hdr->cmdp, hdr->cmd_len))
>   return -EFAULT;
> @@ -242,7 +243,11 @@ static int blk_fill_sghdr_rq(struct
> request_queue *q, struct request *rq,
>*/
>   req->cmd_len = hdr->cmd_len;
>  
> - rq->timeout = msecs_to_jiffies(hdr->timeout);
> + timeout = msecs_to_jiffies(hdr->timeout);
> + if (timeout == MAX_JIFFY_OFFSET)
> + rq->timeout = 0;
> + else
> + rq->timeout = timeout;
>   if (!rq->timeout)
>   rq->timeout = q->sg_timeout;
>   if (!rq->timeout)



Re: [PATCH 0/5] v2: block subsystem refcounter conversions

2017-04-21 Thread James Bottomley
On Fri, 2017-04-21 at 14:30 -0700, Kees Cook wrote:
> On Fri, Apr 21, 2017 at 2:27 PM, James Bottomley
> <james.bottom...@hansenpartnership.com> wrote:
> > On Fri, 2017-04-21 at 13:22 -0700, Kees Cook wrote:
> > > On Fri, Apr 21, 2017 at 12:55 PM, Eric Biggers <
> > > ebigge...@gmail.com>
> > > wrote:
> > > > > > Of course, having extra checks behind a debug option is 
> > > > > > fine.  But they should not be part of the base feature; the 
> > > > > > base feature should just be mitigation of reference count
> > > > > > *overflows*.  It would be nice to do more, of course; but 
> > > > > > when the extra stuff prevents people from using refcount_t 
> > > > > > for performance reasons, it defeats the point of the
> > > > > > feature in the first place.
> > > > > 
> > > > > Sure, but as I said above, I think the smaller tricks and 
> > > > > fixes won't be convincing enough, so their value is
> > > > > questionable.
> > > > 
> > > > This makes no sense, as the main point of the feature is 
> > > > supposed to be the security improvement.  As-is, the extra 
> > > > debugging stuff is actually preventing the security improvement 
> > > > from being adopted, which is unfortunate.
> > > 
> > > We've been trying to handle the conflicting desires of those 
> > > wanting very precise refcounting implementation and gaining the 
> > > security protections. Ultimately, the best way forward seemed to 
> > > be to first land the precise refcounting implementation, and 
> > > start conversion until we ran into concerns over performance. 
> > > Now, since we're here, we can move forward with getting a fast 
> > > implementation that provides the desired security protections 
> > > without too greatly messing with the refcount API.
> > 
> > But that's not what it really looks like.  What it looks like is
> > someone came up with a new API and is now intent on forcing it 
> > through the kernel in about 500 patches using security as the
> > hammer.
> 
> The intent is to split refcounting and statistical counters away from
> atomics, since they have distinct APIs. This will let us audit the
> remaining atomic uses much more easily.

But the security problem is counter overflow, right?  That problem, as
far as I can see exists in the atomics as well.  I'm sure there might
be one or two corner cases depending on overflow actually working, but
I can't think of them.

The refcount vs atomic comes on the precise meaning of decrement to
zero.  I'm not saying there's no benefit to detecting the condition,
but the security problem looks to be much more pressing which is why I
think this can be argued on the merits later.

> > If we were really concerned about security first, we'd have fixed 
> > the atomic overflow problem in the atomics and then built the 
> > precise refcounting on that and tried to persuade people of the
> > merits.
> 
> I agree, but this approach was NAKed by multiple atomics maintainers.

Overriding that decision by trying to convince all the consumers to
move to a new API doesn't seem to be going so well either.  Perhaps we
could assist you in changing the minds of the atomics maintainers ...
what was the primary problem?  The additional couple of cycles or the
fact that some use cases want overflow (or something else)?

James

> > Why can't we still do this?  It looks like the overflow protection 
> > will add only a couple of cycles to the atomics.  We can move the
> > current version to an unchecked variant which can be used either in 
> > truly performance critical regions with no security implications or 
> > if someone really needs the atomic to overflow.  From my point of 
> > view it would give us the 90% case (security) and stop this endless
> > argument over the fast paths.  Subsystems which have already moved 
> > to refcount would stay there and the rest get to evaluate a 
> > migration on the merits of the operational utility.
> 
> -Kees
> 



Re: [PATCH 0/5] v2: block subsystem refcounter conversions

2017-04-21 Thread James Bottomley
On Fri, 2017-04-21 at 13:22 -0700, Kees Cook wrote:
> On Fri, Apr 21, 2017 at 12:55 PM, Eric Biggers  
> wrote:
> > > > Of course, having extra checks behind a debug option is fine. 
> > > >  But they should not be part of the base feature; the base 
> > > > feature should just be mitigation of reference count
> > > > *overflows*.  It would be nice to do more, of
> > > > course; but when the extra stuff prevents people from using 
> > > > refcount_t for performance reasons, it defeats the point of the
> > > > feature in the first place.
> > > 
> > > Sure, but as I said above, I think the smaller tricks and fixes 
> > > won't be convincing enough, so their value is questionable.
> > 
> > This makes no sense, as the main point of the feature is supposed 
> > to be the security improvement.  As-is, the extra debugging stuff 
> > is actually preventing the security improvement from being adopted,
> > which is unfortunate.
> 
> We've been trying to handle the conflicting desires of those wanting
> very precise refcounting implementation and gaining the security
> protections. Ultimately, the best way forward seemed to be to first
> land the precise refcounting implementation, and start conversion
> until we ran into concerns over performance. Now, since we're here, 
> we can move forward with getting a fast implementation that provides 
> the desired security protections without too greatly messing with the
> refcount API.

But that's not what it really looks like.  What it looks like is
someone came up with a new API and is now intent on forcing it through
the kernel in about 500 patches using security as the hammer.

If we were really concerned about security first, we'd have fixed the
atomic overflow problem in the atomics and then built the precise
refcounting on that and tried to persuade people of the merits.

Why can't we still do this?  It looks like the overflow protection will
add only a couple of cycles to the atomics.  We can move the current
version to an unchecked variant which can be used either in truly
performance critical regions with no security implications or if
someone really needs the atomic to overflow.  From my point of view it
would give us the 90% case (security) and stop this endless argument
over the fast paths.  Subsystems which have already moved to refcount
would stay there and the rest get to evaluate a migration on the merits
of the operational utility.

James



Re: support ranges TRIM for libata

2017-03-23 Thread James Bottomley
On Thu, 2017-03-23 at 14:55 +0100, Christoph Hellwig wrote:
> On Thu, Mar 23, 2017 at 09:47:41AM -0400, James Bottomley wrote:
> > > The current implementation already has the issue of that it does
> > > corrupt user data reliably if the using SG_IO for WRITE SAME
> > > commands.
> > 
> > That does need fixing.
> 
> I don't think it's fixable as long as we translate the data payload.
> 
> > Why can't we do what the t10 sat document recommends: if the ATA 
> > device doesn't support the XL version (32 bit ranges) then 
> > translate unmap to multiple non-XL commands?
> 
> Because libata sits underneath the tag allocator we'd getting into
> a giant set of problems.  Where do you expect the new commands to
> magically come from?  And adhere to the queueing limits and not 
> actually deadlock in one way or another?

I'm certainly not saying we blindly follow t10, but I believe their
intent is to issue the next command from the completion of the first
(we can do this using qc->complete_fn, like atapi_request_sense).  That
way we don't get any tag problems because there's only one command
outstanding at once; reusing the qc means no allocation issues either.

The t10 approach does mean the SG_IO problem is actually fixable rather
than simply erroring out.

> > I don't necessarily object to the vendor specific 1<->1 approach, 
> > it's just it won't fix the problem you cited above (SG_IO WRITE 
> > SAME), its just that now we error the command, which may cause some
> > surprise.
> 
> We now error WRITE SAME for passthrough consistently.  Before we only
> accepted it only with the unmap bit set, and even did so incorrectly
> (not checking that the payload was all zeros).

If it never worked for anyone, I'm OK with changing it to error out.

> > I also wonder if we couldn't simply do an ATA_16 TRIM if we're
> > already going to all the trouble of recognising ATA devices in the 
> > sd discard path?
> 
> We probably could do that as well.  But that would drag a lot more
> ATA-specific code into sd.c than just formatting the ranges.

That's up to you ... from the point of view of code documenting itself,
forming the ATA_16 TRIM in sd and not doing any satl transformation is
easier for others to follow, but if it's going to cause more code, I'm
only marginal on the advantages of easier to follow code.

James



Re: support ranges TRIM for libata

2017-03-23 Thread James Bottomley
On Wed, 2017-03-22 at 19:19 +0100, Christoph Hellwig wrote:
> On Tue, Mar 21, 2017 at 02:59:01PM -0400, Tejun Heo wrote:
> > I do like the fact that this is a lot simpler than the previous
> > implementation but am not quite sure we want to deviate 
> > significantly from what we do for other commands (command 
> > translation).  Is it because fixing the existing implementation 
> > would involve invaisve changes including memory allocations?
> 
> The current implementation already has the issue of that it does
> corrupt user data reliably if the using SG_IO for WRITE SAME
> commands.

That does need fixing.

> Doing ranges using translation would turn into a nightmare because
> ATA TRIM ranges are 16 bits long while SCSI UNAMP ranges are 32-bit,
> so we effectively can't translated them without introducing a
> non-standard hook between libata and scsi to communicate that
> limit.

Why can't we do what the t10 sat document recommends: if the ATA device
doesn't support the XL version (32 bit ranges) then translate unmap to
multiple non-XL commands?

I don't necessarily object to the vendor specific 1<->1 approach, it's
just it won't fix the problem you cited above (SG_IO WRITE SAME), its
just that now we error the command, which may cause some surprise.  I
also wonder if we couldn't simply do an ATA_16 TRIM if we're already
going to all the trouble of recognising ATA devices in the sd discard
path?

James

>   And once we're down that path we might as well just do the
> right thing directly.



Re: [PATCH 1/4] block: Allow bdi re-registration

2017-03-08 Thread James Bottomley
On Wed, 2017-03-08 at 17:55 -0500, Tejun Heo wrote:
> Hello,
> 
> On Wed, Mar 08, 2017 at 05:48:31PM +0100, Jan Kara wrote:
> > @@ -710,6 +710,11 @@ static void cgwb_bdi_destroy(struct
> > backing_dev_info *bdi)
> >  */
> > atomic_dec(>usage_cnt);
> > wait_event(cgwb_release_wait, !atomic_read(
> > ->usage_cnt));
> > +   /*
> > +* Grab back our reference so that we hold it when @bdi
> > gets
> > +* re-registered.
> > +*/
> > +   atomic_inc(>usage_cnt);
> 
> So, this is more re-initializing the ref to the initial state so that
> it can be re-used, right?  Maybe ATOMIC_INIT() is a better choice 
> here just to clarify what's going on?

Seconded.  Eventually this is going to get converted to a refcount_t
and it will dump a spurious warning on the 0->1 transition.  We can
avoid that by making this a proper initialization.

James




Re: [bdi_unregister] 165a5e22fa INFO: task swapper:1 blocked for more than 120 seconds.

2017-03-07 Thread James Bottomley
On Tue, 2017-03-07 at 15:41 +0100, Jan Kara wrote:
> On Mon 06-03-17 09:25:42, James Bottomley wrote:
> > On Mon, 2017-03-06 at 17:13 +0100, Jan Kara wrote:
> > > On Mon 06-03-17 07:44:55, James Bottomley wrote:
> ...
> > > > > Sure. The call trace is:
> > > > > 
> > > > > [   41.919244] [ cut here ]
> > > > > [   41.919263] WARNING: CPU: 4 PID: 2335 at
> > > > > drivers/scsi/sd.c:3332
> > > > > sd_shutdown+0x2f/0x100
> > > > > [   41.919268] Modules linked in: scsi_debug(+) netconsole 
> > > > > loop btrfs raid6_pq zlib_deflate lzo_compress xor
> > > > > [   41.919319] CPU: 4 PID: 2335 Comm: modprobe Not tainted 
> > > > > 4.11.0-rc1-xen+ #49
> > > > > [   41.919325] Hardware name: Bochs Bochs, BIOS Bochs
> > > > > 01/01/2011
> > > > > [   41.919331] Call Trace:
> > > > > [   41.919343]  dump_stack+0x8e/0xf0
> > > > > [   41.919354]  __warn+0x116/0x120
> > > > > [   41.919361]  warn_slowpath_null+0x1d/0x20
> > > > > [   41.919368]  sd_shutdown+0x2f/0x100
> > > > > [   41.919374]  sd_remove+0x70/0xd0
> > > > > 
> > > > > *** Here is the unexpected step I guess...
> > > > > 
> > > > > [   41.919383]  driver_probe_device+0xe0/0x4c0
> > > > 
> > > > Exactly.  It's this, I think
> > > > 
> > > > bool test_remove =
> > > > IS_ENABLED(CONFIG_DEBUG_TEST_DRIVER_REMOVE)
> > > > &&
> > > >!drv->suppress_bind_attrs;
> > > > 
> > > > You have that config option set.
> > > 
> > > Yes - or better said 0-day testing has it set. Maybe that is not 
> > > a sane default for 0-day tests? The option is explicitely marked 
> > > as "unstable"... Fengguang?
> > > 
> > > > So the drivers base layer is calling ->remove after probe and
> > > > triggering the destruction of the queue.
> > > > 
> > > > What to do about this (apart from nuke such a stupid option) is
> > > > somewhat more problematic.
> > > 
> > > I guess this is between you and Greg :).
> > 
> > Nice try, sport ... you qualify just behind Dan in the "not my 
> > problem" olympic hurdles event.  I'm annoyed because there's no 
> > indication in the log that the driver add behaviour is radically 
> > altered, so I've been staring at the wrong code for weeks. 
> >  However, the unbind/rebind should work, so the real issue is that 
> > your bdi changes (or perhaps something else in block) have induced 
> > a regression in the unbinding of upper layer drivers.  If you're 
> > going to release the bdi in del_gendisk, you have to have some 
> > mechanism for re-acquiring it on re-probe (likely with the same 
> > name) otherwise rebind is broken for every block driver.
> 
> So my patch does not release bdi in del_gendisk(). Bdi has two
> initialization / destruction phases (similarly to request queue). You
> allocate and initialize bdi through bdi_init(), then you call
> bdi_register() to register it (which happens in device_add_disk()). 
> On shutdown you have to first call bdi_unregister() (used to be 
> called from blk_cleanup_queue(), my patch moved it to del_gendisk()). 
> After that the last reference to bdi may be dropped which does final 
> bdi destruction.
> 
> So do I understand correctly that SCSI may call device_add_disk(),
> del_gendisk() repeatedly for the same request queue?

Yes.  The upper drivers (sd, sr, st and sg) follow a device model. 
 They can thus be bound and unbound many times during the lifetime of a
SCSI device.

>  If yes, then indeed I have a bug to fix... But gendisk seems to get 
> allocated from scratch on each probe so we don't call 
> device_add_disk(), del_gendisk() more times on the same disk, right?

Right, gendisk, being a generic representation of a disk is a property
of the upper layer drivers.  We actually cheat and use it in all of
them (including the apparent character ones, they use alloc_disk,
put_disk but not add_disk).  So it has to be freed when the driver is
unbound and reallocated when it is bound.  It's the fundamental entity
which embeds the SCSI upper layer driver lifetime.

> > The fact that the second rebind tried with a different name 
> > indicates that sd_devt_release wasn't called, so some vestige of 
> > the devt remains on the subsequent rebind.
> 
> Yep, I guess that's caused by Dan's patch (commit 0dba1314d4f8 now)

Re: [bdi_unregister] 165a5e22fa INFO: task swapper:1 blocked for more than 120 seconds.

2017-03-06 Thread James Bottomley
On Mon, 2017-03-06 at 16:14 +0100, Jan Kara wrote:
> On Mon 06-03-17 06:35:21, James Bottomley wrote:
> > On Mon, 2017-03-06 at 13:01 +0100, Jan Kara wrote:
> > > On Mon 06-03-17 11:27:33, Jan Kara wrote:
> > > > Hi,
> > > > 
> > > > On Sun 05-03-17 10:21:11, Wu Fengguang wrote:
> > > > > FYI next-20170303 is good while mainline is bad with this
> > > > > error.
> > > > > The attached reproduce-* may help reproduce the issue.
> > > > 
> > > > Thanks for report! So from the stacktrace we are in the path 
> > > > testing removal of a device immediately after it has been
> > > > probed 
> > > > and for some reason bdi_unregister() hangs - likely waiting for
> > > > cgroup-writeback references to drop. Given how early this
> > > > happens 
> > > > my guess is we fail to initialize something but for now I don't
> > > > see 
> > > > how my patch could make a difference. I'm trying to reproduce
> > > > this 
> > > > to be able to debug more...
> > > 
> > > OK, so after some debugging I think this is yet another problem
> > > in 
> > > SCSI initialization / destruction code which my patch only makes 
> > > visible (added relevant maintainers).
> > > 
> > > I can reproduce the problem reliably with enabling:
> > > 
> > > CONFIG_DEBUG_TEST_DRIVER_REMOVE=y
> > > CONFIG_SCSI_DEBUG=m
> > > CONFIG_BLK_CGROUP=y
> > > CONFIG_MEMCG=y
> > > (and thus CONFIG_CGROUP_WRITEBACK=y)
> > > 
> > > then 'modprobe scsi_debug' is all it takes to reproduce hang. 
> > > Relevant kernel messages with some of my debugging added
> > > (attached is 
> > > a patch that adds those debug messages):
> > 
> > This looks to be precisely the same problem Dan Williams was
> > debugging
> > for us.
> > 
> > > [   58.721765] scsi host0: scsi_debug: version 1.86 [20160430]
> > > [   58.721765]   dev_size_mb=8, opts=0x0, submit_queues=1,
> > > statistics=0
> > > [   58.728946] CGWB init 88007fbb2000
> > > [   58.730095] Created sdev 880078e1a000
> > > [   58.731611] scsi 0:0:0:0: Direct-Access Linux   
> > >  scsi_debug
> > > 0186 PQ : 0 ANSI: 7
> > > [   58.782246] sd 0:0:0:0: [sda] 16384 512-byte logical blocks:
> > > (8.39
> > > MB/8.00 MiB)
> > > [   58.789687] sd 0:0:0:0: [sda] Write Protect is off
> > > [   58.791140] sd 0:0:0:0: [sda] Mode Sense: 73 00 10 08
> > > [   58.800879] sd 0:0:0:0: [sda] Write cache: enabled, read
> > > cache:
> > > enabled, supports DPO and FUA
> > > [   58.893738] sd 0:0:0:0: [sda] Attached SCSI disk
> > > [   58.896808] Unreg1
> > > [   58.897960] Unreg2
> > > [   58.898637] Unreg3
> > > [   58.899100] CGWB 88007fbb2000 usage_cnt: 0
> > > [   58.94] Unreg4
> > > [   58.904976] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> > 
> > OK, can you put a WARN_ON trace in sd_shutdown and tell us where
> > this
> > is coming from.  For the device to be reused after this we have to
> > be
> > calling sd_shutdown() without going into SDEV_DEL.
> 
> Sure. The call trace is:
> 
> [   41.919244] [ cut here ]
> [   41.919263] WARNING: CPU: 4 PID: 2335 at drivers/scsi/sd.c:3332
> sd_shutdown+0x2f/0x100
> [   41.919268] Modules linked in: scsi_debug(+) netconsole loop btrfs
> raid6_pq zlib_deflate lzo_compress xor
> [   41.919319] CPU: 4 PID: 2335 Comm: modprobe Not tainted 4.11.0-rc1
> -xen+ #49
> [   41.919325] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [   41.919331] Call Trace:
> [   41.919343]  dump_stack+0x8e/0xf0
> [   41.919354]  __warn+0x116/0x120
> [   41.919361]  warn_slowpath_null+0x1d/0x20
> [   41.919368]  sd_shutdown+0x2f/0x100
> [   41.919374]  sd_remove+0x70/0xd0
> 
> *** Here is the unexpected step I guess...
> 
> [   41.919383]  driver_probe_device+0xe0/0x4c0

Exactly.  It's this, I think

bool test_remove = IS_ENABLED(CONFIG_DEBUG_TEST_DRIVER_REMOVE) &&
   !drv->suppress_bind_attrs;

You have that config option set.

So the drivers base layer is calling ->remove after probe and
triggering the destruction of the queue.

What to do about this (apart from nuke such a stupid option) is
somewhat more problematic.

James



Re: [bdi_unregister] 165a5e22fa INFO: task swapper:1 blocked for more than 120 seconds.

2017-03-06 Thread James Bottomley
On Mon, 2017-03-06 at 17:13 +0100, Jan Kara wrote:
> On Mon 06-03-17 07:44:55, James Bottomley wrote:
> > On Mon, 2017-03-06 at 16:14 +0100, Jan Kara wrote:
> > > On Mon 06-03-17 06:35:21, James Bottomley wrote:
> > > > On Mon, 2017-03-06 at 13:01 +0100, Jan Kara wrote:
> > > > > On Mon 06-03-17 11:27:33, Jan Kara wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > On Sun 05-03-17 10:21:11, Wu Fengguang wrote:
> > > > > > > FYI next-20170303 is good while mainline is bad with this
> > > > > > > error.
> > > > > > > The attached reproduce-* may help reproduce the issue.
> > > > > > 
> > > > > > Thanks for report! So from the stacktrace we are in the
> > > > > > path 
> > > > > > testing removal of a device immediately after it has been
> > > > > > probed 
> > > > > > and for some reason bdi_unregister() hangs - likely waiting
> > > > > > for
> > > > > > cgroup-writeback references to drop. Given how early this
> > > > > > happens 
> > > > > > my guess is we fail to initialize something but for now I
> > > > > > don't
> > > > > > see 
> > > > > > how my patch could make a difference. I'm trying to
> > > > > > reproduce
> > > > > > this 
> > > > > > to be able to debug more...
> > > > > 
> > > > > OK, so after some debugging I think this is yet another
> > > > > problem
> > > > > in 
> > > > > SCSI initialization / destruction code which my patch only
> > > > > makes 
> > > > > visible (added relevant maintainers).
> > > > > 
> > > > > I can reproduce the problem reliably with enabling:
> > > > > 
> > > > > CONFIG_DEBUG_TEST_DRIVER_REMOVE=y
> > > > > CONFIG_SCSI_DEBUG=m
> > > > > CONFIG_BLK_CGROUP=y
> > > > > CONFIG_MEMCG=y
> > > > > (and thus CONFIG_CGROUP_WRITEBACK=y)
> > > > > 
> > > > > then 'modprobe scsi_debug' is all it takes to reproduce hang.
> > > > > Relevant kernel messages with some of my debugging added
> > > > > (attached is 
> > > > > a patch that adds those debug messages):
> > > > 
> > > > This looks to be precisely the same problem Dan Williams was
> > > > debugging
> > > > for us.
> > > > 
> > > > > [   58.721765] scsi host0: scsi_debug: version 1.86
> > > > > [20160430]
> > > > > [   58.721765]   dev_size_mb=8, opts=0x0, submit_queues=1,
> > > > > statistics=0
> > > > > [   58.728946] CGWB init 88007fbb2000
> > > > > [   58.730095] Created sdev 880078e1a000
> > > > > [   58.731611] scsi 0:0:0:0: Direct-Access Linux   
> > > > >  scsi_debug
> > > > > 0186 PQ : 0 ANSI: 7
> > > > > [   58.782246] sd 0:0:0:0: [sda] 16384 512-byte logical
> > > > > blocks:
> > > > > (8.39
> > > > > MB/8.00 MiB)
> > > > > [   58.789687] sd 0:0:0:0: [sda] Write Protect is off
> > > > > [   58.791140] sd 0:0:0:0: [sda] Mode Sense: 73 00 10 08
> > > > > [   58.800879] sd 0:0:0:0: [sda] Write cache: enabled, read
> > > > > cache:
> > > > > enabled, supports DPO and FUA
> > > > > [   58.893738] sd 0:0:0:0: [sda] Attached SCSI disk
> > > > > [   58.896808] Unreg1
> > > > > [   58.897960] Unreg2
> > > > > [   58.898637] Unreg3
> > > > > [   58.899100] CGWB 88007fbb2000 usage_cnt: 0
> > > > > [   58.94] Unreg4
> > > > > [   58.904976] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> > > > 
> > > > OK, can you put a WARN_ON trace in sd_shutdown and tell us
> > > > where
> > > > this
> > > > is coming from.  For the device to be reused after this we have
> > > > to
> > > > be
> > > > calling sd_shutdown() without going into SDEV_DEL.
> > > 
> > > Sure. The call trace is:
> > > 
> > > [   41.919244] [ cut here ]
> > > [   41.919263] WARNING: CPU: 4 PID: 2335 at
> > > drivers/scsi/sd.c:3332
> > > sd_shutdown+0x2f/0x100
> > > [   41.919268] Modules linked

Re: [bdi_unregister] 165a5e22fa INFO: task swapper:1 blocked for more than 120 seconds.

2017-03-06 Thread James Bottomley
On Mon, 2017-03-06 at 13:01 +0100, Jan Kara wrote:
> On Mon 06-03-17 11:27:33, Jan Kara wrote:
> > Hi,
> > 
> > On Sun 05-03-17 10:21:11, Wu Fengguang wrote:
> > > FYI next-20170303 is good while mainline is bad with this error.
> > > The attached reproduce-* may help reproduce the issue.
> > 
> > Thanks for report! So from the stacktrace we are in the path 
> > testing removal of a device immediately after it has been probed 
> > and for some reason bdi_unregister() hangs - likely waiting for 
> > cgroup-writeback references to drop. Given how early this happens 
> > my guess is we fail to initialize something but for now I don't see 
> > how my patch could make a difference. I'm trying to reproduce this 
> > to be able to debug more...
> 
> OK, so after some debugging I think this is yet another problem in 
> SCSI initialization / destruction code which my patch only makes 
> visible (added relevant maintainers).
> 
> I can reproduce the problem reliably with enabling:
> 
> CONFIG_DEBUG_TEST_DRIVER_REMOVE=y
> CONFIG_SCSI_DEBUG=m
> CONFIG_BLK_CGROUP=y
> CONFIG_MEMCG=y
> (and thus CONFIG_CGROUP_WRITEBACK=y)
> 
> then 'modprobe scsi_debug' is all it takes to reproduce hang. 
> Relevant kernel messages with some of my debugging added (attached is 
> a patch that adds those debug messages):

This looks to be precisely the same problem Dan Williams was debugging
for us.

> [   58.721765] scsi host0: scsi_debug: version 1.86 [20160430]
> [   58.721765]   dev_size_mb=8, opts=0x0, submit_queues=1,
> statistics=0
> [   58.728946] CGWB init 88007fbb2000
> [   58.730095] Created sdev 880078e1a000
> [   58.731611] scsi 0:0:0:0: Direct-Access Linuxscsi_debug
> 0186 PQ : 0 ANSI: 7
> [   58.782246] sd 0:0:0:0: [sda] 16384 512-byte logical blocks: (8.39
> MB/8.00 MiB)
> [   58.789687] sd 0:0:0:0: [sda] Write Protect is off
> [   58.791140] sd 0:0:0:0: [sda] Mode Sense: 73 00 10 08
> [   58.800879] sd 0:0:0:0: [sda] Write cache: enabled, read cache:
> enabled, supports DPO and FUA
> [   58.893738] sd 0:0:0:0: [sda] Attached SCSI disk
> [   58.896808] Unreg1
> [   58.897960] Unreg2
> [   58.898637] Unreg3
> [   58.899100] CGWB 88007fbb2000 usage_cnt: 0
> [   58.94] Unreg4
> [   58.904976] sd 0:0:0:0: [sda] Synchronizing SCSI cache

OK, can you put a WARN_ON trace in sd_shutdown and tell us where this
is coming from.  For the device to be reused after this we have to be
calling sd_shutdown() without going into SDEV_DEL.

Thanks,

James



Re: [LSF/MM TOPIC] do we really need PG_error at all?

2017-02-26 Thread James Bottomley
On Mon, 2017-02-27 at 08:03 +1100, NeilBrown wrote:
> On Sun, Feb 26 2017, James Bottomley wrote:
> 
> > [added linux-scsi and linux-block because this is part of our error
> > handling as well]
> > On Sun, 2017-02-26 at 09:42 -0500, Jeff Layton wrote:
> > > Proposing this as a LSF/MM TOPIC, but it may turn out to be me 
> > > just not understanding the semantics here.
> > > 
> > > As I was looking into -ENOSPC handling in cephfs, I noticed that
> > > PG_error is only ever tested in one place [1] 
> > > __filemap_fdatawait_range, which does this:
> > > 
> > >   if (TestClearPageError(page))
> > >   ret = -EIO;
> > > 
> > > This error code will override any AS_* error that was set in the
> > > mapping. Which makes me wonder...why don't we just set this error 
> > > in the mapping and not bother with a per-page flag? Could we
> > > potentially free up a page flag by eliminating this?
> > 
> > Note that currently the AS_* codes are only set for write errors 
> > not for reads and we have no mapping error handling at all for swap
> > pages, but I'm sure this is fixable.
> 
> How is a read error different from a failure to set PG_uptodate?
> Does PG_error suppress retries?

We don't do any retries in the code above the block layer (or at least
we shouldn't).  

> > 
> > From the I/O layer point of view we take great pains to try to 
> > pinpoint the error exactly to the sector.  We reflect this up by 
> > setting the PG_error flag on the page where the error occurred.  If 
> > we only set the error on the mapping, we lose that granularity, 
> > because the mapping is mostly at the file level (or VMA level for
> > anon pages).
> 
> Are you saying that the IO layer finds the page in the bi_io_vec and
> explicitly sets PG_error,

I didn't say anything about the mechanism.  I think the function you're
looking for is fs/mpage.c:mpage_end_io().  layers below block indicate
the position in the request.  Block maps the position to bio and the
bio completion maps to page.  So the actual granularity seen in the
upper layer depends on how the page to bio mapping is done.

>  rather than just passing an error indication to bi_end_io ??  That
> would seem to be wrong as the page may not be in the page cache.

Usually pages in the mpage_end_io path are pinned, I think.

>  So I guess I misunderstand you.
> 
> > 
> > So I think the question for filesystem people from us would be do 
> > you care about this accuracy?  If it's OK just to know an error
> > occurred somewhere in this file, then perhaps we don't need it.
> 
> I had always assumed that a bio would either succeed or fail, and 
> that no finer granularity could be available.

It does ... but a bio can be as small as a single page.

> I think the question here is: Do filesystems need the pagecache to
> record which pages have seen an IO error?

It's not just filesystems.  The partition code uses PageError() ... the
metadata code might as well (those are things with no mapping).  I'm
not saying we can't remove PG_error; I am saying it's not going to be
quite as simple as using the AS_ flags.

James

> I think that for write errors, there is no value in recording
> block-oriented error status - only file-oriented status.
> For read errors, it might if help to avoid indefinite read retries, 
> but I don't know the code well enough to be sure if this is an issue.
> 
> NeilBrown
> 
> 
> > 
> > James
> > 
> > > The main argument I could see for keeping it is that removing it 
> > > might subtly change the behavior of sync_file_range if you have 
> > > tasks syncing different ranges in a file concurrently. I'm not 
> > > sure if that would break any guarantees though.
> > > 
> > > Even if we do need it, I think we might need some cleanup here 
> > > anyway. A lot of readpage operations end up setting that flag 
> > > when they hit an error. Isn't it wrong to return an error on 
> > > fsync, just because we had a read error somewhere in the file in 
> > > a range that was never dirtied?
> > > 
> > > --
> > > [1]: there is another place in f2fs, but it's more or less 
> > > equivalent to the call site in __filemap_fdatawait_range.
> > > 


signature.asc
Description: This is a digitally signed message part


Re: [LSF/MM TOPIC] do we really need PG_error at all?

2017-02-26 Thread James Bottomley
[added linux-scsi and linux-block because this is part of our error
handling as well]
On Sun, 2017-02-26 at 09:42 -0500, Jeff Layton wrote:
> Proposing this as a LSF/MM TOPIC, but it may turn out to be me just 
> not understanding the semantics here.
> 
> As I was looking into -ENOSPC handling in cephfs, I noticed that
> PG_error is only ever tested in one place [1] 
> __filemap_fdatawait_range, which does this:
> 
>   if (TestClearPageError(page))
>   ret = -EIO;
> 
> This error code will override any AS_* error that was set in the
> mapping. Which makes me wonder...why don't we just set this error in 
> the mapping and not bother with a per-page flag? Could we potentially
> free up a page flag by eliminating this?

Note that currently the AS_* codes are only set for write errors not
for reads and we have no mapping error handling at all for swap pages,
but I'm sure this is fixable.

>From the I/O layer point of view we take great pains to try to pinpoint
the error exactly to the sector.  We reflect this up by setting the
PG_error flag on the page where the error occurred.  If we only set the
error on the mapping, we lose that granularity, because the mapping is
mostly at the file level (or VMA level for anon pages).

So I think the question for filesystem people from us would be do you
care about this accuracy?  If it's OK just to know an error occurred
somewhere in this file, then perhaps we don't need it.

James

> The main argument I could see for keeping it is that removing it 
> might subtly change the behavior of sync_file_range if you have tasks
> syncing different ranges in a file concurrently. I'm not sure if that 
> would break any guarantees though.
> 
> Even if we do need it, I think we might need some cleanup here 
> anyway. A lot of readpage operations end up setting that flag when 
> they hit an error. Isn't it wrong to return an error on fsync, just 
> because we had a read error somewhere in the file in a range that was
> never dirtied?
> 
> --
> [1]: there is another place in f2fs, but it's more or less equivalent 
> to the call site in __filemap_fdatawait_range.
> 



Re: [PATCH 0/5] block subsystem refcounter conversions

2017-02-20 Thread James Bottomley
On Mon, 2017-02-20 at 17:56 +0100, Peter Zijlstra wrote:
> On Mon, Feb 20, 2017 at 07:41:01AM -0800, James Bottomley wrote:
> > On Mon, 2017-02-20 at 08:15 -0700, Jens Axboe wrote:
> > > On 02/20/2017 04:16 AM, Elena Reshetova wrote:
> > > > Now when new refcount_t type and API are finally merged
> > > > (see include/linux/refcount.h), the following
> > > > patches convert various refcounters in the block susystem from 
> > > > atomic_t to refcount_t. By doing this we prevent intentional or
> > > > accidental underflows or overflows that can led to use-after
> > > > -free vulnerabilities.
> > 
> > This description isn't right ... nothing is prevented; we get 
> > warnings on saturation and use after free with this.
> 
> The thing that is prevented is overflow and then a use-after-free by
> making it a leak.
> 
> Modular stuff, you put and free at: (n+1) mod n, by saturating at n-1
> we'll never get there.
> 
> So you loose use-after-free, you gain a resource leak. The general 
> idea being that use-after-free is a nice trampoline for exploits, 
> leaks are 'only' a DoS.

OK, I see the intention: it's protection from outside influence.  It
still doesn't prevent *us* from screwing up in the kernel and inducing
a use after free by doing too many puts (or too few gets) ... that's
what the message suggests to me (me coding wrongly is accidental
underflows or overflows as I read it).

James



Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-19 Thread James Bottomley
On Sun, 2017-02-19 at 18:15 -0700, Jens Axboe wrote:
> On 02/19/2017 06:09 PM, Bart Van Assche wrote:
> > On 02/19/2017 04:11 PM, Jens Axboe wrote:
> > > - Removal of the BLOCK_PC support in struct request, and
> > > refactoring of
> > >   carrying SCSI payloads in the block layer. This cleans up the
> > > code
> > >   nicely, and enables us to kill the SCSI specific parts of
> > > struct
> > >   request, shrinking it down nicely. From Christoph mainly, with
> > > help
> > >   from Hannes.
> > 
> > Hello Jens, Christoph and Mike,
> > 
> > Is anyone working on a fix for the regression introduced by the 
> > BLOCK_PC removal changes (general protection fault) that I had 
> > reported three weeks ago? See also
> > https://www.spinics.net/lists/raid/msg55494.html
> 
> I don't think that's a regression in this series, it just triggers 
> more easily with this series. The BLOCK_PC removal fixes aren't 
> touching device life times at all.
> 
> That said, we will look into this again, of course. Christoph, any
> idea?

We could do with tracing the bdi removal failure issue fingered both by
the block xfstests and Omar.  It's something to do with this set of
commits

> - Fixes for duplicate bdi registrations and bdi/queue life time
>   problems from Jan and Dan.

But no-one has actually root caused it yet.

James



Re: Manual driver binding and unbinding broken for SCSI

2017-02-17 Thread James Bottomley
On Fri, 2017-02-17 at 16:30 -0800, Omar Sandoval wrote:
> Hi, everyone,
> 
> As per $SUBJECT, I can cause a crash on v4.10-rc8, Jens' block/for
> -next,
> and Jan's bdi branch [1] by doing this:
> 
> # lsscsi
> [0:0:0:0]diskQEMU QEMU HARDDISK2.5+  /dev/sda
> # echo 0:0:0:0 > /sys/bus/scsi/drivers/sd/unbind
> # echo 0:0:0:0 > /sys/bus/scsi/drivers/sd/bind
> 
> The resulting trace looks like this:
> 
> [   19.347924] kobject (8800791ea0b8): tried to init an
> initialized object, something is seriously wrong.
> [   19.349781] CPU: 1 PID: 84 Comm: kworker/u8:1 Not tainted 4.10.0
> -rc7-00210-g53f39eeaa263 #34
> [   19.350686] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.10.1-20161122_114906-anatol 04/01/2014
> [   19.350920] Workqueue: events_unbound async_run_entry_fn
> [   19.350920] Call Trace:
> [   19.350920]  dump_stack+0x63/0x83
> [   19.350920]  kobject_init+0x77/0x90
> [   19.350920]  blk_mq_register_dev+0x40/0x130
> [   19.350920]  blk_register_queue+0xb6/0x190
> [   19.350920]  device_add_disk+0x1ec/0x4b0
> [   19.350920]  sd_probe_async+0x10d/0x1c0 [sd_mod]
> [   19.350920]  async_run_entry_fn+0x48/0x150
> [   19.350920]  process_one_work+0x1d0/0x480
> [   19.350920]  worker_thread+0x48/0x4e0
> [   19.350920]  kthread+0x101/0x140
> [   19.350920]  ? process_one_work+0x480/0x480
> [   19.350920]  ? kthread_create_on_node+0x60/0x60
> [   19.350920]  ret_from_fork+0x2c/0x40
> 
> Additionally, on v4.10-rc8, but not on block/for-next or Jan's
> branch,
> doing this:
> 
> # echo 0:0:0:0 > /sys/bus/scsi/drivers/sd/unbind
> # modprobe scsi_debug
> 
> Causes this trace:
> 
> [   18.876096] [ cut here ]
> [   18.877057] WARNING: CPU: 1 PID: 90 at fs/sysfs/dir.c:31
> sysfs_warn_dup+0x62/0x80
> [   18.878270] sysfs: cannot create duplicate filename
> '/devices/virtual/bdi/8:0'
> [   18.879435] Modules linked in: scsi_debug btrfs xor raid6_pq
> sd_mod virtio_scsi scsi_mod nvme nvme_core virtio_net
> [   18.881118] CPU: 1 PID: 90 Comm: kworker/u8:2 Not tainted 4.10.0
> -rc8 #34
> [   18.882114] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.10.1-20161122_114906-anatol 04/01/2014
> [   18.883872] Workqueue: events_unbound async_run_entry_fn
> [   18.884408] Call Trace:
> [   18.884408]  dump_stack+0x63/0x83
> [   18.884408]  __warn+0xcb/0xf0
> [   18.884408]  warn_slowpath_fmt+0x5f/0x80
> [   18.884408]  ? kernfs_path_from_node+0x4f/0x60
> [   18.884408]  sysfs_warn_dup+0x62/0x80
> [   18.884408]  sysfs_create_dir_ns+0x77/0x90
> [   18.884408]  kobject_add_internal+0xbe/0x350
> [   18.884408]  kobject_add+0x75/0xd0
> [   18.884408]  device_add+0x121/0x680
> [   18.884408]  device_create_groups_vargs+0xe0/0xf0
> [   18.884408]  device_create_vargs+0x1c/0x20
> [   18.884408]  bdi_register+0x90/0x1b0
> [   18.884408]  ? sd_revalidate_disk+0x34a/0x1d00 [sd_mod]
> [   18.884408]  bdi_register_owner+0x36/0x60
> [   18.884408]  device_add_disk+0x165/0x4a0
> [   18.884408]  ? update_autosuspend+0x51/0x60
> [   18.884408]  ? __pm_runtime_use_autosuspend+0x5c/0x70
> [   18.884408]  sd_probe_async+0x10d/0x1c0 [sd_mod]
> [   18.884408]  async_run_entry_fn+0x4a/0x170
> [   18.884408]  process_one_work+0x165/0x430
> [   18.884408]  worker_thread+0x4e/0x490
> [   18.884408]  kthread+0x101/0x140
> [   18.884408]  ? process_one_work+0x430/0x430
> [   18.884408]  ? kthread_create_on_node+0x60/0x60
> [   18.884408]  ret_from_fork+0x2c/0x40
> [   18.913090] ---[ end trace f43b051485c2a749 ]---
> 
> On all three kernels, it looks like the bdi sysfs entry hangs around
> after the block device has already been removed:

This seems to be related to a 0day test we got on the block tree,
details here:

http://marc.info/?t=14862406881

I root caused the above to something not being released when it should
be, so it looks like you have the same problem.  It seems to be a
recent commit in the block tree, so could you bisect it since you have
a nice reproducer?

Thanks,

James




Re: [lkp-robot] [scsi, block] 0dba1314d4: WARNING:at_fs/sysfs/dir.c:#sysfs_warn_dup

2017-02-11 Thread James Bottomley
On Mon, 2017-02-06 at 21:09 -0700, Jens Axboe wrote:
> On 02/06/2017 05:14 PM, James Bottomley wrote:
> > On Sun, 2017-02-05 at 21:13 -0800, Dan Williams wrote:
> > > On Sun, Feb 5, 2017 at 1:13 AM, Christoph Hellwig <h...@lst.de>
> > > wrote:
> > > > Dan,
> > > > 
> > > > can you please quote your emails?  I can't find any content 
> > > > inbetween all these quotes.
> > > 
> > > Sorry, I'm using gmail, but I'll switch to attaching the logs.
> > > 
> > > So with help from Xiaolong I was able to reproduce this, and it
> > > does
> > > not appear to be a regression. We simply change the failure
> > > output of
> > > an existing bug. Attached is a log of the same test on v4.10-rc7 
> > > (i.e. without the recent block/scsi fixes), and it shows sda
> > > being
> > > registered twice.
> > > 
> > > "[6.647077] kobject (d5078ca4): tried to init an initialized
> > > object, something is seriously wrong."
> > > 
> > > The change that "scsi, block: fix duplicate bdi name registration
> > > crashes" makes is to properly try to register sdb since the sda
> > > devt
> > > is still alive. However that's not a fix because we've managed to
> > > call blk_register_queue() twice on the same queue.
> > 
> > OK, time to involve others: linux-scsi and linux-block cc'd and
> > I've
> > inserted the log below.
> > 
> > James
> > 
> > ---
> > 
> > [5.969672] scsi host0: scsi_debug: version 1.86 [20160430]
> > [5.969672]   dev_size_mb=8, opts=0x0, submit_queues=1,
> > statistics=0
> > [5.971895] scsi 0:0:0:0: Direct-Access Linuxscsi_debug 
> >   0186 PQ: 0 ANSI: 7
> > [6.006983] sd 0:0:0:0: [sda] 16384 512-byte logical blocks:
> > (8.39 MB/8.00 MiB)
> > [6.026965] sd 0:0:0:0: [sda] Write Protect is off
> > [6.027870] sd 0:0:0:0: [sda] Mode Sense: 73 00 10 08
> > [6.066962] sd 0:0:0:0: [sda] Write cache: enabled, read cache:
> > enabled, supports DPO and FUA
> > [6.486962] sd 0:0:0:0: [sda] Attached SCSI disk
> > [6.488377] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> > [6.489455] sd 0:0:0:0: Attached scsi generic sg0 type 0
> > [6.526982] sd 0:0:0:0: [sda] 16384 512-byte logical blocks:
> > (8.39 MB/8.00 MiB)
> > [6.546964] sd 0:0:0:0: [sda] Write Protect is off
> > [6.547873] sd 0:0:0:0: [sda] Mode Sense: 73 00 10 08
> > [6.586963] sd 0:0:0:0: [sda] Write cache: enabled, read cache:
> > enabled, supports DPO and FUA
> > [6.647077] kobject (d5078ca4): tried to init an initialized
> > object, something is seriously wrong.
> 
> So sda is probed twice, and hilarity ensues when we try to register 
> it twice.  I can't reproduce this, using scsi_debug and with 
> scsi_async enabled.

Actually, when you look closely, it's not a double add; it's an
add/remove/add.  You can see this from

[6.488377] sd 0:0:0:0: [sda] Synchronizing SCSI cache

That's from sd_shutdown() as the driver is removing.  It looks like
something with the config caused the built in SCSI debug to do this
(not sure why), but since the stack trace is in block, I think the bug
is in the remove path: something didn't release the mq object
correctly.

James




Re: [Lsf-pc] LSF/MM Question

2017-02-07 Thread James Bottomley
On Tue, 2017-02-07 at 16:09 +, Jim Mostek via Lsf-pc wrote:
> wondering about the upcoming Linux Storage Filesystem & MM summint in
> March. LSF/MM Question
> 
> What presentations are there so far?

LSF/MM is not really a conference, it's a summit.  That means it's
going to be discussion driven rather than presentation driven.  The
discussion agenda will be fluid (because issues come up as other things
are discussed) but the outline should begin to take shape a couple of
weeks beforehand.

James



Re: [lkp-robot] [scsi, block] 0dba1314d4: WARNING:at_fs/sysfs/dir.c:#sysfs_warn_dup

2017-02-06 Thread James Bottomley
On Sun, 2017-02-05 at 21:13 -0800, Dan Williams wrote:
> On Sun, Feb 5, 2017 at 1:13 AM, Christoph Hellwig  wrote:
> > Dan,
> > 
> > can you please quote your emails?  I can't find any content 
> > inbetween all these quotes.
> 
> Sorry, I'm using gmail, but I'll switch to attaching the logs.
> 
> So with help from Xiaolong I was able to reproduce this, and it does
> not appear to be a regression. We simply change the failure output of
> an existing bug. Attached is a log of the same test on v4.10-rc7 
> (i.e. without the recent block/scsi fixes), and it shows sda being
> registered twice.
> 
> "[6.647077] kobject (d5078ca4): tried to init an initialized
> object, something is seriously wrong."
> 
> The change that "scsi, block: fix duplicate bdi name registration
> crashes" makes is to properly try to register sdb since the sda devt
> is still alive. However that's not a fix because we've managed to 
> call blk_register_queue() twice on the same queue.

OK, time to involve others: linux-scsi and linux-block cc'd and I've
inserted the log below.

James

---

[5.969672] scsi host0: scsi_debug: version 1.86 [20160430]
[5.969672]   dev_size_mb=8, opts=0x0, submit_queues=1, statistics=0
[5.971895] scsi 0:0:0:0: Direct-Access Linuxscsi_debug   0186 
PQ: 0 ANSI: 7
[6.006983] sd 0:0:0:0: [sda] 16384 512-byte logical blocks: (8.39 MB/8.00 
MiB)
[6.026965] sd 0:0:0:0: [sda] Write Protect is off
[6.027870] sd 0:0:0:0: [sda] Mode Sense: 73 00 10 08
[6.066962] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, 
supports DPO and FUA
[6.486962] sd 0:0:0:0: [sda] Attached SCSI disk
[6.488377] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[6.489455] sd 0:0:0:0: Attached scsi generic sg0 type 0
[6.526982] sd 0:0:0:0: [sda] 16384 512-byte logical blocks: (8.39 MB/8.00 
MiB)
[6.546964] sd 0:0:0:0: [sda] Write Protect is off
[6.547873] sd 0:0:0:0: [sda] Mode Sense: 73 00 10 08
[6.586963] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, 
supports DPO and FUA
[6.647077] kobject (d5078ca4): tried to init an initialized object, 
something is seriously wrong.
[6.648723] CPU: 0 PID: 99 Comm: kworker/u2:1 Not tainted 4.10.0-rc7 #932
[6.649811] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.9.1-1.fc24 04/01/2014
[6.651418] Workqueue: events_unbound async_run_entry_fn
[6.652347] Call Trace:
[6.652987]  dump_stack+0x79/0xa4
[6.653716]  kobject_init+0x75/0x90
[6.654452]  blk_mq_register_dev+0x2a/0x110
[6.655269]  blk_register_queue+0x7b/0x130
[6.656080]  device_add_disk+0x1c6/0x460
[6.656866]  sd_probe_async+0xf1/0x1c0
[6.657634]  ? __lock_acquire.isra.14+0x43b/0x940
[6.658501]  async_run_entry_fn+0x30/0x190
[6.659311]  ? process_one_work+0x12f/0x430
[6.660113]  process_one_work+0x1aa/0x430
[6.660901]  ? process_one_work+0x12f/0x430
[6.661716]  worker_thread+0x1dd/0x470
[6.662479]  kthread+0xd4/0x100
[6.663175]  ? process_one_work+0x430/0x430
[6.663984]  ? __kthread_create_on_node+0x180/0x180
[6.664869]  ret_from_fork+0x21/0x2c
[6.665638] kobject (ffab51ec): tried to init an initialized object, 
something is seriously wrong.
[6.667290] CPU: 0 PID: 99 Comm: kworker/u2:1 Not tainted 4.10.0-rc7 #932
[6.668372] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.9.1-1.fc24 04/01/2014
[6.669984] Workqueue: events_unbound async_run_entry_fn
[6.670909] Call Trace:
[6.671540]  dump_stack+0x79/0xa4
[6.672266]  kobject_init+0x75/0x90
[6.673011]  blk_mq_register_dev+0x4c/0x110
[6.673832]  blk_register_queue+0x7b/0x130
[6.674633]  device_add_disk+0x1c6/0x460
[6.675413]  sd_probe_async+0xf1/0x1c0
[6.676191]  ? __lock_acquire.isra.14+0x43b/0x940
[6.677057]  async_run_entry_fn+0x30/0x190
[6.677860]  ? process_one_work+0x12f/0x430
[6.678667]  process_one_work+0x1aa/0x430
[6.679455]  ? process_one_work+0x12f/0x430
[6.680269]  worker_thread+0x1dd/0x470
[6.681036]  kthread+0xd4/0x100
[6.681737]  ? process_one_work+0x430/0x430
[6.682540]  ? __kthread_create_on_node+0x180/0x180
[6.683420]  ret_from_fork+0x21/0x2c
[6.684207] [ cut here ]
[6.685067] WARNING: CPU: 0 PID: 99 at ./include/linux/kref.h:46 
kobject_get+0x7f/0x90
[6.686592] CPU: 0 PID: 99 Comm: kworker/u2:1 Not tainted 4.10.0-rc7 #932
[6.687680] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.9.1-1.fc24 04/01/2014
[6.689280] Workqueue: events_unbound async_run_entry_fn
[6.690208] Call Trace:
[6.690843]  dump_stack+0x79/0xa4
[6.691563]  __warn+0xd2/0xf0
[6.692246]  ? kobject_get+0x7f/0x90
[6.692992]  warn_slowpath_null+0x25/0x30
[6.693787]  kobject_get+0x7f/0x90
[6.694505]  kobject_add_internal+0x2e/0x360
[6.695322]  ? kfree_const+0x18/0x20
[6.696071]  ? kobject_set_name_vargs+0x62/0x80
[6.696914]  

Re: remove the cmd_type field from struct request

2017-01-31 Thread James Bottomley
On Tue, 2017-01-31 at 10:02 -0800, Jens Axboe wrote:
> On 01/31/2017 07:57 AM, Christoph Hellwig wrote:
> > [1] which were a pain in the ass to untangle and debug during 
> > development, it's really time for it to die..
> 
> Outside of the patch series in question, how to we expedite the
> euthanasia of IDE? What explicit features/support are we missing 
> through libata that would need to be added, before we can git rm 
> drivers/ide/?

I thought the primary objection was actually embedded in that libata
with its reliance on SCSI was just too large a dependency, so they have
to keep using drivers/ide.  Perhaps nvme and flash is obviating this
problem and we can ask them again, though?

James

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] [LSF/MM ATTEND] md raid general discussion

2017-01-15 Thread James Bottomley
On Mon, 2017-01-16 at 11:33 +0800, Guoqing Jiang wrote:
> 
> On 01/10/2017 12:38 AM, Coly Li wrote:
> > Hi Folks,
> > 
> > I'd like to propose a general md raid discussion, it is quite 
> > necessary for most of active md raid developers sit together to 
> > discuss current challenge of Linux software raid and development
> > trends.
> > 
> > In the last years, we have many development activities in md raid, 
> > e.g. raid5 cache, raid1 clustering, partial parity log, fast fail
> > upstreaming, and some effort for raid1 & raid0 performance
> > improvement.
> > 
> > I see there are some kind of functionality overlap between r5cache
> > (raid5 cache) and PPL (partial parity log), currently I have no 
> > idea where we will go for these two development activities.
> > Also I receive reports from users that raid1 performance is desired 
> > when it is built on NVMe SSDs as a cache (maybe bcache or dm
> > -cache). I am working on some raid1 performance improvement (e.g. 
> > new raid1 I/O barrier and lockless raid1 I/O submit), and have some 
> > more ideas to discuss.
> > 
> > Therefore, if md raid developers may have a chance to sit together,
> > discuss how to efficiently collaborate in next year, it will be 
> > much more productive then communicating on mailing list.
> 
> I would like to attend raid discussion, besides above topics I think 
> we can talk about improve the test suite of mdadm to make it more 
> robust (I can share related test suite which is used for clustered
> raid).

Just so you know ... and just in case others are watching.  You're not
going to be getting an invite to LSF/MM unless you send an attend or
topic request in as the CFP asks:

http://marc.info/?l=linux-fsdevel=148285919408577

The rationale is simple: it's to difficult to track all the "me too"
reply emails and even if we could, it's not actually clear what the
intention of the sender is.  So you taking the time to compose an
official email as the CFP requests allows the programme committee to
distinguish.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os

2017-01-11 Thread James Bottomley
On Thu, 2017-01-12 at 11:35 +0900, Damien Le Moal wrote:
> > Just a note for the poor admin looking after the lists: to find all 
> > the ATTEND and TOPIC requests for the lists I fold up the threads 
> > to the top.  If you frame your attend request as a reply, it's 
> > possible it won't get counted because I didn't find it
> > 
> > so please *start a new thread* for ATTEND and TOPIC requests.
> 
> My apologies for the overhead. I will resend.
> Thank you.

You don't need to resend ... I've got you on the list.  I replied
publicly just in case there were any other people who did this that I
didn't notice.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os

2017-01-11 Thread James Bottomley
On Thu, 2017-01-12 at 10:33 +0900, Damien Le Moal wrote:
> Hello,
> 
> A long discussion on the list followed this initial topic proposal 
> from Matias. I think this is a worthy topic to discuss at LSF in 
> order to steer development of the zoned block device interface in the 
> right direction. Considering the relation and implication to ZBC/ZAC
> support,I would like to attend LSF/MM to participate in this
> discussion.

Just a note for the poor admin looking after the lists: to find all the
ATTEND and TOPIC requests for the lists I fold up the threads to the
top.  If you frame your attend request as a reply, it's possible it
won't get counted because I didn't find it

so please *start a new thread* for ATTEND and TOPIC requests.

Thanks,

James

PS If you think you sent a TOPIC/ATTEND request in reply to something,
then I really haven't seen it because this is the first one I noticed,
and you should resend.


--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications

2016-12-13 Thread James Bottomley
On Tue, 2016-12-13 at 13:55 -0500, Jerome Glisse wrote:
> On Tue, Dec 13, 2016 at 10:20:52AM -0800, James Bottomley wrote:
> > On Tue, 2016-12-13 at 13:15 -0500, Jerome Glisse wrote:
> > > I would like to discuss un-addressable device memory in the
> > > context 
> > > of filesystem and block device. Specificaly how to handle write
> > > -back,
> > > read, ... when a filesystem page is migrated to device memory
> > > that 
> > > CPU can not access.
> > > 
> > > I intend to post a patchset leveraging the same idea as the
> > > existing
> > > block bounce helper (block/bounce.c) to handle this. I believe
> > > this 
> > > is worth discussing during summit see how people feels about such
> > > plan and if they have better ideas.
> > 
> > Isn't this pretty much what the transcendent memory interfaces we
> > currently have are for?  It's current use cases seem to be
> > compressed
> > swap and distributed memory, but there doesn't seem to be any
> > reason in
> > principle why you can't use the interface as well.
> > 
> 
> I am not a specialist of tmem or cleancache

Well, that makes two of us; I just got to sit through Dan Magenheimer's
talks and some stuff stuck.

>  but my understand is that there is no way to allow for file back 
> page to be dirtied while being in this special memory.

Unless you have some other definition of dirtied, I believe that's what
an exclusive tmem get in frontswap actually does.  It marks the page
dirty when it comes back because it may have been modified.

> In my case when you migrate a page to the device it might very well 
> be so that the device can write something in it (results of some sort 
> of computation). So page might migrate to device memory as clean but
> return from it in dirty state.
> 
> Second aspect is that even if memory i am dealing with is un
> -addressable i still have struct page for it and i want to be able to 
> use regular page migration.

Tmem keeps a struct page ... what's the problem with page migration?
the fact that tmem locks the page when it's not addressable and you
want to be able to migrate the page even when it's not addressable?

> So given my requirement i didn't thought that cleancache was the way
> to address them. Maybe i am wrong.

I'm not saying it is, I just asked if you'd considered it, since the
requirements look similar.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications

2016-12-13 Thread James Bottomley
On Tue, 2016-12-13 at 13:15 -0500, Jerome Glisse wrote:
> I would like to discuss un-addressable device memory in the context 
> of filesystem and block device. Specificaly how to handle write-back,
> read, ... when a filesystem page is migrated to device memory that 
> CPU can not access.
> 
> I intend to post a patchset leveraging the same idea as the existing
> block bounce helper (block/bounce.c) to handle this. I believe this 
> is worth discussing during summit see how people feels about such 
> plan and if they have better ideas.

Isn't this pretty much what the transcendent memory interfaces we
currently have are for?  It's current use cases seem to be compressed
swap and distributed memory, but there doesn't seem to be any reason in
principle why you can't use the interface as well.

James


> I also like to join discussions on:
>   - Peer-to-Peer DMAs between PCIe devices
>   - CDM coherent device memory
>   - PMEM
>   - overall mm discussions
> 
> Cheers,
> Jérôme
> --
> To unsubscribe from this list: send the line "unsubscribe linux
> -fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: LSF/MM 2017: Call for Proposals

2016-12-08 Thread James Bottomley
On Thu, 2016-12-08 at 13:26 +0100, Michal Hocko wrote:
> On Wed 07-12-16 06:57:06, James Bottomley wrote:
> [...]
> > Just on this point, since there seems to be a lot of confusion: lsf
> > -pc
> > is the list for contacting the programme committee, so you cannot
> > subscribe to it.
> > 
> > There is no -discuss equivalent, like kernel summit has, because we
> > expect you to cc the relevant existing mailing list and have the
> > discussion there instead rather than expecting people to subscribe
> > to a
> > new list.
> 
> There used to be l...@lists.linux-foundation.org. Is it not active
> anymore?

Not until we get the list of invitees sorted out.  And then it's only
for stuff actually to do with lsf (like logistics, bof or session
requests or other meetups).  Any technical stuff should still be cc'd
to the relevant mailing list.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: LSF/MM 2017: Call for Proposals

2016-12-07 Thread James Bottomley
On Thu, 2016-12-01 at 09:11 -0500, Jeff Layton wrote:
> 1) Proposals for agenda topics should be sent before January 15th, 
> 2016 to:
> 
> lsf...@lists.linux-foundation.org
> 
> and cc the Linux list or lists that are relevant for the topic in
> question:
> 
> ATA:   linux-...@vger.kernel.org
> Block: linux-block@vger.kernel.org
> FS:linux-fsde...@vger.kernel.org
> MM:linux...@kvack.org
> SCSI:  linux-s...@vger.kernel.org
> NVMe:  linux-n...@lists.infradead.org
> 
> Please tag your proposal with [LSF/MM TOPIC] to make it easier to 
> track.

Just on this point, since there seems to be a lot of confusion: lsf-pc
is the list for contacting the programme committee, so you cannot
subscribe to it.

There is no -discuss equivalent, like kernel summit has, because we
expect you to cc the relevant existing mailing list and have the
discussion there instead rather than expecting people to subscribe to a
new list.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 9/9] [RFC] nvme: Fix a race condition

2016-09-27 Thread James Bottomley
On Tue, 2016-09-27 at 09:43 -0700, Bart Van Assche wrote:
> On 09/27/2016 09:31 AM, Steve Wise wrote:
> > > @@ -2079,11 +2075,15 @@ EXPORT_SYMBOL_GPL(nvme_kill_queues);
> > >  void nvme_stop_queues(struct nvme_ctrl *ctrl)
> > >  {
> > >   struct nvme_ns *ns;
> > > + struct request_queue *q;
> > > 
> > >   mutex_lock(>namespaces_mutex);
> > >   list_for_each_entry(ns, >namespaces, list) {
> > > - blk_mq_cancel_requeue_work(ns->queue);
> > > - blk_mq_stop_hw_queues(ns->queue);
> > > + q = ns->queue;
> > > + blk_quiesce_queue(q);
> > > + blk_mq_cancel_requeue_work(q);
> > > + blk_mq_stop_hw_queues(q);
> > > + blk_resume_queue(q);
> > >   }
> > >   mutex_unlock(>namespaces_mutex);
> > 
> > Hey Bart, should nvme_stop_queues() really be resuming the blk
> > queue?
> 
> Hello Steve,
> 
> Would you perhaps prefer that blk_resume_queue(q) is called from 
> nvme_start_queues()? I think that would make the NVMe code harder to 
> review. The above code won't cause any unexpected side effects if an 
> NVMe namespace is removed after nvme_stop_queues() has been called 
> and before nvme_start_queues() is called. Moving the 
> blk_resume_queue(q) call into nvme_start_queues() will only work as 
> expected if no namespaces are added nor removed between the 
> nvme_stop_queues() and nvme_start_queues() calls. I'm not familiar 
> enough with the NVMe code to know whether or not this change is safe
> ...

It's something that looks obviously wrong, so explain why you need to
do it, preferably in a comment above the function.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to make dynamically allocated devt the default for scsi disks?

2016-08-14 Thread James Bottomley
On Sat, 2016-08-13 at 11:27 -0700, Dan Williams wrote:
> On Sat, Aug 13, 2016 at 10:43 AM, James Bottomley
> <james.bottom...@hansenpartnership.com> wrote:
> > On Sat, 2016-08-13 at 09:29 -0700, Dan Williams wrote:
> > > On Sat, Aug 13, 2016 at 8:23 AM, James Bottomley
> > > <james.bottom...@hansenpartnership.com> wrote:
> > > > It does?  The race is the fact that the parent can be removed
> > > > before the child meaning if the parent name is re-registered 
> > > > before the child dies we get a duplicate name in bdi space.
> > > 
> > > No, the race is that the *name* of the parent isn't released 
> > > until the child is both unregistered and put.  The device core is
> > > already ensuring that the parent is not released until all 
> > > descendants have been removed.
> > 
> > We're both saying the same thing: the problem is that, with
> > df08c32ce3be the bdi name lifetime is tied to the lifetime of the
> > gendisk.  However, the parent of the gendisk currently is only tied 
> > to the visibility lifetime of the gendisk, not the final put 
> > lifetime, so it doesn't see this.
> > 
> > > > 
> > > > > So I tried the attached and it makes the libnvdimm unit tests
> > > > > start crashing.
> > > > 
> > > > Well, the attached is clearly buggy, isn't it?  You're trying 
> > > > to do a get on the parent before the parent is actually set.
> > > 
> > > Ah, yes, thank you.  Fixed up v2 attached that passes my tests.
> > > 
> > > > Why don't you just try the incremental patch I sent instead of
> > > > trying to rework it?
> > > 
> > > I reworked it because it is the bdi that holds this extra 
> > > dependency on the disk's parent, not the disk itself.
> > 
> > Philosophically I don't like this approach.  The dependency goes
> > 
> > bdi->gendisk->parent
> 
> I'm arguing that there is no bdi->gendisk dependency.

You created one with your bdi->owner field.  Just because you didn't
call it a parent doesn't mean it wasn't one.  Arguably the whole bdi
thing is strangely done because gendisk treats it like a class and
that's how it behaves, it just doesn't have a proper class structure
(which is why gendisk creates the link that would be done by the class
infrastructure)

> The dependency is:
> 
> bdi->devt

devt isn't a device (in the struct device sense).  It exists as
effectively an embedded component of the bdi.  As far as I can tell
there's no reason for it to be separately allocated, it could be
properly embedded as is the normal pattern.

> It just so happens that block-dynamic devt is released in
> disk_release().
> 
> > Making the bdi manage the parent lifetime is an unusual pattern.
> >  Making the parent stay around until the last reference to gendisk 
> > is put is the usual one.
> 
> What's unusual is the bdi's dependency on the allocated name, not the
> gendisk itself.

A name is just a resource belonging to an object.  The object it
belongs to is the bdi and the bdi is parented by the owner field (and a
hokey link) to the gendisk.

> > > > >   A couple crash logs attached.  Not yet sure what assumption
> > > > > is getting violated, but how about that conversion of scsi to 
> > > > > use dynamic devt? ;-)
> > > > 
> > > > It's completely orthogonal.  The problem is in hierarchy 
> > > > lifetimes: switching from static to dynamic allocation won't 
> > > > change that at all.  You don't see this problem in nvme because 
> > > > the parent control device's lifetime belongs to the controller 
> > > > not the disk.  In SCSI the parent is our representation of the 
> > > > SCSI device whose lifetime is governed at the SCSI level and 
> > > > effectively represents the disk.
> > > > 
> > > 
> > > No, it's only the name.  We could achieve the same by teaching 
> > > the block core to manage the "sd_index_ida" instead of the sd 
> > > driver itself, but the v2-patch attached works and does not 
> > > introduce that layering violation.
> > 
> > Um, so this patch doesn't fix the problem. It merely makes the 
> > lifetime rules correct so the problem can then be fixed at the scsi
> > level.
> 
> You're right that this patch does not fix the problem, I missed that
> the scsi_disk is not the parent of the gendisk, so this patch does
> nothing to delay scsi_disk_release.  What I think is the real fix is
> to make the devt properly reference counted and prevent
> id

Re: Time to make dynamically allocated devt the default for scsi disks?

2016-08-13 Thread James Bottomley
On Sat, 2016-08-13 at 09:29 -0700, Dan Williams wrote:
> On Sat, Aug 13, 2016 at 8:23 AM, James Bottomley
> <james.bottom...@hansenpartnership.com> wrote:
> > It does?  The race is the fact that the parent can be removed 
> > before the child meaning if the parent name is re-registered before 
> > the child dies we get a duplicate name in bdi space.
> 
> No, the race is that the *name* of the parent isn't released until 
> the child is both unregistered and put.  The device core is already
> ensuring that the parent is not released until all descendants have
> been removed.

We're both saying the same thing: the problem is that, with
df08c32ce3be the bdi name lifetime is tied to the lifetime of the
gendisk.  However, the parent of the gendisk currently is only tied to
the visibility lifetime of the gendisk, not the final put lifetime, so
it doesn't see this.

> > 
> > > So I tried the attached and it makes the libnvdimm unit tests 
> > > start crashing.
> > 
> > Well, the attached is clearly buggy, isn't it?  You're trying to do 
> > a get on the parent before the parent is actually set.
> 
> Ah, yes, thank you.  Fixed up v2 attached that passes my tests.
> 
> > Why don't you just try the incremental patch I sent instead of 
> > trying to rework it?
> 
> I reworked it because it is the bdi that holds this extra dependency
> on the disk's parent, not the disk itself.

Philosophically I don't like this approach.  The dependency goes 

bdi->gendisk->parent

Making the bdi manage the parent lifetime is an unusual pattern. 
 Making the parent stay around until the last reference to gendisk is
put is the usual one.

> > >   A couple crash logs attached.  Not yet sure what assumption
> > > is getting violated, but how about that conversion of scsi to use
> > > dynamic devt? ;-)
> > 
> > It's completely orthogonal.  The problem is in hierarchy lifetimes:
> > switching from static to dynamic allocation won't change that at 
> > all.  You don't see this problem in nvme because the parent control
> > device's lifetime belongs to the controller not the disk.  In SCSI 
> > the parent is our representation of the SCSI device whose lifetime 
> > is governed at the SCSI level and effectively represents the disk.
> > 
> 
> No, it's only the name.  We could achieve the same by teaching the
> block core to manage the "sd_index_ida" instead of the sd driver
> itself, but the v2-patch attached works and does not introduce that
> layering violation.

Um, so this patch doesn't fix the problem. It merely makes the lifetime
rules correct so the problem can then be fixed at the scsi level.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to make dynamically allocated devt the default for scsi disks?

2016-08-13 Thread James Bottomley
On Fri, 2016-08-12 at 21:57 -0700, Dan Williams wrote:
> On Fri, Aug 12, 2016 at 5:29 PM, Dan Williams <
> dan.j.willi...@intel.com> wrote:
> > On Fri, Aug 12, 2016 at 5:17 PM, James Bottomley
> > <james.bottom...@hansenpartnership.com> wrote:
> > > On Fri, 2016-08-12 at 14:29 -0700, Dan Williams wrote:
> > > > Before spending effort trying to flush the destruction of old
> > > > bdi
> > > > instances before new ones are registered, is it rather time to
> > > > complete the conversion of sd to only use dynamically allocated
> > > > devt?
> > > 
> > > Do we have to go that far?  Surely your fix is extensible: the
> > > only
> > > reason it doesn't work for us is that the gendisk holds the
> > > parent
> > > without a reference, so we can free the SCSI device before its
> > > child
> > > gendisk (good job no-one actually uses gendisk->parent after
> > > we've
> > > released it ...).  If we fix that it would mean SCSI can't
> > > release the
> > > sdev until after the queue is dead and the bdi namespace
> > > released, so
> > > isn't something like this the easy fix?
> > > 
> > > James
> > > 
> > > ---
> > > 
> > > diff --git a/block/genhd.c b/block/genhd.c
> > > index fcd6d4f..54ae4ae 100644
> > > --- a/block/genhd.c
> > > +++ b/block/genhd.c
> > > @@ -514,7 +514,7 @@ static void register_disk(struct device
> > > *parent, struct gendisk *disk)
> > > struct hd_struct *part;
> > > int err;
> > > 
> > > -   ddev->parent = parent;
> > > +   ddev->parent = get_device(parent);
> > > 
> > > dev_set_name(ddev, "%s", disk->disk_name);
> > > 
> > > @@ -1144,6 +1144,7 @@ static void disk_release(struct device
> > > *dev)
> > > hd_free_part(>part0);
> > > if (disk->queue)
> > > blk_put_queue(disk->queue);
> > > +   put_device(dev->parent);
> > > kfree(disk);
> > >  }
> > >  struct class block_class = {
> > 
> > Looks ok at first glance to me.
> > 
> > We do hold a reference on the parent device, but it gets dropped at
> > device_unregister() time and this moves it out to the final put.

We do?  Where?

> > However, this does leave static devt block-device-drivers that
> > register a disk without a parent device susceptible to the race... 
> > I think those exist given all the drivers still using add_disk()
> > after commit 52c44d93c26f "block: remove ->driverfs_dev".

It does?  The race is the fact that the parent can be removed before
the child meaning if the parent name is re-registered before the child
dies we get a duplicate name in bdi space.

> So I tried the attached and it makes the libnvdimm unit tests start
> crashing.

Well, the attached is clearly buggy, isn't it?  You're trying to do a
get on the parent before the parent is actually set.  Why don't you
just try the incremental patch I sent instead of trying to rework it?

>   A couple crash logs attached.  Not yet sure what assumption
> is getting violated, but how about that conversion of scsi to use
> dynamic devt? ;-)

It's completely orthogonal.  The problem is in hierarchy lifetimes:
switching from static to dynamic allocation won't change that at all. 
 You don't see this problem in nvme because the parent control device's
lifetime belongs to the controller not the disk.  In SCSI the parent is
our representation of the SCSI device whose lifetime is governed at the
SCSI level and effectively represents the disk.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to make dynamically allocated devt the default for scsi disks?

2016-08-12 Thread James Bottomley
On Fri, 2016-08-12 at 14:29 -0700, Dan Williams wrote:
> Before spending effort trying to flush the destruction of old bdi
> instances before new ones are registered, is it rather time to
> complete the conversion of sd to only use dynamically allocated devt?

Do we have to go that far?  Surely your fix is extensible: the only
reason it doesn't work for us is that the gendisk holds the parent
without a reference, so we can free the SCSI device before its child
gendisk (good job no-one actually uses gendisk->parent after we've
released it ...).  If we fix that it would mean SCSI can't release the
sdev until after the queue is dead and the bdi namespace released, so
isn't something like this the easy fix?

James

---

diff --git a/block/genhd.c b/block/genhd.c
index fcd6d4f..54ae4ae 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -514,7 +514,7 @@ static void register_disk(struct device *parent, struct 
gendisk *disk)
struct hd_struct *part;
int err;
 
-   ddev->parent = parent;
+   ddev->parent = get_device(parent);
 
dev_set_name(ddev, "%s", disk->disk_name);
 
@@ -1144,6 +1144,7 @@ static void disk_release(struct device *dev)
hd_free_part(>part0);
if (disk->queue)
blk_put_queue(disk->queue);
+   put_device(dev->parent);
kfree(disk);
 }
 struct class block_class = {
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html