date:20171003

Re: [PATCH v5 5/8] block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag

2017-10-03 Thread Christoph Hellwig

> +EXPORT_SYMBOL(blk_set_preempt_only);

EXPORT_SYMBOL_GPL please.

Except for that this looks fine:

Reviewed-by: Christoph Hellwig

Re: [PATCH v5 4/8] block: Convert RQF_PREEMPT into REQ_PREEMPT

2017-10-03 Thread Christoph Hellwig

So as pointed out in the last run (after changing my mind deeper into
the series) I think we should instead use a BLK_MQ_REQ_PREEMPT flag.

The preempt only makes sense at the request level, not for file system
requests.  For the legacy case we can add blk_get_request_flags that
takes the BLK_MQ_REQ_* flags, which is a much saner interface than the
current blk_get_request anyway.

Re: [PATCH v5 0/8] block, scsi, md: Improve suspend and resume

2017-10-03 Thread Christoph Hellwig

Bart, Ming:

can you guys please work a little better together?  We've now got two
patchsets that are getting very similar.

Bart, please at least CC Ming when you send out the patches.

Ming - instead of sending a separate series right after Bart a
differential series would be nice.  This also applies the other way
around if Ming is the first after a while.

Re: [PATCH 9/9] bsg: split handling of SCSI CDBs vs transport requeues

2017-10-03 Thread Hannes Reinecke

On 10/03/2017 12:48 PM, Christoph Hellwig wrote:
> The current BSG design tries to shoe-horn the transport-specific passthrough
> commands into the overall framework for SCSI passthrough requests.  This
> has a couple problems:
> 
>  - each passthrough queue has to set the QUEUE_FLAG_SCSI_PASSTHROUGH flag
>despite not dealing with SCSI commands at all.  Because of that these
>queues could also incorrectly accept SCSI commands from in-kernel
>users or through the legacy SCSI_IOCTL_SEND_COMMAND ioctl.
>  - the real SCSI bsg queues also incorrectly accept bsg requests of the
>BSG_SUB_PROTOCOL_SCSI_TRANSPORT type
>  - the bsg transport code is almost unredable because it tries to reuse
>different SCSI concepts for its own purpose.
> 
> This patch instead adds a new bsg_ops structure to handle the two cases
> differently, and thus solves all of the above problems.  Another side
> effect is that the bsg-lib queues also don't need to embedd a
> struct scsi_request anymore.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  block/bsg-lib.c   | 158 +++
>  block/bsg.c   | 257 
> +-
>  drivers/scsi/scsi_lib.c   |   4 +-
>  drivers/scsi/scsi_sysfs.c |   3 +-
>  drivers/scsi/scsi_transport_sas.c |   1 -
>  include/linux/bsg-lib.h   |   4 +-
>  include/linux/bsg.h   |  35 --
>  7 files changed, 251 insertions(+), 211 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [PATCH] block: drop "sending ioctl to a partition" message

2017-10-03 Thread Christoph Hellwig

On Tue, Oct 03, 2017 at 06:22:23PM +0200, Paolo Bonzini wrote:
> On 21/09/2017 16:49, Paolo Bonzini wrote:
> > After the first few months, the message has not led to many bug reports.
> > It's been almost five years now, and in practice the main source of
> > it seems to be MTIOCGET that someone is using to detect tape devices.
> > While we could whitelist it just like CDROM_GET_CAPABILITY, this patch
> > just removes the message altogether.
> > 
> > Signed-off-by: Paolo Bonzini 
> 
> Ping (with fixed email address for Jens)...

How about implementing the revised version I suggested?

Re: [PATCH 8/9] block: pass full fmode_t to blk_verify_command

2017-10-03 Thread Hannes Reinecke

On 10/03/2017 12:48 PM, Christoph Hellwig wrote:
> Use the obvious calling convention.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  block/bsg.c| 18 --
>  block/scsi_ioctl.c |  8 
>  drivers/scsi/sg.c  |  2 +-
>  include/linux/blkdev.h |  2 +-
>  4 files changed, 14 insertions(+), 16 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [PATCH 7/9] bsg-lib: remove bsg_job.req

2017-10-03 Thread Hannes Reinecke

On 10/03/2017 12:48 PM, Christoph Hellwig wrote:
> Users of the bsg-lib interface should only use the bsg_job data structure
> and not know about implementation details of it.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  block/bsg-lib.c | 14 ++
>  include/linux/bsg-lib.h |  1 -
>  2 files changed, 6 insertions(+), 9 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [PATCH 6/9] bsg-lib: introduce a timeout field in struct bsg_job

2017-10-03 Thread Hannes Reinecke

On 10/03/2017 12:48 PM, Christoph Hellwig wrote:
> The zfcp driver wants to know the timeout for a bsg job, so add a field
> to struct bsg_job for it in preparation of not exposing the request
> to the bsg-lib users.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  block/bsg-lib.c | 1 +
>  drivers/s390/scsi/zfcp_fc.c | 4 ++--
>  include/linux/bsg-lib.h | 2 ++
>  3 files changed, 5 insertions(+), 2 deletions(-)
> Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [PATCH 5/9] scsi_transport_sas: check reply payload length instead of bidi request

2017-10-03 Thread Hannes Reinecke

On 10/03/2017 12:48 PM, Christoph Hellwig wrote:
> As a user of bsg-lib the SAS transport should not poke into request
> internals but use the bsg_job fields instead.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/scsi/scsi_transport_sas.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [PATCH 4/9] qla2xxx: don't break the bsg-lib abstractions

2017-10-03 Thread Hannes Reinecke

On 10/03/2017 12:48 PM, Christoph Hellwig wrote:
> Always use bsg_job->reply instead of scsi_req(bsg_job->req)->sense), as
> they always point to the same memory.
> 
> Never set scsi_req(bsg_job->req)->result and we'll set that value through
> bsg_job_done.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/scsi/qla2xxx/qla_bsg.c | 10 --
>  drivers/scsi/qla2xxx/qla_isr.c | 12 +++-
>  drivers/scsi/qla2xxx/qla_mr.c  |  3 +--
>  3 files changed, 8 insertions(+), 17 deletions(-)
> 
[ .. ]
> @@ -2571,7 +2569,7 @@ qla24xx_bsg_timeout(struct bsg_job *bsg_job)
>   }
>   spin_unlock_irqrestore(&ha->hardware_lock, flags);
>   ql_log(ql_log_info, vha, 0x708b, "SRB not found to abort.\n");
> - scsi_req(bsg_job->req)->result = bsg_reply->result = -ENXIO;
> +  bsg_reply->result = -ENXIO;
>   return 0;
>  
>  done:
Whitespace issue ...

Otherwise:

Reviwed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [PATCH 3/9] libfc: don't assign resid_len in fc_lport_bsg_request

2017-10-03 Thread Hannes Reinecke

On 10/03/2017 12:48 PM, Christoph Hellwig wrote:
> bsg_job_done takes care of updating the scsi_request structure fields.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/scsi/libfc/fc_lport.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [PATCH 2/9] bfa: don't reset max_segments for every bsg request

2017-10-03 Thread Hannes Reinecke

On 10/03/2017 12:48 PM, Christoph Hellwig wrote:
> We already support 256 or more segments as long as the architecture
> supports SG chaining (all the ones that matter do), so removed the
> weird playing with limits from the job handler.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/scsi/bfa/bfad_bsg.c | 7 ---
>  1 file changed, 7 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [PATCH 1/9] bsg-lib: fix use-after-free under memory-pressure

2017-10-03 Thread Hannes Reinecke

On 10/03/2017 12:48 PM, Christoph Hellwig wrote:
> From: Benjamin Block 
> 
> When under memory-pressure it is possible that the mempool which backs
> the 'struct request_queue' will make use of up to BLKDEV_MIN_RQ count
> emergency buffers - in case it can't get a regular allocation. These
> buffers are preallocated and once they are also used, they are
> re-supplied with old finished requests from the same request_queue (see
> mempool_free()).
> 
> The bug is, when re-supplying the emergency pool, the old requests are
> not again ran through the callback mempool_t->alloc(), and thus also not
> through the callback bsg_init_rq(). Thus we skip initialization, and
> while the sense-buffer still should be good, scsi_request->cmd might
> have become to be an invalid pointer in the meantime. When the request
> is initialized in bsg.c, and the user's CDB is larger than BLK_MAX_CDB,
> bsg will replace it with a custom allocated buffer, which is freed when
> the user's command is finished, thus it dangles afterwards. When next a
> command is sent by the user that has a smaller/similar CDB as
> BLK_MAX_CDB, bsg will assume that scsi_request->cmd is backed by
> scsi_request->__cmd, will not make a custom allocation, and write into
> undefined memory.
> 
> Fix this by splitting bsg_init_rq() into two functions:
>  - bsg_init_rq() is changed to only do the allocation of the
>sense-buffer, which is used to back the bsg job's reply buffer. This
>pointer should never change during the lifetime of a scsi_request, so
>it doesn't need re-initialization.
>  - bsg_initialize_rq() is a new function that makes use of
>'struct request_queue's initialize_rq_fn callback (which was
>introduced in v4.12). This is always called before the request is
>given out via blk_get_request(). This function does the remaining
>initialization that was previously done in bsg_init_rq(), and will
>also do it when the request is taken from the emergency-pool of the
>backing mempool.
> 
> Fixes: 50b4d485528d ("bsg-lib: fix kernel panic resulting from missing 
> allocation of reply-buffer")
> Cc:  # 4.11+
> Reviewed-by: Christoph Hellwig 
> Signed-off-by: Benjamin Block 
> ---
>  block/bsg-lib.c | 27 +--
>  1 file changed, 21 insertions(+), 6 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Hannes Reinecke

On 10/03/2017 08:53 PM, Luis R. Rodriguez wrote:
> Now that all filesystems which used to rely on kthread
> freezing have been converted to filesystem freeze/thawing
> we can remove the kernel kthread freezer.
> 
> Signed-off-by: Luis R. Rodriguez 
> ---
>  Documentation/power/freezing-of-tasks.txt |  9 --
>  drivers/xen/manage.c  |  6 
>  include/linux/freezer.h   |  4 ---
>  kernel/power/hibernate.c  | 10 ++-
>  kernel/power/power.h  | 20 +
>  kernel/power/process.c| 47 
> ---
>  kernel/power/user.c   |  9 --
>  tools/power/pm-graph/analyze_suspend.py   |  1 -
>  8 files changed, 3 insertions(+), 103 deletions(-)
> 
Weelll ... have you checked MD?
It's using kernel threads, which probably should be stopped, too, when
going into suspend. After all, MD might be doing on of it's internal
operations (eg resync) at that time, which will generate quite some I/O.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [RFC 4/5] ext4: add fs freezing support on suspend/hibernation

2017-10-03 Thread Theodore Ts'o

On Tue, Oct 03, 2017 at 10:13:21PM +0200, Luis R. Rodriguez wrote:
> > After we remove add the NEEDS_RECOVERY flag, we need to make sure
> > recovery flag is pushed out to disk before any other changes are
> > allowed to be pushed out to disk.  That's why we originally did the
> > update synchronously.
> 
> I see. I had to try to dig through linux-history to see why this was done, and
> I actually could not find an exact commit which explained what you just did.
> Thanks!
> 
> Hrm, so on freeze we issue the same commit as well, so is a sync *really* 
> needed
> on thaw?

So let me explain what's going on.  When we do a freeze, we make sure
all of the blocks that are written to the journal are writen to the
final location on disk, and the journal is is truncated.  (This is
called a "journal checkpoint".)  Then we clear the NEEDS RECOVERY
feature flag and set the EXT4_VALID_FS flags in the superblock.  In
the no journal case, we flush out all metadata writes, and we set the
EXT4_VALID_FS flag.  In both cases, we call ext4_commit_super(sb, 1)
to make sure the flags update is pushed out to disk.  This puts the
file system into a "quiscent" state; in fact, it looks like the file
system has been unmounted, so that it becomes safe to run
dump/restore, run fsck -n on the file system, etc.

The problem on the thaw side is that we need to make sure that
EXT4_VALID_FS flag is removed (and if journaling is enabled, the NEEDS
RECOVERY feature flag is set) and the superblock is flushed out to the
storage device before any other writes are persisted on the disk.  In
the case where we have journalling enabled, we could wait until the
first journal commit to write out the superblock, since in journal
mode all metadata updates to the disk are suppressed until the journal
commit.  We don't do that today, but it is a change we could make.

However, in the no journal node we need to make sure the EXT4_VALID_FS
flag is cleared on disk before any other metadata operations can take
place, and calling ext4_commit_super(sb, 1) is the only real way to do
that.

> No, it was am empirical evaluation done at testing, I observed bio_submit()
> stalls, we never get completion from the device. Even if we call thaw at the
> very end of resume, after the devices should be up, we still end up in the 
> same
> situation. Given how I order freezing filesystems after freezing userspace I 
> do
> believe we should thaw filesystems prior unfreezing userspace, its why I 
> placed
> the call where it is now.

So when we call ext4_commit_super(sb, 1), we issue the bio, and then
we block waiting for the I/O to complete.  That's a blocking call.  Is
the thaw context one which allows blocking?  If userspace is still
frozen, does that imply that the scheduler isn't allow to run?  If
that's the case, then that's probably the problem.

More generally, if the thawing code needs to allocate memory, or do
any number of things that could potentially block, this could
potentially be an issue.  Or if we have a network block device, or
something else in the storage stack that needs to run a kernel thread
context (or a workqueue, etc.) --- is the fact that userspace is
frozen mean the scheduler is going to refuse to schedule()?

I know at one point we made a distinction between freezing userspace
threads and kernel threads, but were there people who didn't like this
because of unnecessary complexity.  It sounds to me like on the thaw
side, we might also need to unfreeze kernel threads first, and then
allow userspace threads to run.  Do we do that today, or in the new
freeze/thaw code?  It's been quite a while since I've looked at that
part of the kernel.

- Ted

Re: [PATCH v5 2/8] md: Neither resync nor reshape while the system is frozen

2017-10-03 Thread Luis R. Rodriguez

On Mon, Oct 02, 2017 at 03:52:12PM -0700, Bart Van Assche wrote:
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 3f7426120a3b..a2cf2a93b0cb 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8961,6 +8963,37 @@ static void md_stop_all_writes(void)
>   mdelay(1000*1);
>  }
>  
> +/*
> + * Ensure that neither resyncing nor reshaping occurs while the system is
> + * frozen.
> + */
> +static int md_notify_pm(struct notifier_block *bl, unsigned long state,
> + void *unused)
> +{
> + struct mddev *mddev;
> + struct list_head *tmp;
> +
> + pr_debug("%s: state = %ld; system_freezing_cnt = %d\n", __func__, state,
> +  atomic_read(&system_freezing_cnt));
> +
> + switch (state) {
> + case PM_HIBERNATION_PREPARE:

Hm, why not also include and use this for PM_SUSPEND_PREPARE and/or
a PM_RESTORE_PREPARE.

case PM_HIBERNATION_PREPARE:
case PM_SUSPEND_PREPARE:
case PM_RESTORE_PREPARE:

> + md_stop_all_writes();
> + break;
> + case PM_POST_HIBERNATION:

Likewise here:

case PM_POST_SUSPEND:
case PM_POST_HIBERNATION:
case PM_POST_RESTORE:

I have revised the kernel suspend ordering and indeed things issued
with the pm notifier should suffice to be processed given we actually
call the pm ops (dpm_prepare()) for device drivers *after* the notifier
is called and then after userspace is frozen. That is:

   pm_suspend() --> enter_state() -->
 sys_sync()
 suspend_prepare() -->
 __pm_notifier_call_chain(PM_SUSPEND_PREPARE, ...);
 suspend_freeze_processes() -->
 freeze_processes() -->
 __usermodehelper_set_disable_depth(UMH_DISABLED);
 freeze all tasks ...
 freeze_kernel_threads()
 suspend_devices_and_enter() -->
 dpm_suspend_start() -->
 dpm_prepare()
 dpm_suspend()
 suspend_enter()  -->
 platform_suspend_prepare()
 dpm_suspend_late()
 freeze_enter()
 syscore_suspend()

On our way back up:

 enter_state() -->
 suspend_devices_and_enter() --> (bail from loop)
 dpm_resume_end() -->
 dpm_resume()
 dpm_complete()
 suspend_finish() -->
 suspend_thaw_processes() -->
 thaw_processes() -->
 __usermodehelper_set_disable_depth(UMH_FREEZING);
 thaw_workqueues();
 thaw all processes ...
 usermodehelper_enable();
 pm_notifier_call_chain(PM_POST_SUSPEND);

So notifier would indeed be the right tooling to use here.

  Luis

Re: [PATCH v5 1/8] md: Introduce md_stop_all_writes()

2017-10-03 Thread Luis R. Rodriguez

On Mon, Oct 02, 2017 at 03:52:11PM -0700, Bart Van Assche wrote:
> Introduce md_stop_all_writes() because the next patch will add
> a second caller for this function. Rename md_notifier into
> md_reboot_notifier to avoid that the name of this notifier will
> become confusing due to the next patch. This patch does not
> change any functionality.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Shaohua Li 
> Cc: linux-r...@vger.kernel.org
> Cc: Ming Lei 
> Cc: Christoph Hellwig 
> Cc: Hannes Reinecke 
> Cc: Johannes Thumshirn 

Reviewed-by: Luis R. Rodriguez 

  Luis

Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Bart Van Assche

On Wed, 2017-10-04 at 02:47 +0200, Luis R. Rodriguez wrote:
>   3) Lookup for kthreads which seem to generate IO -- address / review if
>  removal of the freezer API can be done somehow with a quescing. This
>  is currently for example being done on SCSI / md.
>  4) Only after all the above is done should we consider this patch or some
> form of it.

After having given this more thought, I think we should omit these last two
steps. Modifying the md driver such that it does not submit I/O requests while
processes are frozen requires either to use the freezer API or to open-code it.
I think there is general agreement in the kernel community that open-coding a
single mechanism in multiple drivers is wrong. Does this mean that instead of
removing the freezer API we should keep it and review all its users instead?

Bart.

Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Luis R. Rodriguez

On Tue, Oct 03, 2017 at 11:15:07PM +0200, Rafael J. Wysocki wrote:
> On Tuesday, October 3, 2017 8:59:00 PM CEST Rafael J. Wysocki wrote:
> > On Tuesday, October 3, 2017 8:53:13 PM CEST Luis R. Rodriguez wrote:
> > > Now that all filesystems which used to rely on kthread
> > > freezing have been converted to filesystem freeze/thawing
> > > we can remove the kernel kthread freezer.
> > > 
> > > Signed-off-by: Luis R. Rodriguez 
> > 
> > I like this one. :-)
> 
> However, suspend_freeze/thaw_processes() require some more work.
> 
> In particular, the freezing of workqueues is being removed here
> without a replacement.

Hrm, where do you see that? freeze_workqueues_busy() is still called on
try_to_freeze_tasks(). Likewise thaw_processes() also calls thaw_workqueues().
I did forget to nuke pm_nosig_freezing though.

Also as I have noted a few times now we have yet to determine if we can remove
all freezer calls on kthreads without causing a regression. Granted I'm being
overly careful here, I still would not be surprised to learn about a stupid use
case where this will be hard to remove now.

Only once this is done should this patch be considered. This will take time. So
I'd like more general consensus on long term approach:

  1) first address each fs to use its freezer calls on susend/hibernate / and 
thaw
 on resume. While at it, remove freezer API calls on their respective
 kthreads.
  2) In parallel address removing freezer API calls on non-IO kthreads, these
 should be trivial, but who knows what surprises lurk here
  3) Lookup for kthreads which seem to generate IO -- address / review if
 removal of the freezer API can be done somehow with a quescing. This
 is currently for example being done on SCSI / md.
  4) Only after all the above is done should we consider this patch or some
 form of it.

  Luis

Re: [PATCH V6 13/14] block: mq-deadline: Limit write request dispatch for zoned block devices

2017-10-03 Thread Damien Le Moal

Bart,

On 10/4/17 05:56, Bart Van Assche wrote:
> On Tue, 2017-10-03 at 09:19 +0900, Damien Le Moal wrote:
>> On 10/3/17 08:44, Bart Van Assche wrote:
>>> On Mon, 2017-10-02 at 16:15 +0900, Damien Le Moal wrote:
  static void deadline_wunlock_zone(struct deadline_data *dd,
  struct request *rq)
  {
 +  unsigned long flags;
 +
 +  spin_lock_irqsave(&dd->zone_lock, flags);
 +
WARN_ON_ONCE(!test_and_clear_bit(blk_rq_zone_no(rq), dd->zones_wlock));
deadline_clear_request_zone_wlock(rq);
 +
 +  spin_unlock_irqrestore(&dd->zone_lock, flags);
  }
>>>
>>> Is a request removed from the sort and fifo lists before being dispatched? 
>>> If so,
>>> does that mean that obtaining zone_lock in the above function is not 
>>> necessary?
>>
>> Yes, a request is removed from the sort tree and fifo list before
>> dispatching. But the dd->zone_lock spinlock is not there to protect
>> that, dd->lock protects the sort tree and fifo list. dd->zone_lock was
>> added to prevent the completed_request() method from changing a zone
>> lock state while deadline_fifo_requests() or deadline_next_request() are
>> running. Ex:
>>
>> Imagine this case: write request A for a zone Z is being executed (it
>> was dispatched) so Z is locked. Dispatch runs and inspects the next
>> requests in sort order. Let say we have the sequential writes B, C, D, E
>> queued for the same zone Z. First B is inspected and cannot be
>> dispatched (zone locked). Inspection move on to C, but before the that,
>> A completes and Z is unlocked. Then C will be OK to go as the zone is
>> now unlocked. But it is the wrong choice as it will result in out of
>> order write. B must be the next request dispatched after A.
>>
>> dd->zone_lock prevents this from happening. Without this spinlock, the
>> bad example case above happens very easily.
> 
> Hello Damien,
> 
> Thanks for the detailed and really clear reply. I hope you do not mind that I
> have a few further questions about this patch?
> - Does the zone_lock spinlock have to be acquired by both 
> deadline_wunlock_zone()
>   callers or only by the call from the request completion path?

Not really. Only the completion path is strongly needed as the insert
path unlock is under dd->lock, which prevents concurrent execution of
the sort or fifo request search. So the zone_lock lock/unlock could be
moved out of deadline_wunlock_zone() and done directly in
dd_completed_request().

> - Why do both the mq-deadline and the sd driver each have their own instance 
> of
>   the zones_wlock bitmap? Has it been considered to convert both bitmaps into 
> a
>   single bitmap that is shared by both kernel components and that exists e.g. 
> at
>   the request queue level?

The sd driver level zone locking handles only the legacy path. Hence the
zone lock bitmap attached to the scsi disk struct. For scsi-mq/blk-mq,
mq-deadline does the zone locking. Both bitmaps do not exist at the same
time.

Indeed we could move the zone lock bitmap in the request queue so that
it is common between legacy and mq cases. Christoph has a series doing
that, and going further by doing the zone locking directly within the
block layer dispatch code for both legacy and mq path. So the scsi level
locking and mq-deadline locking become unnecessary. Working on that new
series right now.

Best regards.

-- 
Damien Le Moal,
Western Digital

Re: [PATCH] blk-mq-debugfs: fix device sched directory for default scheduler

2017-10-03 Thread Jens Axboe

On 10/03/2017 03:57 PM, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> In blk_mq_debugfs_register(), I remembered to set up the per-hctx sched
> directories if a default scheduler was already configured by
> blk_mq_sched_init() from blk_mq_init_allocated_queue(), but I didn't do
> the same for the device-wide sched directory. Fix it.

Good catch, applied for 4.14.

-- 
Jens Axboe

[PATCH] blk-mq-debugfs: fix device sched directory for default scheduler

2017-10-03 Thread Omar Sandoval

From: Omar Sandoval 

In blk_mq_debugfs_register(), I remembered to set up the per-hctx sched
directories if a default scheduler was already configured by
blk_mq_sched_init() from blk_mq_init_allocated_queue(), but I didn't do
the same for the device-wide sched directory. Fix it.

Fixes: d332ce091813 ("blk-mq-debugfs: allow schedulers to register debugfs 
attributes")
Signed-off-by: Omar Sandoval 
---
Based on v4.14-rc3.

 block/blk-mq-debugfs.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 980e73095643..de294d775acf 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -815,10 +815,14 @@ int blk_mq_debugfs_register(struct request_queue *q)
goto err;
 
/*
-* blk_mq_init_hctx() attempted to do this already, but q->debugfs_dir
+* blk_mq_init_sched() attempted to do this already, but q->debugfs_dir
 * didn't exist yet (because we don't know what to name the directory
 * until the queue is registered to a gendisk).
 */
+   if (q->elevator && !q->sched_debugfs_dir)
+   blk_mq_debugfs_register_sched(q);
+
+   /* Similarly, blk_mq_init_hctx() couldn't do this previously. */
queue_for_each_hw_ctx(q, hctx, i) {
if (!hctx->debugfs_dir && blk_mq_debugfs_register_hctx(q, hctx))
goto err;
-- 
2.14.2

Re: [PATCH] null_blk: change configfs dependency to select

2017-10-03 Thread Shaohua Li

On Tue, Oct 03, 2017 at 03:50:16PM -0600, Jens Axboe wrote:
> A recent commit made null_blk depend on configfs, which is kind of
> annoying since you now have to find this dependency and enable that
> as well. Discovered this since I no longer had null_blk available
> on a box I needed to debug, since it got killed when the config
> updated after the configfs change was merged.
> 
> Fixes: 3bf2bd20734e ("nullb: add configfs interface")
> Signed-off-by: Jens Axboe 

Reviewed-by: Shaohua Li  
> diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
> index 4a438b8abe27..2dfe99b328f8 100644
> --- a/drivers/block/Kconfig
> +++ b/drivers/block/Kconfig
> @@ -17,7 +17,7 @@ if BLK_DEV
>  
>  config BLK_DEV_NULL_BLK
>   tristate "Null test block driver"
> - depends on CONFIGFS_FS
> + select CONFIGFS_FS
>  
>  config BLK_DEV_FD
>   tristate "Normal floppy disk support"
>

[PATCH] null_blk: change configfs dependency to select

2017-10-03 Thread Jens Axboe

A recent commit made null_blk depend on configfs, which is kind of
annoying since you now have to find this dependency and enable that
as well. Discovered this since I no longer had null_blk available
on a box I needed to debug, since it got killed when the config
updated after the configfs change was merged.

Fixes: 3bf2bd20734e ("nullb: add configfs interface")
Signed-off-by: Jens Axboe 

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 4a438b8abe27..2dfe99b328f8 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -17,7 +17,7 @@ if BLK_DEV
 
 config BLK_DEV_NULL_BLK
tristate "Null test block driver"
-   depends on CONFIGFS_FS
+   select CONFIGFS_FS
 
 config BLK_DEV_FD
tristate "Normal floppy disk support"

Re: [PATCH v2] blk-throttle: fix possible io stall when upgrade to max

2017-10-03 Thread Jens Axboe

On 09/30/2017 12:38 AM, Joseph Qi wrote:
> From: Joseph Qi 
> 
> There is a case which will lead to io stall. The case is described as
> follows. 
> /test1
>   |-subtest1
> /test2
>   |-subtest2
> And subtest1 and subtest2 each has 32 queued bios already.
> 
> Now upgrade to max. In throtl_upgrade_state, it will try to dispatch
> bios as follows:
> 1) tg=subtest1, do nothing;
> 2) tg=test1, transfer 32 queued bios from subtest1 to test1; no pending
> left, no need to schedule next dispatch;
> 3) tg=subtest2, do nothing;
> 4) tg=test2, transfer 32 queued bios from subtest2 to test2; no pending
> left, no need to schedule next dispatch;
> 5) tg=/, transfer 8 queued bios from test1 to /, 8 queued bios from
> test2 to /, 8 queued bios from test1 to /, and 8 queued bios from test2
> to /; note that test1 and test2 each still has 16 queued bios left;
> 6) tg=/, try to schedule next dispatch, but since disptime is now
> (update in tg_update_disptime, wait=0), pending timer is not scheduled
> in fact;
> 7) In throtl_upgrade_state it totally dispatches 32 queued bios and with
> 32 left. test1 and test2 each has 16 queued bios;
> 8) throtl_pending_timer_fn sees the left over bios, but could do
> nothing, because throtl_select_dispatch returns 0, and test1/test2 has
> no pending tg.
> 
> The blktrace shows the following:
> 8,32   00 2.539007641 0  m   N throtl upgrade to max
> 8,32   00 2.539072267 0  m   N throtl /test2 dispatch 
> nr_queued=16 read=0 write=16
> 8,32   70 2.539077142 0  m   N throtl /test1 dispatch 
> nr_queued=16 read=0 write=16
> 
> So force schedule dispatch if there are pending children.

Applied for 4.14, thanks.

-- 
Jens Axboe

Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Rafael J. Wysocki

On Tuesday, October 3, 2017 8:59:00 PM CEST Rafael J. Wysocki wrote:
> On Tuesday, October 3, 2017 8:53:13 PM CEST Luis R. Rodriguez wrote:
> > Now that all filesystems which used to rely on kthread
> > freezing have been converted to filesystem freeze/thawing
> > we can remove the kernel kthread freezer.
> > 
> > Signed-off-by: Luis R. Rodriguez 
> 
> I like this one. :-)

However, suspend_freeze/thaw_processes() require some more work.

In particular, the freezing of workqueues is being removed here
without a replacement.

Thanks,
Rafael

Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Luis R. Rodriguez

On Tue, Oct 03, 2017 at 03:09:53PM -0600, Shuah Khan wrote:
> On Tue, Oct 3, 2017 at 3:00 PM, Jiri Kosina  wrote:
> > On Tue, 3 Oct 2017, Pavel Machek wrote:
> >
> >> > Again, I agree that the (rare) kthreads that are actually "creating" new
> >> > I/O have to be somehow frozen and require special care.
> >>
> >> Agreed. Was any effort made to identify those special kernel threads?
> >
> > I don't think there is any other way than just inspecting all the
> > try_to_freeze() instances in the kernel, and understanding what that
> > particular kthread is doing.
> >
> > I've cleaned up most of the low-hanging fruit already, where the
> > try_to_freeze() was obviously completely pointless, but a lot more time
> > needs to be invested into this.
> >
> 
> There are about 36 drivers that call try_to_freeze() and half (18 ) of
> those are media drivers. Maybe it is easier handle sub-system by
> sub-system basis for a review of which one of these usages could be
> removed. cc'ing Mauro and linux-media

Yes :)

I guess no one reads cover letters, but indeed. To be clear, this last
patch should only go in after a few kernels from now all kthreads
are vetted for piece-meal wise.

This patch would be the nail on the kthread freezer coffin. It should
go in last, who knows how many years from now, and if ever.

  Luis

Re: [RFC 2/5] fs: freeze on suspend and thaw on resume

2017-10-03 Thread Luis R. Rodriguez

On Wed, Oct 04, 2017 at 07:58:41AM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 11:53:10AM -0700, Luis R. Rodriguez wrote:
> > diff --git a/fs/super.c b/fs/super.c
> > index d45e92d9a38f..ce8da8b187b1 100644
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -1572,3 +1572,82 @@ int thaw_super(struct super_block *sb)
> > return 0;
> >  }
> >  EXPORT_SYMBOL(thaw_super);
> > +
> > +#ifdef CONFIG_PM_SLEEP
> > +static bool super_allows_freeze(struct super_block *sb)
> > +{
> > +   return !!(sb->s_type->fs_flags & FS_FREEZE_ON_SUSPEND);
> > +}
> 
> That's a completely misleading function name. All superblocks can be
> frozen - freeze_super() is filesystem independent. And given that, I
> don't see why these super_should_freeze() hoops need to be jumped
> through...

I added this given ext4's current thaw implementation stalls on resume 
as it implicates a bio_submit() and this never completes. So I refactored
ext4 thaw skip a sync on thaw. This requires more eyeballs. This may be an
underlying issue elsewhere.  If its not an bug elsewhere or on ordering, then
there may be some restrictions on thaw when used on resume. Refer to my notes
to Ted on patch #4 for ext4.

> > +int fs_suspend_freeze_sb(struct super_block *sb, void *priv)
> > +{
> > +   int error = 0;
> > +
> > +   spin_lock(&sb_lock);
> > +   if (!super_should_freeze(sb))
> > +   goto out;
> > +
> > +   up_read(&sb->s_umount);
> > +   pr_info("%s (%s): freezing\n", sb->s_type->name, sb->s_id);
> > +   error = freeze_super(sb);
> > +   down_read(&sb->s_umount);
> > +out:
> > +   if (error && error != -EBUSY)
> > +   pr_notice("%s (%s): Unable to freeze, error=%d",
> > + sb->s_type->name, sb->s_id, error);
> > +   spin_unlock(&sb_lock);
> > +   return error;
> > +}
> 
> I don't think this was ever tested.  Calling freeze_super() with a
> spinlock held with through "sleeping in atomic" errors all over the
> place.

No, I run time tested it with a rootfs with ext4 and xfs with my
development tree and hammering on both while I loop on suspend
and resume.

> Also, the s_umount lock juggling is nasty. Your new copy+pasted
> iterate_supers_reverse() takes the lock in read mode, yet all the
> freeze/thaw callers here want to take it in write mode.

Yeap, yuk!

> So, really,
> iterate_supers_reverse() needs to be iterate_supers_reverse_excl()
> and take the write lock, and freeze_super/thaw_super need to be
> factored into locked and unlocked versions.
> 
> In which case, we end up with:
> 
> int fs_suspend_freeze_sb(struct super_block *sb, void *priv)
> {
>   return freeze_locked_super(sb);
> }
> 
> int fs_suspend_thaw_sb(struct super_block *sb, void *priv)
> {
>   return thaw_locked_super(sb);
> }

Groovy, thanks, I suspected that locking was convoluted and
we could come up with something better. Will do it this way.

Its really what I hoped we could do :) I just needed to get
it in writing.

  Luis

Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Shuah Khan

On Tue, Oct 3, 2017 at 3:00 PM, Jiri Kosina  wrote:
> On Tue, 3 Oct 2017, Pavel Machek wrote:
>
>> > Again, I agree that the (rare) kthreads that are actually "creating" new
>> > I/O have to be somehow frozen and require special care.
>>
>> Agreed. Was any effort made to identify those special kernel threads?
>
> I don't think there is any other way than just inspecting all the
> try_to_freeze() instances in the kernel, and understanding what that
> particular kthread is doing.
>
> I've cleaned up most of the low-hanging fruit already, where the
> try_to_freeze() was obviously completely pointless, but a lot more time
> needs to be invested into this.
>

There are about 36 drivers that call try_to_freeze() and half (18 ) of
those are media drivers. Maybe it is easier handle sub-system by
sub-system basis for a review of which one of these usages could be
removed. cc'ing Mauro and linux-media

thanks,
-- Shuah

Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Luis R. Rodriguez

On Wed, Oct 04, 2017 at 08:04:49AM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 11:53:13AM -0700, Luis R. Rodriguez wrote:
> > Now that all filesystems which used to rely on kthread
> > freezing have been converted to filesystem freeze/thawing
> > we can remove the kernel kthread freezer.
> 
> Really? There's no other subsystem that relies on kernel thread
> and workqueue freezing to function correctly on suspend?

On my cover letter I noted this patch should not be consideredd
until *all* kthreads are vetted. So I agree vehemently with you.
The patch set only addressed 2 filesystem, so two kthreads. Much
other work would be needed before this lat patch could *ever*
be considered.

  Luis

Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Dave Chinner

On Tue, Oct 03, 2017 at 11:53:13AM -0700, Luis R. Rodriguez wrote:
> Now that all filesystems which used to rely on kthread
> freezing have been converted to filesystem freeze/thawing
> we can remove the kernel kthread freezer.

Really? There's no other subsystem that relies on kernel thread
and workqueue freezing to function correctly on suspend?

> -/**
> - * freeze_kernel_threads - Make freezable kernel threads go to the 
> refrigerator.
> - *
> - * On success, returns 0.  On failure, -errno and only the kernel threads are
> - * thawed, so as to give a chance to the caller to do additional cleanups
> - * (if any) before thawing the userspace tasks. So, it is the responsibility
> - * of the caller to thaw the userspace tasks, when the time is right.
> - */
> -int freeze_kernel_threads(void)
> -{
> - int error;
> -
> - pr_info("Freezing remaining freezable tasks ... ");
> -
> - pm_nosig_freezing = true;
> - error = try_to_freeze_tasks(false);

This freezes workqueues as well as kernel threads, so this affects
any subsystem that uses WQ_FREEZABLE. A quick glance tells me this
includes graphics drivers, spi devices, usb hubs, power management,
and a few filesystems, too.

> - if (!error)
> - pr_cont("done.");
> -
> - pr_cont("\n");
> - BUG_ON(in_atomic());
> -
> - if (error)
> - thaw_kernel_threads();
> - return error;
> -}
> -
>  void thaw_processes(void)
>  {
>   struct task_struct *g, *p;
> @@ -234,23 +207,3 @@ void thaw_processes(void)
>   pr_cont("done.\n");
>   trace_suspend_resume(TPS("thaw_processes"), 0, false);
>  }
> -
> -void thaw_kernel_threads(void)
> -{
> - struct task_struct *g, *p;
> -
> - pm_nosig_freezing = false;
> - pr_info("Restarting kernel threads ... ");
> -
> - thaw_workqueues();

And this is where the workqueues are thawed.

So I doubt we can safely remove all this code like this...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC 0/5] fs: replace kthread freezing with filesystem freeze/thaw

2017-10-03 Thread Bart Van Assche

On Tue, 2017-10-03 at 13:47 -0700, Matthew Wilcox wrote:
> The SCSI layer can send SCSI commands; they aren't I/Os in the sense that
> they do reads and writes to media, but they are block requests.  Maybe those
> should be allowed even to frozen devices?

Hello Matthew,

I think only RQF_PM requests should be processed for frozen SCSI devices
and no other requests.

Bart.

Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Jiri Kosina

On Tue, 3 Oct 2017, Pavel Machek wrote:

> > Again, I agree that the (rare) kthreads that are actually "creating" new 
> > I/O have to be somehow frozen and require special care.
> 
> Agreed. Was any effort made to identify those special kernel threads?

I don't think there is any other way than just inspecting all the 
try_to_freeze() instances in the kernel, and understanding what that 
particular kthread is doing.

I've cleaned up most of the low-hanging fruit already, where the 
try_to_freeze() was obviously completely pointless, but a lot more time 
needs to be invested into this.

-- 
Jiri Kosina
SUSE Labs

Re: [RFC 2/5] fs: freeze on suspend and thaw on resume

2017-10-03 Thread Dave Chinner

On Tue, Oct 03, 2017 at 11:53:10AM -0700, Luis R. Rodriguez wrote:
> This uses the existing filesystem freeze and thaw callbacks to
> freeze each filesystem on suspend/hibernation and thaw upon resume.
> 
> This is needed so that we properly really stop IO in flight without
> races after userspace has been frozen. Without this we rely on
> kthread freezing and its semantics are loose and error prone.
> For instance, even though a kthread may use try_to_freeze() and end
> up being frozen we have no way of being sure that everything that
> has been spawned asynchronously from it (such as timers) have also
> been stopped as well.
> 
> A long term advantage of also adding filesystem freeze / thawing
> supporting durign suspend / hibernation is that long term we may
> be able to eventually drop the kernel's thread freezing completely
> as it was originally added to stop disk IO in flight as we hibernate
> or suspend.
> 
> This also implies that many kthread users exist which have been
> adding freezer semantics onto its kthreads without need. These also
> will need to be reviewed later.
> 
> This is based on prior work originally by Rafael Wysocki and later by
> Jiri Kosina.
> 
> Signed-off-by: Luis R. Rodriguez 
> ---
>  fs/super.c | 79 
> ++
>  include/linux/fs.h | 13 +
>  kernel/power/process.c | 14 -
>  3 files changed, 105 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index d45e92d9a38f..ce8da8b187b1 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1572,3 +1572,82 @@ int thaw_super(struct super_block *sb)
>   return 0;
>  }
>  EXPORT_SYMBOL(thaw_super);
> +
> +#ifdef CONFIG_PM_SLEEP
> +static bool super_allows_freeze(struct super_block *sb)
> +{
> + return !!(sb->s_type->fs_flags & FS_FREEZE_ON_SUSPEND);
> +}

That's a completely misleading function name. All superblocks can be
frozen - freeze_super() is filesystem independent. And given that, I
don't see why these super_should_freeze() hoops need to be jumped
through...

> +
> +static bool super_should_freeze(struct super_block *sb)
> +{
> + if (!sb->s_root)
> + return false;
> + if (!(sb->s_flags & MS_BORN))
> + return false;
> + /*
> +  * We don't freeze virtual filesystems, we skip those filesystems with
> +  * no backing device.
> +  */
> + if (sb->s_bdi == &noop_backing_dev_info)
> + return false;
> + /* No need to freeze read-only filesystems */
> + if (sb->s_flags & MS_RDONLY)
> + return false;
> + if (!super_allows_freeze(sb))
> + return false;
> +
> + return true;
> +}

> +
> +int fs_suspend_freeze_sb(struct super_block *sb, void *priv)
> +{
> + int error = 0;
> +
> + spin_lock(&sb_lock);
> + if (!super_should_freeze(sb))
> + goto out;
> +
> + up_read(&sb->s_umount);
> + pr_info("%s (%s): freezing\n", sb->s_type->name, sb->s_id);
> + error = freeze_super(sb);
> + down_read(&sb->s_umount);
> +out:
> + if (error && error != -EBUSY)
> + pr_notice("%s (%s): Unable to freeze, error=%d",
> +   sb->s_type->name, sb->s_id, error);
> + spin_unlock(&sb_lock);
> + return error;
> +}

I don't think this was ever tested.  Calling freeze_super() with a
spinlock held with through "sleeping in atomic" errors all over the
place.

Also, the s_umount lock juggling is nasty. Your new copy+pasted
iterate_supers_reverse() takes the lock in read mode, yet all the
freeze/thaw callers here want to take it in write mode. So, really,
iterate_supers_reverse() needs to be iterate_supers_reverse_excl()
and take the write lock, and freeze_super/thaw_super need to be
factored into locked and unlocked versions.

In which case, we end up with:

int fs_suspend_freeze_sb(struct super_block *sb, void *priv)
{
return freeze_locked_super(sb);
}

int fs_suspend_thaw_sb(struct super_block *sb, void *priv)
{
return thaw_locked_super(sb);
}

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Pavel Machek

On Tue 2017-10-03 22:38:33, Jiri Kosina wrote:
> On Tue, 3 Oct 2017, Pavel Machek wrote:
> 
> > > This point I don't understand. What exactly do you mean?
> > 
> > Hibernation:
> > 
> > We freeze, do system snapshot, unfreeze, and write image to
> > disk. 
> 
> Where/why do you unfreeze between creating the snapshot and writing it 
> out?

Yep, sorry, we don't unfreeze.

> Again, I agree that the (rare) kthreads that are actually "creating" new 
> I/O have to be somehow frozen and require special care.

Agreed. Was any effort made to identify those special kernel threads?
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature

Re: [PATCH V6 13/14] block: mq-deadline: Limit write request dispatch for zoned block devices

2017-10-03 Thread Bart Van Assche

On Tue, 2017-10-03 at 09:19 +0900, Damien Le Moal wrote:
> On 10/3/17 08:44, Bart Van Assche wrote:
> > On Mon, 2017-10-02 at 16:15 +0900, Damien Le Moal wrote:
> > >  static void deadline_wunlock_zone(struct deadline_data *dd,
> > > struct request *rq)
> > >  {
> > > + unsigned long flags;
> > > +
> > > + spin_lock_irqsave(&dd->zone_lock, flags);
> > > +
> > >   WARN_ON_ONCE(!test_and_clear_bit(blk_rq_zone_no(rq), dd->zones_wlock));
> > >   deadline_clear_request_zone_wlock(rq);
> > > +
> > > + spin_unlock_irqrestore(&dd->zone_lock, flags);
> > >  }
> > 
> > Is a request removed from the sort and fifo lists before being dispatched? 
> > If so,
> > does that mean that obtaining zone_lock in the above function is not 
> > necessary?
> 
> Yes, a request is removed from the sort tree and fifo list before
> dispatching. But the dd->zone_lock spinlock is not there to protect
> that, dd->lock protects the sort tree and fifo list. dd->zone_lock was
> added to prevent the completed_request() method from changing a zone
> lock state while deadline_fifo_requests() or deadline_next_request() are
> running. Ex:
> 
> Imagine this case: write request A for a zone Z is being executed (it
> was dispatched) so Z is locked. Dispatch runs and inspects the next
> requests in sort order. Let say we have the sequential writes B, C, D, E
> queued for the same zone Z. First B is inspected and cannot be
> dispatched (zone locked). Inspection move on to C, but before the that,
> A completes and Z is unlocked. Then C will be OK to go as the zone is
> now unlocked. But it is the wrong choice as it will result in out of
> order write. B must be the next request dispatched after A.
> 
> dd->zone_lock prevents this from happening. Without this spinlock, the
> bad example case above happens very easily.

Hello Damien,

Thanks for the detailed and really clear reply. I hope you do not mind that I
have a few further questions about this patch?
- Does the zone_lock spinlock have to be acquired by both 
deadline_wunlock_zone()
  callers or only by the call from the request completion path?
- Why do both the mq-deadline and the sd driver each have their own instance of
  the zones_wlock bitmap? Has it been considered to convert both bitmaps into a
  single bitmap that is shared by both kernel components and that exists e.g. at
  the request queue level?

Thanks,

Bart.

Re: [RFC 0/5] fs: replace kthread freezing with filesystem freeze/thaw

2017-10-03 Thread Luis R. Rodriguez

On Tue, Oct 03, 2017 at 01:47:55PM -0700, Matthew Wilcox wrote:
> On Tue, Oct 03, 2017 at 10:05:11PM +0200, Luis R. Rodriguez wrote:
> > On Wed, Oct 04, 2017 at 03:33:01AM +0800, Ming Lei wrote:
> > > Even though this patch can make kthread to not do I/O during
> > > suspend/resume, the SCSI quiesce still can cause similar issue
> > > in other case, like when sending SCSI domain validation
> > > to transport_spi, which happens in revalidate path, nothing
> > > to do with suspend/resume.
> > 
> > Are you saying that the SCSI layer can generate IO even without the 
> > filesystem
> > triggering it?
> 
> The SCSI layer can send SCSI commands; they aren't I/Os in the sense that
> they do reads and writes to media,

Then there should be no issue waiting for the device to respond?

> but they are block requests.  Maybe those should be allowed even to frozen
> devices?

I'm afraid *maybe* won't suffice here, we need a clear explanation as
to what happens with frozen devices on the block layer *if* the device
driver is already suspended. Does the block layer just queue these?
If the goal is to just get queued but not processed, then why even
send these quiesce commands? Wouldn't they only be processed *after*
resume?

  Luis

Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Jiri Kosina

On Tue, 3 Oct 2017, Jiri Kosina wrote:

> > What about the many drivers outside filesystems that use the 
> > set_freezable() / try_to_freeze() / wait_event_freezable() API?
> 
> Many/most of them are just completely bogus and pointless. 

More specifically -- they don't really care at all whether they get 
scheduled out exactly at the try_to_freeze() point; they are perfectly 
happy being scheduled out at any other scheduling point, and land on 
runqueue after the resume has been completed.

Sure, certain drivers need to take action when system is undergoing 
hibernation/suspend. But that's what PM callbacks are for, not kthread 
hibernation.

-- 
Jiri Kosina
SUSE Labs

Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Luis R. Rodriguez

On Tue, Oct 03, 2017 at 10:12:04PM +0200, Pavel Machek wrote:
> On Tue 2017-10-03 11:53:13, Luis R. Rodriguez wrote:
> > Now that all filesystems which used to rely on kthread
> > freezing have been converted to filesystem freeze/thawing
> > we can remove the kernel kthread freezer.
> 
> Are you surely-sure? You mentioned other in kernel sources of writes;
> what about those?

You perhaps did not read the cover letter, it describes this patch will very
likely take time to *ever* apply. We must cover tons of grounds first, to
address precisely what you say.

In fact other than kthreads that generate IO we may have now even crazy stupid
kthreads using the freezer API which *do not generate IO* which are totally
bogus but now acts as "features". We'll need to carefully carve out all freezer 
API uses on all kthreads. This should be done atomically to avoid regressions
and ensure bisectability.

  Luis

Re: [RFC 0/5] fs: replace kthread freezing with filesystem freeze/thaw

2017-10-03 Thread Matthew Wilcox

On Tue, Oct 03, 2017 at 10:05:11PM +0200, Luis R. Rodriguez wrote:
> On Wed, Oct 04, 2017 at 03:33:01AM +0800, Ming Lei wrote:
> > On Tue, Oct 03, 2017 at 11:53:08AM -0700, Luis R. Rodriguez wrote:
> > > INFO: task kworker/u8:8:1320 blocked for more than 10 seconds.
> > >   Tainted: GE   4.13.0-next-20170907+ #88
> > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > kworker/u8:8D0  1320  2 0x8000
> > > Workqueue: events_unbound async_run_entry_fn
> > > Call Trace:
> > >  __schedule+0x2ec/0x7a0
> > >  schedule+0x36/0x80
> > >  io_schedule+0x16/0x40
> > >  get_request+0x278/0x780
> > >  ? remove_wait_queue+0x70/0x70
> > >  blk_get_request+0x9c/0x110
> > >  scsi_execute+0x7a/0x310 [scsi_mod]
> > >  sd_sync_cache+0xa3/0x190 [sd_mod]
> > >  ? blk_run_queue+0x3f/0x50
> > >  sd_suspend_common+0x7b/0x130 [sd_mod]
> > >  ? scsi_print_result+0x270/0x270 [scsi_mod]
> > >  sd_suspend_system+0x13/0x20 [sd_mod]
> > >  do_scsi_suspend+0x1b/0x30 [scsi_mod]
> > >  scsi_bus_suspend_common+0xb1/0xd0 [scsi_mod]
> > >  ? device_for_each_child+0x69/0x90
> > >  scsi_bus_suspend+0x15/0x20 [scsi_mod]
> > >  dpm_run_callback+0x56/0x140
> > >  ? scsi_bus_freeze+0x20/0x20 [scsi_mod]
> > >  __device_suspend+0xf1/0x340
> > >  async_suspend+0x1f/0xa0
> > >  async_run_entry_fn+0x38/0x160
> > >  process_one_work+0x191/0x380
> > >  worker_thread+0x4e/0x3c0
> > >  kthread+0x109/0x140
> > >  ? process_one_work+0x380/0x380
> > >  ? kthread_create_on_node+0x70/0x70
> > >  ret_from_fork+0x25/0x30
> > 
> > Actually we are trying to fix this issue inside block layer/SCSI, please
> > see the following link:
> > 
> > https://marc.info/?l=linux-scsi&m=150703947029304&w=2
> > 
> > Even though this patch can make kthread to not do I/O during
> > suspend/resume, the SCSI quiesce still can cause similar issue
> > in other case, like when sending SCSI domain validation
> > to transport_spi, which happens in revalidate path, nothing
> > to do with suspend/resume.
> 
> Are you saying that the SCSI layer can generate IO even without the filesystem
> triggering it?

The SCSI layer can send SCSI commands; they aren't I/Os in the sense that
they do reads and writes to media, but they are block requests.  Maybe those
should be allowed even to frozen devices?

Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Rafael J. Wysocki

On Tue, Oct 3, 2017 at 10:38 PM, Jiri Kosina  wrote:
> On Tue, 3 Oct 2017, Pavel Machek wrote:
>
>> > This point I don't understand. What exactly do you mean?
>>
>> Hibernation:
>>
>> We freeze, do system snapshot, unfreeze, and write image to
>> disk.
>
> Where/why do you unfreeze between creating the snapshot and writing it
> out?

We do not unfreeze then.

Re: [RFC 2/5] fs: freeze on suspend and thaw on resume

2017-10-03 Thread Luis R. Rodriguez

On Tue, Oct 03, 2017 at 08:32:39PM +, Bart Van Assche wrote:
> On Tue, 2017-10-03 at 22:23 +0200, Luis R. Rodriguez wrote:
> > On Tue, Oct 03, 2017 at 08:02:22PM +, Bart Van Assche wrote:
> > > On Tue, 2017-10-03 at 11:53 -0700, Luis R. Rodriguez wrote:
> > > > +static bool super_allows_freeze(struct super_block *sb)
> > > > +{
> > > > +   return !!(sb->s_type->fs_flags & FS_FREEZE_ON_SUSPEND);
> > > > +}
> > > 
> > > A minor comment: if "!!" would be left out the compiler will perform the
> > > conversion from int to bool implicitly
> > 
> > For all compilers?
> 
> Let's have a look at the output of the following commands:
> 
> $ PAGER= git grep 'typedef.*[[:blank:]]bool;' include
> include/linux/types.h:typedef _Bool bool;
> $ PAGER= git grep std= Makefile
> Makefile:   -fomit-frame-pointer -std=gnu89 $(HOST_LFS_CFLAGS)
> Makefile:  -std=gnu89 $(call cc-option,-fno-PIE)
> 
> From 
> https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gcc/C-Dialect-Options.html#C-Dialect-Options:
> ‘gnu89’
> GNU dialect of ISO C90 (including some C99 features).
> 
> I think this means that the Linux kernel tree can only be compiled correctly
> by compilers that support the C11 type _Bool.

:*) beautiful, thanks.

> > > Anyway, I agree with the approach of this patch and I think
> > > that freezing filesystems before processes are frozen would be a big step
> > > forward.
> > 
> > Great! But please note, the current implementation calls fs_suspend_freeze()
> > *after* try_to_freeze_tasks(), ie: this implementation freezes userspace and
> > only after then filesystems.
> 
> What will the impact be of that choice on filesystems implemented in 
> userspace?

Depends on what kernel hooks those use? Also now is a good time for those 
working
on userspace filesystems to chime in :) Its why I am stating -- I am not saying
I have found the right order, I have find the right order that works for me, and
we need consensus on what the right order should be.

  Luis

Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Jiri Kosina

On Tue, 3 Oct 2017, Pavel Machek wrote:

> > This point I don't understand. What exactly do you mean?
> 
> Hibernation:
> 
> We freeze, do system snapshot, unfreeze, and write image to
> disk. 

Where/why do you unfreeze between creating the snapshot and writing it 
out?

Again, I agree that the (rare) kthreads that are actually "creating" new 
I/O have to be somehow frozen and require special care.

-- 
Jiri Kosina
SUSE Labs

Re: [RFC 2/5] fs: freeze on suspend and thaw on resume

2017-10-03 Thread Bart Van Assche

On Tue, 2017-10-03 at 22:23 +0200, Luis R. Rodriguez wrote:
> On Tue, Oct 03, 2017 at 08:02:22PM +, Bart Van Assche wrote:
> > On Tue, 2017-10-03 at 11:53 -0700, Luis R. Rodriguez wrote:
> > > +static bool super_allows_freeze(struct super_block *sb)
> > > +{
> > > + return !!(sb->s_type->fs_flags & FS_FREEZE_ON_SUSPEND);
> > > +}
> > 
> > A minor comment: if "!!" would be left out the compiler will perform the
> > conversion from int to bool implicitly
> 
> For all compilers?

Let's have a look at the output of the following commands:

$ PAGER= git grep 'typedef.*[[:blank:]]bool;' include
include/linux/types.h:typedef _Bool bool;
$ PAGER= git grep std= Makefile
Makefile:   -fomit-frame-pointer -std=gnu89 $(HOST_LFS_CFLAGS)
Makefile:  -std=gnu89 $(call cc-option,-fno-PIE)

From 
https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gcc/C-Dialect-Options.html#C-Dialect-Options:
‘gnu89’
GNU dialect of ISO C90 (including some C99 features).

I think this means that the Linux kernel tree can only be compiled correctly
by compilers that support the C11 type _Bool.

> > Anyway, I agree with the approach of this patch and I think
> > that freezing filesystems before processes are frozen would be a big step
> > forward.
> 
> Great! But please note, the current implementation calls fs_suspend_freeze()
> *after* try_to_freeze_tasks(), ie: this implementation freezes userspace and
> only after then filesystems.

What will the impact be of that choice on filesystems implemented in userspace?

Thanks,

Bart.

Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Luis R. Rodriguez

On Tue, Oct 03, 2017 at 08:21:42PM +, Bart Van Assche wrote:
> On Tue, 2017-10-03 at 22:17 +0200, Jiri Kosina wrote:
> > On Tue, 3 Oct 2017, Bart Van Assche wrote:
> > > What about the many drivers outside filesystems that use the 
> > > set_freezable() / try_to_freeze() / wait_event_freezable() API?
> > 
> > Many/most of them are just completely bogus and pointless. I've killed a 
> > lot of those in the past, but the copy/paste programming is just too 
> > strong enemy to fight against.
> 
> If just a single driver would use that API to prevent that I/O occurs while
> processes are frozen then this patch will break that driver.

Yes! And although as Jiri points out, its debatable where this is being used,
but as you suggest there may be *valid* reasons for it *now* even though
originally it *may* have been bogus...

Its why I believe this now can only be done piecemeal wise, slowly but steady.
To avoid regressions, and make this effort bisectable.

  Luis

Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Jiri Kosina

On Tue, 3 Oct 2017, Bart Van Assche wrote:

> If just a single driver would use that API to prevent that I/O occurs while
> processes are frozen then this patch will break that driver. I would like to
> know your opinion about the following patch in particular: "[PATCH v4 1/7] md:
> Make md resync and reshape threads freezable"
> (https://www.spinics.net/lists/linux-block/msg17854.html).

Alright, if there are kthreads which are actually *creating* new I/O 
(contrary to kthreads being I/O completion helpers) -- which apparently is 
the md case here, then those definitely have to be frozen in some form.

-- 
Jiri Kosina
SUSE Labs

Re: [RFC 2/5] fs: freeze on suspend and thaw on resume

2017-10-03 Thread Luis R. Rodriguez

On Tue, Oct 03, 2017 at 08:02:22PM +, Bart Van Assche wrote:
> On Tue, 2017-10-03 at 11:53 -0700, Luis R. Rodriguez wrote:
> > +static bool super_allows_freeze(struct super_block *sb)
> > +{
> > +   return !!(sb->s_type->fs_flags & FS_FREEZE_ON_SUSPEND);
> > +}
> 
> A minor comment: if "!!" would be left out the compiler will perform the
> conversion from int to bool implicitly

For all compilers?

> so I propose to leave out the "!!" and parentheses.

OK!

> Anyway, I agree with the approach of this patch and I think
> that freezing filesystems before processes are frozen would be a big step
> forward.

Great! But please note, the current implementation calls fs_suspend_freeze()
*after* try_to_freeze_tasks(), ie: this implementation freezes userspace and
only after then filesystems.

Order will be *critical* here to get right, so we should definitely figure
out if this is definitely the right place (TM) to call fs_suspend_freeze().

Lastly, a final minor implementation note:

I think using a PM notifier would have been much cleaner, in fact it was my the
way I originally implemented this orthogonally to Jiri's work, however to get
this right the semantics of __pm_notifier_call_chain() would need to be
expanded with another state, perhaps PM_USERSPACE_FROZEN, for example. I
decided in the end a new state was not worth it give we would just have one
user: fs freezing. So to be clear using a notifier to wrap this code into
the fs code and not touching kernel/power/process.c is not possible with
today's semantics nor do I think its worth it to expand on these semantics.

This approach is explicit about order and requirements for those that should
care: those that will maintain kernel/power/process.c and friends. Having
this in a notifier would shift this implicitly.

  Luis

Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Bart Van Assche

On Tue, 2017-10-03 at 22:17 +0200, Jiri Kosina wrote:
> On Tue, 3 Oct 2017, Bart Van Assche wrote:
> > What about the many drivers outside filesystems that use the 
> > set_freezable() / try_to_freeze() / wait_event_freezable() API?
> 
> Many/most of them are just completely bogus and pointless. I've killed a 
> lot of those in the past, but the copy/paste programming is just too 
> strong enemy to fight against.

If just a single driver would use that API to prevent that I/O occurs while
processes are frozen then this patch will break that driver. I would like to
know your opinion about the following patch in particular: "[PATCH v4 1/7] md:
Make md resync and reshape threads freezable"
(https://www.spinics.net/lists/linux-block/msg17854.html).

Thanks,

Bart.

Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Pavel Machek

Hi!

> > Can memory management doing background writes be a problem?
> 
> This point I don't understand. What exactly do you mean?

Hibernation:

We freeze, do system snapshot, unfreeze, and write image to
disk. Memory management wakes up (this used to be called bdflush),
decides its time to write some dirty data back to disk. Powerdown.

But state on the disk now does not correspond to what the image
expects. Problem?
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature

Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Jiri Kosina

On Tue, 3 Oct 2017, Bart Van Assche wrote:

> What about the many drivers outside filesystems that use the 
> set_freezable() / try_to_freeze() / wait_event_freezable() API?

Many/most of them are just completely bogus and pointless. I've killed a 
lot of those in the past, but the copy/paste programming is just too 
strong enemy to fight against.

-- 
Jiri Kosina
SUSE Labs

Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Jiri Kosina

On Tue, 3 Oct 2017, Pavel Machek wrote:

> Can knfsd be a problem?

It'd actually be useful to walk through all the 'try_to_freeze()' 
instances in kthreads, and analyze those. I've started this effort, and as 
a result I removed most of them; but there are still quite a few 
remaining. And knfsd is one of those.

> Can memory management doing background writes be a problem?

This point I don't understand. What exactly do you mean?

-- 
Jiri Kosina
SUSE Labs

Re: [RFC 4/5] ext4: add fs freezing support on suspend/hibernation

2017-10-03 Thread Luis R. Rodriguez

On Tue, Oct 03, 2017 at 03:59:30PM -0400, Theodore Ts'o wrote:
> On Tue, Oct 03, 2017 at 11:53:12AM -0700, Luis R. Rodriguez wrote:
> > @@ -4926,7 +4926,7 @@ static int ext4_unfreeze(struct super_block *sb)
> > ext4_set_feature_journal_needs_recovery(sb);
> > }
> >  
> > -   ext4_commit_super(sb, 1);
> > +   ext4_commit_super(sb, 0);
> > return 0;
> >  }
> >
> 
> After we remove add the NEEDS_RECOVERY flag, we need to make sure
> recovery flag is pushed out to disk before any other changes are
> allowed to be pushed out to disk.  That's why we originally did the
> update synchronously.

I see. I had to try to dig through linux-history to see why this was done, and
I actually could not find an exact commit which explained what you just did.
Thanks!

Hrm, so on freeze we issue the same commit as well, so is a sync *really* needed
on thaw?

> There are other ways we could fulfill this requirements, but doing a
> synchronous update is the simplest way to handle this.  Was it
> necessary to change this given the other changes you are making the fs
> freeze implementation?

No, it was am empirical evaluation done at testing, I observed bio_submit()
stalls, we never get completion from the device. Even if we call thaw at the
very end of resume, after the devices should be up, we still end up in the same
situation. Given how I order freezing filesystems after freezing userspace I do
believe we should thaw filesystems prior unfreezing userspace, its why I placed
the call where it is now.

So this is something I was hoping others could help grok/iron out. I'd prefer
if we could live with thaw'ing unchanged, but this requires to figure this
issue out. Why we can't implicate bio_submit() on fs thaw as-is in this
implementation.

  Luis

Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Bart Van Assche

On Tue, 2017-10-03 at 11:53 -0700, Luis R. Rodriguez wrote:
> Now that all filesystems which used to rely on kthread
> freezing have been converted to filesystem freeze/thawing
> we can remove the kernel kthread freezer.

Many subsystems use system_freezable_wq and/or
system_freezable_power_efficient_wq to queue work that should not run while
processes are frozen. Should it be mentioned in the patch description that
this patch does not affect workqueues for which WQ_FREEZABLE has been set?

What about the many drivers outside filesystems that use the set_freezable() /
try_to_freeze() / wait_event_freezable() API?

Thanks,

Bart.

Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Pavel Machek

On Tue 2017-10-03 11:53:13, Luis R. Rodriguez wrote:
> Now that all filesystems which used to rely on kthread
> freezing have been converted to filesystem freeze/thawing
> we can remove the kernel kthread freezer.

Are you surely-sure? You mentioned other in kernel sources of writes;
what about those?

Can knfsd be a problem?

Can memory management doing background writes be a problem?

Did you do some hibernation testing?

> Signed-off-by: Luis R. Rodriguez 

Please don't apply this just yet.

Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

signature.asc
Description: Digital signature

Re: [RFC 2/5] fs: freeze on suspend and thaw on resume

2017-10-03 Thread Jiri Kosina

On Tue, 3 Oct 2017, Luis R. Rodriguez wrote:

> This uses the existing filesystem freeze and thaw callbacks to
> freeze each filesystem on suspend/hibernation and thaw upon resume.
> 
> This is needed so that we properly really stop IO in flight without
> races after userspace has been frozen. Without this we rely on
> kthread freezing and its semantics are loose and error prone.
> For instance, even though a kthread may use try_to_freeze() and end
> up being frozen we have no way of being sure that everything that
> has been spawned asynchronously from it (such as timers) have also
> been stopped as well.
> 
> A long term advantage of also adding filesystem freeze / thawing
> supporting durign suspend / hibernation is that long term we may
> be able to eventually drop the kernel's thread freezing completely
> as it was originally added to stop disk IO in flight as we hibernate
> or suspend.
> 
> This also implies that many kthread users exist which have been
> adding freezer semantics onto its kthreads without need. These also
> will need to be reviewed later.
> 
> This is based on prior work originally by Rafael Wysocki and later by
> Jiri Kosina.
> 
> Signed-off-by: Luis R. Rodriguez 

Thanks a lot for picking this up; I never found time to actually finalize 
it.

Acked-by: Jiri Kosina 

for patches 2 and 5 (the fs agnostic code), which were in the core of my 
original work.

-- 
Jiri Kosina
SUSE Labs

Re: [RFC 0/5] fs: replace kthread freezing with filesystem freeze/thaw

2017-10-03 Thread Luis R. Rodriguez

On Wed, Oct 04, 2017 at 03:33:01AM +0800, Ming Lei wrote:
> On Tue, Oct 03, 2017 at 11:53:08AM -0700, Luis R. Rodriguez wrote:
> > INFO: task kworker/u8:8:1320 blocked for more than 10 seconds.
> >   Tainted: GE   4.13.0-next-20170907+ #88
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > kworker/u8:8D0  1320  2 0x8000
> > Workqueue: events_unbound async_run_entry_fn
> > Call Trace:
> >  __schedule+0x2ec/0x7a0
> >  schedule+0x36/0x80
> >  io_schedule+0x16/0x40
> >  get_request+0x278/0x780
> >  ? remove_wait_queue+0x70/0x70
> >  blk_get_request+0x9c/0x110
> >  scsi_execute+0x7a/0x310 [scsi_mod]
> >  sd_sync_cache+0xa3/0x190 [sd_mod]
> >  ? blk_run_queue+0x3f/0x50
> >  sd_suspend_common+0x7b/0x130 [sd_mod]
> >  ? scsi_print_result+0x270/0x270 [scsi_mod]
> >  sd_suspend_system+0x13/0x20 [sd_mod]
> >  do_scsi_suspend+0x1b/0x30 [scsi_mod]
> >  scsi_bus_suspend_common+0xb1/0xd0 [scsi_mod]
> >  ? device_for_each_child+0x69/0x90
> >  scsi_bus_suspend+0x15/0x20 [scsi_mod]
> >  dpm_run_callback+0x56/0x140
> >  ? scsi_bus_freeze+0x20/0x20 [scsi_mod]
> >  __device_suspend+0xf1/0x340
> >  async_suspend+0x1f/0xa0
> >  async_run_entry_fn+0x38/0x160
> >  process_one_work+0x191/0x380
> >  worker_thread+0x4e/0x3c0
> >  kthread+0x109/0x140
> >  ? process_one_work+0x380/0x380
> >  ? kthread_create_on_node+0x70/0x70
> >  ret_from_fork+0x25/0x30
> 
> Actually we are trying to fix this issue inside block layer/SCSI, please
> see the following link:
> 
> https://marc.info/?l=linux-scsi&m=150703947029304&w=2
> 
> Even though this patch can make kthread to not do I/O during
> suspend/resume, the SCSI quiesce still can cause similar issue
> in other case, like when sending SCSI domain validation
> to transport_spi, which happens in revalidate path, nothing
> to do with suspend/resume.

Are you saying that the SCSI layer can generate IO even without the filesystem
triggering it?

If so then by all means these are certainly other areas we should address
quiescing as I noted in my email.

Also, *iff* the generated IO is triggered on the SCSI suspend callback, then
clearly the next question is if this is truly needed. If so then yes, it
should be quiesced and all restrictions should be considered.

Note that device pm ops get called first, then later the notifiers are
processed, and only later is userspace frozen. Its this gap this patch
set addresses, and its also where where I saw the issue creep in. Depending on
the questions above we may or not need more work in other layers.

So I am not saying this patch set is sufficient to address all IO quiescing,
quite the contrary I acknowledged that each subsystem should vet if they have
non-FS generated IO (seems you and Bart are doing  great job at doing this
analysis on SCSI). This patchset however should help with odd corner cases
which *are* triggered by the FS and the spaghetti code requirements of the
kthread freezing clearly does not suffice.

> So IMO the root cause is in SCSI's quiesce.
> 
> You can find the similar description in above link:
> 
>   Once SCSI device is put into QUIESCE, no new request except for
>   RQF_PREEMPT can be dispatched to SCSI successfully, and
>   scsi_device_quiesce() just simply waits for completion of I/Os
>   dispatched to SCSI stack. It isn't enough at all.

I see so the race here is *on* the pm ops of SCSI we have generated IO
to QUIESCE.

> 
>   Because new request still can be coming, but all the allocated
>   requests can't be dispatched successfully, so request pool can be
>   consumed up easily. Then RQF_PREEMPT can't be allocated, and
>   hang forever, just like the stack trace you posted.
> 

I see. Makes sense. So SCSI quiesce has restrictions and they're being
violated.

Anyway, don't think of this as a replacement for yours or Bart's work then, but
rather supplemental.

Are you saying we should not move forward with this patch set, or simply that
the above splat is rather properly fixed with SCSI quiescing? Given you're
explanation I'd have to agree. But even with this considered and accepted, from
a theoretical perspective -- why would this patch set actually seem to fix the
same issue? Is it, that it just *seems* to fix it?

  Luis

Re: [RFC 2/5] fs: freeze on suspend and thaw on resume

2017-10-03 Thread Bart Van Assche

On Tue, 2017-10-03 at 11:53 -0700, Luis R. Rodriguez wrote:
> +static bool super_allows_freeze(struct super_block *sb)
> +{
> + return !!(sb->s_type->fs_flags & FS_FREEZE_ON_SUSPEND);
> +}

A minor comment: if "!!" would be left out the compiler will perform the
conversion from int to bool implicitly so I propose to leave out the "!!"
and parentheses. Anyway, I agree with the approach of this patch and I think
that freezing filesystems before processes are frozen would be a big step
forward.

Bart.

Re: [RFC 4/5] ext4: add fs freezing support on suspend/hibernation

2017-10-03 Thread Theodore Ts'o

On Tue, Oct 03, 2017 at 11:53:12AM -0700, Luis R. Rodriguez wrote:
> @@ -4926,7 +4926,7 @@ static int ext4_unfreeze(struct super_block *sb)
>   ext4_set_feature_journal_needs_recovery(sb);
>   }
>  
> - ext4_commit_super(sb, 1);
> + ext4_commit_super(sb, 0);
>   return 0;
>  }
>

After we remove add the NEEDS_RECOVERY flag, we need to make sure
recovery flag is pushed out to disk before any other changes are
allowed to be pushed out to disk.  That's why we originally did the
update synchronously.

There are other ways we could fulfill this requirements, but doing a
synchronous update is the simplest way to handle this.  Was it
necessary to change this given the other changes you are making the fs
freeze implementation?

- Ted

Re: [RFC 0/5] fs: replace kthread freezing with filesystem freeze/thaw

2017-10-03 Thread Ming Lei

On Tue, Oct 03, 2017 at 11:53:08AM -0700, Luis R. Rodriguez wrote:
> At the 2015 South Korea kernel summit Jiri Kosina had pointd out the issue of
> the sloppy semantics of the kthread freezer, lwn covered this pretty well [0].
> In short, he explained how the original purpose of the freezer was to aid
> in going to suspend to ensure no uwanted IO activity would cause filesystem
> corruption. Kernel kthreads require special freezer handling though, the call
> try_to_freeze() is often sprinkled at strategic places, but sometimes this is
> done without set_freezable() making try_to_freeze() useless. Other helpers 
> such
> as freezable_schedule_timeout() exist, and very likely they are not used in 
> any
> consistent and proper way either all over the kernel. Dealing with these
> helpers alone also does not and cannot ensure that everything that has been
> spawned asynchronously from a kthread (such as timers) are stopped as well,
> these are left to the kthread user implementation, and chances are pretty high
> there are many bugs lurking here. There are even non-IO bounds kthreads now
> using the freezer APIs, where this is not even needed!
> 
> Jiri suggested we can easily replace the complexities of the kthread freezer
> by just using the existing filesystem freeze/thaw callbacks on hibernation and
> suspend.
> 
> I've picked up Jiri's work given there are bugs which after inspection don't
> see like real bugs, but just random IO loosely waiting to be completed and the
> kernel not really being able to tell me who the culprit is. In fact even if 
> one
> plugs a fix, one ends up in another random place and its really unclear who is
> to blaim for next.
> 
> For instance, to reproduce a simple suspend bug on what may at first seem to 
> be
> an XFS bug, one can issue a dd onto disk prior to suspend, and we'll get a
> stall on our way to suspend, claiming the issue was the SCSI layer not being
> able to quiesce the disk. This was reported on OpenSUSE and reproduced on
> linux-next easily [1]. The following script can be run while we loop on
> systemctl suspend and qemu system_wakeup calls to resume:
> 
>   while true; do
>   dd if=/dev/zero of=crap bs=1M count=1024 &> /dev/null
>   done
> 
> You end up with with a hung suspend attempt, and eventually a splat
> as follows with a hunk task notification:
> 
> INFO: task kworker/u8:8:1320 blocked for more than 10 seconds.
>   Tainted: GE   4.13.0-next-20170907+ #88
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kworker/u8:8D0  1320  2 0x8000
> Workqueue: events_unbound async_run_entry_fn
> Call Trace:
>  __schedule+0x2ec/0x7a0
>  schedule+0x36/0x80
>  io_schedule+0x16/0x40
>  get_request+0x278/0x780
>  ? remove_wait_queue+0x70/0x70
>  blk_get_request+0x9c/0x110
>  scsi_execute+0x7a/0x310 [scsi_mod]
>  sd_sync_cache+0xa3/0x190 [sd_mod]
>  ? blk_run_queue+0x3f/0x50
>  sd_suspend_common+0x7b/0x130 [sd_mod]
>  ? scsi_print_result+0x270/0x270 [scsi_mod]
>  sd_suspend_system+0x13/0x20 [sd_mod]
>  do_scsi_suspend+0x1b/0x30 [scsi_mod]
>  scsi_bus_suspend_common+0xb1/0xd0 [scsi_mod]
>  ? device_for_each_child+0x69/0x90
>  scsi_bus_suspend+0x15/0x20 [scsi_mod]
>  dpm_run_callback+0x56/0x140
>  ? scsi_bus_freeze+0x20/0x20 [scsi_mod]
>  __device_suspend+0xf1/0x340
>  async_suspend+0x1f/0xa0
>  async_run_entry_fn+0x38/0x160
>  process_one_work+0x191/0x380
>  worker_thread+0x4e/0x3c0
>  kthread+0x109/0x140
>  ? process_one_work+0x380/0x380
>  ? kthread_create_on_node+0x70/0x70
>  ret_from_fork+0x25/0x30

Actually we are trying to fix this issue inside block layer/SCSI, please
see the following link:

https://marc.info/?l=linux-scsi&m=150703947029304&w=2

Even though this patch can make kthread to not do I/O during
suspend/resume, the SCSI quiesce still can cause similar issue
in other case, like when sending SCSI domain validation
to transport_spi, which happens in revalidate path, nothing
to do with suspend/resume.

So IMO the root cause is in SCSI's quiesce.

You can find the similar description in above link:

Once SCSI device is put into QUIESCE, no new request except for
RQF_PREEMPT can be dispatched to SCSI successfully, and
scsi_device_quiesce() just simply waits for completion of I/Os
dispatched to SCSI stack. It isn't enough at all.

Because new request still can be coming, but all the allocated
requests can't be dispatched successfully, so request pool can be
consumed up easily. Then RQF_PREEMPT can't be allocated, and
hang forever, just like the stack trace you posted.

-- 
Ming

Re: [PATCH V8 5/8] percpu-refcount: introduce __percpu_ref_tryget_live

2017-10-03 Thread Tejun Heo

Hello,

On Wed, Oct 04, 2017 at 03:20:40AM +0800, Ming Lei wrote:
> On Tue, Oct 03, 2017 at 07:14:59AM -0700, Tejun Heo wrote:
> > Hello,
> > 
> > On Tue, Oct 03, 2017 at 10:04:03PM +0800, Ming Lei wrote:
> > > Block layer need to call this function after holding
> > > rcu lock in a real hot path, so introduce this helper.
> > 
> > The patch description is too anemic.  It doesn't even describe what
> > changes are being made, let alone justifying them.
> 
> Sorry for Cc you, and this change is needed by the following
> patch:
> 
>   https://marc.info/?l=linux-scsi&m=150703956929333&w=2
> 
> Block layer need to call percpu_ref_tryget_live()
> with holding RCU read lock, given it is a hot I/O
> path, that is why I make this patch so that we can
> avoid nested RCU read lock and save a bit cost.

I'm not necessarily against it but please name it sth like
percpu_ref_tryget_live_locked(), add lockdep assertions in the inner
function and document properly.

Thanks.

-- 
tejun

Re: [PATCH V8 5/8] percpu-refcount: introduce __percpu_ref_tryget_live

2017-10-03 Thread Ming Lei

On Tue, Oct 03, 2017 at 06:40:24PM +, Bart Van Assche wrote:
> On Tue, 2017-10-03 at 22:04 +0800, Ming Lei wrote:
> > Block layer need to call this function after holding
> > rcu lock in a real hot path, so introduce this helper.
> 
> Since it is allowed to nest rcu_read_lock_sched() calls I don't think
> that this patch is necessary.

Yeah, I know that, with this patch, we can avoid the nest RCU lock.
As I mentioned, it is a real hot path, so maybe better to not
introduce the extra RCU read lock/unlock.

If you guys doesn't care the little cost, I can remove that.

-- 
Ming

Re: [PATCH V8 5/8] percpu-refcount: introduce __percpu_ref_tryget_live

2017-10-03 Thread Ming Lei

On Tue, Oct 03, 2017 at 07:14:59AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Tue, Oct 03, 2017 at 10:04:03PM +0800, Ming Lei wrote:
> > Block layer need to call this function after holding
> > rcu lock in a real hot path, so introduce this helper.
> 
> The patch description is too anemic.  It doesn't even describe what
> changes are being made, let alone justifying them.

Sorry for Cc you, and this change is needed by the following
patch:

https://marc.info/?l=linux-scsi&m=150703956929333&w=2

Block layer need to call percpu_ref_tryget_live()
with holding RCU read lock, given it is a hot I/O
path, that is why I make this patch so that we can
avoid nested RCU read lock and save a bit cost.

--
Ming

Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Rafael J. Wysocki

On Tuesday, October 3, 2017 8:53:13 PM CEST Luis R. Rodriguez wrote:
> Now that all filesystems which used to rely on kthread
> freezing have been converted to filesystem freeze/thawing
> we can remove the kernel kthread freezer.
> 
> Signed-off-by: Luis R. Rodriguez 

I like this one. :-)

Thanks,
Rafael

[RFC 0/5] fs: replace kthread freezing with filesystem freeze/thaw

2017-10-03 Thread Luis R. Rodriguez

At the 2015 South Korea kernel summit Jiri Kosina had pointd out the issue of
the sloppy semantics of the kthread freezer, lwn covered this pretty well [0].
In short, he explained how the original purpose of the freezer was to aid
in going to suspend to ensure no uwanted IO activity would cause filesystem
corruption. Kernel kthreads require special freezer handling though, the call
try_to_freeze() is often sprinkled at strategic places, but sometimes this is
done without set_freezable() making try_to_freeze() useless. Other helpers such
as freezable_schedule_timeout() exist, and very likely they are not used in any
consistent and proper way either all over the kernel. Dealing with these
helpers alone also does not and cannot ensure that everything that has been
spawned asynchronously from a kthread (such as timers) are stopped as well,
these are left to the kthread user implementation, and chances are pretty high
there are many bugs lurking here. There are even non-IO bounds kthreads now
using the freezer APIs, where this is not even needed!

Jiri suggested we can easily replace the complexities of the kthread freezer
by just using the existing filesystem freeze/thaw callbacks on hibernation and
suspend.

I've picked up Jiri's work given there are bugs which after inspection don't
see like real bugs, but just random IO loosely waiting to be completed and the
kernel not really being able to tell me who the culprit is. In fact even if one
plugs a fix, one ends up in another random place and its really unclear who is
to blaim for next.

For instance, to reproduce a simple suspend bug on what may at first seem to be
an XFS bug, one can issue a dd onto disk prior to suspend, and we'll get a
stall on our way to suspend, claiming the issue was the SCSI layer not being
able to quiesce the disk. This was reported on OpenSUSE and reproduced on
linux-next easily [1]. The following script can be run while we loop on
systemctl suspend and qemu system_wakeup calls to resume:

while true; do
dd if=/dev/zero of=crap bs=1M count=1024 &> /dev/null
done

You end up with with a hung suspend attempt, and eventually a splat
as follows with a hunk task notification:

INFO: task kworker/u8:8:1320 blocked for more than 10 seconds.
  Tainted: GE   4.13.0-next-20170907+ #88
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u8:8D0  1320  2 0x8000
Workqueue: events_unbound async_run_entry_fn
Call Trace:
 __schedule+0x2ec/0x7a0
 schedule+0x36/0x80
 io_schedule+0x16/0x40
 get_request+0x278/0x780
 ? remove_wait_queue+0x70/0x70
 blk_get_request+0x9c/0x110
 scsi_execute+0x7a/0x310 [scsi_mod]
 sd_sync_cache+0xa3/0x190 [sd_mod]
 ? blk_run_queue+0x3f/0x50
 sd_suspend_common+0x7b/0x130 [sd_mod]
 ? scsi_print_result+0x270/0x270 [scsi_mod]
 sd_suspend_system+0x13/0x20 [sd_mod]
 do_scsi_suspend+0x1b/0x30 [scsi_mod]
 scsi_bus_suspend_common+0xb1/0xd0 [scsi_mod]
 ? device_for_each_child+0x69/0x90
 scsi_bus_suspend+0x15/0x20 [scsi_mod]
 dpm_run_callback+0x56/0x140
 ? scsi_bus_freeze+0x20/0x20 [scsi_mod]
 __device_suspend+0xf1/0x340
 async_suspend+0x1f/0xa0
 async_run_entry_fn+0x38/0x160
 process_one_work+0x191/0x380
 worker_thread+0x4e/0x3c0
 kthread+0x109/0x140
 ? process_one_work+0x380/0x380
 ? kthread_create_on_node+0x70/0x70
 ret_from_fork+0x25/0x30

Sprinkling freezer_do_not_count() on scsi_execute() *iff* we're going to
suspend is one way to fix this splat, however this just pushes the bug
elsewhere, and we keep doing this ever on. The right thing to do here is ensure
that after userpsace is frozen we also freeze all filesystem behaviour.

Its possible other symptoms have been observed elsewhere, and similar work
arounds are being devised. Perhaps this work can help put a stop gap for such
work arounds and complexities.

While this all seems trivial, well discussed, and so it should be well accepted
for us to go full swing with a replacement of the kthread freezer with
filesystem suspend / thaw -- not all filesystems have a respective freeze/thaw
call. Also in practice it would seem some thaw'ing calls need some revision.
For instance thawing with a sync call on ext4 currently issues a bio_submit(),
which in turn stalls on resume. Each call/filesystem then should be vetted
manually, as I've done for ext4 in this series.

Likewise while kthread freezing may have been designed to help with avoiding IO
on disk, its sloppy use on kthreads may now be providing buggy functionality
which *helps* somehow now. The removal of freezer helpers on kthreads which are
not IO bound should also be carefully vetted for to avoid potential
regressions.

Then we have the few areas in the kernel which may generate disk IO which do
not come from filesystems -- this was mentioned recently the the Alpine Linux
Persistence and Storage Summit [2] -- such drivers would need to get proper
suspend calls eventually.

Given all the above then I'd like to move *carefully* with the

[RFC 4/5] ext4: add fs freezing support on suspend/hibernation

2017-10-03 Thread Luis R. Rodriguez

This also removes the superflous freezer calls as they are no longer
needed. We need to avoid sync call on thaw as otherwise we end up with
a stall on bio_submit().

Signed-off-by: Luis R. Rodriguez 
---
 fs/ext4/super.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3cc5635d00cc..03d3bbd27dae 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -119,7 +119,8 @@ static struct file_system_type ext2_fs_type = {
.name   = "ext2",
.mount  = ext4_mount,
.kill_sb= kill_block_super,
-   .fs_flags   = FS_REQUIRES_DEV,
+   .fs_flags   = FS_REQUIRES_DEV |
+ FS_FREEZE_ON_SUSPEND,
 };
 MODULE_ALIAS_FS("ext2");
 MODULE_ALIAS("ext2");
@@ -134,7 +135,8 @@ static struct file_system_type ext3_fs_type = {
.name   = "ext3",
.mount  = ext4_mount,
.kill_sb= kill_block_super,
-   .fs_flags   = FS_REQUIRES_DEV,
+   .fs_flags   = FS_REQUIRES_DEV |
+ FS_FREEZE_ON_SUSPEND,
 };
 MODULE_ALIAS_FS("ext3");
 MODULE_ALIAS("ext3");
@@ -2979,8 +2981,6 @@ static int ext4_lazyinit_thread(void *arg)
}
mutex_unlock(&eli->li_list_mtx);
 
-   try_to_freeze();
-
cur = jiffies;
if ((time_after_eq(cur, next_wakeup)) ||
(MAX_JIFFY_OFFSET == next_wakeup)) {
@@ -4926,7 +4926,7 @@ static int ext4_unfreeze(struct super_block *sb)
ext4_set_feature_journal_needs_recovery(sb);
}
 
-   ext4_commit_super(sb, 1);
+   ext4_commit_super(sb, 0);
return 0;
 }
 
@@ -5771,7 +5771,8 @@ static struct file_system_type ext4_fs_type = {
.name   = "ext4",
.mount  = ext4_mount,
.kill_sb= kill_block_super,
-   .fs_flags   = FS_REQUIRES_DEV,
+   .fs_flags   = FS_REQUIRES_DEV |
+ FS_FREEZE_ON_SUSPEND,
 };
 MODULE_ALIAS_FS("ext4");
 
-- 
2.14.0

[RFC 3/5] xfs: allow fs freeze on suspend/hibernation

2017-10-03 Thread Luis R. Rodriguez

This also removes the freezer calls on the XFS kthread as they are
no longer needed.

Signed-off-by: Luis R. Rodriguez 
---
 fs/xfs/xfs_super.c | 3 ++-
 fs/xfs/xfs_trans_ail.c | 7 ++-
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 64013b2db8e0..75c3415657eb 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1775,7 +1775,8 @@ static struct file_system_type xfs_fs_type = {
.name   = "xfs",
.mount  = xfs_fs_mount,
.kill_sb= kill_block_super,
-   .fs_flags   = FS_REQUIRES_DEV,
+   .fs_flags   = FS_REQUIRES_DEV |
+ FS_FREEZE_ON_SUSPEND,
 };
 MODULE_ALIAS_FS("xfs");
 
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 354368a906e5..8cebda88e91a 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -512,7 +512,6 @@ xfsaild(
longtout = 0;   /* milliseconds */
 
current->flags |= PF_MEMALLOC;
-   set_freezable();
 
while (!kthread_should_stop()) {
if (tout && tout <= 20)
@@ -535,19 +534,17 @@ xfsaild(
if (!xfs_ail_min(ailp) &&
ailp->xa_target == ailp->xa_target_prev) {
spin_unlock(&ailp->xa_lock);
-   freezable_schedule();
+   schedule();
tout = 0;
continue;
}
spin_unlock(&ailp->xa_lock);
 
if (tout)
-   freezable_schedule_timeout(msecs_to_jiffies(tout));
+   schedule_timeout(msecs_to_jiffies(tout));
 
__set_current_state(TASK_RUNNING);
 
-   try_to_freeze();
-
tout = xfsaild_push(ailp);
}
 
-- 
2.14.0

[RFC 2/5] fs: freeze on suspend and thaw on resume

2017-10-03 Thread Luis R. Rodriguez

This uses the existing filesystem freeze and thaw callbacks to
freeze each filesystem on suspend/hibernation and thaw upon resume.

This is needed so that we properly really stop IO in flight without
races after userspace has been frozen. Without this we rely on
kthread freezing and its semantics are loose and error prone.
For instance, even though a kthread may use try_to_freeze() and end
up being frozen we have no way of being sure that everything that
has been spawned asynchronously from it (such as timers) have also
been stopped as well.

A long term advantage of also adding filesystem freeze / thawing
supporting durign suspend / hibernation is that long term we may
be able to eventually drop the kernel's thread freezing completely
as it was originally added to stop disk IO in flight as we hibernate
or suspend.

This also implies that many kthread users exist which have been
adding freezer semantics onto its kthreads without need. These also
will need to be reviewed later.

This is based on prior work originally by Rafael Wysocki and later by
Jiri Kosina.

Signed-off-by: Luis R. Rodriguez 
---
 fs/super.c | 79 ++
 include/linux/fs.h | 13 +
 kernel/power/process.c | 14 -
 3 files changed, 105 insertions(+), 1 deletion(-)

diff --git a/fs/super.c b/fs/super.c
index d45e92d9a38f..ce8da8b187b1 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1572,3 +1572,82 @@ int thaw_super(struct super_block *sb)
return 0;
 }
 EXPORT_SYMBOL(thaw_super);
+
+#ifdef CONFIG_PM_SLEEP
+static bool super_allows_freeze(struct super_block *sb)
+{
+   return !!(sb->s_type->fs_flags & FS_FREEZE_ON_SUSPEND);
+}
+
+static bool super_should_freeze(struct super_block *sb)
+{
+   if (!sb->s_root)
+   return false;
+   if (!(sb->s_flags & MS_BORN))
+   return false;
+   /*
+* We don't freeze virtual filesystems, we skip those filesystems with
+* no backing device.
+*/
+   if (sb->s_bdi == &noop_backing_dev_info)
+   return false;
+   /* No need to freeze read-only filesystems */
+   if (sb->s_flags & MS_RDONLY)
+   return false;
+
+   if (!super_allows_freeze(sb))
+   return false;
+
+   return true;
+}
+
+int fs_suspend_freeze_sb(struct super_block *sb, void *priv)
+{
+   int error = 0;
+
+   spin_lock(&sb_lock);
+   if (!super_should_freeze(sb))
+   goto out;
+
+   up_read(&sb->s_umount);
+   pr_info("%s (%s): freezing\n", sb->s_type->name, sb->s_id);
+   error = freeze_super(sb);
+   down_read(&sb->s_umount);
+out:
+   if (error && error != -EBUSY)
+   pr_notice("%s (%s): Unable to freeze, error=%d",
+ sb->s_type->name, sb->s_id, error);
+   spin_unlock(&sb_lock);
+   return error;
+}
+
+int fs_suspend_freeze(void)
+{
+   return iterate_supers_reverse(fs_suspend_freeze_sb, NULL);
+}
+
+void fs_resume_unfreeze_sb(struct super_block *sb, void *priv)
+{
+   int error = 0;
+
+   spin_lock(&sb_lock);
+   if (!super_should_freeze(sb))
+   goto out;
+
+   up_read(&sb->s_umount);
+   pr_info("%s (%s): thawing\n", sb->s_type->name, sb->s_id);
+   error = thaw_super(sb);
+   down_read(&sb->s_umount);
+out:
+   if (error && error != -EBUSY)
+   pr_notice("%s (%s): Unable to unfreeze, error=%d",
+ sb->s_type->name, sb->s_id, error);
+   spin_unlock(&sb_lock);
+}
+
+void fs_resume_unfreeze(void)
+{
+   iterate_supers(fs_resume_unfreeze_sb, NULL);
+}
+
+#endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index cd084792cf39..7754230d6afc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2071,6 +2071,7 @@ struct file_system_type {
 #define FS_BINARY_MOUNTDATA2
 #define FS_HAS_SUBTYPE 4
 #define FS_USERNS_MOUNT8   /* Can be mounted by userns 
root */
+#define FS_FREEZE_ON_SUSPEND   16  /* should filesystem freeze on suspend? 
*/
 #define FS_RENAME_DOES_D_MOVE  32768   /* FS will handle d_move() during 
rename() internally. */
struct dentry *(*mount) (struct file_system_type *, int,
   const char *, void *);
@@ -2172,6 +2173,18 @@ extern int fd_statfs(int, struct kstatfs *);
 extern int vfs_ustat(dev_t, struct kstatfs *);
 extern int freeze_super(struct super_block *super);
 extern int thaw_super(struct super_block *super);
+#ifdef CONFIG_PM_SLEEP
+int fs_suspend_freeze(void);
+void fs_resume_unfreeze(void);
+#else
+static inline int fs_suspend_freeze(void)
+{
+   return 0;
+}
+static inline void fs_resume_unfreeze(void)
+{
+}
+#endif
 extern bool our_mnt(struct vfsmount *mnt);
 extern __printf(2, 3)
 int super_setup_bdi_name(struct super_block *sb, char *fmt, ...);
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 50f25cb370c6..9d1277768312 100644
--- a/kernel/power/proces

[RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Luis R. Rodriguez

Now that all filesystems which used to rely on kthread
freezing have been converted to filesystem freeze/thawing
we can remove the kernel kthread freezer.

Signed-off-by: Luis R. Rodriguez 
---
 Documentation/power/freezing-of-tasks.txt |  9 --
 drivers/xen/manage.c  |  6 
 include/linux/freezer.h   |  4 ---
 kernel/power/hibernate.c  | 10 ++-
 kernel/power/power.h  | 20 +
 kernel/power/process.c| 47 ---
 kernel/power/user.c   |  9 --
 tools/power/pm-graph/analyze_suspend.py   |  1 -
 8 files changed, 3 insertions(+), 103 deletions(-)

diff --git a/Documentation/power/freezing-of-tasks.txt 
b/Documentation/power/freezing-of-tasks.txt
index af005770e767..477559f57c95 100644
--- a/Documentation/power/freezing-of-tasks.txt
+++ b/Documentation/power/freezing-of-tasks.txt
@@ -71,15 +71,6 @@ Rationale behind the functions dealing with freezing and 
thawing of tasks:
 freeze_processes():
   - freezes only userspace tasks
 
-freeze_kernel_threads():
-  - freezes all tasks (including kernel threads) because we can't freeze
-kernel threads without freezing userspace tasks
-
-thaw_kernel_threads():
-  - thaws only kernel threads; this is particularly useful if we need to do
-anything special in between thawing of kernel threads and thawing of
-userspace tasks, or if we want to postpone the thawing of userspace tasks
-
 thaw_processes():
   - thaws all tasks (including kernel threads) because we can't thaw userspace
 tasks without thawing kernel threads
diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index c425d03d37d2..8ca0e0c9a7d5 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -109,12 +109,6 @@ static void do_suspend(void)
goto out;
}
 
-   err = freeze_kernel_threads();
-   if (err) {
-   pr_err("%s: freeze kernel threads failed %d\n", __func__, err);
-   goto out_thaw;
-   }
-
err = dpm_suspend_start(PMSG_FREEZE);
if (err) {
pr_err("%s: dpm_suspend_start %d\n", __func__, err);
diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index dd03e837ebb7..037ef3f16173 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -43,9 +43,7 @@ extern void __thaw_task(struct task_struct *t);
 
 extern bool __refrigerator(bool check_kthr_stop);
 extern int freeze_processes(void);
-extern int freeze_kernel_threads(void);
 extern void thaw_processes(void);
-extern void thaw_kernel_threads(void);
 
 /*
  * DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION
@@ -263,9 +261,7 @@ static inline void __thaw_task(struct task_struct *t) {}
 
 static inline bool __refrigerator(bool check_kthr_stop) { return false; }
 static inline int freeze_processes(void) { return -ENOSYS; }
-static inline int freeze_kernel_threads(void) { return -ENOSYS; }
 static inline void thaw_processes(void) {}
-static inline void thaw_kernel_threads(void) {}
 
 static inline bool try_to_freeze_nowarn(void) { return false; }
 static inline bool try_to_freeze(void) { return false; }
diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
index a5c36e9c56a6..7c3af084b10a 100644
--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -352,10 +352,6 @@ int hibernation_snapshot(int platform_mode)
if (error)
goto Close;
 
-   error = freeze_kernel_threads();
-   if (error)
-   goto Cleanup;
-
if (hibernation_test(TEST_FREEZER)) {
 
/*
@@ -363,13 +359,13 @@ int hibernation_snapshot(int platform_mode)
 * successful freezer test.
 */
freezer_test_done = true;
-   goto Thaw;
+   goto Cleanup;
}
 
error = dpm_prepare(PMSG_FREEZE);
if (error) {
dpm_complete(PMSG_RECOVER);
-   goto Thaw;
+   goto Cleanup;
}
 
suspend_console();
@@ -405,8 +401,6 @@ int hibernation_snapshot(int platform_mode)
platform_end(platform_mode);
return error;
 
- Thaw:
-   thaw_kernel_threads();
  Cleanup:
swsusp_free();
goto Close;
diff --git a/kernel/power/power.h b/kernel/power/power.h
index 1d2d761e3c25..333bde062e42 100644
--- a/kernel/power/power.h
+++ b/kernel/power/power.h
@@ -253,25 +253,7 @@ extern int pm_test_level;
 #ifdef CONFIG_SUSPEND_FREEZER
 static inline int suspend_freeze_processes(void)
 {
-   int error;
-
-   error = freeze_processes();
-   /*
-* freeze_processes() automatically thaws every task if freezing
-* fails. So we need not do anything extra upon error.
-*/
-   if (error)
-   return error;
-
-   error = freeze_kernel_threads();
-   /*
-* freeze_kernel_threads() thaws only kernel threads upon freezing
-* failure. So we ha

[RFC 1/5] fs: add iterate_supers_reverse()

2017-10-03 Thread Luis R. Rodriguez

There are use cases where we wish to traverse the superblock list
in reverse order. This particular implementation will also enable
to capture errors.

Signed-off-by: Luis R. Rodriguez 
---
 fs/super.c | 43 +++
 include/linux/fs.h |  1 +
 2 files changed, 44 insertions(+)

diff --git a/fs/super.c b/fs/super.c
index 02da00410de8..d45e92d9a38f 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -609,6 +609,49 @@ void iterate_supers(void (*f)(struct super_block *, void 
*), void *arg)
spin_unlock(&sb_lock);
 }
 
+/**
+ * iterate_supers_reverse - call function for active superblocks in reverse
+ * @f: function to call
+ * @arg: argument to pass to it
+ *
+ * Scans the superblock list and calls given function, passing it
+ * locked superblock and given argument, in reverse order. Returns if
+ * an error occurred.
+ */
+int iterate_supers_reverse(int (*f)(struct super_block *, void *), void *arg)
+{
+   struct super_block *sb, *p = NULL;
+   int error = 0;
+
+   spin_lock(&sb_lock);
+   list_for_each_entry_reverse(sb, &super_blocks, s_list) {
+   if (hlist_unhashed(&sb->s_instances))
+   continue;
+   sb->s_count++;
+   spin_unlock(&sb_lock);
+
+   down_read(&sb->s_umount);
+   if (sb->s_root && (sb->s_flags & SB_BORN)) {
+   error = f(sb, arg);
+   if (error) {
+   spin_lock(&sb_lock);
+   break;
+   }
+   }
+   up_read(&sb->s_umount);
+
+   spin_lock(&sb_lock);
+   if (p)
+   __put_super(p);
+   p = sb;
+   }
+   if (p)
+   __put_super(p);
+   spin_unlock(&sb_lock);
+
+   return error;
+}
+
 /**
  * iterate_supers_type - call function for superblocks of given type
  * @type: fs type
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 46796b2f956b..cd084792cf39 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3097,6 +3097,7 @@ extern struct super_block *get_active_super(struct 
block_device *bdev);
 extern void drop_super(struct super_block *sb);
 extern void drop_super_exclusive(struct super_block *sb);
 extern void iterate_supers(void (*)(struct super_block *, void *), void *);
+extern int iterate_supers_reverse(int (*)(struct super_block *, void *), void 
*);
 extern void iterate_supers_type(struct file_system_type *,
void (*)(struct super_block *, void *), void *);
 
-- 
2.14.0

Re: [PATCH V8 5/8] percpu-refcount: introduce __percpu_ref_tryget_live

2017-10-03 Thread Bart Van Assche

On Tue, 2017-10-03 at 22:04 +0800, Ming Lei wrote:
> Block layer need to call this function after holding
> rcu lock in a real hot path, so introduce this helper.

Since it is allowed to nest rcu_read_lock_sched() calls I don't think
that this patch is necessary.

Bart.

Re: [PATCH V8 0/8] block/scsi: safe SCSI quiescing

2017-10-03 Thread Oleksandr Natalenko

Also

Tested-by: Oleksandr Natalenko 

for whole v8.

On úterý 3. října 2017 16:03:58 CEST Ming Lei wrote:
> Hi Jens,
> 
> Please consider this patchset for V4.15, and it fixes one
> kind of long-term I/O hang issue in either block legacy path
> or blk-mq.
> 
> The current SCSI quiesce isn't safe and easy to trigger I/O deadlock.
> 
> Once SCSI device is put into QUIESCE, no new request except for
> RQF_PREEMPT can be dispatched to SCSI successfully, and
> scsi_device_quiesce() just simply waits for completion of I/Os
> dispatched to SCSI stack. It isn't enough at all.
> 
> Because new request still can be comming, but all the allocated
> requests can't be dispatched successfully, so request pool can be
> consumed up easily.
> 
> Then request with RQF_PREEMPT can't be allocated and wait forever,
> then system hangs forever, such as during system suspend or
> sending SCSI domain alidation in case of transport_spi.
> 
> Both IO hang inside system suspend[1] or SCSI domain validation
> were reported before.
> 
> This patch introduces preempt only mode, and solves the issue
> by allowing RQF_PREEMP only during SCSI quiesce.
> 
> Both SCSI and SCSI_MQ have this IO deadlock issue, this patch fixes
> them all.
> 
> V8:
>   - fix one race as pointed out by Bart
>   - pass 'op' to blk_queue_enter() as suggested by Christoph
> 
> V7:
>   - add Reviewed-by & Tested-by
>   - one line change in patch 5 for checking preempt request
> 
> V6:
>   - borrow Bart's idea of preempt only, with clean
> implementation(patch 5/patch 6)
>   - needn't any external driver's dependency, such as MD's
>   change
> 
> V5:
>   - fix one tiny race by introducing blk_queue_enter_preempt_freeze()
>   given this change is small enough compared with V4, I added
>   tested-by directly
> 
> V4:
>   - reorganize patch order to make it more reasonable
>   - support nested preempt freeze, as required by SCSI transport spi
>   - check preempt freezing in slow path of of blk_queue_enter()
>   - add "SCSI: transport_spi: resume a quiesced device"
>   - wake up freeze queue in setting dying for both blk-mq and legacy
>   - rename blk_mq_[freeze|unfreeze]_queue() in one patch
>   - rename .mq_freeze_wq and .mq_freeze_depth
>   - improve comment
> 
> V3:
>   - introduce q->preempt_unfreezing to fix one bug of preempt freeze
>   - call blk_queue_enter_live() only when queue is preempt frozen
>   - cleanup a bit on the implementation of preempt freeze
>   - only patch 6 and 7 are changed
> 
> V2:
>   - drop the 1st patch in V1 because percpu_ref_is_dying() is
>   enough as pointed by Tejun
>   - introduce preempt version of blk_[freeze|unfreeze]_queue
>   - sync between preempt freeze and normal freeze
>   - fix warning from percpu-refcount as reported by Oleksandr
> 
> 
> [1] https://marc.info/?t=150340250100013&r=3&w=2
> 
> 
> Thanks,
> Ming
> 
> Bart Van Assche (1):
>   block: Convert RQF_PREEMPT into REQ_PREEMPT
> 
> Ming Lei (7):
>   blk-mq: only run hw queues for blk-mq
>   block: tracking request allocation with q_usage_counter
>   block: pass 'op' to blk_queue_enter()
>   percpu-refcount: introduce __percpu_ref_tryget_live
>   blk-mq: return if queue is frozen via current blk_freeze_queue_start
>   block: support PREEMPT_ONLY
>   SCSI: set block queue at preempt only when SCSI device is put into
> quiesce
> 
>  block/blk-core.c| 66
> + block/blk-mq-debugfs.c  |
>  2 +-
>  block/blk-mq.c  | 26 
>  block/blk-mq.h  |  1 -
>  block/blk-timeout.c |  2 +-
>  block/blk.h |  2 +-
>  drivers/ide/ide-atapi.c |  3 +-
>  drivers/ide/ide-io.c|  2 +-
>  drivers/ide/ide-pm.c|  4 +--
>  drivers/scsi/scsi_lib.c | 31 +++
>  fs/block_dev.c  |  4 +--
>  include/linux/blk-mq.h  |  4 +--
>  include/linux/blk_types.h   |  6 
>  include/linux/blkdev.h  | 10 ---
>  include/linux/percpu-refcount.h | 27 ++---
>  15 files changed, 137 insertions(+), 53 deletions(-)

Re: [Nbd] [PATCH] MAINTAINERS: update list for NBD

2017-10-03 Thread Jens Axboe

On 09/22/2017 04:09 AM, Wouter Verhelst wrote:
> nbd-gene...@sourceforge.net becomes n...@other.debian.org, because
> sourceforge is just a spamtrap these days.

Applied for 4.14, thanks.

-- 
Jens Axboe

Re: [Nbd] [PATCH] MAINTAINERS: update list for NBD

2017-10-03 Thread Wouter Verhelst

ping?

On Fri, Sep 22, 2017 at 12:09:54PM +0200, Wouter Verhelst wrote:
> nbd-gene...@sourceforge.net becomes n...@other.debian.org, because
> sourceforge is just a spamtrap these days.
> 
> Signed-off-by: Wouter Verhelst 
> Reviewed-by: Josef Bacik 
> ---
>  MAINTAINERS | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index fbb269415f06..2f8824995a93 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -9326,7 +9326,7 @@ NETWORK BLOCK DEVICE (NBD)
>  M:   Josef Bacik 
>  S:   Maintained
>  L:   linux-block@vger.kernel.org
> -L:   nbd-gene...@lists.sourceforge.net
> +L:   n...@other.debian.org
>  F:   Documentation/blockdev/nbd.txt
>  F:   drivers/block/nbd.c
>  F:   include/uapi/linux/nbd.h
> -- 
> 2.14.1
> 
> 

-- 
Could you people please use IRC like normal people?!?

  -- Amaya Rodrigo Sastre, trying to quiet down the buzz in the DebConf 2008
 Hacklab

Re: [PATCH] block: drop "sending ioctl to a partition" message

2017-10-03 Thread Paolo Bonzini

On 21/09/2017 16:49, Paolo Bonzini wrote:
> After the first few months, the message has not led to many bug reports.
> It's been almost five years now, and in practice the main source of
> it seems to be MTIOCGET that someone is using to detect tape devices.
> While we could whitelist it just like CDROM_GET_CAPABILITY, this patch
> just removes the message altogether.
> 
> Signed-off-by: Paolo Bonzini 

Ping (with fixed email address for Jens)...

Paolo

Re: [PATCH] block,bfq: Disable writeback throttling

2017-10-03 Thread lee . tibbert


Luca,

Please feel free to add:

  Tested-by: Lee Tibbert 

The short story is that, with one small commenting issue, the patch is
fine on my system.  The small issue is that block/blk-wbt.c line 657 says:

Disable wbt, if enabled by default. Only called from CFQ.

As of this patch, that comment is no longer true.  Adding "& BFQ" is
the obvious edit, but simply deleting the "Only" sentence is probably
more robust. IMHO, that should not hold up acceptence of this patch.

Longer story:

  In addition to running this for almost three weeks now, I explicitly
  tested a number of paths CONFIG_BLK_WBT enabled &
  disabled, etc. Details upon request.

  In particular, I tested that when CONFIG_BLK_WBT is enabled
  sys/block/sda/queue/wbt_lat_usec is cleared when the elevator
  is set to BFQ(BFQ-MQ). Patch does what it purports.

  I also tested that wbt_lat_usec gets set back to its default when
  an elevator other than BFQ is selected after BFQ is active. (Yeah,
  the new elevator probably does this. There is no code in BFQ to do it).

  This patch makes it easier to get write back settings right when using
  cgroups.  With cgroup, on wants  CONFIG_BLK_DEV_THROTTLING on
  but CONFIG_BLK_WBT off. This patch painlessly assures the latter
  and reduces, at least my, confusion.

Thank you for this patch. I hope that it is accepted for linux-block in
as time allows.

Lee

Re: [PATCH v5 0/8] block, scsi, md: Improve suspend and resume

2017-10-03 Thread Oleksandr Natalenko

Hello.

I can confirm that this patchset applied on top of v4.14-rc3 also (like Ming's 
one) fixes the issue with suspend/resume+RAID10, reported by me previously.

Tested-by: Oleksandr Natalenko 

On úterý 3. října 2017 0:52:10 CEST Bart Van Assche wrote:
> Hello Jens,
> 
> It is known that during the resume following a hibernate, especially when
> using an md RAID1 array created on top of SCSI devices, sometimes the
> system hangs instead of coming up properly. This patch series fixes this
> problem. This patch series is an alternative for Ming Lei's "block/scsi:
> safe SCSI quiescing" patch series. The advantages of this patch series are
> that a fix for the md driver is included, that no third argument has been
> added to blk_get_request() and that there is no race between
> blk_get_request() and SCSI quiescing.
> 
> These patches have been tested on top of a merge of the block layer
> for-next branch and Linus' master tree. Linus' master tree includes
> patch "KVM: x86: Fix the NULL pointer parameter in check_cr_write()"
> but the block layer for-next branch not yet.
> 
> Please consider these changes for kernel v4.15.
> 
> Thanks,
> 
> Bart.
> 
> Changes between v4 and v5:
> - Split blk_set_preempt_only() into two functions as requested by Christoph.
> - Made blk_get_request() trigger WARN_ONCE() if it is attempted to allocate
> a request while the system is frozen. This is a big help to identify
> drivers that submit I/O while the system is frozen.
> - Since kernel thread freezing is on its way out, reworked the approach for
>   avoiding that the md driver submits I/O while the system is frozen such
> that the approach no longer depends on the kernel thread freeze mechanism.
> - Fixed the (theoretical) deadlock in scsi_device_quiesce() that was
> identified by Ming.
> 
> Changes between v3 and v4:
> - Made sure that this patch series not only works for scsi-mq but also for
>   the legacy SCSI stack.
> - Removed an smp_rmb()/smp_wmb() pair from the hot path and added a comment
>   that explains why that is safe.
> - Reordered the patches in this patch series.
> 
> Changes between v2 and v3:
> - Made md kernel threads freezable.
> - Changed the approach for quiescing SCSI devices again.
> - Addressed Ming's review comments.
> 
> Changes compared to v1 of this patch series:
> - Changed the approach and rewrote the patch series.
> 
> Bart Van Assche (7):
>   md: Introduce md_stop_all_writes()
>   md: Neither resync nor reshape while the system is frozen
>   block: Convert RQF_PREEMPT into REQ_PREEMPT
>   block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
>   scsi: Reduce suspend latency
>   scsi: Set QUEUE_FLAG_PREEMPT_ONLY while quiesced
>   block: Make SCSI device suspend and resume work reliably
> 
> Ming Lei (1):
>   block: Make q_usage_counter also track legacy requests
> 
>  block/blk-core.c  | 67
> ++- block/blk-mq-debugfs.c|
>  2 +-
>  block/blk-mq.c| 14 +++---
>  block/blk-timeout.c   |  2 +-
>  drivers/ide/ide-atapi.c   |  3 +--
>  drivers/ide/ide-io.c  |  2 +-
>  drivers/ide/ide-pm.c  |  4 +--
>  drivers/md/md.c   | 52 +++-
>  drivers/scsi/scsi_lib.c   | 43 --
>  fs/block_dev.c|  4 +--
>  include/linux/blk_types.h |  6 +
>  include/linux/blkdev.h| 11 +---
>  12 files changed, 160 insertions(+), 50 deletions(-)

Re: [PATCH 2/2] block: fix peeking requests during PM

2017-10-03 Thread h...@lst.de

On Tue, Oct 03, 2017 at 03:13:43PM +, Bart Van Assche wrote:
> > +   switch (rq->q->rpm_status) {
> > +   case RPM_SUSPENDED:
> > +   return false;
> > +   case RPM_ACTIVE:
> > +   return rq->rq_flags & RQF_PM;
> > +   default:
> > +   return true;
> > +   }
> >  }
> 
> Hello Christoph,
> 
> The old code does not process non-PM requests in the RPM_SUSPENDING mode nor
> in the RPM_RESUMING code. I think the new code processes all requests in
> these two modes. Was that behavior change intended?

Yeah, looks like this version has the RPM_ACTIVE and default cases
switched.  Back to the drawing board.

Re: [PATCH 2/2] block: fix peeking requests during PM

2017-10-03 Thread Bart Van Assche

On Tue, 2017-10-03 at 10:47 +0200, Christoph Hellwig wrote:
> -static struct request *blk_pm_peek_request(struct request_queue *q,
> -struct request *rq)
> +static bool blk_pm_allow_request(struct request *rq)
>  {
> - if (q->dev && (q->rpm_status == RPM_SUSPENDED ||
> - (q->rpm_status != RPM_ACTIVE && !(rq->rq_flags & RQF_PM
> - return NULL;
> - else
> - return rq;
> + if (!rq->q->dev)
> + return true;
> +
> + switch (rq->q->rpm_status) {
> + case RPM_SUSPENDED:
> + return false;
> + case RPM_ACTIVE:
> + return rq->rq_flags & RQF_PM;
> + default:
> + return true;
> + }
>  }

Hello Christoph,

The old code does not process non-PM requests in the RPM_SUSPENDING mode nor
in the RPM_RESUMING code. I think the new code processes all requests in
these two modes. Was that behavior change intended?

Thanks,

Bart.

Re: [PATCH] lightnvm: pblk: use vfree_atomic when freeing line metadata

2017-10-03 Thread Hans Holmberg

Thanks for the review Andrey, i'll send a new patch removing the lock.

Re: [PATCH] lightnvm: pblk: use vfree_atomic when freeing line metadata

2017-10-03 Thread Javier González

> On 3 Oct 2017, at 16.43, Javier González  wrote:
> 
>> On 3 Oct 2017, at 16.20, Andrey Ryabinin  wrote:
>> 
>> 
>> 
>> On 10/03/2017 05:11 PM, Javier González wrote:
 On 3 Oct 2017, at 16.07, Andrey Ryabinin  wrote:
 
 
 
 On 10/03/2017 04:48 PM, Hans Holmberg wrote:
> From: Hans Holmberg 
> 
> The commit bf22e37a6413 ("mm: add vfree_atomic()") made vfree unsafe to
> call in atomic context (unless the call came from an interrupt) and
> introduced vfree_atomic that is safe to call in atomic context.
> 
> So, since we're holding locks when freeing line metadata, we need to
> use the atomic version of vfree.
> 
> Fix this by introducing an atomic variant of pblk_mfree and
> switching to that in pblk_line_meta_free.
> 
> Signed-off-by: Hans Holmberg 
> ---
> 
> The patch is for:
> https://github.com/OpenChannelSSD/linux branch for-4.15/pblk
> 
> drivers/lightnvm/pblk-init.c | 3 ++-
> drivers/lightnvm/pblk.h  | 8 
> 2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
> index c452478..3a191a6 100644
> --- a/drivers/lightnvm/pblk-init.c
> +++ b/drivers/lightnvm/pblk-init.c
> @@ -396,7 +396,8 @@ static void pblk_line_meta_free(struct pblk *pblk)
>   spin_lock(&l_mg->free_lock);
 
 What's the point in holding ->free_lock here? It seems like it could be 
 just dropped.
>>> 
>>> This lock can indeed be dropped,
>> 
>> So, let's do this. This would be the best way to fix this.
>> 
>>> but the general pblk semaphore, which
>>> serializes initialization and tear down cannot. This is taken on
>>> pblk_exit().
>> 
>> But semaphore is not the problem here. We can sleep under semaphore, so it's 
>> fine.
> 
> It seems to me like a false positive, but lockdep complains on the
> mentioned rw_semaphore held by pblk, and on the mutex held by the
> lightnvm subsystem when removing a target (dev->mlock).
> 
> [ 6037.778889] BUG: sleeping function called from invalid context at 
> mm/vmalloc.c:1492
> [ 6037.786579] in_atomic(): 1, irqs_disabled(): 0, pid: 1282, name: nvme
> [ 6037.793050] 3 locks held by nvme/1282:
> [ 6037.793053]  #0:  (&dev->mlock){+.+.+.}, at: [] 
> nvm_ctl_ioctl+0x3c5/0x6a0
> [ 6037.793075]  #1:  (pblk_lock){+.+.+.}, at: [] 
> pblk_exit+0x1b/0x100
> [ 6037.793092]  #2:  (&(&l_mg->free_lock)->rlock){+.+...}, at: 
> [] pblk_line_meta_free+0x8a/0x130
> 
> Any ideas?
> 

Ok. When dropping ->free_lock, lockdep does not complain. It's just a
misleading notification from lockdep, signalling semaphores as "held
locks" when a real non sleeping lock is being taken.

We will just remove ->free_lock then.

Thanks Andrey.

Javier



signature.asc
Description: Message signed with OpenPGP

Re: RQF_PM fix/cleanup

2017-10-03 Thread Jens Axboe

On 10/03/2017 02:46 AM, Christoph Hellwig wrote:
> Hi Jens,
> 
> this series fixed the blocking/passing of requests during PM IFF there were
> more than one.  That currently isn't the case, but relying on that not only
> is fragile, but also leads to more obscure code than handling it properly.

Looks good to me, added for 4.15.

-- 
Jens Axboe

Re: [PATCH] lightnvm: pblk: use vfree_atomic when freeing line metadata

2017-10-03 Thread Javier González

> On 3 Oct 2017, at 16.20, Andrey Ryabinin  wrote:
> 
> 
> 
> On 10/03/2017 05:11 PM, Javier González wrote:
>>> On 3 Oct 2017, at 16.07, Andrey Ryabinin  wrote:
>>> 
>>> 
>>> 
>>> On 10/03/2017 04:48 PM, Hans Holmberg wrote:
 From: Hans Holmberg 
 
 The commit bf22e37a6413 ("mm: add vfree_atomic()") made vfree unsafe to
 call in atomic context (unless the call came from an interrupt) and
 introduced vfree_atomic that is safe to call in atomic context.
 
 So, since we're holding locks when freeing line metadata, we need to
 use the atomic version of vfree.
 
 Fix this by introducing an atomic variant of pblk_mfree and
 switching to that in pblk_line_meta_free.
 
 Signed-off-by: Hans Holmberg 
 ---
 
 The patch is for:
 https://github.com/OpenChannelSSD/linux branch for-4.15/pblk
 
 drivers/lightnvm/pblk-init.c | 3 ++-
 drivers/lightnvm/pblk.h  | 8 
 2 files changed, 10 insertions(+), 1 deletion(-)
 
 diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
 index c452478..3a191a6 100644
 --- a/drivers/lightnvm/pblk-init.c
 +++ b/drivers/lightnvm/pblk-init.c
 @@ -396,7 +396,8 @@ static void pblk_line_meta_free(struct pblk *pblk)
spin_lock(&l_mg->free_lock);
>>> 
>>> What's the point in holding ->free_lock here? It seems like it could be 
>>> just dropped.
>> 
>> This lock can indeed be dropped,
> 
> So, let's do this. This would be the best way to fix this.
> 
>> but the general pblk semaphore, which
>> serializes initialization and tear down cannot. This is taken on
>> pblk_exit().
> 
> But semaphore is not the problem here. We can sleep under semaphore, so it's 
> fine.
> 

It seems to me like a false positive, but lockdep complains on the
mentioned rw_semaphore held by pblk, and on the mutex held by the
lightnvm subsystem when removing a target (dev->mlock).

[ 6037.778889] BUG: sleeping function called from invalid context at 
mm/vmalloc.c:1492
[ 6037.786579] in_atomic(): 1, irqs_disabled(): 0, pid: 1282, name: nvme
[ 6037.793050] 3 locks held by nvme/1282:
[ 6037.793053]  #0:  (&dev->mlock){+.+.+.}, at: [] 
nvm_ctl_ioctl+0x3c5/0x6a0
[ 6037.793075]  #1:  (pblk_lock){+.+.+.}, at: [] 
pblk_exit+0x1b/0x100
[ 6037.793092]  #2:  (&(&l_mg->free_lock)->rlock){+.+...}, at: 
[] pblk_line_meta_free+0x8a/0x130

Any ideas?

Thanks,
Javier





signature.asc
Description: Message signed with OpenPGP

Re: [PATCH BUGFIX/IMPROVEMENT 0/4] block, bfq: series of fixes of bugs affecting service guarantees

2017-10-03 Thread Jens Axboe

On 09/21/2017 03:03 AM, Paolo Valente wrote:
> Hi,
> the first patch in this series fixes a bug that causes bfq to fail to
> guarantee a high responsiveness on some drives, if there is heavy
> random read+write I/O in the background. More precisely, such a
> failure allowed this bug to be found [1], but the bug may well cause
> other yet unreported anomalies.
> 
> This fix uncovered other bugs that were concealed by the fixed bug,
> for rather subtle reasons. These further bugs caused similar
> responsiveness failures, but with sequential reaad+write workloads in
> the background. The remaining three patches fix these further bugs.
> 
> The sum of these fixes makes responsiveness much stabler with BFQ. In
> the presence of write hogs, it is however still impossible for an I/O
> scheduler to guarantee perfect responsiveness in any circustance,
> because of throttling issues in the virtual-memory management
> subsystem, and in other higher-level components.

Added for 4.15, thanks.

-- 
Jens Axboe

Re: [PATCH] lightnvm: pblk: use vfree_atomic when freeing line metadata

2017-10-03 Thread Andrey Ryabinin



On 10/03/2017 05:11 PM, Javier González wrote:
>> On 3 Oct 2017, at 16.07, Andrey Ryabinin  wrote:
>>
>>
>>
>> On 10/03/2017 04:48 PM, Hans Holmberg wrote:
>>> From: Hans Holmberg 
>>>
>>> The commit bf22e37a6413 ("mm: add vfree_atomic()") made vfree unsafe to
>>> call in atomic context (unless the call came from an interrupt) and
>>> introduced vfree_atomic that is safe to call in atomic context.
>>>
>>> So, since we're holding locks when freeing line metadata, we need to
>>> use the atomic version of vfree.
>>>
>>> Fix this by introducing an atomic variant of pblk_mfree and
>>> switching to that in pblk_line_meta_free.
>>>
>>> Signed-off-by: Hans Holmberg 
>>> ---
>>>
>>> The patch is for:
>>> https://github.com/OpenChannelSSD/linux branch for-4.15/pblk
>>>
>>> drivers/lightnvm/pblk-init.c | 3 ++-
>>> drivers/lightnvm/pblk.h  | 8 
>>> 2 files changed, 10 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
>>> index c452478..3a191a6 100644
>>> --- a/drivers/lightnvm/pblk-init.c
>>> +++ b/drivers/lightnvm/pblk-init.c
>>> @@ -396,7 +396,8 @@ static void pblk_line_meta_free(struct pblk *pblk)
>>> spin_lock(&l_mg->free_lock);
>>
>> What's the point in holding ->free_lock here? It seems like it could be just 
>> dropped.
>>
> 
> This lock can indeed be dropped,

So, let's do this. This would be the best way to fix this.

> but the general pblk semaphore, which
> serializes initialization and tear down cannot. This is taken on
> pblk_exit().
> 

But semaphore is not the problem here. We can sleep under semaphore, so it's 
fine.

> Javier
>

Re: [PATCH V8 5/8] percpu-refcount: introduce __percpu_ref_tryget_live

2017-10-03 Thread Tejun Heo

Hello,

On Tue, Oct 03, 2017 at 10:04:03PM +0800, Ming Lei wrote:
> Block layer need to call this function after holding
> rcu lock in a real hot path, so introduce this helper.

The patch description is too anemic.  It doesn't even describe what
changes are being made, let alone justifying them.

Thanks.

-- 
tejun

Re: [PATCH] lightnvm: pblk: use vfree_atomic when freeing line metadata

2017-10-03 Thread Javier González

> On 3 Oct 2017, at 16.07, Andrey Ryabinin  wrote:
> 
> 
> 
> On 10/03/2017 04:48 PM, Hans Holmberg wrote:
>> From: Hans Holmberg 
>> 
>> The commit bf22e37a6413 ("mm: add vfree_atomic()") made vfree unsafe to
>> call in atomic context (unless the call came from an interrupt) and
>> introduced vfree_atomic that is safe to call in atomic context.
>> 
>> So, since we're holding locks when freeing line metadata, we need to
>> use the atomic version of vfree.
>> 
>> Fix this by introducing an atomic variant of pblk_mfree and
>> switching to that in pblk_line_meta_free.
>> 
>> Signed-off-by: Hans Holmberg 
>> ---
>> 
>> The patch is for:
>> https://github.com/OpenChannelSSD/linux branch for-4.15/pblk
>> 
>> drivers/lightnvm/pblk-init.c | 3 ++-
>> drivers/lightnvm/pblk.h  | 8 
>> 2 files changed, 10 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
>> index c452478..3a191a6 100644
>> --- a/drivers/lightnvm/pblk-init.c
>> +++ b/drivers/lightnvm/pblk-init.c
>> @@ -396,7 +396,8 @@ static void pblk_line_meta_free(struct pblk *pblk)
>>  spin_lock(&l_mg->free_lock);
> 
> What's the point in holding ->free_lock here? It seems like it could be just 
> dropped.
> 

This lock can indeed be dropped, but the general pblk semaphore, which
serializes initialization and tear down cannot. This is taken on
pblk_exit().

Javier


signature.asc
Description: Message signed with OpenPGP

Re: [PATCH] lightnvm: pblk: use vfree_atomic when freeing line metadata

2017-10-03 Thread Andrey Ryabinin



On 10/03/2017 04:48 PM, Hans Holmberg wrote:
> From: Hans Holmberg 
> 
> The commit bf22e37a6413 ("mm: add vfree_atomic()") made vfree unsafe to
> call in atomic context (unless the call came from an interrupt) and
> introduced vfree_atomic that is safe to call in atomic context.
> 
> So, since we're holding locks when freeing line metadata, we need to
> use the atomic version of vfree.
> 
> Fix this by introducing an atomic variant of pblk_mfree and
> switching to that in pblk_line_meta_free.
> 
> Signed-off-by: Hans Holmberg 
> ---
> 
> The patch is for:
> https://github.com/OpenChannelSSD/linux branch for-4.15/pblk
> 
>  drivers/lightnvm/pblk-init.c | 3 ++-
>  drivers/lightnvm/pblk.h  | 8 
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
> index c452478..3a191a6 100644
> --- a/drivers/lightnvm/pblk-init.c
> +++ b/drivers/lightnvm/pblk-init.c
> @@ -396,7 +396,8 @@ static void pblk_line_meta_free(struct pblk *pblk)
>   spin_lock(&l_mg->free_lock);

What's the point in holding ->free_lock here? It seems like it could be just 
dropped.

>   for (i = 0; i < PBLK_DATA_LINES; i++) {
>   kfree(l_mg->sline_meta[i]);
> - pblk_mfree(l_mg->eline_meta[i]->buf, l_mg->emeta_alloc_type);
> + pblk_mfree_atomic(l_mg->eline_meta[i]->buf,
> +   l_mg->emeta_alloc_type);
>   kfree(l_mg->eline_meta[i]);
>   }
>   spin_unlock(&l_mg->free_lock);
> diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
> index 03965da..93f98e3 100644
> --- a/drivers/lightnvm/pblk.h
> +++ b/drivers/lightnvm/pblk.h
> @@ -881,6 +881,14 @@ static inline void pblk_mfree(void *ptr, int type)
>   vfree(ptr);
>  }
>  
> +static inline void pblk_mfree_atomic(void *ptr, int type)
> +{
> + if (type == PBLK_KMALLOC_META)
> + kfree(ptr);
> + else
> + vfree_atomic(ptr);
> +}
> +
>  static inline struct nvm_rq *nvm_rq_from_c_ctx(void *c_ctx)
>  {
>   return c_ctx - sizeof(struct nvm_rq);
>

[PATCH V8 1/8] blk-mq: only run hw queues for blk-mq

2017-10-03 Thread Ming Lei

This patch just makes it explicitely.

Tested-by: Oleksandr Natalenko 
Tested-by: Martin Steigerwald 
Reviewed-by: Johannes Thumshirn 
Cc: Bart Van Assche 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 98a18609755e..6fd9f86fc86d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -125,7 +125,8 @@ void blk_freeze_queue_start(struct request_queue *q)
freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
if (freeze_depth == 1) {
percpu_ref_kill(&q->q_usage_counter);
-   blk_mq_run_hw_queues(q, false);
+   if (q->mq_ops)
+   blk_mq_run_hw_queues(q, false);
}
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
-- 
2.9.5

[PATCH V8 3/8] block: Convert RQF_PREEMPT into REQ_PREEMPT

2017-10-03 Thread Ming Lei

From: Bart Van Assche 

This patch does not change any functionality but makes the
REQ_PREEMPT flag available to blk_get_request(). A later patch
will add code to blk_get_request() that checks the REQ_PREEMPT
flag. Note: the IDE sense_rq request is allocated statically so
there is no blk_get_request() call that corresponds to this
request.

Signed-off-by: Bart Van Assche 
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 block/blk-mq-debugfs.c|  2 +-
 drivers/ide/ide-atapi.c   |  3 +--
 drivers/ide/ide-io.c  |  2 +-
 drivers/ide/ide-pm.c  |  4 ++--
 drivers/scsi/scsi_lib.c   | 11 ++-
 include/linux/blk_types.h |  6 ++
 include/linux/blkdev.h|  3 ---
 7 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 980e73095643..62ac248a4984 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -266,6 +266,7 @@ static const char *const cmd_flag_name[] = {
CMD_FLAG_NAME(BACKGROUND),
CMD_FLAG_NAME(NOUNMAP),
CMD_FLAG_NAME(NOWAIT),
+   CMD_FLAG_NAME(PREEMPT),
 };
 #undef CMD_FLAG_NAME
 
@@ -279,7 +280,6 @@ static const char *const rqf_name[] = {
RQF_NAME(MIXED_MERGE),
RQF_NAME(MQ_INFLIGHT),
RQF_NAME(DONTPREP),
-   RQF_NAME(PREEMPT),
RQF_NAME(COPY_USER),
RQF_NAME(FAILED),
RQF_NAME(QUIET),
diff --git a/drivers/ide/ide-atapi.c b/drivers/ide/ide-atapi.c
index 14d1e7d9a1d6..1258739d5fa1 100644
--- a/drivers/ide/ide-atapi.c
+++ b/drivers/ide/ide-atapi.c
@@ -211,9 +211,8 @@ void ide_prep_sense(ide_drive_t *drive, struct request *rq)
}
 
sense_rq->rq_disk = rq->rq_disk;
-   sense_rq->cmd_flags = REQ_OP_DRV_IN;
+   sense_rq->cmd_flags = REQ_OP_DRV_IN | REQ_PREEMPT;
ide_req(sense_rq)->type = ATA_PRIV_SENSE;
-   sense_rq->rq_flags |= RQF_PREEMPT;
 
req->cmd[0] = GPCMD_REQUEST_SENSE;
req->cmd[4] = cmd_len;
diff --git a/drivers/ide/ide-io.c b/drivers/ide/ide-io.c
index 3a234701d92c..06ffccd0fb10 100644
--- a/drivers/ide/ide-io.c
+++ b/drivers/ide/ide-io.c
@@ -539,7 +539,7 @@ void do_ide_request(struct request_queue *q)
 */
if ((drive->dev_flags & IDE_DFLAG_BLOCKED) &&
ata_pm_request(rq) == 0 &&
-   (rq->rq_flags & RQF_PREEMPT) == 0) {
+   (rq->cmd_flags & REQ_PREEMPT) == 0) {
/* there should be no pending command at this point */
ide_unlock_port(hwif);
goto plug_device;
diff --git a/drivers/ide/ide-pm.c b/drivers/ide/ide-pm.c
index 544f02d673ca..f8d2709dcd56 100644
--- a/drivers/ide/ide-pm.c
+++ b/drivers/ide/ide-pm.c
@@ -89,9 +89,9 @@ int generic_ide_resume(struct device *dev)
}
 
memset(&rqpm, 0, sizeof(rqpm));
-   rq = blk_get_request(drive->queue, REQ_OP_DRV_IN, __GFP_RECLAIM);
+   rq = blk_get_request(drive->queue, REQ_OP_DRV_IN | REQ_PREEMPT,
+__GFP_RECLAIM);
ide_req(rq)->type = ATA_PRIV_PM_RESUME;
-   rq->rq_flags |= RQF_PREEMPT;
rq->special = &rqpm;
rqpm.pm_step = IDE_PM_START_RESUME;
rqpm.pm_state = PM_EVENT_ON;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 9cf6a80fe297..62f905b22821 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -253,8 +253,9 @@ int scsi_execute(struct scsi_device *sdev, const unsigned 
char *cmd,
int ret = DRIVER_ERROR << 24;
 
req = blk_get_request(sdev->request_queue,
-   data_direction == DMA_TO_DEVICE ?
-   REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM);
+ (data_direction == DMA_TO_DEVICE ?
+  REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN) | REQ_PREEMPT,
+ __GFP_RECLAIM);
if (IS_ERR(req))
return ret;
rq = scsi_req(req);
@@ -268,7 +269,7 @@ int scsi_execute(struct scsi_device *sdev, const unsigned 
char *cmd,
rq->retries = retries;
req->timeout = timeout;
req->cmd_flags |= flags;
-   req->rq_flags |= rq_flags | RQF_QUIET | RQF_PREEMPT;
+   req->rq_flags |= rq_flags | RQF_QUIET;
 
/*
 * head injection *required* here otherwise quiesce won't work
@@ -1301,7 +1302,7 @@ scsi_prep_state_check(struct scsi_device *sdev, struct 
request *req)
/*
 * If the devices is blocked we defer normal commands.
 */
-   if (!(req->rq_flags & RQF_PREEMPT))
+   if (!(req->cmd_flags & REQ_PREEMPT))
ret = BLKPREP_DEFER;
break;
default:
@@ -1310,7 +1311,7 @@ scsi_prep_state_check(struct scsi_device *sdev, struct 
request *req)
 * special commands.  In particular any

[PATCH V8 2/8] block: tracking request allocation with q_usage_counter

2017-10-03 Thread Ming Lei

This usage is basically same with blk-mq, so that we can
support to freeze legacy queue easily.

Also 'wake_up_all(&q->mq_freeze_wq)' has to be moved
into blk_set_queue_dying() since both legacy and blk-mq
may wait on the wait queue of .mq_freeze_wq.

Tested-by: Oleksandr Natalenko 
Tested-by: Martin Steigerwald 
Reviewed-by: Hannes Reinecke 
Cc: Bart Van Assche 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 block/blk-core.c | 15 +++
 block/blk-mq.c   |  7 ---
 2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 048be4aa6024..900eaa8eefa6 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -610,6 +610,12 @@ void blk_set_queue_dying(struct request_queue *q)
}
spin_unlock_irq(q->queue_lock);
}
+
+   /*
+* We need to ensure that processes currently waiting on
+* the queue are notified as well.
+*/
+   wake_up_all(&q->mq_freeze_wq);
 }
 EXPORT_SYMBOL_GPL(blk_set_queue_dying);
 
@@ -1395,16 +1401,22 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
   unsigned int op, gfp_t gfp_mask)
 {
struct request *rq;
+   int ret = 0;
 
WARN_ON_ONCE(q->mq_ops);
 
/* create ioc upfront */
create_io_context(gfp_mask, q->node);
 
+   ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM) ||
+   (op & REQ_NOWAIT));
+   if (ret)
+   return ERR_PTR(ret);
spin_lock_irq(q->queue_lock);
rq = get_request(q, op, NULL, gfp_mask);
if (IS_ERR(rq)) {
spin_unlock_irq(q->queue_lock);
+   blk_queue_exit(q);
return rq;
}
 
@@ -1576,6 +1588,7 @@ void __blk_put_request(struct request_queue *q, struct 
request *req)
blk_free_request(rl, req);
freed_request(rl, sync, rq_flags);
blk_put_rl(rl);
+   blk_queue_exit(q);
}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
@@ -1857,8 +1870,10 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
struct bio *bio)
 * Grab a free request. This is might sleep but can not fail.
 * Returns with the queue unlocked.
 */
+   blk_queue_enter_live(q);
req = get_request(q, bio->bi_opf, bio, GFP_NOIO);
if (IS_ERR(req)) {
+   blk_queue_exit(q);
__wbt_done(q->rq_wb, wb_acct);
if (PTR_ERR(req) == -ENOMEM)
bio->bi_status = BLK_STS_RESOURCE;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6fd9f86fc86d..10c1f49f663d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -256,13 +256,6 @@ void blk_mq_wake_waiters(struct request_queue *q)
queue_for_each_hw_ctx(q, hctx, i)
if (blk_mq_hw_queue_mapped(hctx))
blk_mq_tag_wakeup_all(hctx->tags, true);
-
-   /*
-* If we are called because the queue has now been marked as
-* dying, we need to ensure that processes currently waiting on
-* the queue are notified as well.
-*/
-   wake_up_all(&q->mq_freeze_wq);
 }
 
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
-- 
2.9.5

[PATCH V8 5/8] percpu-refcount: introduce __percpu_ref_tryget_live

2017-10-03 Thread Ming Lei

Block layer need to call this function after holding
rcu lock in a real hot path, so introduce this helper.

Cc: Bart Van Assche 
Cc: Tejun Heo 
Signed-off-by: Ming Lei 
---
 include/linux/percpu-refcount.h | 27 +--
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index c13dceb87b60..a0f22586a28d 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -221,6 +221,21 @@ static inline bool percpu_ref_tryget(struct percpu_ref 
*ref)
return ret;
 }
 
+static inline bool __percpu_ref_tryget_live(struct percpu_ref *ref)
+{
+   unsigned long __percpu *percpu_count;
+   bool ret = false;
+
+   if (__ref_is_percpu(ref, &percpu_count)) {
+   this_cpu_inc(*percpu_count);
+   ret = true;
+   } else if (!(ref->percpu_count_ptr & __PERCPU_REF_DEAD)) {
+   ret = atomic_long_inc_not_zero(&ref->count);
+   }
+
+   return ret;
+}
+
 /**
  * percpu_ref_tryget_live - try to increment a live percpu refcount
  * @ref: percpu_ref to try-get
@@ -238,18 +253,10 @@ static inline bool percpu_ref_tryget(struct percpu_ref 
*ref)
  */
 static inline bool percpu_ref_tryget_live(struct percpu_ref *ref)
 {
-   unsigned long __percpu *percpu_count;
-   bool ret = false;
+   bool ret;
 
rcu_read_lock_sched();
-
-   if (__ref_is_percpu(ref, &percpu_count)) {
-   this_cpu_inc(*percpu_count);
-   ret = true;
-   } else if (!(ref->percpu_count_ptr & __PERCPU_REF_DEAD)) {
-   ret = atomic_long_inc_not_zero(&ref->count);
-   }
-
+   ret = __percpu_ref_tryget_live(ref);
rcu_read_unlock_sched();
 
return ret;
-- 
2.9.5

[PATCH V8 8/8] SCSI: set block queue at preempt only when SCSI device is put into quiesce

2017-10-03 Thread Ming Lei

Simply quiesing SCSI device and waiting for completeion of IO
dispatched to SCSI queue isn't safe, it is easy to use up
request pool because all allocated requests before can't
be dispatched when device is put in QIUESCE. Then no request
can be allocated for RQF_PREEMPT, and system may hang somewhere,
such as When sending commands of sync_cache or start_stop during
system suspend path.

Before quiesing SCSI, this patch sets block queue in preempt
mode first, so no new normal request can enter queue any more,
and all pending requests are drained too once blk_set_preempt_only(true)
is returned. Then RQF_PREEMPT can be allocated successfully duirng
SCSI quiescing.

This patch fixes one long term issue of IO hang, in either block legacy
and blk-mq.

Tested-by: Oleksandr Natalenko 
Tested-by: Martin Steigerwald 
Cc: sta...@vger.kernel.org
Cc: Bart Van Assche 
Signed-off-by: Ming Lei 
---
 drivers/scsi/scsi_lib.c | 20 +++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 62f905b22821..f7ffd33a283c 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2929,12 +2929,28 @@ scsi_device_quiesce(struct scsi_device *sdev)
 {
int err;
 
+   /*
+* Simply quiesing SCSI device isn't safe, it is easy
+* to use up requests because all these allocated requests
+* can't be dispatched when device is put in QIUESCE.
+* Then no request can be allocated and we may hang
+* somewhere, such as system suspend/resume.
+*
+* So we set block queue in preempt only first, no new
+* normal request can enter queue any more, and all pending
+* requests are drained once blk_set_preempt_only()
+* returns. Only RQF_PREEMPT is allowed in preempt only mode.
+*/
+   blk_set_preempt_only(sdev->request_queue, true);
+
mutex_lock(&sdev->state_mutex);
err = scsi_device_set_state(sdev, SDEV_QUIESCE);
mutex_unlock(&sdev->state_mutex);
 
-   if (err)
+   if (err) {
+   blk_set_preempt_only(sdev->request_queue, false);
return err;
+   }
 
scsi_run_queue(sdev->request_queue);
while (atomic_read(&sdev->device_busy)) {
@@ -2965,6 +2981,8 @@ void scsi_device_resume(struct scsi_device *sdev)
scsi_device_set_state(sdev, SDEV_RUNNING) == 0)
scsi_run_queue(sdev->request_queue);
mutex_unlock(&sdev->state_mutex);
+
+   blk_set_preempt_only(sdev->request_queue, false);
 }
 EXPORT_SYMBOL(scsi_device_resume);
 
-- 
2.9.5

[PATCH V8 7/8] block: support PREEMPT_ONLY

2017-10-03 Thread Ming Lei

When queue is in PREEMPT_ONLY mode, only REQ_PREEMPT request
can be allocated and dispatched, other requests won't be allowed
to enter I/O path.

This is useful for supporting safe SCSI quiesce.

Part of this patch is from Bart's '[PATCH v4 4∕7] block: Add the 
QUEUE_FLAG_PREEMPT_ONLY
request queue flag'.

Tested-by: Oleksandr Natalenko 
Tested-by: Martin Steigerwald 
Cc: Bart Van Assche 
Signed-off-by: Ming Lei 
---
 block/blk-core.c   | 44 +---
 include/linux/blkdev.h |  5 +
 2 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 1bb566245d37..7849cc1687bc 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -346,6 +346,34 @@ void blk_sync_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
+void blk_set_preempt_only(struct request_queue *q, bool preempt_only)
+{
+   unsigned long flags;
+
+   spin_lock_irqsave(q->queue_lock, flags);
+   if (preempt_only)
+   queue_flag_set(QUEUE_FLAG_PREEMPT_ONLY, q);
+   else
+   queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
+   spin_unlock_irqrestore(q->queue_lock, flags);
+
+   /*
+* The synchronize_rcu() implicied in blk_mq_freeze_queue()
+* or the explicit one will make sure the above write on
+* PREEMPT_ONLY is observed in blk_queue_enter() before
+* running blk_mq_unfreeze_queue().
+*
+* blk_mq_freeze_queue() also drains up any request in queue,
+* so blk_queue_enter() will see the above updated value of
+* PREEMPT flag before any new allocation.
+*/
+   if (!blk_mq_freeze_queue(q))
+   synchronize_rcu();
+
+   blk_mq_unfreeze_queue(q);
+}
+EXPORT_SYMBOL(blk_set_preempt_only);
+
 /**
  * __blk_run_queue_uncond - run a queue whether or not it has been stopped
  * @q: The queue to run
@@ -771,8 +799,16 @@ int blk_queue_enter(struct request_queue *q, unsigned int 
op)
while (true) {
int ret;
 
-   if (percpu_ref_tryget_live(&q->q_usage_counter))
-   return 0;
+   rcu_read_lock_sched();
+   if (__percpu_ref_tryget_live(&q->q_usage_counter)) {
+   if (likely((op & REQ_PREEMPT) ||
+   !blk_queue_preempt_only(q))) {
+   rcu_read_unlock_sched();
+   return 0;
+   } else
+   percpu_ref_put(&q->q_usage_counter);
+   }
+   rcu_read_unlock_sched();
 
if (op & REQ_NOWAIT)
return -EBUSY;
@@ -787,7 +823,9 @@ int blk_queue_enter(struct request_queue *q, unsigned int 
op)
smp_rmb();
 
ret = wait_event_interruptible(q->mq_freeze_wq,
-   !atomic_read(&q->mq_freeze_depth) ||
+   (!atomic_read(&q->mq_freeze_depth) &&
+   ((op & REQ_PREEMPT) ||
+!blk_queue_preempt_only(q))) ||
blk_queue_dying(q));
if (blk_queue_dying(q))
return -ENODEV;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4c688385d866..66d46d9eac29 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -628,6 +628,7 @@ struct request_queue {
 #define QUEUE_FLAG_REGISTERED  26  /* queue has been registered to a disk 
*/
 #define QUEUE_FLAG_SCSI_PASSTHROUGH 27 /* queue supports SCSI commands */
 #define QUEUE_FLAG_QUIESCED28  /* queue has been quiesced */
+#define QUEUE_FLAG_PREEMPT_ONLY29  /* only process REQ_PREEMPT 
requests */
 
 #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) |\
 (1 << QUEUE_FLAG_STACKABLE)|   \
@@ -732,6 +733,10 @@ static inline void queue_flag_clear(unsigned int flag, 
struct request_queue *q)
((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
 REQ_FAILFAST_DRIVER))
 #define blk_queue_quiesced(q)  test_bit(QUEUE_FLAG_QUIESCED, &(q)->queue_flags)
+#define blk_queue_preempt_only(q)  \
+   test_bit(QUEUE_FLAG_PREEMPT_ONLY, &(q)->queue_flags)
+
+extern void blk_set_preempt_only(struct request_queue *q, bool preempt_only);
 
 static inline bool blk_account_rq(struct request *rq)
 {
-- 
2.9.5

[PATCH V8 6/8] blk-mq: return if queue is frozen via current blk_freeze_queue_start

2017-10-03 Thread Ming Lei

We need this return value in the following patch to decide
if a explicit synchronize_rcu() is needed.

Cc: Bart Van Assche 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 13 -
 block/blk-mq.h |  1 -
 block/blk.h|  2 +-
 include/linux/blk-mq.h |  4 ++--
 4 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index bf89549074bb..80d6dd99d7f1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -118,7 +118,7 @@ void blk_mq_in_flight(struct request_queue *q, struct 
hd_struct *part,
blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
 }
 
-void blk_freeze_queue_start(struct request_queue *q)
+bool blk_freeze_queue_start(struct request_queue *q)
 {
int freeze_depth;
 
@@ -127,7 +127,9 @@ void blk_freeze_queue_start(struct request_queue *q)
percpu_ref_kill(&q->q_usage_counter);
if (q->mq_ops)
blk_mq_run_hw_queues(q, false);
+   return true;
}
+   return false;
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
 
@@ -150,7 +152,7 @@ EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait_timeout);
  * Guarantee no request is in use, so we can change any data structure of
  * the queue afterward.
  */
-void blk_freeze_queue(struct request_queue *q)
+bool blk_freeze_queue(struct request_queue *q)
 {
/*
 * In the !blk_mq case we are only calling this to kill the
@@ -159,17 +161,18 @@ void blk_freeze_queue(struct request_queue *q)
 * no blk_unfreeze_queue(), and blk_freeze_queue() is not
 * exported to drivers as the only user for unfreeze is blk_mq.
 */
-   blk_freeze_queue_start(q);
+   bool ret = blk_freeze_queue_start(q);
blk_mq_freeze_queue_wait(q);
+   return ret;
 }
 
-void blk_mq_freeze_queue(struct request_queue *q)
+bool blk_mq_freeze_queue(struct request_queue *q)
 {
/*
 * ...just an alias to keep freeze and unfreeze actions balanced
 * in the blk_mq_* namespace
 */
-   blk_freeze_queue(q);
+   return blk_freeze_queue(q);
 }
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue);
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index ef15b3414da5..41044a8662ca 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -26,7 +26,6 @@ struct blk_mq_ctx {
 } cacheline_aligned_in_smp;
 
 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
-void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_free_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
diff --git a/block/blk.h b/block/blk.h
index fcb9775b997d..fa1b33d3fcbd 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -65,7 +65,7 @@ void blk_rq_bio_prep(struct request_queue *q, struct request 
*rq,
 void blk_queue_bypass_start(struct request_queue *q);
 void blk_queue_bypass_end(struct request_queue *q);
 void __blk_queue_free_tags(struct request_queue *q);
-void blk_freeze_queue(struct request_queue *q);
+bool blk_freeze_queue(struct request_queue *q);
 
 static inline void blk_queue_enter_live(struct request_queue *q)
 {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 50c6485cb04f..d6195925e6c3 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -253,9 +253,9 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool 
async);
 void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
busy_tag_iter_fn *fn, void *priv);
-void blk_mq_freeze_queue(struct request_queue *q);
+bool blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_unfreeze_queue(struct request_queue *q);
-void blk_freeze_queue_start(struct request_queue *q);
+bool blk_freeze_queue_start(struct request_queue *q);
 void blk_mq_freeze_queue_wait(struct request_queue *q);
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 unsigned long timeout);
-- 
2.9.5

[PATCH V8 4/8] block: pass 'op' to blk_queue_enter()

2017-10-03 Thread Ming Lei

We need to check if the request to be allocated is PREEMPT_ONLY,
and have to pass REQ_PREEEMPT flag to blk_queue_eneter(), so pass
'op' to blk_queue_enter() directly.

Cc: Bart Van Assche 
Suggested-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 block/blk-core.c   | 11 ++-
 block/blk-mq.c |  3 ++-
 block/blk-timeout.c|  2 +-
 fs/block_dev.c |  4 ++--
 include/linux/blkdev.h |  2 +-
 5 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 900eaa8eefa6..1bb566245d37 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -766,7 +766,7 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
-int blk_queue_enter(struct request_queue *q, bool nowait)
+int blk_queue_enter(struct request_queue *q, unsigned int op)
 {
while (true) {
int ret;
@@ -774,7 +774,7 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
if (percpu_ref_tryget_live(&q->q_usage_counter))
return 0;
 
-   if (nowait)
+   if (op & REQ_NOWAIT)
return -EBUSY;
 
/*
@@ -1408,8 +1408,8 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
/* create ioc upfront */
create_io_context(gfp_mask, q->node);
 
-   ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM) ||
-   (op & REQ_NOWAIT));
+   ret = blk_queue_enter(q, (gfp_mask & __GFP_DIRECT_RECLAIM) ? op :
+   op | REQ_NOWAIT);
if (ret)
return ERR_PTR(ret);
spin_lock_irq(q->queue_lock);
@@ -1436,6 +1436,7 @@ struct request *blk_get_request(struct request_queue *q, 
unsigned int op,
req = blk_mq_alloc_request(q, op,
(gfp_mask & __GFP_DIRECT_RECLAIM) ?
0 : BLK_MQ_REQ_NOWAIT);
+
if (!IS_ERR(req) && q->mq_ops->initialize_rq_fn)
q->mq_ops->initialize_rq_fn(req);
} else {
@@ -2216,7 +2217,7 @@ blk_qc_t generic_make_request(struct bio *bio)
do {
struct request_queue *q = bio->bi_disk->queue;
 
-   if (likely(blk_queue_enter(q, bio->bi_opf & REQ_NOWAIT) == 0)) {
+   if (likely(blk_queue_enter(q, bio->bi_opf) == 0)) {
struct bio_list lower, same;
 
/* Create a fresh bio_list for all subordinate requests 
*/
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 10c1f49f663d..bf89549074bb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -384,7 +384,8 @@ struct request *blk_mq_alloc_request(struct request_queue 
*q, unsigned int op,
struct request *rq;
int ret;
 
-   ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
+   ret = blk_queue_enter(q, !(flags & BLK_MQ_REQ_NOWAIT) ? op :
+   op | REQ_NOWAIT);
if (ret)
return ERR_PTR(ret);
 
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 17ec83bb0900..071c6de9154d 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -134,7 +134,7 @@ void blk_timeout_work(struct work_struct *work)
struct request *rq, *tmp;
int next_set = 0;
 
-   if (blk_queue_enter(q, true))
+   if (blk_queue_enter(q, REQ_NOWAIT))
return;
spin_lock_irqsave(q->queue_lock, flags);
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 93d088ffc05c..98cf2d7ee9d3 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -674,7 +674,7 @@ int bdev_read_page(struct block_device *bdev, sector_t 
sector,
if (!ops->rw_page || bdev_get_integrity(bdev))
return result;
 
-   result = blk_queue_enter(bdev->bd_queue, false);
+   result = blk_queue_enter(bdev->bd_queue, 0);
if (result)
return result;
result = ops->rw_page(bdev, sector + get_start_sect(bdev), page, false);
@@ -710,7 +710,7 @@ int bdev_write_page(struct block_device *bdev, sector_t 
sector,
 
if (!ops->rw_page || bdev_get_integrity(bdev))
return -EOPNOTSUPP;
-   result = blk_queue_enter(bdev->bd_queue, false);
+   result = blk_queue_enter(bdev->bd_queue, 0);
if (result)
return result;
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ab6b6c0392c2..4c688385d866 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -960,7 +960,7 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct 
gendisk *, fmode_t,
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 struct scsi_ioctl_command __user *);
 
-extern int blk_queue_enter(struct request_queue *q, bool nowait);
+extern int blk_queue_enter(struct request_queue *q, unsigned int op);
 extern void blk_queue_exit(struct request_queue *q);
 extern void blk_start_queue(struct request_queue *q);

[PATCH V8 0/8] block/scsi: safe SCSI quiescing

2017-10-03 Thread Ming Lei

Hi Jens,

Please consider this patchset for V4.15, and it fixes one
kind of long-term I/O hang issue in either block legacy path
or blk-mq.

The current SCSI quiesce isn't safe and easy to trigger I/O deadlock.

Once SCSI device is put into QUIESCE, no new request except for
RQF_PREEMPT can be dispatched to SCSI successfully, and
scsi_device_quiesce() just simply waits for completion of I/Os
dispatched to SCSI stack. It isn't enough at all.

Because new request still can be comming, but all the allocated
requests can't be dispatched successfully, so request pool can be
consumed up easily.

Then request with RQF_PREEMPT can't be allocated and wait forever,
then system hangs forever, such as during system suspend or
sending SCSI domain alidation in case of transport_spi.

Both IO hang inside system suspend[1] or SCSI domain validation
were reported before.

This patch introduces preempt only mode, and solves the issue
by allowing RQF_PREEMP only during SCSI quiesce.

Both SCSI and SCSI_MQ have this IO deadlock issue, this patch fixes
them all.

V8:
- fix one race as pointed out by Bart
- pass 'op' to blk_queue_enter() as suggested by Christoph

V7:
- add Reviewed-by & Tested-by
- one line change in patch 5 for checking preempt request

V6:
- borrow Bart's idea of preempt only, with clean
  implementation(patch 5/patch 6)
- needn't any external driver's dependency, such as MD's
change

V5:
- fix one tiny race by introducing blk_queue_enter_preempt_freeze()
given this change is small enough compared with V4, I added
tested-by directly

V4:
- reorganize patch order to make it more reasonable
- support nested preempt freeze, as required by SCSI transport spi
- check preempt freezing in slow path of of blk_queue_enter()
- add "SCSI: transport_spi: resume a quiesced device"
- wake up freeze queue in setting dying for both blk-mq and legacy
- rename blk_mq_[freeze|unfreeze]_queue() in one patch
- rename .mq_freeze_wq and .mq_freeze_depth
- improve comment

V3:
- introduce q->preempt_unfreezing to fix one bug of preempt freeze
- call blk_queue_enter_live() only when queue is preempt frozen
- cleanup a bit on the implementation of preempt freeze
- only patch 6 and 7 are changed

V2:
- drop the 1st patch in V1 because percpu_ref_is_dying() is
enough as pointed by Tejun
- introduce preempt version of blk_[freeze|unfreeze]_queue
- sync between preempt freeze and normal freeze
- fix warning from percpu-refcount as reported by Oleksandr


[1] https://marc.info/?t=150340250100013&r=3&w=2


Thanks,
Ming

Bart Van Assche (1):
  block: Convert RQF_PREEMPT into REQ_PREEMPT

Ming Lei (7):
  blk-mq: only run hw queues for blk-mq
  block: tracking request allocation with q_usage_counter
  block: pass 'op' to blk_queue_enter()
  percpu-refcount: introduce __percpu_ref_tryget_live
  blk-mq: return if queue is frozen via current blk_freeze_queue_start
  block: support PREEMPT_ONLY
  SCSI: set block queue at preempt only when SCSI device is put into
quiesce

 block/blk-core.c| 66 +
 block/blk-mq-debugfs.c  |  2 +-
 block/blk-mq.c  | 26 
 block/blk-mq.h  |  1 -
 block/blk-timeout.c |  2 +-
 block/blk.h |  2 +-
 drivers/ide/ide-atapi.c |  3 +-
 drivers/ide/ide-io.c|  2 +-
 drivers/ide/ide-pm.c|  4 +--
 drivers/scsi/scsi_lib.c | 31 +++
 fs/block_dev.c  |  4 +--
 include/linux/blk-mq.h  |  4 +--
 include/linux/blk_types.h   |  6 
 include/linux/blkdev.h  | 10 ---
 include/linux/percpu-refcount.h | 27 ++---
 15 files changed, 137 insertions(+), 53 deletions(-)

-- 
2.9.5

Re: [PATCH BUGFIX/IMPROVEMENT 0/4] block, bfq: series of fixes of bugs affecting service guarantees

2017-10-03 Thread lee . tibbert

On Thu, Sep 21, 2017 at 5:03 AM, Paolo Valente 
wrote:

> Hi,
> the first patch in this series fixes a bug that causes bfq to fail to
> guarantee a high responsiveness on some drives, if there is heavy
> random read+write I/O in the background. More precisely, such a
> failure allowed this bug to be found [1], but the bug may well cause
> other yet unreported anomalies.
>
> This fix uncovered other bugs that were concealed by the fixed bug,
> for rather subtle reasons. These further bugs caused similar
> responsiveness failures, but with sequential reaad+write workloads in
> the background. The remaining three patches fix these further bugs.
>
> The sum of these fixes makes responsiveness much stabler with BFQ. In
> the presence of write hogs, it is however still impossible for an I/O
> scheduler to guarantee perfect responsiveness in any circustance,
> because of throttling issues in the virtual-memory management
> subsystem, and in other higher-level components.
>
> Thanks,
> Paolo
>
> [1] Background I/O Type: Random - Background I/O mix: Reads and writes
> - Application to start: LibreOffice Writer in
> http://www.phoronix.com/scan.php?page=news_item&px=Linux-4.13-IO-Laptop
>
>
> Paolo Valente (4):
>   block, bfq: fix wrong init of saved start time for weight raising
>   block, bfq: check and switch back to interactive wr also on queue
> split
>   block, bfq: let early-merged queues be weight-raised on split too
>   block, bfq: decrease burst size when queues in burst exit
>
>  block/bfq-iosched.c | 169 ++
> +-
>  1 file changed, 102 insertions(+), 67 deletions(-)
>
> --
> 2.10.0
>

Paolo,

Please feel free to add:

  Tested-by: Lee Tibbert 

I thought that you and linux-block readers might be interested
in a data point. I have been running the algodev variant of these
patches for about two weeks. Now I am running these exact patches.

At one point, I had firefox running in my main windows, and, in other
windows, a full linux build, a full scala-native build, and a git gc. I
barely noticed the builds going on.  The variance of latency in
firefox was well within my normal network variance. Usually any
one of the build jobs would have made firefox useless.

Pretty amazing!  In fact, it was so amazing that I did not trust my
memory several days later, so I repeated the full test.
Same result:  great reduced latency.

There being no such thing as a free lunch, the build jobs did
take longer than if executed standalone. However, I did not mind
because I was able to continue getting work done on the tasks
I needed in interactive time.

I hope these patches are accepted into linux mainline & will
I will continue to exercise them.

Lee

Re: [PATCH 0/9] recovery robustness improvements

2017-10-03 Thread Hans Holmberg

Cheers!

Thanks,
Hans

On Tue, Oct 3, 2017 at 12:51 PM, Javier González  wrote:
>> On 3 Oct 2017, at 12.05, Hans Holmberg  
>> wrote:
>>
>> From: Hans Holmberg 
>>
>> This patchset improves the robustness of recovery - fixing a bunch of
>> issues that occurs when removing and re-creating a pblk instance.
>> It also adds a couple of debug-only prints to facilitate detection
>> of L2P table corruptions.
>>
>> The patches apply on top of:
>> https://github.com/OpenChannelSSD/linux branch for-4.15/pblk
>>
>> Hans Holmberg (9):
>>  lightnvm: pblk: prevent gc kicks when gc is not operational
>>  lightnvm: pblk: recover partially written lines correctly
>>  lightnvm: pblk: free full lines during recovery
>>  lightnvm: pblk: start gc if needed during init
>>  lightnvm: pblk: consider bad sectors in emeta during recovery
>>  lightnvm: pblk: shut down gc gracefully during exit
>>  lightnvm: pblk: add l2p crc debug printouts
>>  lightnvm: pblk: gc all lines in the pipeline before exit
>>  lightnvm: pblk: correct valid lba count calculation
>>
>> drivers/lightnvm/pblk-core.c |  3 ++
>> drivers/lightnvm/pblk-gc.c   | 88 
>> +++-
>> drivers/lightnvm/pblk-init.c | 47 ++---
>> drivers/lightnvm/pblk-map.c  |  7 ++--
>> drivers/lightnvm/pblk-recovery.c | 46 ++---
>> drivers/lightnvm/pblk-sysfs.c|  2 +-
>> drivers/lightnvm/pblk.h  |  6 ++-
>> 7 files changed, 145 insertions(+), 54 deletions(-)
>>
>> --
>> 2.7.4
>
> Thanks Hans.
>
> Matias: I picked the patches on for-4.15/pblk on top of the rest of the
> series and fixed some of the commit messages. Also, note that Rakesh's
> patches needed some re-work, so please pick them from this tree instead.
>
> They should apply clean on your for-4.15/core.
>
> Thanks,
> Javier

[PATCH] lightnvm: pblk: use vfree_atomic when freeing line metadata

2017-10-03 Thread Hans Holmberg

From: Hans Holmberg 

The commit bf22e37a6413 ("mm: add vfree_atomic()") made vfree unsafe to
call in atomic context (unless the call came from an interrupt) and
introduced vfree_atomic that is safe to call in atomic context.

So, since we're holding locks when freeing line metadata, we need to
use the atomic version of vfree.

Fix this by introducing an atomic variant of pblk_mfree and
switching to that in pblk_line_meta_free.

Signed-off-by: Hans Holmberg 
---

The patch is for:
https://github.com/OpenChannelSSD/linux branch for-4.15/pblk

 drivers/lightnvm/pblk-init.c | 3 ++-
 drivers/lightnvm/pblk.h  | 8 
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index c452478..3a191a6 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -396,7 +396,8 @@ static void pblk_line_meta_free(struct pblk *pblk)
spin_lock(&l_mg->free_lock);
for (i = 0; i < PBLK_DATA_LINES; i++) {
kfree(l_mg->sline_meta[i]);
-   pblk_mfree(l_mg->eline_meta[i]->buf, l_mg->emeta_alloc_type);
+   pblk_mfree_atomic(l_mg->eline_meta[i]->buf,
+ l_mg->emeta_alloc_type);
kfree(l_mg->eline_meta[i]);
}
spin_unlock(&l_mg->free_lock);
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 03965da..93f98e3 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -881,6 +881,14 @@ static inline void pblk_mfree(void *ptr, int type)
vfree(ptr);
 }
 
+static inline void pblk_mfree_atomic(void *ptr, int type)
+{
+   if (type == PBLK_KMALLOC_META)
+   kfree(ptr);
+   else
+   vfree_atomic(ptr);
+}
+
 static inline struct nvm_rq *nvm_rq_from_c_ctx(void *c_ctx)
 {
return c_ctx - sizeof(struct nvm_rq);
-- 
2.7.4

Re: [PATCH BUGFIX/IMPROVEMENT 0/4] block, bfq: series of fixes of bugs affecting service guarantees

2017-10-03 Thread Lee Tibbert

On Thu, Sep 21, 2017 at 5:03 AM, Paolo Valente 
wrote:

> Hi,
> the first patch in this series fixes a bug that causes bfq to fail to
> guarantee a high responsiveness on some drives, if there is heavy
> random read+write I/O in the background. More precisely, such a
> failure allowed this bug to be found [1], but the bug may well cause
> other yet unreported anomalies.
>
> This fix uncovered other bugs that were concealed by the fixed bug,
> for rather subtle reasons. These further bugs caused similar
> responsiveness failures, but with sequential reaad+write workloads in
> the background. The remaining three patches fix these further bugs.
>
> The sum of these fixes makes responsiveness much stabler with BFQ. In
> the presence of write hogs, it is however still impossible for an I/O
> scheduler to guarantee perfect responsiveness in any circustance,
> because of throttling issues in the virtual-memory management
> subsystem, and in other higher-level components.
>
> Thanks,
> Paolo
>
> [1] Background I/O Type: Random - Background I/O mix: Reads and writes
> - Application to start: LibreOffice Writer in
> http://www.phoronix.com/scan.php?page=news_item&px=Linux-4.13-IO-Laptop
>
>
> Paolo Valente (4):
>   block, bfq: fix wrong init of saved start time for weight raising
>   block, bfq: check and switch back to interactive wr also on queue
> split
>   block, bfq: let early-merged queues be weight-raised on split too
>   block, bfq: decrease burst size when queues in burst exit
>
>  block/bfq-iosched.c | 169 ++
> +-
>  1 file changed, 102 insertions(+), 67 deletions(-)
>
> --
> 2.10.0
>

Paolo,

Please feel free to add:

  Tested-by: Lee Tibbert 

I thought that you and linux-block readers might be interested
in a data point. I have been running the algodev variant of these
patches for about two weeks. Now I am running these exact patches.

At one point, I had firefox running in my main windows, and, in other
windows, a full linux build, a full scala-native build, and a git gc. I
barely noticed the builds going on.  The variance of latency in
firefox was well within my normal network variance. Usually any
one of the build jobs would have made firefox useless.

Pretty amazing!  In fact, it was so amazing that I did not trust my
memory several days later, so I repeated the full test.
Same result:  great reduced latency.

There being no such thing as a free lunch, the build jobs did
take longer than if executed standalone. However, I did not mind
because I was able to continue getting work done on the tasks
I needed in interactive time.

I hope these patches are accepted into linux mainline & will
I will continue to exercise them.

Lee

1 2 >

1 - 100 of 151 matches

Mail list logo