Re: [PATCH v3] blk-mq: punt failed direct issue to dispatch list

2018-12-07 Thread Bart Van Assche
On Fri, 2018-12-07 at 09:35 -0700, Jens Axboe wrote:
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 29bfe8017a2d..9e5bda8800f8 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -377,6 +377,16 @@ void blk_mq_sched_insert_request(struct request *rq, 
> bool at_head,
>  
>   WARN_ON(e && (rq->tag != -1));
>  
> + /*
> +  * It's illegal to insert a request into the scheduler that has
> +  * been through ->queue_rq(). Warn for that case, and use a bypass
> +  * insert to be safe.
> +  */

Shouldn't this refer to requests that have been prepared instead of requests
that have been through ->queue_rq()? I think this function is called for
requests that are requeued. Requeued requests have been through ->queue_rq()
but are unprepared before being requeued.

Thanks,

Bart.


Re: [PATCH v3] blk-mq: punt failed direct issue to dispatch list

2018-12-07 Thread Bart Van Assche
On Thu, 2018-12-06 at 22:17 -0700, Jens Axboe wrote:
> Instead of making special cases for what we can direct issue, and now
> having to deal with DM solving the livelock while still retaining a BUSY
> condition feedback loop, always just add a request that has been through
> ->queue_rq() to the hardware queue dispatch list. These are safe to use
> as no merging can take place there. Additionally, if requests do have
> prepped data from drivers, we aren't dependent on them not sharing space
> in the request structure to safely add them to the IO scheduler lists.

How about making blk_mq_sched_insert_request() complain if a request is passed
to it in which the RQF_DONTPREP flag has been set to avoid that this problem is
reintroduced in the future? Otherwise this patch looks fine to me.

Bart.


Re: [GIT PULL] Follow up block fix

2018-12-06 Thread Bart Van Assche
On Thu, 2018-12-06 at 17:26 -0700, Jens Axboe wrote:
> On 12/6/18 5:20 PM, Bart Van Assche wrote:
> > Which branch does that tag correspond to? Even after having run git fetch
> > --tags I can't find that tag in your repository.
> 
> I pushed it before I sent the email, where are you looking?
> 
> http://git.kernel.dk/cgit/linux-block/tag/?h=for-linus-20181206

On my development system git has been configured to fetch from the following 
repository:
https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/

Should I fetch your branches from the git.kernel.dk repository instead?

Thanks,

Bart.


Re: [GIT PULL] Follow up block fix

2018-12-06 Thread Bart Van Assche
On Thu, 2018-12-06 at 16:59 -0700, Jens Axboe wrote:
> Just a single followup fix to the corruption fix from yesterday. We have
> an exported interface that does direct dispatch, with DM being the sole
> user of it. Change that to do bypass insert always instead of attempting
> direct dispatch. This fixes a deadlock that can happen on DM.
> Additionally, it gets rid of any exported user of direct dispatch, and
> enables DM to drop any BUSY handling from using the interface.
> 
> Please pull!
> 
> 
>   git://git.kernel.dk/linux-block.git tags/for-linus-20181206

Hi Jens,

Which branch does that tag correspond to? Even after having run git fetch
--tags I can't find that tag in your repository. Additionally, the patch on
your for-linus branch is missing the "Cc: stable" and "Reported-by:" tags
that you had promised to add. Did I look at the wrong branch?

Thanks,

Bart.


Re: [PATCH] block: fix direct dispatch issue failure for clones

2018-12-06 Thread Bart Van Assche
On Thu, 2018-12-06 at 15:21 -0700, Jens Axboe wrote:
> On 12/6/18 3:20 PM, Jens Axboe wrote:
> > After the direct dispatch corruption fix, we permanently disallow direct
> > dispatch of non read/write requests. This works fine off the normal IO
> > path, as they will be retried like any other failed direct dispatch
> > request. But for the blk_insert_cloned_request() that only DM uses to
> > bypass the bottom level scheduler, we always first attempt direct
> > dispatch. For some types of requests, that's now a permanent failure,
> > and no amount of retrying will make that succeed.
> > 
> > Don't use direct dispatch off the cloned insert path, always just use
> > bypass inserts. This still bypasses the bottom level scheduler, which is
> > what DM wants.
> > 
> > Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
> > Signed-off-by: Jens Axboe 
> 
> Bart, I'll add your reported-by here of course, and also a stable CC
> since the original patch went into stable.

Feel free to add the following:

Tested-by: Bart Van Assche 



Re: for-next branch and blktests/srp

2018-12-06 Thread Bart Van Assche
On Thu, 2018-12-06 at 08:47 -0800, Bart Van Assche wrote:
> If I merge Jens' for-next branch with Linus' master branch, boot the
> resulting kernel in a VM and run blktests/tests/srp/002 then that test
> never finishes. The same test passes against Linus' master branch. I
> think this is a regression. The following appears in the system log if
> I run that test:
> 
> Call Trace:
> INFO: task kworker/0:1:12 blocked for more than 120 seconds.
> Call Trace:
> INFO: task ext4lazyinit:2079 blocked for more than 120 seconds.
> Call Trace:
> INFO: task fio:2151 blocked for more than 120 seconds.
> Call Trace:
> INFO: task fio:2154 blocked for more than 120 seconds.

Hi Jens,

My test results so far are as follows:
* With kernel v4.20-rc5 test srp/002 passes.
* With your for-next branch test srp/002 reports the symptoms reported in my 
e-mail.
* With Linus' master branch from this morning test srp/002 fails in the same 
way as
  your for-next branch.
* Also with Linus' master branch, test srp/002 passes if I revert the following 
commit:
  ffe81d45322c ("blk-mq: fix corruption with direct issue"). So it seems like 
that
  commit fixed one regression but introduced another regression.

Bart.


Re: for-next branch and blktests/srp

2018-12-06 Thread Bart Van Assche
On Thu, 2018-12-06 at 11:12 -0700, Jens Axboe wrote:
> On 12/6/18 11:10 AM, Bart Van Assche wrote:
> > On Thu, 2018-12-06 at 10:02 -0800, Bart Van Assche wrote:
> > > On Thu, 2018-12-06 at 10:48 -0700, Jens Axboe wrote:
> > > > which would result in a non-zero exit, which should be expected for this
> > > > test?
> > > 
> > > Test srp/002 simulates network failures while running fio on top of 
> > > dm-mpath.
> > > Since queue_if_no_path is enabled in multipath.conf dm-mpath will keep 
> > > retrying
> > > the block layer requests it receives if these requests fail due to path 
> > > removal.
> > > If fio reports "io_u error" for this test then that means that the test 
> > > failed.
> > > I haven't seen fio reporting any I/O errors for this test with any 
> > > upstream
> > > kernel that I tested in the past year or so.
> > 
> > Hi Jens,
> > 
> > Please also verify that the version of multipath-tools that you are running 
> > includes
> > the following patch:
> 
> I installed from apt... Says this:
> 
> # apt show multipath-tools
> Package: multipath-tools
> Version: 0.6.4-5+deb9u1
> 
> I can run a self-compiled one, if this one doesn't work.

Please do. I use the following script to build and install multipath-tools
over the Debian binaries:

export LIB=/lib
apt-get install -y libaio-dev libdevmapper-dev libjson-c-dev librados-dev 
libreadline-dev libsystemd-dev liburcu-dev
make -s 
make -s install

Bart.


Re: for-next branch and blktests/srp

2018-12-06 Thread Bart Van Assche
On Thu, 2018-12-06 at 10:02 -0800, Bart Van Assche wrote:
> On Thu, 2018-12-06 at 10:48 -0700, Jens Axboe wrote:
> > which would result in a non-zero exit, which should be expected for this
> > test?
> 
> Test srp/002 simulates network failures while running fio on top of dm-mpath.
> Since queue_if_no_path is enabled in multipath.conf dm-mpath will keep 
> retrying
> the block layer requests it receives if these requests fail due to path 
> removal.
> If fio reports "io_u error" for this test then that means that the test 
> failed.
> I haven't seen fio reporting any I/O errors for this test with any upstream
> kernel that I tested in the past year or so.

Hi Jens,

Please also verify that the version of multipath-tools that you are running 
includes
the following patch:

>From a2675025ae9f652b005345b9082f5042b32c992c Mon Sep 17 00:00:00 2001
From: Bart Van Assche 
Date: Fri, 18 Nov 2016 13:33:44 -0800
Subject: [PATCH] Avoid that reloading a map sporadically triggers I/O errors

Avoid that reloading a map while there are no paths triggers a flush
and hence unwanted I/O errors if 'queue_if_no_path' is enabled.

Fixes: commit d569988e7528 ("libmultipath: Fixup 'DM_DEVICE_RELOAD' handling")
Signed-off-by: Bart Van Assche 
---
 libmultipath/devmapper.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/libmultipath/devmapper.c b/libmultipath/devmapper.c
index 9b6b0537ba6d..1576dd017d6a 100644
--- a/libmultipath/devmapper.c
+++ b/libmultipath/devmapper.c
@@ -392,7 +392,7 @@ int dm_addmap_reload(struct multipath *mpp, char *params, 
int flush)
  params, ADDMAP_RO, SKIP_KPARTX_OFF);
}
if (r)
-   r = dm_simplecmd(DM_DEVICE_RESUME, mpp->alias, flush,
+   r = dm_simplecmd(DM_DEVICE_RESUME, mpp->alias, !flush,
 1, udev_flags, 0);
return r;
 }



Re: for-next branch and blktests/srp

2018-12-06 Thread Bart Van Assche
On Thu, 2018-12-06 at 10:48 -0700, Jens Axboe wrote:
> which would result in a non-zero exit, which should be expected for this
> test?

Hi Jens,

Test srp/002 simulates network failures while running fio on top of dm-mpath.
Since queue_if_no_path is enabled in multipath.conf dm-mpath will keep retrying
the block layer requests it receives if these requests fail due to path removal.
If fio reports "io_u error" for this test then that means that the test failed.
I haven't seen fio reporting any I/O errors for this test with any upstream
kernel that I tested in the past year or so.

Bart.


Re: for-next branch and blktests/srp

2018-12-06 Thread Bart Van Assche
On Thu, 2018-12-06 at 10:00 -0700, Jens Axboe wrote:
> On 12/6/18 9:47 AM, Bart Van Assche wrote:
> > If I merge Jens' for-next branch with Linus' master branch, boot the
> > resulting kernel in a VM and run blktests/tests/srp/002 then that test
> > never finishes. The same test passes against Linus' master branch. I
> > think this is a regression. The following appears in the system log if
> > I run that test:
> 
> You are running that test on a dm device? Can you shed some light on
> how that dm device is setup?

Hi Jens,

The dm device referred to in my e-mail is a dm-mpath device created by test
srp/002. All parameters of that device are under control of that test script.
>From the srp/002 test script:

DESCRIPTION="File I/O on top of multipath concurrently with logout and login 
(mq)"

>From srp/multipath.conf:

defaults {
find_multipaths no
user_friendly_names yes
queue_without_daemonno
}

devices {
device {
vendor  "LIO-ORG|SCST_BIO|FUSIONIO"
product ".*"
features"1 queue_if_no_path"
path_checkertur
}
}

Bart.


for-next branch and blktests/srp

2018-12-06 Thread Bart Van Assche
Hello,

If I merge Jens' for-next branch with Linus' master branch, boot the
resulting kernel in a VM and run blktests/tests/srp/002 then that test
never finishes. The same test passes against Linus' master branch. I
think this is a regression. The following appears in the system log if
I run that test:

Call Trace:
INFO: task kworker/0:1:12 blocked for more than 120 seconds.
Call Trace:
INFO: task ext4lazyinit:2079 blocked for more than 120 seconds.
Call Trace:
INFO: task fio:2151 blocked for more than 120 seconds.
Call Trace:
INFO: task fio:2154 blocked for more than 120 seconds.

My list-pending-block-requests script produces the following output:

dm-0
dm-0/hctx0/active:0
dm-0/hctx0/dispatch:b856a515 {.op=WRITE_ZEROES, .cmd_flags=SYNC, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1064, .internal_tag=-1}
dm-0/hctx0/dispatch:6a2dca0a {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1065, .internal_tag=-1}
dm-0/hctx0/dispatch:1baa5734 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1085, .internal_tag=-1}
dm-0/hctx0/dispatch:2d618841 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1025, .internal_tag=-1}
dm-0/hctx0/dispatch:c8e6e436 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1250, .internal_tag=-1}
dm-0/hctx0/dispatch:7af12ca3 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1057, .internal_tag=-1}
dm-0/hctx0/dispatch:2ad59d1e {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1060, .internal_tag=-1}
dm-0/hctx0/dispatch:51c35166 {.op=READ, .cmd_flags=, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1188, .internal_tag=-1}
dm-0/hctx0/dispatch:05da9610 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1253, .internal_tag=-1}
dm-0/hctx0/dispatch:a25483c5 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1254, .internal_tag=-1}
dm-0/hctx0/dispatch:7ef87035 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1255, .internal_tag=-1}
dm-0/hctx0/dispatch:8a8f15c0 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1261, .internal_tag=-1}
dm-0/hctx0/dispatch:d65ae959 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1263, .internal_tag=-1}
dm-0/hctx0/dispatch:88aac981 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1264, .internal_tag=-1}
dm-0/hctx0/dispatch:89489e70 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1265, .internal_tag=-1}
dm-0/hctx0/dispatch:7c69e2d2 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1154, .internal_tag=-1}
dm-0/hctx0/dispatch:e32aa9b5 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1155, .internal_tag=-1}
dm-0/hctx0/dispatch:5b88a332 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1157, .internal_tag=-1}
dm-0/hctx0/dispatch:1de14ef0 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1159, .internal_tag=-1}
dm-0/hctx0/dispatch:a380d0f3 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1161, .internal_tag=-1}
dm-0/hctx0/dispatch:c48cfa16 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1162, .internal_tag=-1}
dm-0/hctx0/dispatch:11d96382 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1163, .internal_tag=-1}
dm-0/hctx0/dispatch:b5efd5d7 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1167, .internal_tag=-1}
dm-0/hctx0/dispatch:80f5c476 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1153, .internal_tag=-1}
dm-0/hctx0/dispatch:c73c244d {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1266, .internal_tag=-1}
dm-0/hctx0/dispatch:69c43800 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1268, .internal_tag=-1}
dm-0/hctx0/dispatch:d2ec3c96 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1269, .internal_tag=-1}
dm-0/hctx0/dispatch:9b2099db {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1270, .internal_tag=-1}
dm-0/hctx0/dispatch:7d433090 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1272, .internal_tag=-1}
dm-0/hctx0/dispatch:afa7d2ea {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1274, .internal_tag=-1}
dm-0/hctx0/dispatch:c0f9d890 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1276, .internal_tag=-1}

for-next branch and throttling

2018-12-06 Thread Bart Van Assche
Hello,

If I merge Jens' for-next branch with Linus' master branch and boot the
resulting kernel in a VM then the call trace shown below appears. This call
trace does not appear when building and booting the code from Linus' master
branch. Is this perhaps a regression?

WARNING: CPU: 1 PID: 257 at block/blk-throttle.c:2128 
blk_throtl_bio+0xc00/0x1120
Modules linked in: floppy(+) virtio_blk(+) virtio_net net_failover failover 
i2c_piix4 pata_acpi
CPU: 1 PID: 257 Comm: systemd-udevd Not tainted 4.20.0-rc5-dbg+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
RIP: 0010:blk_throtl_bio+0xc00/0x1120
Call Trace:
 ? create_task_io_context+0x1a2/0x1e0
 ? lock_downgrade+0x2e0/0x2e0
 ? kasan_check_read+0x11/0x20
 ? do_raw_spin_unlock+0xa8/0x140
 ? preempt_count_sub+0x18/0xd0
 generic_make_request_checks+0x432/0xcc0
 ? trace_event_raw_event_block_rq_requeue+0x290/0x290
 ? __lock_acquire+0x5b0/0x17e0
 ? __bio_associate_blkg.isra.33+0x1d0/0x3c0
 generic_make_request+0x151/0x950
 ? blk_queue_enter+0x7e0/0x7e0
 submit_bio+0x9b/0x250
 ? submit_bio+0x9b/0x250
 ? generic_make_request+0x950/0x950
 ? guard_bio_eod+0x120/0x2d0
 submit_bh_wbc+0x2df/0x310
 block_read_full_page+0x34c/0x4c0
 ? I_BDEV+0x20/0x20
 ? __bread_gfp+0x170/0x170
 ? add_to_page_cache_locked+0x20/0x20
 ? blkdev_writepages+0x10/0x10
 ? blkdev_writepages+0x10/0x10
 blkdev_readpage+0x18/0x20
 do_read_cache_page+0x5ed/0xbf0
 ? blkdev_writepages+0x10/0x10
 ? find_valid_gpt+0x1a8/0x8b0
 ? efi_partition+0x147/0x6c1
 ? check_partition+0x1b2/0x314
 ? blkdev_get+0x1f2/0x5a0
 ? __device_add_disk+0x6c1/0x7e0
 ? device_add_disk+0x13/0x20
 ? pagecache_get_page+0x420/0x420
 ? driver_probe_device+0x118/0x180
 ? __driver_attach+0x167/0x190
 ? bus_for_each_dev+0x105/0x160
 ? driver_attach+0x2b/0x30
 ? bus_add_driver+0x23d/0x350
 ? driver_register+0xdc/0x160
 ? register_virtio_driver+0x3c/0x60
 ? init+0x51/0x1000 [virtio_blk]
 ? do_one_initcall+0xb4/0x3a1
 ? do_init_module+0x103/0x360
 ? load_module+0x443e/0x4630
 ? __do_sys_finit_module+0x12d/0x1b0
 ? __x64_sys_finit_module+0x43/0x50
 ? do_syscall_64+0x71/0x210
 ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
 ? match_held_lock+0x20/0x250
 ? match_held_lock+0x35/0x250
 ? blkdev_writepages+0x10/0x10
 read_cache_page+0x3a/0x50
 read_dev_sector+0x64/0x170
 read_lba+0x1f9/0x3a0
 ? msdos_partition+0x1370/0x1370
 ? find_valid_gpt+0x1a8/0x8b0
 ? kmem_cache_alloc_trace+0x151/0x340
 ? find_valid_gpt+0x1a8/0x8b0
 find_valid_gpt+0x1c6/0x8b0
 ? is_gpt_valid.part.5+0x640/0x640
 ? __alloc_pages_nodemask+0x1c4/0x4d0
 ? lockdep_hardirqs_on+0x182/0x260
 ? get_page_from_freelist+0x648/0x2070
 ? mark_lock+0x69/0x8d0
 ? trace_hardirqs_on+0x24/0x130
 ? kasan_unpoison_shadow+0x35/0x50
 ? get_page_from_freelist+0x648/0x2070
 ? __asan_loadN+0xf/0x20
 ? widen_string+0x73/0x170
 ? __asan_loadN+0xf/0x20
 ? widen_string+0x73/0x170
 ? get_page_from_freelist+0x1eea/0x2070
 ? set_precision+0x90/0x90
 ? string+0x135/0x180
 efi_partition+0x147/0x6c1
 ? format_decode+0xce/0x550
 ? enable_ptr_key_workfn+0x30/0x30
 ? find_valid_gpt+0x8b0/0x8b0
 ? vsnprintf+0x11d/0x820
 ? pointer+0x4d0/0x4d0
 ? snprintf+0x88/0xa0
 ? snprintf+0x88/0xa0
 ? vsprintf+0x20/0x20
 ? find_valid_gpt+0x8b0/0x8b0
 check_partition+0x1b2/0x314
 rescan_partitions+0x13b/0x440
 __blkdev_get+0x61f/0x930
 ? check_disk_change+0xa0/0xa0
 ? lock_downgrade+0x2e0/0x2e0
 ? var_wake_function+0x90/0x90
 blkdev_get+0x1f2/0x5a0
 ? preempt_count_sub+0x18/0xd0
 ? _raw_spin_unlock+0x31/0x50
 ? bd_may_claim+0x80/0x80
 ? bdget+0x26b/0x280
 ? blkdev_writepage+0x20/0x20
 ? kasan_check_read+0x11/0x20
 ? kobject_put+0x23/0x220
 __device_add_disk+0x6c1/0x7e0
 ? blk_alloc_devt+0x150/0x150
 ? __asan_storeN+0x12/0x20
 device_add_disk+0x13/0x20
 virtblk_probe+0x839/0xace [virtio_blk]
 ? virtblk_restore+0xd0/0xd0 [virtio_blk]
 ? mutex_unlock+0x12/0x20
 ? kernfs_activate+0xd0/0xe0
 ? kasan_check_write+0x14/0x20
 ? kernfs_put+0x2c/0x280
 ? vring_transport_features+0x19/0x70
 ? vp_finalize_features+0xe6/0x160
 ? vp_get_status+0x29/0x30
 ? virtio_finalize_features+0xbe/0xe0
 virtio_dev_probe+0x27d/0x390
 really_probe+0x191/0x560
 ? driver_probe_device+0x180/0x180
 driver_probe_device+0x118/0x180
 ? driver_probe_device+0x180/0x180
 __driver_attach+0x167/0x190
 bus_for_each_dev+0x105/0x160
 ? subsys_dev_iter_exit+0x10/0x10
 ? kasan_check_read+0x11/0x20
 ? preempt_count_sub+0x18/0xd0
 ? _raw_spin_unlock+0x31/0x50
 driver_attach+0x2b/0x30
 bus_add_driver+0x23d/0x350
 driver_register+0xdc/0x160
 ? 0xa007
 register_virtio_driver+0x3c/0x60
 init+0x51/0x1000 [virtio_blk]
 do_one_initcall+0xb4/0x3a1
 ? trace_event_raw_event_initcall_finish+0x120/0x120
 ? kasan_unpoison_shadow+0x35/0x50
 ? __asan_register_globals+0x7b/0xa0
 do_init_module+0x103/0x360
 load_module+0x443e/0x4630
 ? module_frob_arch_sections+0x20/0x20
 ? vfs_read+0xe6/0x1e0
 ? kernel_read+0x79/0xa0
 ? kasan_check_write+0x14/0x20
 ? kernel_read_file+0x259/0x310
 ? do_mmap+0x431/0x6c0
 __do_sys_finit_module+0x12d/0x1b0
 ? 

Re: [PATCH v3] block: add documentation for io_timeout

2018-12-06 Thread Bart Van Assche
On Wed, 2018-12-05 at 22:17 +0800, Weiping Zhang wrote:
> +Description:
> + io_timeout is a request’s timeouts at block layer in
> + milliseconds. When the underlying driver starts processing
> + a request, the generic block layer will start a timer, if
> + this request cannot be completed in io_timeout milliseconds,
> + a timeout event will occur.

Sorry but I think this description is still somewhat confusing. How about
changing that description into the following?

io_timeout is the request timeout in milliseconds. If a request does not
complete in this time then the block driver timeout handler is invoked.
That timeout handler can decide to retry the request, to fail it or to
start a device recovery strategy.

Bart.


Re: [PATCH 1/2] block: get rid of BLK_MAX_TIMEOUT

2018-12-06 Thread Bart Van Assche
On Thu, 2018-12-06 at 22:18 +0800, Weiping Zhang wrote:
> Before this patch, even we set io_timeout to 30*HZ(default), but
> blk_rq_timeout always return jiffies +5*HZ,
>   [1]. if there no pending request in timeout list, the timer callback
> blk_rq_timed_out_timer will be called after 5*HZ, and then
> blk_mq_check_expired will check is there exist some request
> was delayed by compare jiffies and request->deadline, obvious
> request is not timeout because we set request's timeouts is 30*HZ.
> So for this case timer callback should be called at jiffies + 30 instead
> of jiffies + 5*HZ.
> 
>   [2]. if there are pending request in timeout list, we compare request's
> expiry and queue's expiry. If time_after(request->expire, queue->expire) 
> modify
> queue->timeout.expire to request->expire, otherwise do nothing.
> 
> So I think this patch just solve problem in [1], no other regression, or 
> what's
> I missing here ?

The blk_rq_timeout() function was introduced by commit 0d2602ca30e4 ("blk-mq:
improve support for shared tags maps"). I think the purpose of that function
is to make sure that the nr_active counter in struct blk_mq_hw_ctx gets updated
at least once every five seconds. So there are two problems with this patch:
- It reduces the frequency of 'nr_active' updates. I think that is wrong and
  also that it will negatively affect drivers that rely on this functionality,
  e.g. the SCSI core.
- The patch description does not match the code changes in this patch.

Bart.



Re: [PATCH v3] block: add documentation for io_timeout

2018-12-06 Thread Bart Van Assche
On Wed, 2018-12-05 at 22:59 +0800, Weiping Zhang wrote:
> Weiping Zhang  于2018年12月5日周三 下午10:49写道:
> > Christoph Hellwig  于2018年12月5日周三 下午10:40写道:
> > > Can you please also send a patch to not show this attribute for
> > > drivers without a timeout handler?  Thanks!
> 
> Is there a simple way do that ?

How about checking the timeout member of struct blk_mq_ops for blk-mq and
checking the rq_timed_out_fn member in struct request_queue for the legacy
block layer?

> Shall we return -ENOTSUPP when user read/write this attribute when
> driver has no timeout handler ?

A much more elegant solution is to introduce a sysfs attribute group for the
io_timeout attribute and to make that group visible only if a timeout handler
has been defined. See e.g. disk_attr_group in block/genhd.c for an example.

Bart.


Re: [PATCH 1/2] block: get rid of BLK_MAX_TIMEOUT

2018-12-05 Thread Bart Van Assche
On Wed, 2018-12-05 at 23:37 +0800, Weiping Zhang wrote:
> @@ -130,7 +119,7 @@ void blk_add_timer(struct request *req)
>* than an existing one, modify the timer. Round up to next nearest
>* second.
>*/
> - expiry = blk_rq_timeout(round_jiffies_up(expiry));
> + expiry = round_jiffies_up(expiry);

If you would have read the comment above this code, you would have known
that this patch does not do what you think it does and additionally that
it introduces a regression.

Bart.


Re: [PATCH] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag

2018-12-04 Thread Bart Van Assche

On 12/4/18 2:00 AM, Kashyap Desai wrote:

Problem statement :
Whenever try to get outstanding request via scsi_host_find_tag,
block layer will return stale entries instead of actual outstanding
request. Kernel panic if stale entry is inaccessible or memory is reused.
Fix :
Undo request mapping in blk_mq_put_driver_tag  nce request is return.

More detail :
Whenever each SDEV entry is created, block layer allocate separate tags
and static requestis.Those requests are not valid after SDEV is deleted
from the system. On the fly, block layer maps static rqs to rqs as below
from blk_mq_get_driver_tag()

data.hctx->tags->rqs[rq->tag] = rq;

Above mapping is active in-used requests and it is the same mapping which
is referred in function scsi_host_find_tag().
After running some IOs, “data.hctx->tags->rqs[rq->tag]” will have some
entries which will never be reset in block layer.

There would be a kernel panic, If request pointing to
“data.hctx->tags->rqs[rq->tag]” is part of “sdev” which is removed
and as part of that all the memory allocation of request associated with
that sdev might be reused or inaccessible to the driver.
Kernel panic snippet -

BUG: unable to handle kernel paging request at ff800010
IP: [] mpt3sas_scsih_scsi_lookup_get+0x6c/0xc0 [mpt3sas]
PGD aa4414067 PUD 0
Oops:  [#1] SMP
Call Trace:
  [] mpt3sas_get_st_from_smid+0x1f/0x60 [mpt3sas]
  [] scsih_shutdown+0x55/0x100 [mpt3sas]

Cc: 
Signed-off-by: Kashyap Desai 
Signed-off-by: Sreekanth Reddy 


---
  block/blk-mq.h | 1 +
  1 file changed, 1 insertion(+)

diff --git a/block/blk-mq.h b/block/blk-mq.h
index 9497b47..57432be 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -175,6 +175,7 @@ static inline bool
blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
  static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
 struct request *rq)
  {
+hctx->tags->rqs[rq->tag] = NULL;
  blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
  rq->tag = -1;


No SCSI driver should call scsi_host_find_tag() after a request has 
finished. The above patch introduces yet another race and hence can't be 
a proper fix.


Bart.



Re: block: sbitmap related lockdep warning

2018-12-03 Thread Bart Van Assche
On Mon, 2018-12-03 at 15:24 -0700, Jens Axboe wrote:
> On 12/3/18 3:02 AM, Ming Lei wrote:
> > Hi,
> > 
> > Just found there is sbmitmap related lockdep warning, not take a close
> > look yet, maybe
> > it is caused by recent sbitmap change.
> > 
> > [1] test
> > - modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1
> > submit_queues=1 hw_queue_depth=1
> > - then run fio on the 4 null_blk devices
> 
> This is a false positive - lockdep thinks that ->swap_lock needs to be
> IRQ safe since it's called with IRQs disabled from the
> blk_mq_mark_tag_wait() path. But we never grab the lock from IRQ
> context. I wonder how to teach lockdep about that...

There is probably a better solution, but one possible solution is to disable
lockdep checking for swap_lock by using lockdep_set_novalidate_class().

Bart.


Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests

2018-12-01 Thread Bart Van Assche

On 12/1/18 9:11 AM, Hannes Reinecke wrote:

On 12/1/18 5:48 PM, Christoph Hellwig wrote:

On Fri, Nov 30, 2018 at 01:36:09PM -0700, Jens Axboe wrote:

On 11/30/18 1:26 PM, Keith Busch wrote:

A driver may wish to iterate every tagged request, not just ones that
satisfy blk_mq_request_started(). The intended use is so a driver may
terminate entered requests on quiesced queues.


How about we just move the started check into the handler passed in for
those that care about it? Much saner to make the interface iterate
everything, and leave whatever state check to the callback.


So we used to do that, and I changed it back in May to test for
MQ_RQ_IN_FLIGHT, and then Ming changed it to check
blk_mq_request_started.  So this is clearly a minefield of sorts..

Note that at least mtip32xx, nbd, skd and the various nvme transports
want to use the function to terminate all requests in the error
path, and it would be great to have one single understood, documented
and debugged helper for that in the core, so this is a vote for moving
more of the logic in your second helper into the core code.  skd
will need actually use ->complete to release resources for that, though
and mtip plays some odd abort bits.  If it weren't for the interesting
abort behavior in nvme-fc that means we could even unexport the
low-level interface.


Yes, I'm very much in favour of this, too.
We always have this IMO slightly weird notion of stopping the queue, set 
some error flags in the driver, then _restarting_ the queue, just so 
that the driver then sees the error flag and terminates the requests.

Which I always found quite counter-intuitive.
So having a common helper for terminating requests for queue errors 
would be very welcomed here.


But when we have that we really should audit all drivers to ensure they 
do the right thin (tm).


Would calling blk_abort_request() for all outstanding requests be 
sufficient to avoid that the queue has to be stopped and restarted in 
the nvme-fc driver?


Bart.


Re: [PATCH] block: fix single range discard merge

2018-11-30 Thread Bart Van Assche
On Fri, 2018-11-30 at 10:20 -0700, Jens Axboe wrote:
> On 11/30/18 10:18 AM, Bart Van Assche wrote:
> > On Sat, 2018-12-01 at 00:38 +0800, Ming Lei wrote:
> > > Fixes: 445251d0f4d329a ("blk-mq: fix discard merge with scheduler 
> > > attached")
> > 
> > Since this patch fixes a bug introduced in kernel v4.16, does it need
> > a "Cc: stable" tag?
> 
> Like the other one, isn't stable implied with Fixes in there? You'd want
> a stable backport for any kernel that has that patchset. I think that's
> a stronger hint than stable cc.

(+Greg KH)

Hi Greg,

Would it be possible to clarify what your preferences are for adding a
"Cc: stable" tag?

Thanks,

Bart.


Re: [PATCH 01/27] aio: fix failure to put the file pointer

2018-11-30 Thread Bart Van Assche
On Fri, 2018-11-30 at 10:08 -0700, Jens Axboe wrote:
> On 11/30/18 10:07 AM, Bart Van Assche wrote:
> > On Fri, 2018-11-30 at 09:56 -0700, Jens Axboe wrote:
> > > If the ioprio capability check fails, we return without putting
> > > the file pointer.
> > > 
> > > Fixes: d9a08a9e616b ("fs: Add aio iopriority support")
> > > Reviewed-by: Johannes Thumshirn 
> > > Reviewed-by: Christoph Hellwig 
> > > Signed-off-by: Jens Axboe 
> > > ---
> > >  fs/aio.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > > 
> > > diff --git a/fs/aio.c b/fs/aio.c
> > > index b984918be4b7..205390c0c1bb 100644
> > > --- a/fs/aio.c
> > > +++ b/fs/aio.c
> > > @@ -1436,6 +1436,7 @@ static int aio_prep_rw(struct kiocb *req, struct 
> > > iocb *iocb)
> > >   ret = ioprio_check_cap(iocb->aio_reqprio);
> > >   if (ret) {
> > >   pr_debug("aio ioprio check cap error: %d\n", ret);
> > > + fput(req->ki_filp);
> > >   return ret;
> > >   }
> > 
> > Since this patch fixes a bug that was introduced in kernel v4.18, does this
> > patch need a "Cc: stable" tag?
> 
> The fixes should take care of that by itself, I hope.

Hi Jens,

My understanding is that patches that have a "Cc: stable" tag are guaranteed to
be integrated in a stable kernel sooner or later. Without that tag it depends on
the stable kernel maintainer whether or not these patches get picked up. I think
the "AUTOSEL" tag Sasha Levin uses indicates that a patch was picked up for one
of his stable kernels and that it did not have a "Cc: stable" tag. I'm not sure
Greg KH picks up patches that only have a "Fixes:" tag but no "Cc: stable" tag.

Bart.


Re: [PATCH] block: fix single range discard merge

2018-11-30 Thread Bart Van Assche
On Sat, 2018-12-01 at 00:38 +0800, Ming Lei wrote:
> Fixes: 445251d0f4d329a ("blk-mq: fix discard merge with scheduler attached")

Since this patch fixes a bug introduced in kernel v4.16, does it need
a "Cc: stable" tag?

Thanks,

Bart.


Re: [PATCH 05/27] block: ensure that async polled IO is marked REQ_NOWAIT

2018-11-30 Thread Bart Van Assche
On Fri, 2018-11-30 at 09:56 -0700, Jens Axboe wrote:
> We can't wait for polled events to complete, as they may require active
> polling from whoever submitted it. If that is the same task that is
> submitting new IO, we could deadlock waiting for IO to complete that
> this task is supposed to be completing itself.
> 
> Signed-off-by: Jens Axboe 
> ---
>  fs/block_dev.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 6de8d35f6e41..ebc3d5a0f424 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -402,8 +402,16 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
> *iter, int nr_pages)
>  
>   nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
>   if (!nr_pages) {
> - if (iocb->ki_flags & IOCB_HIPRI)
> + if (iocb->ki_flags & IOCB_HIPRI) {
>   bio->bi_opf |= REQ_HIPRI;
> + /*
> +  * For async polled IO, we can't wait for
> +  * requests to complete, as they may also be
> +  * polled and require active reaping.
> +  */
> + if (!is_sync)
> + bio->bi_opf |= REQ_NOWAIT;
> + }
>  
>   qc = submit_bio(bio);
>   WRITE_ONCE(iocb->ki_cookie, qc);

Setting REQ_NOWAIT from inside the block layer will make the code that
submits requests harder to review. Have you considered to make this code
fail I/O if REQ_NOWAIT has not been set and to require that the context
that submits I/O sets REQ_NOWAIT?

Thanks,

Bart.


Re: [PATCH 01/27] aio: fix failure to put the file pointer

2018-11-30 Thread Bart Van Assche
On Fri, 2018-11-30 at 09:56 -0700, Jens Axboe wrote:
> If the ioprio capability check fails, we return without putting
> the file pointer.
> 
> Fixes: d9a08a9e616b ("fs: Add aio iopriority support")
> Reviewed-by: Johannes Thumshirn 
> Reviewed-by: Christoph Hellwig 
> Signed-off-by: Jens Axboe 
> ---
>  fs/aio.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index b984918be4b7..205390c0c1bb 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1436,6 +1436,7 @@ static int aio_prep_rw(struct kiocb *req, struct iocb 
> *iocb)
>   ret = ioprio_check_cap(iocb->aio_reqprio);
>   if (ret) {
>   pr_debug("aio ioprio check cap error: %d\n", ret);
> + fput(req->ki_filp);
>   return ret;
>   }

Since this patch fixes a bug that was introduced in kernel v4.18, does this
patch need a "Cc: stable" tag?

Thanks,

Bart.


Re: [PATCH v2] block: add documentation for io_timeout

2018-11-29 Thread Bart Van Assche
On Thu, 2018-11-29 at 18:22 +0800, Weiping Zhang wrote:
> add documentation for /sys/block//queue/io_timeout

Patch descriptions should consist of full sentences. That means that these
should start with a capital letter and end with a period.

> +
> +What:/sys/block//queue/io_timeout
> +Date:November 2018
> +Contact: Weiping Zhang 
> +Description:
> + io_timeout is the timeout in milliseconds of a request in
> + block layer. The block layer will start a timer when low
> + level device driver start the request, and cancel timer
> + when request completes.

The grammar of that explanation needs to be improved ... I can spot three
grammatical issues.

Thanks,

Bart.


Re: [PATCH] block: add documentation for io_timeout

2018-11-28 Thread Bart Van Assche
On Thu, 2018-11-29 at 00:54 +0800, Weiping Zhang wrote:
> add documentation for /sys/block//queue/io_timeout
> 
> Signed-off-by: Weiping Zhang 
> ---
>  Documentation/ABI/testing/sysfs-block | 9 +
>  Documentation/block/queue-sysfs.txt   | 6 ++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-block 
> b/Documentation/ABI/testing/sysfs-block
> index dea212db9df3..f37cca16 100644
> --- a/Documentation/ABI/testing/sysfs-block
> +++ b/Documentation/ABI/testing/sysfs-block
> @@ -271,3 +271,12 @@ Description:
>   size of 512B sectors of the zones of the device, with
>   the eventual exception of the last zone of the device
>   which may be smaller.
> +
> +What:/sys/block//queue/io_timeout
> +Date:November 2018
> +Contact: Weiping Zhang 
> +Description:
> + io_timeout is the timeout in millisecods of a request in
> + block layer. The block layer will start a timer when low
> + level device driver start the request, and cancle timer
> + when request completes.
> diff --git a/Documentation/block/queue-sysfs.txt 
> b/Documentation/block/queue-sysfs.txt
> index 2c1e67058fd3..cbd44fe056fa 100644
> --- a/Documentation/block/queue-sysfs.txt
> +++ b/Documentation/block/queue-sysfs.txt
> @@ -67,6 +67,12 @@ If set to a value larger than 0, the kernel will put the 
> process issuing
>  IO to sleep for this amont of microseconds before entering classic
>  polling.
>  
> +io_timeout (RW)
> +---
> +This is the timeout in millisecods of a request in block layer.
> +The block layer will start a timer when low level device driver start
> +the request, and cancle timer when request completes.
> +
>  iostats (RW)
>  -
>  This file is used to control (on/off) the iostats accounting of the

Please run this patch through a spelling checker. I think "millisecods" should
be changed into "milliseconds" and also that "cancle" should be changed into
"cancel".

Thanks,

Bart.


Re: [RFC PATCH v2] block: add io timeout to sysfs

2018-11-28 Thread Bart Van Assche
On Mon, 2018-11-19 at 22:11 +0800, Weiping Zhang wrote:
> Give a interface to adjust io timeout by device.
> 
> Signed-off-by: Weiping Zhang 
> ---
> 
> Changes since v1:
> * make sure timeout > 0
> 
>  block/blk-sysfs.c | 27 +++
>  1 file changed, 27 insertions(+)

Documentation for new block layer sysfs attributes should be added in
Documentation/ABI/testing/sysfs-block and also in 
Documentation/block/queue-sysfs.txt.
Please add such documentation for this new attribute.

> +static ssize_t queue_io_timeout_store(struct request_queue *q, const char 
> *page,
> +   size_t count)
> +{
> + unsigned int val;
> + int err;
> +
> + err = kstrtou32(page, 10, );
> + if (err || val == 0)
> + return -EINVAL;
> +
> + blk_queue_rq_timeout(q, val);
> +
> + return count;
> +}

Setting the block layer timeout to a very high value (e.g. hours) may make it 
look
like a request got stuck without users having an easy way of figuring out what 
is
going on. I'm wondering whether this function should restrict the upper bound 
for
block layer timeouts. How about limiting timeout values to ten minutes?

Thanks,

Bart.


Re: [PATCH] kyber: fix wrong strlcpy() size in trace_kyber_latency()

2018-11-11 Thread Bart Van Assche

On 11/11/18 9:25 AM, Omar Sandoval wrote:

From: Omar Sandoval 

When copying to the latency type, we should be passing LATENCY_TYPE_LEN,
not DOMAIN_LEN. This isn't a problem in practice because we only pass
"total" or "I/O", but let's fix it.

Reported-by: Jordan Glover 
Signed-off-by: Omar Sandoval 
---
  include/trace/events/kyber.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/trace/events/kyber.h b/include/trace/events/kyber.h
index a9834c37ac40..7aaa298375ad 100644
--- a/include/trace/events/kyber.h
+++ b/include/trace/events/kyber.h
@@ -32,7 +32,7 @@ TRACE_EVENT(kyber_latency,
TP_fast_assign(
__entry->dev = 
disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
strlcpy(__entry->domain, domain, DOMAIN_LEN);
-   strlcpy(__entry->type, type, DOMAIN_LEN);
+   strlcpy(__entry->type, type, LATENCY_TYPE_LEN);
__entry->percentile  = percentile;
__entry->numerator   = numerator;
__entry->denominator = denominator;


Can both strlcpy() invocations be changed such that the third argument 
is a sizeof() expression instead of an explicit constant? I think that 
would make the Kyber tracing code easier to verify for humans.


Thanks,

Bart.



Re: [PATCH 3/6] skd_main: don't use req->special

2018-11-09 Thread Bart Van Assche

On 11/9/18 10:32 AM, Christoph Hellwig wrote:

Add a retries field to the internal request structure instead, which gets
set to zero on the first submission.

Signed-off-by: Christoph Hellwig 
---
  drivers/block/skd_main.c | 8 +++-
  1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c
index a0196477165f..88f29b06d90e 100644
--- a/drivers/block/skd_main.c
+++ b/drivers/block/skd_main.c
@@ -181,6 +181,7 @@ struct skd_request_context {
struct fit_completion_entry_v1 completion;
  
  	struct fit_comp_error_info err_info;

+   int retries;
  
  	blk_status_t status;

  };
@@ -495,6 +496,11 @@ static blk_status_t skd_mq_queue_rq(struct blk_mq_hw_ctx 
*hctx,
if (unlikely(skdev->state != SKD_DRVR_STATE_ONLINE))
return skd_fail_all(q) ? BLK_STS_IOERR : BLK_STS_RESOURCE;
  
+	if (!(req->rq_flags & RQF_DONTPREP)) {

+   skreq->retries = 0;
+   req->rq_flags |= RQF_DONTPREP;
+   }
+
blk_mq_start_request(req);
  
  	WARN_ONCE(tag >= skd_max_queue_depth, "%#x > %#x (nr_requests = %lu)\n",

@@ -1426,7 +1432,7 @@ static void skd_resolve_req_exception(struct skd_device 
*skdev,
break;
  
  	case SKD_CHECK_STATUS_REQUEUE_REQUEST:

-   if ((unsigned long) ++req->special < SKD_MAX_RETRIES) {
+   if ((unsigned long) skreq->retries < SKD_MAX_RETRIES) {
skd_log_skreq(skdev, skreq, "retry");
blk_mq_requeue_request(req, true);
break;


Hi Christoph,

Where has the code been moved to that increments the new skreq->retries 
counter?


Thanks,

Bart.




Re: [PATCH 1/2] blk-mq-tag: change busy_iter_fn to return whether to continue or not

2018-11-08 Thread Bart Van Assche
On Thu, 2018-11-08 at 10:50 -0700, Jens Axboe wrote:
> How about this incremental?
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 097e9a67d5f5..87bc5df72d48 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -248,7 +248,8 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int 
> bitnr, void *data)
>   * @fn:  Pointer to the function that will be called for each 
> request
>   *   associated with @hctx that has been assigned a driver tag.
>   *   @fn will be called as follows: @fn(@hctx, rq, @data, @reserved)
> - *   where rq is a pointer to a request.
> + *   where rq is a pointer to a request. Return true to continue
> + *   iterating tags, false to stop.
>   * @data:Will be passed as third argument to @fn.
>   * @reserved:Indicates whether @bt is the breserved_tags member or 
> the
>   *   bitmap_tags member of struct blk_mq_tags.
> @@ -301,7 +302,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned 
> int bitnr, void *data)
>   *   or the bitmap_tags member of struct blk_mq_tags.
>   * @fn:  Pointer to the function that will be called for each 
> started
>   *   request. @fn will be called as follows: @fn(rq, @data,
> - *   @reserved) where rq is a pointer to a request.
> + *   @reserved) where rq is a pointer to a request. Return true
> + *   to continue iterating tags, false to stop.
>   * @data:Will be passed as second argument to @fn.
>   * @reserved:Indicates whether @bt is the breserved_tags member or 
> the
>   *   bitmap_tags member of struct blk_mq_tags.
> @@ -326,7 +328,8 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, 
> struct sbitmap_queue *bt,
>   * @fn:  Pointer to the function that will be called for each 
> started
>   *   request. @fn will be called as follows: @fn(rq, @priv,
>   *   reserved) where rq is a pointer to a request. 'reserved'
> - *   indicates whether or not @rq is a reserved request.
> + *   indicates whether or not @rq is a reserved request. Return
> + *   true to continue iterating tags, false to stop.
>   * @priv:Will be passed as second argument to @fn.
>   */
>  static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
> @@ -343,7 +346,8 @@ static void blk_mq_all_tag_busy_iter(struct blk_mq_tags 
> *tags,
>   * @fn:  Pointer to the function that will be called for each 
> started
>   *   request. @fn will be called as follows: @fn(rq, @priv,
>   *   reserved) where rq is a pointer to a request. 'reserved'
> - *   indicates whether or not @rq is a reserved request.
> + *   indicates whether or not @rq is a reserved request. Return
> + *   true to continue iterating tags, false to stop.
>   * @priv:Will be passed as second argument to @fn.
>   */
>  void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,

This looks fine to me. Thanks!

Bart.


Re: [PATCH 1/2] blk-mq-tag: change busy_iter_fn to return whether to continue or not

2018-11-08 Thread Bart Van Assche
On Thu, 2018-11-08 at 09:31 -0700, Jens Axboe wrote:
> On 11/8/18 9:28 AM, Bart Van Assche wrote:
> > On Thu, 2018-11-08 at 09:06 -0700, Jens Axboe wrote:
> > > --- a/block/blk-mq-debugfs.c
> > > +++ b/block/blk-mq-debugfs.c
> > > @@ -424,13 +424,15 @@ struct show_busy_params {
> > >   * Note: the state of a request may change while this function is in 
> > > progress,
> > >   * e.g. due to a concurrent blk_mq_finish_request() call.
> > >   */
> > > -static void hctx_show_busy_rq(struct request *rq, void *data, bool 
> > > reserved)
> > > +static bool hctx_show_busy_rq(struct request *rq, void *data, bool 
> > > reserved)
> > >  {
> > 
> > Please update the kdoc header above hctx_show_busy_rq() such that it 
> > reflects
> > the new behavior. I'm referring to the "will be called for each request" 
> > part.
> > Otherwise this patch looks fine to me.
> 
> Took a look at the comment, and what do you want changed? There's no change
> in behavior for hctx_show_busy_rq(), it loops all requests just like before.
> We just return true to ensure we continue iterating.

Oops, I added my reply below the wrong function. I wanted to refer to the
following comment above blk_mq_queue_tag_busy_iter():

 * @fn: Pointer to the function that will be called for each request

Additionally, how about similar comments above bt_for_each(), bt_tags_for_each()
blk_mq_all_tag_busy_iter() and blk_mq_tagset_busy_iter()? I think this patch
affects all these functions.

Thanks,

Bart.


Re: [PATCH 2/2] blk-mq: provide a helper to check if a queue is busy

2018-11-08 Thread Bart Van Assche
On Thu, 2018-11-08 at 09:06 -0700, Jens Axboe wrote:
> +static bool blk_mq_check_busy(struct blk_mq_hw_ctx *hctx, struct request *rq,
> +   void *priv, bool reserved)
> +{
> + /*
> +  * If we find a request, we know the queue is busy. Return false
> +  * to stop the iteration.
> +  */
> + if (rq->q == hctx->queue) {
> + bool *busy = (bool *) priv;

I think the "(bool *)" cast can be left out. Anyway:

Reviewed-by: Bart Van Assche 



Re: [PATCH 1/2] blk-mq-tag: change busy_iter_fn to return whether to continue or not

2018-11-08 Thread Bart Van Assche
On Thu, 2018-11-08 at 09:06 -0700, Jens Axboe wrote:
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -424,13 +424,15 @@ struct show_busy_params {
>   * Note: the state of a request may change while this function is in 
> progress,
>   * e.g. due to a concurrent blk_mq_finish_request() call.
>   */
> -static void hctx_show_busy_rq(struct request *rq, void *data, bool reserved)
> +static bool hctx_show_busy_rq(struct request *rq, void *data, bool reserved)
>  {

Please update the kdoc header above hctx_show_busy_rq() such that it reflects
the new behavior. I'm referring to the "will be called for each request" part.
Otherwise this patch looks fine to me.

Bart.


Re: [PATCH blktests] fix discontiguous-io compile error on 32 bit systems

2018-11-05 Thread Bart Van Assche

On 11/5/18 6:19 PM, yuyufen wrote:

I am sorry that I did not see the discussion before this.
And you are right. Please ignore this patch.


Don't worry - this can happen :-)

Bart.


Re: [PATCH blktests] fix discontiguous-io compile error on 32 bit systems

2018-11-01 Thread Bart Van Assche
On Thu, 2018-11-01 at 14:35 +0800, Yufen Yu wrote:
> When make discontiguous-io.cpp with -m32, g++ compiler reports
> error for std::min(long unsigned int, size_t) has diffent
> arguments type.
> 
> fixes: fd21728886e7 ("Add the discontiguous-io test program")
> Signed-off-by: Yufen Yu 
> ---
>  src/discontiguous-io.cpp | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/src/discontiguous-io.cpp b/src/discontiguous-io.cpp
> index 5e0ee0f..855aba9 100644
> --- a/src/discontiguous-io.cpp
> +++ b/src/discontiguous-io.cpp
> @@ -291,7 +291,7 @@ int main(int argc, char **argv)
>   unsigned char *p = &*buf.begin();
>   for (int i = 0; i < len / 4; i++)
>   iov.append(p + 4 + i * 8,
> -std::min(4ul, len - i * 4));
> +std::min((size_t)4, len - i * 4));
>   } else {
>   iov.append(&*buf.begin(), buf.size());
>   }

Are you reading the messages posted on linux-block? An alternative that I like
better has been discussed in this e-mail thread:
https://www.spinics.net/lists/linux-block/msg32181.html

Bart.


Re: [PATCH blktests -v2] Fix build failure for discontiguous-io on 32-bit platforms

2018-10-30 Thread Bart Van Assche
On Tue, 2018-10-30 at 12:02 -0400, Theodore Y. Ts'o wrote:
> On Tue, Oct 30, 2018 at 08:02:55AM -0700, Bart Van Assche wrote:
> > Details about how the build fails on 32-bit systems would have been welcome
> > in the commit message. Anyway:
> > 
> > Reviewed-by: Bart Van Assche 
> 
> For the record, here's the failure.  It's the usual incomprehensible
> C++ error reporting.  :-)
> 
> [ ... ]
> discontiguous-io.cpp: In function 'int main(int, char**)':
> discontiguous-io.cpp:294:34: error: no matching function for call to 
> 'min(long unsigned int, size_t)'
>  std::min(4ul, len - i * 4));
>   ^
> [ ... ]

Thanks Ted for having provided the full compiler output. However, I don't
think that the entire error message should go in the commit message. To me,
the part of the compiler error message I cited tells the whole story :-)

Bart.


Re: [PATCH blktests -v2] Fix build failure for discontiguous-io on 32-bit platforms

2018-10-30 Thread Bart Van Assche
On Tue, 2018-10-30 at 10:36 -0400, Theodore Ts'o wrote:
> Signed-off-by: Theodore Ts'o 
> ---
>  src/discontiguous-io.cpp | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/src/discontiguous-io.cpp b/src/discontiguous-io.cpp
> index 5e0ee0f..f51a305 100644
> --- a/src/discontiguous-io.cpp
> +++ b/src/discontiguous-io.cpp
> @@ -251,7 +251,7 @@ int main(int argc, char **argv)
>   const char *dev;
>   int c;
>   std::vector buf;
> - size_t len = 512;
> + unsigned long len = 512;
>  
>   while ((c = getopt(argc, argv, "hl:o:sw")) != EOF) {
>   switch (c) {

Details about how the build fails on 32-bit systems would have been welcome
in the commit message. Anyway:

Reviewed-by: Bart Van Assche 


Re: [PATCH blktests] Fix build failure for discontiguous-io on 32-bit platforms

2018-10-29 Thread Bart Van Assche
On Mon, 2018-10-29 at 17:08 -0400, Theodore Y. Ts'o wrote:
> On Mon, Oct 29, 2018 at 09:26:43AM -0700, Bart Van Assche wrote:
> > 
> > Have you considered to change the data type of 'len' from size_t into 
> > unsigned long
> > instead of inserting this cast? That would make it clear that no integer 
> > truncation
> > happens in the iov.append() call.
> 
> Well the potential integer truncation that could happen is here:
> 
> case 'l': len = strtoul(optarg, NULL, 0); break;

If the data type of 'len' would be changed into unsigned long, how could that
assignment cause integer truncation since strtoul() returns an unsigned long
value? 

> I have a sneaking suspicion that ib_srp has zero change of
> working on 32-bit systems (which is what uses this function) so this
> was more about making sure blktests would build when I'm building the
> 32-bit version of my test appliance VM.

What makes you think that ib_srp won't work on 32-bit systems? I don't think
that anyone uses this driver on a 32-bit system. I'm not aware of any aspect
of that driver that restricts it to 64-bit systems only either.

> (For that matter, I've been banging my head against a brick wall
> trying to make the srp tests work in either a KVM or GCE environment,
> even with a 64-bit VM, but that's a different issue.  It's not high
> priority for me, at the moment, but eventually I would like the
> {kvm,gce,android}-xfstest test appliance VM's generated by the
> xfstests-bld repo to be able to run all of blktests.)

Can you be more specific about what didn't work? Were you unable to start the
tests or did a particular test fail?

> What offends me more is that later on there are scopes when len gets
> shadowed with a "ssize_t len;" definition --- I whould have thought
> gcc or clang would have complained bitterly about that, but I guess
> not.

I agree that it would be an improvement to get rid of variable shadowing.
Once that has been done the -Wshadow compiler flag can be added. I will look
into this if nobody else starts looking into this soon.

Bart.


Re: [PATCH blktests] Fix build failure for discontiguous-io on 32-bit platforms

2018-10-29 Thread Bart Van Assche
On Mon, 2018-10-29 at 12:15 -0400, Theodore Ts'o wrote:
> Signed-off-by: Theodore Ts'o 
> ---
>  src/discontiguous-io.cpp | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/src/discontiguous-io.cpp b/src/discontiguous-io.cpp
> index 5e0ee0f..a59c18d 100644
> --- a/src/discontiguous-io.cpp
> +++ b/src/discontiguous-io.cpp
> @@ -291,7 +291,7 @@ int main(int argc, char **argv)
>   unsigned char *p = &*buf.begin();
>   for (int i = 0; i < len / 4; i++)
>   iov.append(p + 4 + i * 8,
> -std::min(4ul, len - i * 4));
> +std::min(4ul, (unsigned long) len - 
> i * 4));
>   } else {
>   iov.append(&*buf.begin(), buf.size());
>   }

Hi Ted,

Have you considered to change the data type of 'len' from size_t into unsigned 
long
instead of inserting this cast? That would make it clear that no integer 
truncation
happens in the iov.append() call.

Thanks,

Bart.


Re: [PATCH 2/5] block: move .dma_alignment into q->limits

2018-10-18 Thread Bart Van Assche
On Thu, 2018-10-18 at 21:18 +0800, Ming Lei wrote:
> Turns out q->dma_alignement should be stack limit because now bvec table
   ^^
   dma_alignment?
> is immutalbe, the underlying queue's dma alignment has to be perceptible
 ^ ^^^
 immutable?observable?
> by stack driver, so IO buffer can be allocated as dma aligned before
 ^
 stacked?

Anyway:

Reviewed-by: Bart Van Assche 



Re: [PATCH 0/5] block: introduce helpers for allocating io buffer from slab

2018-10-18 Thread Bart Van Assche
On Thu, 2018-10-18 at 07:03 -0700, Matthew Wilcox wrote:
> On Thu, Oct 18, 2018 at 09:18:12PM +0800, Ming Lei wrote:
> > Filesystems may allocate io buffer from slab, and use this buffer to
> > submit bio. This way may break storage drivers if they have special
> > requirement on DMA alignment.
> 
> Before we go down this road, could we have a discussion about what
> hardware actually requires this?  Storage has this weird assumption that
> I/Os must be at least 512 byte aligned in memory, and I don't know where
> this idea comes from.  Network devices can do arbitrary byte alignment.
> Even USB controllers can do arbitrary byte alignment.  Sure, performance
> is going to suck and there are definite risks on some architectures
> with doing IOs that are sub-cacheline aligned, but why is storage such a
> special snowflake that we assume that host controllers are only capable
> of doing 512-byte aligned DMAs?
> 
> I just dragged out the NCR53c810 data sheet from 1993, and it's capable of
> doing arbitrary alignment of DMA.  NVMe is capable of 4-byte aligned DMA.
> What hardware needs this 512 byte alignment?

How about starting with modifying the queue_dma_alignment() function? The
current implementation of that function is as follows:

static inline int queue_dma_alignment(struct request_queue *q)
{
return q ? q->dma_alignment : 511;
}

In other words, for block drivers that do not set the DMA alignment
explicitly it is assumed that these drivers need 512 byte alignment. I think
the "512 byte alignment as default" was introduced in 2002. From Thomas
Gleixner's history tree, commit ad519c6902fb:

+static inline int queue_dma_alignment(request_queue_t *q)
+{
+   int retval = 511;
+
+   if (q && q->dma_alignment)
+   retval = q->dma_alignment;
+
+   return retval;
+}
+
+static inline int bdev_dma_aligment(struct block_device *bdev)
+{
+   return queue_dma_alignment(bdev_get_queue(bdev));
+}
+

Bart.


Re: [PATCH v2] block: BFQ default for single queue devices

2018-10-17 Thread Bart Van Assche

On 10/17/18 3:05 AM, Jan Kara wrote:

Well, the problem with this is that big distro people really don't care
much because they already use udev for tuning the IO scheduler. So whatever
defaults the kernel is going to pick likely won't be seen by distro
customers. Embedded people seem to be driving this effort because they
either don't run udev or they feel not all their teams building new
products have enough expertise to come up with a proper set of rules...


What's missing in this discussion is a definition of "embedded system". 
Is that a system like a streaming player for TV channels that neither 
has a keyboard nor a display or a system that can run multiple apps 
simultaneously like a smartphone? I think the difference matters because 
some embedded devices hardly do any background I/O nor load any 
executable code from storage after boot. So at least for some embedded 
devices the problem discussed in this e-mail thread does not exist.


Bart.


Re: [PATCH v2] block: BFQ default for single queue devices

2018-10-15 Thread Bart Van Assche
On Mon, 2018-10-15 at 16:10 +0200, Linus Walleij wrote:
> + * For blk-mq devices, we default to using:
> + * - "none" for multiqueue devices (nr_hw_queues != 1)
> + * - "bfq", if available, for single queue devices
> + * - "mq-deadline" if "bfq" is not available for single queue devices
> + * - "none" for single queue devices as well as last resort

For SATA SSDs nr_hw_queues == 1 so this patch will also affect these SSDs.
Since this patch is an attempt to improve performance, I'd like to see
measurement data for one or more recent SATA SSDs before a decision is
taken about what to do with this patch. 

Thanks,

Bart.



[PATCH] blk-mq-debugfs: Also show requests that have not yet been started

2018-10-04 Thread Bart Van Assche
When debugging e.g. the SCSI timeout handler it is important that
requests that have not yet been started or that already have
completed are also reported through debugfs.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Martin K. Petersen 
---
 block/blk-mq-debugfs.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index a5ea86835fcb..41b86f50d126 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -431,8 +431,7 @@ static void hctx_show_busy_rq(struct request *rq, void 
*data, bool reserved)
 {
const struct show_busy_params *params = data;
 
-   if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx &&
-   blk_mq_rq_state(rq) != MQ_RQ_IDLE)
+   if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx)
__blk_mq_debugfs_rq_show(params->m,
 list_entry_rq(>queuelist));
 }
-- 
2.19.0.605.g01d371f741-goog



Re: [PATCH] blk-mq-debugfs: Also show requests that have not yet been started

2018-10-03 Thread Bart Van Assche
On Wed, 2018-10-03 at 16:12 -0600, Jens Axboe wrote:
> On 10/3/18 3:42 PM, Bart Van Assche wrote:
> > On Fri, 2018-01-12 at 22:11 +0000, Bart Van Assche wrote:
> > >  /*
> > > + * Show "busy" requests - these are the requests owned by the block 
> > > driver.
> > > + * The test list_empty(>queuelist) is used to figure out whether or 
> > > not
> > > + * a request is owned by the block driver. That test works because the 
> > > block
> > > + * layer core uses list_del_init() consistently to remove a request from 
> > > one
> > > + * of the request lists.
> > > + *
> > >   * Note: the state of a request may change while this function is in 
> > > progress,
> > >   * e.g. due to a concurrent blk_mq_finish_request() call.
> > >   */
> > > @@ -402,7 +408,7 @@ static void hctx_show_busy_rq(struct request *rq, 
> > > void *data, bool reserved)
> > >   const struct show_busy_params *params = data;
> > >  
> > >   if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx &&
> > > - blk_mq_rq_state(rq) != MQ_RQ_IDLE)
> > > + list_empty(>queuelist))
> > >   __blk_mq_debugfs_rq_show(params->m,
> > >list_entry_rq(>queuelist));
> > >  }
> > 
> > Hello Jens,
> > 
> > Can you share your opinion about the above patch?
> 
> I just convince myself that the list check is super useful. The request
> could be on any number of lists, either not yet seen by the driver, or
> maybe sitting in dispatch. You're only going to be finding these
> requests if the tag is allocated, which means that it's already in some
> sort of non-idle state.
> 
> So enlighten me why we need the list check at all? Wouldn't it be better
> to simply remove it, and ensure that the debug print includes things
> like list state, rq state, etc?

Hello Jens,

I have tried to leave the list_empty() check out but if I do that then
requests that have the state "idle" (allocated but not yet started) also
show up. I think these should be left out from the output produced by
reading the "busy" attribute because such requests are not interesting
when analyzing an I/O lockup:

nullb0/hctx1/busy:abe67123 {.op=READ, .cmd_flags=, 
.rq_flags=IO_STAT|STATS, .s
tate=idle, .tag=63, .internal_tag=-1}

Thanks,

Bart.


Re: [PATCH] blk-mq-debugfs: Also show requests that have not yet been started

2018-10-03 Thread Bart Van Assche
On Fri, 2018-01-12 at 22:11 +, Bart Van Assche wrote:
> On Fri, 2018-01-12 at 15:05 -0700, Jens Axboe wrote:
> > On 1/12/18 3:00 PM, Bart Van Assche wrote:
> > > On Fri, 2018-01-12 at 14:55 -0700, Jens Axboe wrote:
> > > > On 1/12/18 2:52 PM, Bart Van Assche wrote:
> > > > > When debugging e.g. the SCSI timeout handler it is important that
> > > > > requests that have not yet been started or that already have
> > > > > completed are also reported through debugfs.
> > > > > 
> > > > > This patch depends on a patch that went upstream recently, namely
> > > > > commit 14e3062fb185 ("scsi: core: Fix a scsi_show_rq() NULL pointer
> > > > > dereference").
> > > > 
> > > > Why don't we just kill the check, and dump any request that has a
> > > > matching hctx? We already know the bit was set, so just print
> > > > all of them.
> > > 
> > > It is very helpful during debugging that requests owned by a block driver 
> > > and
> > > requests owned by the block layer core show up in different debugfs files.
> > > Removing the check completely would cause all requests to show up in the 
> > > same
> > > debugfs file and would make interpreting the contents of these debugfs 
> > > files
> > > much harder.
> > 
> > Yeah, we'd have to make it just one file at that point. I'm not hugely
> > against the queuelist check, but probably warrants a comment as it's not
> > immediately clear (as opposed to the idle check, or the previous START
> > bit check).
> 
> How about the below?
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 19db3f583bf1..da859dac442b 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -394,6 +394,12 @@ struct show_busy_params {
>  };
>  
>  /*
> + * Show "busy" requests - these are the requests owned by the block driver.
> + * The test list_empty(>queuelist) is used to figure out whether or not
> + * a request is owned by the block driver. That test works because the block
> + * layer core uses list_del_init() consistently to remove a request from one
> + * of the request lists.
> + *
>   * Note: the state of a request may change while this function is in 
> progress,
>   * e.g. due to a concurrent blk_mq_finish_request() call.
>   */
> @@ -402,7 +408,7 @@ static void hctx_show_busy_rq(struct request *rq, void 
> *data, bool reserved)
>   const struct show_busy_params *params = data;
>  
>   if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx &&
> - blk_mq_rq_state(rq) != MQ_RQ_IDLE)
> + list_empty(>queuelist))
>   __blk_mq_debugfs_rq_show(params->m,
>list_entry_rq(>queuelist));
>  }

Hello Jens,

Can you share your opinion about the above patch?

Thanks,

Bart.



[PATCH] block: Finish renaming REQ_DISCARD into REQ_OP_DISCARD

2018-10-03 Thread Bart Van Assche
From: Bart Van Assche 

Some time ago REQ_DISCARD was renamed into REQ_OP_DISCARD. Some comments
and documentation files were not updated however. Update these comments
and documentation files. See also commit 4e1b2d52a80d ("block, fs,
drivers: remove REQ_OP compat defs and related code").

Signed-off-by: Bart Van Assche 
Cc: Mike Christie 
Cc: Martin K. Petersen 
Cc: Philipp Reisner 
Cc: Lars Ellenberg 
---
 Documentation/blockdev/zram.txt| 2 +-
 Documentation/device-mapper/log-writes.txt | 2 +-
 drivers/block/drbd/drbd_int.h  | 2 +-
 drivers/block/drbd/drbd_main.c | 2 +-
 drivers/block/drbd/drbd_protocol.h | 4 ++--
 drivers/block/drbd/drbd_req.c  | 2 +-
 drivers/block/drbd/drbd_worker.c   | 2 +-
 drivers/target/target_core_spc.c   | 6 +++---
 8 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 875b2b56b87f..3c1b5ab54bc0 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -190,7 +190,7 @@ whitespace:
  notify_free  Depending on device usage scenario it may account
   a) the number of pages freed because of swap slot free
   notifications or b) the number of pages freed because of
-  REQ_DISCARD requests sent by bio. The former ones are
+  REQ_OP_DISCARD requests sent by bio. The former ones are
   sent to a swap block device when a swap slot is freed,
   which implies that this disk is being used as a swap disk.
   The latter ones are sent by filesystem mounted with
diff --git a/Documentation/device-mapper/log-writes.txt 
b/Documentation/device-mapper/log-writes.txt
index f4ebcbaf50f3..b638d124be6a 100644
--- a/Documentation/device-mapper/log-writes.txt
+++ b/Documentation/device-mapper/log-writes.txt
@@ -38,7 +38,7 @@ inconsistent file system.
 Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as
 they complete as those requests will obviously bypass the device cache.
 
-Any REQ_DISCARD requests are treated like WRITE requests.  Otherwise we would
+Any REQ_OP_DISCARD requests are treated like WRITE requests.  Otherwise we 
would
 have all the DISCARD requests, and then the WRITE requests and then the FLUSH
 request.  Consider the following example:
 
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 8a4c1328e6f9..1e47db57b9d2 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -429,7 +429,7 @@ enum {
__EE_CALL_AL_COMPLETE_IO,
__EE_MAY_SET_IN_SYNC,
 
-   /* is this a TRIM aka REQ_DISCARD? */
+   /* is this a TRIM aka REQ_OP_DISCARD? */
__EE_IS_TRIM,
 
/* In case a barrier failed,
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 15a9ffce9012..55fd104f1ed4 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -1673,7 +1673,7 @@ static u32 bio_flags_to_wire(struct drbd_connection 
*connection,
return bio->bi_opf & REQ_SYNC ? DP_RW_SYNC : 0;
 }
 
-/* Used to send write or TRIM aka REQ_DISCARD requests
+/* Used to send write or TRIM aka REQ_OP_DISCARD requests
  * R_PRIMARY -> Peer   (P_DATA, P_TRIM)
  */
 int drbd_send_dblock(struct drbd_peer_device *peer_device, struct drbd_request 
*req)
diff --git a/drivers/block/drbd/drbd_protocol.h 
b/drivers/block/drbd/drbd_protocol.h
index c3081f93051c..48dabbb21e11 100644
--- a/drivers/block/drbd/drbd_protocol.h
+++ b/drivers/block/drbd/drbd_protocol.h
@@ -57,7 +57,7 @@ enum drbd_packet {
P_PROTOCOL_UPDATE = 0x2d, /* data sock: is used in established 
connections */
 /* 0x2e to 0x30 reserved, used in drbd 9 */
 
-   /* REQ_DISCARD. We used "discard" in different contexts before,
+   /* REQ_OP_DISCARD. We used "discard" in different contexts before,
 * which is why I chose TRIM here, to disambiguate. */
P_TRIM= 0x31,
 
@@ -126,7 +126,7 @@ struct p_header100 {
 #define DP_UNPLUG 8 /* not used anymore   */
 #define DP_FUA   16 /* equals REQ_FUA */
 #define DP_FLUSH 32 /* equals REQ_PREFLUSH   */
-#define DP_DISCARD   64 /* equals REQ_DISCARD */
+#define DP_DISCARD   64 /* equals REQ_OP_DISCARD */
 #define DP_SEND_RECEIVE_ACK 128 /* This is a proto B write request */
 #define DP_SEND_WRITE_ACK   256 /* This is a proto C write request */
 #define DP_WSAME512 /* equiv. REQ_WRITE_SAME */
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 19cac36e9737..1c4da17e902e 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -650,7 +650,7 @@ int __req_mod(struct drbd_request *req, enum drbd_req_event 
what,
case DISCARD_COMPLETED_NOTSUPP:
 

Re: [PATCH blktests 0/3] Add NVMeOF multipath tests

2018-09-27 Thread Bart Van Assche
On Tue, 2018-09-18 at 17:18 -0700, Omar Sandoval wrote:
> On Tue, Sep 18, 2018 at 05:02:47PM -0700, Bart Van Assche wrote:
> > On 9/18/18 4:24 PM, Omar Sandoval wrote:
> > > On Tue, Sep 18, 2018 at 02:20:59PM -0700, Bart Van Assche wrote:
> > > > Can you have a look at the updated master branch of
> > > > https://github.com/bvanassche/blktests? That code should no longer fail 
> > > > if
> > > > unloading the nvme kernel module fails. Please note that you will need
> > > > kernel v4.18 to test these scripts - a KASAN complaint appears if I run
> > > > these tests against kernel v4.19-rc4.
> > > 
> > > Thanks, these pass now. Is it expected that my nvme device gets a
> > > multipath device configured after running these tests?
> > > 
> > > $ lsblk
> > > NAME MAJ:MIN RM SIZE RO TYPE  MOUNTPOINT
> > > vda  254:00  16G  0 disk
> > > └─vda1   254:10  16G  0 part  /
> > > vdb  254:16   0   8G  0 disk
> > > vdc  254:32   0   8G  0 disk
> > > vdd  254:48   0   8G  0 disk
> > > nvme0n1  259:00   8G  0 disk
> > > └─mpatha 253:00   8G  0 mpath
> > 
> > No, all multipath devices that were created during a test should be removed
> > before that test finishes. I will look into this.
> > 
> > > Also, can you please fix:
> > > 
> > >   _have_kernel_option NVME_MULTIPATH && exit 1
> > > 
> > > to not exit on failure? It should use SKIP_REASON and return 1. You
> > > might need to add something like _dont_have_kernel_option to properly
> > > handle the case where the config is not found.
> > 
> > OK, I will change this.
> > 
> > > Side note which isn't a blocker for merging is that there's a lot of
> > > duplicated code between these helpers and the srp helpers. How hard
> > > would it be to refactor that?
> > 
> > Are you perhaps referring to the code that is shared between the
> > tests/srp/rc tests/nvmeof-mp/rc shell scripts?
> 
> Yes, those.
> 
> > The hardest part is probably
> > to chose a location where to store these functions. Should I create a file
> > with common code under common/, under tests/srp/, under tests/nvmeof-mp/ or
> > perhaps somewhere else?
> 
> Just put it under common.

Hi Omar,

All feedback mentioned above has been addressed. The following pull request has
been updated: https://github.com/osandov/blktests/pull/33. Please let me know
if you want me to post these patches on the linux-block mailing list.

Note: neither the upstream kernel v4.18 nor v4.19-rc4 are stable enough to pass
all nvmeof-mp tests if kernel debugging options like KASAN are enabled.
Additionally, the NVMe device_add_disk() race condition often causes multipathd
to refuse to consider /dev/nvme... devices. The output on my test setup is as
follows (all tests pass):

# ./check -q nvmeof-mp
nvmeof-mp/001 (Log in and log out)   [passed]
runtime  1.528s  ...  1.909s
nvmeof-mp/002 (File I/O on top of multipath concurrently with logout and login 
(mq)) [
passed]time  38.968s  ...
runtime  38.968s  ...  38.571s
nvmeof-mp/004 (File I/O on top of multipath concurrently with logout and login 
(sq-on-
nvmeof-mp/004 (File I/O on top of multipath concurrently with logout and login 
(sq-on-
mq)) [passed]38.632s  ...
runtime  38.632s  ...  37.529s
nvmeof-mp/005 (Direct I/O with large transfer sizes and bs=4M) [passed]
runtime  13.382s  ...  13.684s
nvmeof-mp/006 (Direct I/O with large transfer sizes and bs=8M) [passed]
runtime  13.511s  ...  13.480s
nvmeof-mp/009 (Buffered I/O with large transfer sizes and bs=4M) [passed]
runtime  13.665s  ...  13.763s
nvmeof-mp/010 (Buffered I/O with large transfer sizes and bs=8M) [passed]
runtime  13.442s  ...  13.900s
nvmeof-mp/011 (Block I/O on top of multipath concurrently with logout and 
login) [pass
ed] runtime  37.988s  ...
runtime  37.988s  ...  37.945s
nvmeof-mp/012 (dm-mpath on top of multiple I/O schedulers)   [passed]
runtime  21.659s  ...  21.733s

Thanks,

Bart.



Re: [PATCHv3 0/5] genhd: register default groups with device_add_disk()

2018-09-27 Thread Bart Van Assche
On Fri, 2018-09-21 at 07:48 +0200, Christoph Hellwig wrote:
> Can you resend this with the one easy fixup pointed out?  It would
> be good to finally get the race fix merged.

Seconded. I also would like to see these patches being merged upstream.

Bart.



[PATCH v11 5/8] percpu-refcount: Introduce percpu_ref_resurrect()

2018-09-26 Thread Bart Van Assche
This function will be used in a later patch to switch the struct
request_queue q_usage_counter from killed back to live. In contrast
to percpu_ref_reinit(), this new function does not require that the
refcount is zero.

Signed-off-by: Bart Van Assche 
Acked-by: Tejun Heo 
Reviewed-by: Ming Lei 
Cc: Christoph Hellwig 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 include/linux/percpu-refcount.h |  1 +
 lib/percpu-refcount.c   | 28 ++--
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 009cdf3d65b6..b297cd1cd4f1 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -108,6 +108,7 @@ void percpu_ref_switch_to_atomic_sync(struct percpu_ref 
*ref);
 void percpu_ref_switch_to_percpu(struct percpu_ref *ref);
 void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
 percpu_ref_func_t *confirm_kill);
+void percpu_ref_resurrect(struct percpu_ref *ref);
 void percpu_ref_reinit(struct percpu_ref *ref);
 
 /**
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 9f96fa7bc000..de10b8c0bff6 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -356,11 +356,35 @@ EXPORT_SYMBOL_GPL(percpu_ref_kill_and_confirm);
  */
 void percpu_ref_reinit(struct percpu_ref *ref)
 {
+   WARN_ON_ONCE(!percpu_ref_is_zero(ref));
+
+   percpu_ref_resurrect(ref);
+}
+EXPORT_SYMBOL_GPL(percpu_ref_reinit);
+
+/**
+ * percpu_ref_resurrect - modify a percpu refcount from dead to live
+ * @ref: perpcu_ref to resurrect
+ *
+ * Modify @ref so that it's in the same state as before percpu_ref_kill() was
+ * called. @ref must be dead but must not yet have exited.
+ *
+ * If @ref->release() frees @ref then the caller is responsible for
+ * guaranteeing that @ref->release() does not get called while this
+ * function is in progress.
+ *
+ * Note that percpu_ref_tryget[_live]() are safe to perform on @ref while
+ * this function is in progress.
+ */
+void percpu_ref_resurrect(struct percpu_ref *ref)
+{
+   unsigned long __percpu *percpu_count;
unsigned long flags;
 
spin_lock_irqsave(_ref_switch_lock, flags);
 
-   WARN_ON_ONCE(!percpu_ref_is_zero(ref));
+   WARN_ON_ONCE(!(ref->percpu_count_ptr & __PERCPU_REF_DEAD));
+   WARN_ON_ONCE(__ref_is_percpu(ref, _count));
 
ref->percpu_count_ptr &= ~__PERCPU_REF_DEAD;
percpu_ref_get(ref);
@@ -368,4 +392,4 @@ void percpu_ref_reinit(struct percpu_ref *ref)
 
spin_unlock_irqrestore(_ref_switch_lock, flags);
 }
-EXPORT_SYMBOL_GPL(percpu_ref_reinit);
+EXPORT_SYMBOL_GPL(percpu_ref_resurrect);
-- 
2.19.0.605.g01d371f741-goog



[PATCH v11 4/8] block: Schedule runtime resume earlier

2018-09-26 Thread Bart Van Assche
Instead of scheduling runtime resume of a request queue after a
request has been queued, schedule asynchronous resume during request
allocation. The new pm_request_resume() calls occur after
blk_queue_enter() has increased the q_usage_counter request queue
member. This change is needed for a later patch that will make request
allocation block while the queue status is not RPM_ACTIVE.

Signed-off-by: Bart Van Assche 
Reviewed-by: Ming Lei 
Reviewed-by: Christoph Hellwig 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-core.c | 3 ++-
 block/elevator.c | 1 -
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index fd91e9bf2893..fec135ae52cf 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -956,7 +956,8 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
 
wait_event(q->mq_freeze_wq,
   (atomic_read(>mq_freeze_depth) == 0 &&
-   (pm || !blk_queue_pm_only(q))) ||
+   (pm || (blk_pm_request_resume(q),
+   !blk_queue_pm_only(q ||
   blk_queue_dying(q));
if (blk_queue_dying(q))
return -ENODEV;
diff --git a/block/elevator.c b/block/elevator.c
index 1c992bf6cfb1..e18ac68626e3 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -601,7 +601,6 @@ void __elv_add_request(struct request_queue *q, struct 
request *rq, int where)
trace_block_rq_insert(q, rq);
 
blk_pm_add_request(q, rq);
-   blk_pm_request_resume(q);
 
rq->q = q;
 
-- 
2.19.0.605.g01d371f741-goog



[PATCH v11 8/8] blk-mq: Enable support for runtime power management

2018-09-26 Thread Bart Van Assche
Now that the blk-mq core processes power management requests
(marked with RQF_PREEMPT) in other states than RPM_ACTIVE, enable
runtime power management for blk-mq.

Signed-off-by: Bart Van Assche 
Reviewed-by: Ming Lei 
Reviewed-by: Christoph Hellwig 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-mq.c | 2 ++
 block/blk-pm.c | 6 --
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 96d501e8663c..d384ab700afd 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -33,6 +33,7 @@
 #include "blk-mq.h"
 #include "blk-mq-debugfs.h"
 #include "blk-mq-tag.h"
+#include "blk-pm.h"
 #include "blk-stat.h"
 #include "blk-mq-sched.h"
 #include "blk-rq-qos.h"
@@ -475,6 +476,7 @@ static void __blk_mq_free_request(struct request *rq)
struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
const int sched_tag = rq->internal_tag;
 
+   blk_pm_mark_last_busy(rq);
if (rq->tag != -1)
blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
if (sched_tag != -1)
diff --git a/block/blk-pm.c b/block/blk-pm.c
index 972fbc656846..f8fdae01bea2 100644
--- a/block/blk-pm.c
+++ b/block/blk-pm.c
@@ -30,12 +30,6 @@
  */
 void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
 {
-   /* Don't enable runtime PM for blk-mq until it is ready */
-   if (q->mq_ops) {
-   pm_runtime_disable(dev);
-   return;
-   }
-
q->dev = dev;
q->rpm_status = RPM_ACTIVE;
pm_runtime_set_autosuspend_delay(q->dev, -1);
-- 
2.19.0.605.g01d371f741-goog



[PATCH v11 7/8] block: Make blk_get_request() block for non-PM requests while suspended

2018-09-26 Thread Bart Van Assche
Instead of allowing requests that are not power management requests
to enter the queue in runtime suspended status (RPM_SUSPENDED), make
the blk_get_request() caller block. This change fixes a starvation
issue: it is now guaranteed that power management requests will be
executed no matter how many blk_get_request() callers are waiting.
For blk-mq, instead of maintaining the q->nr_pending counter, rely
on q->q_usage_counter. Call pm_runtime_mark_last_busy() every time a
request finishes instead of only if the queue depth drops to zero.

Signed-off-by: Bart Van Assche 
Reviewed-by: Ming Lei 
Reviewed-by: Christoph Hellwig 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-core.c | 37 -
 block/blk-pm.c   | 44 +++-
 2 files changed, 47 insertions(+), 34 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index fec135ae52cf..16dd3a989753 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2746,30 +2746,6 @@ void blk_account_io_done(struct request *req, u64 now)
}
 }
 
-#ifdef CONFIG_PM
-/*
- * Don't process normal requests when queue is suspended
- * or in the process of suspending/resuming
- */
-static bool blk_pm_allow_request(struct request *rq)
-{
-   switch (rq->q->rpm_status) {
-   case RPM_RESUMING:
-   case RPM_SUSPENDING:
-   return rq->rq_flags & RQF_PM;
-   case RPM_SUSPENDED:
-   return false;
-   default:
-   return true;
-   }
-}
-#else
-static bool blk_pm_allow_request(struct request *rq)
-{
-   return true;
-}
-#endif
-
 void blk_account_io_start(struct request *rq, bool new_io)
 {
struct hd_struct *part;
@@ -2815,11 +2791,14 @@ static struct request *elv_next_request(struct 
request_queue *q)
 
while (1) {
list_for_each_entry(rq, >queue_head, queuelist) {
-   if (blk_pm_allow_request(rq))
-   return rq;
-
-   if (rq->rq_flags & RQF_SOFTBARRIER)
-   break;
+#ifdef CONFIG_PM
+   /*
+* If a request gets queued in state RPM_SUSPENDED
+* then that's a kernel bug.
+*/
+   WARN_ON_ONCE(q->rpm_status == RPM_SUSPENDED);
+#endif
+   return rq;
}
 
/*
diff --git a/block/blk-pm.c b/block/blk-pm.c
index 9b636960d285..972fbc656846 100644
--- a/block/blk-pm.c
+++ b/block/blk-pm.c
@@ -1,8 +1,11 @@
 // SPDX-License-Identifier: GPL-2.0
 
+#include 
 #include 
 #include 
 #include 
+#include "blk-mq.h"
+#include "blk-mq-tag.h"
 
 /**
  * blk_pm_runtime_init - Block layer runtime PM initialization routine
@@ -68,14 +71,40 @@ int blk_pre_runtime_suspend(struct request_queue *q)
if (!q->dev)
return ret;
 
+   WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);
+
+   /*
+* Increase the pm_only counter before checking whether any
+* non-PM blk_queue_enter() calls are in progress to avoid that any
+* new non-PM blk_queue_enter() calls succeed before the pm_only
+* counter is decreased again.
+*/
+   blk_set_pm_only(q);
+   ret = -EBUSY;
+   /* Switch q_usage_counter from per-cpu to atomic mode. */
+   blk_freeze_queue_start(q);
+   /*
+* Wait until atomic mode has been reached. Since that
+* involves calling call_rcu(), it is guaranteed that later
+* blk_queue_enter() calls see the pm-only state. See also
+* http://lwn.net/Articles/573497/.
+*/
+   percpu_ref_switch_to_atomic_sync(>q_usage_counter);
+   if (percpu_ref_is_zero(>q_usage_counter))
+   ret = 0;
+   /* Switch q_usage_counter back to per-cpu mode. */
+   blk_mq_unfreeze_queue(q);
+
spin_lock_irq(q->queue_lock);
-   if (q->nr_pending) {
-   ret = -EBUSY;
+   if (ret < 0)
pm_runtime_mark_last_busy(q->dev);
-   } else {
+   else
q->rpm_status = RPM_SUSPENDING;
-   }
spin_unlock_irq(q->queue_lock);
+
+   if (ret)
+   blk_clear_pm_only(q);
+
return ret;
 }
 EXPORT_SYMBOL(blk_pre_runtime_suspend);
@@ -106,6 +135,9 @@ void blk_post_runtime_suspend(struct request_queue *q, int 
err)
pm_runtime_mark_last_busy(q->dev);
}
spin_unlock_irq(q->queue_lock);
+
+   if (err)
+   blk_clear_pm_only(q);
 }
 EXPORT_SYMBOL(blk_post_runtime_suspend);
 
@@ -153,13 +185,15 @@ void blk_post_runtime_resume(struct request_queue *q, int 
err)
spin_lock_irq(q->queue_lock);
if (!err) {
q->rpm_status = RPM_ACTIVE;
-   __blk_run_queue(q);
pm_

[PATCH v11 2/8] block, scsi: Change the preempt-only flag into a counter

2018-09-26 Thread Bart Van Assche
The RQF_PREEMPT flag is used for three purposes:
- In the SCSI core, for making sure that power management requests
  are executed even if a device is in the "quiesced" state.
- For domain validation by SCSI drivers that use the parallel port.
- In the IDE driver, for IDE preempt requests.
Rename "preempt-only" into "pm-only" because the primary purpose of
this mode is power management. Since the power management core may
but does not have to resume a runtime suspended device before
performing system-wide suspend and since a later patch will set
"pm-only" mode as long as a block device is runtime suspended, make
it possible to set "pm-only" mode from more than one context. Since
with this change scsi_device_quiesce() is no longer idempotent, make
that function return early if it is called for a quiesced queue.

Signed-off-by: Bart Van Assche 
Acked-by: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Ming Lei 
Cc: Jianchao Wang 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-core.c| 35 ++-
 block/blk-mq-debugfs.c  | 10 +-
 drivers/scsi/scsi_lib.c | 11 +++
 include/linux/blkdev.h  | 14 +-
 4 files changed, 43 insertions(+), 27 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 6d4dd176bd9d..1a691f5269bb 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -422,24 +422,25 @@ void blk_sync_queue(struct request_queue *q)
 EXPORT_SYMBOL(blk_sync_queue);
 
 /**
- * blk_set_preempt_only - set QUEUE_FLAG_PREEMPT_ONLY
+ * blk_set_pm_only - increment pm_only counter
  * @q: request queue pointer
- *
- * Returns the previous value of the PREEMPT_ONLY flag - 0 if the flag was not
- * set and 1 if the flag was already set.
  */
-int blk_set_preempt_only(struct request_queue *q)
+void blk_set_pm_only(struct request_queue *q)
 {
-   return blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q);
+   atomic_inc(>pm_only);
 }
-EXPORT_SYMBOL_GPL(blk_set_preempt_only);
+EXPORT_SYMBOL_GPL(blk_set_pm_only);
 
-void blk_clear_preempt_only(struct request_queue *q)
+void blk_clear_pm_only(struct request_queue *q)
 {
-   blk_queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
-   wake_up_all(>mq_freeze_wq);
+   int pm_only;
+
+   pm_only = atomic_dec_return(>pm_only);
+   WARN_ON_ONCE(pm_only < 0);
+   if (pm_only == 0)
+   wake_up_all(>mq_freeze_wq);
 }
-EXPORT_SYMBOL_GPL(blk_clear_preempt_only);
+EXPORT_SYMBOL_GPL(blk_clear_pm_only);
 
 /**
  * __blk_run_queue_uncond - run a queue whether or not it has been stopped
@@ -918,7 +919,7 @@ EXPORT_SYMBOL(blk_alloc_queue);
  */
 int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 {
-   const bool preempt = flags & BLK_MQ_REQ_PREEMPT;
+   const bool pm = flags & BLK_MQ_REQ_PREEMPT;
 
while (true) {
bool success = false;
@@ -926,11 +927,11 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
rcu_read_lock();
if (percpu_ref_tryget_live(>q_usage_counter)) {
/*
-* The code that sets the PREEMPT_ONLY flag is
-* responsible for ensuring that that flag is globally
-* visible before the queue is unfrozen.
+* The code that increments the pm_only counter is
+* responsible for ensuring that that counter is
+* globally visible before the queue is unfrozen.
 */
-   if (preempt || !blk_queue_preempt_only(q)) {
+   if (pm || !blk_queue_pm_only(q)) {
success = true;
} else {
percpu_ref_put(>q_usage_counter);
@@ -955,7 +956,7 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
 
wait_event(q->mq_freeze_wq,
   (atomic_read(>mq_freeze_depth) == 0 &&
-   (preempt || !blk_queue_preempt_only(q))) ||
+   (pm || !blk_queue_pm_only(q))) ||
   blk_queue_dying(q));
if (blk_queue_dying(q))
return -ENODEV;
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index cb1e6cf7ac48..a5ea86835fcb 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -102,6 +102,14 @@ static int blk_flags_show(struct seq_file *m, const 
unsigned long flags,
return 0;
 }
 
+static int queue_pm_only_show(void *data, struct seq_file *m)
+{
+   struct request_queue *q = data;
+
+   seq_printf(m, "%d\n", atomic_read(>pm_only));
+   return 0;
+}
+
 #define QUEUE_FLAG_NAME(name) [QUEUE_FLAG_##name] = #name
 static const char *con

[PATCH v11 0/8] blk-mq: Implement runtime power management

2018-09-26 Thread Bart Van Assche
Hello Jens,

One of the pieces that is missing before blk-mq can be made the default
is implementing runtime power management support for blk-mq.  This patch
series not only implements runtime power management for blk-mq but also
fixes a starvation issue in the power management code for the legacy
block layer. Please consider this patch series for the upstream kernel.

Thanks,

Bart.

Changes compared to v10:
- Added a comment in the percpu-refcount patch in this series as Tejun
  asked.
- Updated Acked-by / Reviewed-by tags.

Changes compared to v9:
- Left out the patches that document the functions that iterate over
  requests and also the patch that introduces blk_mq_queue_rq_iter().
- Simplified blk_pre_runtime_suspend(): left out the check whether no
  requests are in progress.
- Fixed the race between blk_queue_enter(), queue freezing and runtime
  power management that Ming had identified.
- Added a new patch that introduces percpu_ref_resurrect().

Changes compared to v8:
- Fixed the race that was reported by Jianchao.
- Fixed another spelling issue in a source code comment.

Changes compared to v7:
- Addressed Jianchao's feedback about patch "Make blk_get_request() block
  for non-PM requests while suspended".
- Added two new patches - one that documents the functions that iterate
  over requests and one that introduces a new function that iterates over
  all requests associated with a queue.

Changes compared to v6:
- Left out the patches that split RQF_PREEMPT in three flags.
- Left out the patch that introduces the SCSI device state SDEV_SUSPENDED.
- Left out the patch that introduces blk_pm_runtime_exit().
- Restored the patch that changes the PREEMPT_ONLY flag into a counter.

Changes compared to v5:
- Introduced a new flag RQF_DV that replaces RQF_PREEMPT for SCSI domain
  validation.
- Introduced a new request queue state QUEUE_FLAG_DV_ONLY for SCSI domain
  validation.
- Instead of using SDEV_QUIESCE for both runtime suspend and SCSI domain
  validation, use that state for domain validation only and introduce a new
  state for runtime suspend, namely SDEV_QUIESCE.
- Reallow system suspend during SCSI domain validation.
- Moved the runtime resume call from the request allocation code into
  blk_queue_enter().
- Instead of relying on q_usage_counter, iterate over the tag set to
  determine whether or not any requests are in flight.

Changes compared to v4:
- Dropped the patches "Give RQF_PREEMPT back its original meaning" and
  "Serialize queue freezing and blk_pre_runtime_suspend()".
- Replaced "percpu_ref_read()" with "percpu_is_in_use()".
- Inserted pm_request_resume() calls in the block layer request allocation
  code such that the context that submits a request no longer has to call
  pm_runtime_get().

Changes compared to v3:
- Avoid adverse interactions between system-wide suspend/resume and runtime
  power management by changing the PREEMPT_ONLY flag into a counter.
- Give RQF_PREEMPT back its original meaning, namely that it is only set
  for ide_preempt requests.
- Remove the flag BLK_MQ_REQ_PREEMPT.
- Removed the pm_request_resume() call.

Changes compared to v2:
- Fixed the build for CONFIG_BLOCK=n.
- Added a patch that introduces percpu_ref_read() in the percpu-counter
  code.
- Added a patch that makes it easier to detect missing pm_runtime_get*()
  calls.
- Addressed Jianchao's feedback including the comment about runtime
  overhead of switching a per-cpu counter to atomic mode.

Changes compared to v1:
- Moved the runtime power management code into a separate file.
- Addressed Ming's feedback.

Bart Van Assche (8):
  block: Move power management code into a new source file
  block, scsi: Change the preempt-only flag into a counter
  block: Split blk_pm_add_request() and blk_pm_put_request()
  block: Schedule runtime resume earlier
  percpu-refcount: Introduce percpu_ref_resurrect()
  block: Allow unfreezing of a queue while requests are in progress
  block: Make blk_get_request() block for non-PM requests while
suspended
  blk-mq: Enable support for runtime power management

 block/Kconfig   |   3 +
 block/Makefile  |   1 +
 block/blk-core.c| 270 
 block/blk-mq-debugfs.c  |  10 +-
 block/blk-mq.c  |   4 +-
 block/blk-pm.c  | 216 +
 block/blk-pm.h  |  69 
 block/elevator.c|  22 +--
 drivers/scsi/scsi_lib.c |  11 +-
 drivers/scsi/scsi_pm.c  |   1 +
 drivers/scsi/sd.c   |   1 +
 drivers/scsi/sr.c   |   1 +
 include/linux/blk-pm.h  |  24 +++
 include/linux/blkdev.h  |  37 ++---
 include/linux/percpu-refcount.h |   1 +
 lib/percpu-refcount.c   |  28 +++-
 16 files changed, 401 insertions(+), 298 deletions(-)
 create mode 100644 block/blk-pm.c
 create mode 100644 block/b

[PATCH v11 1/8] block: Move power management code into a new source file

2018-09-26 Thread Bart Van Assche
Move the code for runtime power management from blk-core.c into the
new source file blk-pm.c. Move the corresponding declarations from
 into . For CONFIG_PM=n, leave out
the declarations of the functions that are not used in that mode.
This patch not only reduces the number of #ifdefs in the block layer
core code but also reduces the size of header file 
and hence should help to reduce the build time of the Linux kernel
if CONFIG_PM is not defined.

Signed-off-by: Bart Van Assche 
Reviewed-by: Ming Lei 
Reviewed-by: Christoph Hellwig 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/Kconfig  |   3 +
 block/Makefile |   1 +
 block/blk-core.c   | 196 +
 block/blk-pm.c | 188 +++
 block/blk-pm.h |  43 +
 block/elevator.c   |  22 +
 drivers/scsi/scsi_pm.c |   1 +
 drivers/scsi/sd.c  |   1 +
 drivers/scsi/sr.c  |   1 +
 include/linux/blk-pm.h |  24 +
 include/linux/blkdev.h |  23 -
 11 files changed, 264 insertions(+), 239 deletions(-)
 create mode 100644 block/blk-pm.c
 create mode 100644 block/blk-pm.h
 create mode 100644 include/linux/blk-pm.h

diff --git a/block/Kconfig b/block/Kconfig
index 1f2469a0123c..85263e7bded6 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -228,4 +228,7 @@ config BLK_MQ_RDMA
depends on BLOCK && INFINIBAND
default y
 
+config BLK_PM
+   def_bool BLOCK && PM
+
 source block/Kconfig.iosched
diff --git a/block/Makefile b/block/Makefile
index 572b33f32c07..27eac600474f 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -37,3 +37,4 @@ obj-$(CONFIG_BLK_WBT) += blk-wbt.o
 obj-$(CONFIG_BLK_DEBUG_FS) += blk-mq-debugfs.o
 obj-$(CONFIG_BLK_DEBUG_FS_ZONED)+= blk-mq-debugfs-zoned.o
 obj-$(CONFIG_BLK_SED_OPAL) += sed-opal.o
+obj-$(CONFIG_BLK_PM)   += blk-pm.o
diff --git a/block/blk-core.c b/block/blk-core.c
index 4dbc93f43b38..6d4dd176bd9d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -42,6 +42,7 @@
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-sched.h"
+#include "blk-pm.h"
 #include "blk-rq-qos.h"
 
 #ifdef CONFIG_DEBUG_FS
@@ -1726,16 +1727,6 @@ void part_round_stats(struct request_queue *q, int cpu, 
struct hd_struct *part)
 }
 EXPORT_SYMBOL_GPL(part_round_stats);
 
-#ifdef CONFIG_PM
-static void blk_pm_put_request(struct request *rq)
-{
-   if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending)
-   pm_runtime_mark_last_busy(rq->q->dev);
-}
-#else
-static inline void blk_pm_put_request(struct request *rq) {}
-#endif
-
 void __blk_put_request(struct request_queue *q, struct request *req)
 {
req_flags_t rq_flags = req->rq_flags;
@@ -3757,191 +3748,6 @@ void blk_finish_plug(struct blk_plug *plug)
 }
 EXPORT_SYMBOL(blk_finish_plug);
 
-#ifdef CONFIG_PM
-/**
- * blk_pm_runtime_init - Block layer runtime PM initialization routine
- * @q: the queue of the device
- * @dev: the device the queue belongs to
- *
- * Description:
- *Initialize runtime-PM-related fields for @q and start auto suspend for
- *@dev. Drivers that want to take advantage of request-based runtime PM
- *should call this function after @dev has been initialized, and its
- *request queue @q has been allocated, and runtime PM for it can not happen
- *yet(either due to disabled/forbidden or its usage_count > 0). In most
- *cases, driver should call this function before any I/O has taken place.
- *
- *This function takes care of setting up using auto suspend for the device,
- *the autosuspend delay is set to -1 to make runtime suspend impossible
- *until an updated value is either set by user or by driver. Drivers do
- *not need to touch other autosuspend settings.
- *
- *The block layer runtime PM is request based, so only works for drivers
- *that use request as their IO unit instead of those directly use bio's.
- */
-void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
-{
-   /* Don't enable runtime PM for blk-mq until it is ready */
-   if (q->mq_ops) {
-   pm_runtime_disable(dev);
-   return;
-   }
-
-   q->dev = dev;
-   q->rpm_status = RPM_ACTIVE;
-   pm_runtime_set_autosuspend_delay(q->dev, -1);
-   pm_runtime_use_autosuspend(q->dev);
-}
-EXPORT_SYMBOL(blk_pm_runtime_init);
-
-/**
- * blk_pre_runtime_suspend - Pre runtime suspend check
- * @q: the queue of the device
- *
- * Description:
- *This function will check if runtime suspend is allowed for the device
- *by examining if there are any requests pending in the queue. If there
- *are requests pending, the device can not be runtime suspended; otherwise,
- *the queue's status will be updated to SUSPENDING and the driver can
- *pro

[PATCH v11 6/8] block: Allow unfreezing of a queue while requests are in progress

2018-09-26 Thread Bart Van Assche
A later patch will call blk_freeze_queue_start() followed by
blk_mq_unfreeze_queue() without waiting for q_usage_counter to drop
to zero. Make sure that this doesn't cause a kernel warning to appear
by switching from percpu_ref_reinit() to percpu_ref_resurrect(). The
former namely requires that the refcount it operates on is zero.

Signed-off-by: Bart Van Assche 
Reviewed-by: Ming Lei 
Reviewed-by: Christoph Hellwig 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 block/blk-mq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 85a1c1a59c72..96d501e8663c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -198,7 +198,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
freeze_depth = atomic_dec_return(>mq_freeze_depth);
WARN_ON_ONCE(freeze_depth < 0);
if (!freeze_depth) {
-   percpu_ref_reinit(>q_usage_counter);
+   percpu_ref_resurrect(>q_usage_counter);
wake_up_all(>mq_freeze_wq);
}
 }
-- 
2.19.0.605.g01d371f741-goog



[PATCH v11 3/8] block: Split blk_pm_add_request() and blk_pm_put_request()

2018-09-26 Thread Bart Van Assche
Move the pm_request_resume() and pm_runtime_mark_last_busy() calls into
two new functions and thereby separate legacy block layer code from code
that works for both the legacy block layer and blk-mq. A later patch will
add calls to the new functions in the blk-mq code.

Signed-off-by: Bart Van Assche 
Reviewed-by: Ming Lei 
Reviewed-by: Christoph Hellwig 
Cc: Martin K. Petersen 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-core.c |  1 +
 block/blk-pm.h   | 36 +++-
 block/elevator.c |  1 +
 3 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 1a691f5269bb..fd91e9bf2893 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1744,6 +1744,7 @@ void __blk_put_request(struct request_queue *q, struct 
request *req)
 
blk_req_zone_write_unlock(req);
blk_pm_put_request(req);
+   blk_pm_mark_last_busy(req);
 
elv_completed_request(q, req);
 
diff --git a/block/blk-pm.h b/block/blk-pm.h
index 1ffc8ef203ec..a8564ea72a41 100644
--- a/block/blk-pm.h
+++ b/block/blk-pm.h
@@ -6,8 +6,23 @@
 #include 
 
 #ifdef CONFIG_PM
+static inline void blk_pm_request_resume(struct request_queue *q)
+{
+   if (q->dev && (q->rpm_status == RPM_SUSPENDED ||
+  q->rpm_status == RPM_SUSPENDING))
+   pm_request_resume(q->dev);
+}
+
+static inline void blk_pm_mark_last_busy(struct request *rq)
+{
+   if (rq->q->dev && !(rq->rq_flags & RQF_PM))
+   pm_runtime_mark_last_busy(rq->q->dev);
+}
+
 static inline void blk_pm_requeue_request(struct request *rq)
 {
+   lockdep_assert_held(rq->q->queue_lock);
+
if (rq->q->dev && !(rq->rq_flags & RQF_PM))
rq->q->nr_pending--;
 }
@@ -15,17 +30,28 @@ static inline void blk_pm_requeue_request(struct request 
*rq)
 static inline void blk_pm_add_request(struct request_queue *q,
  struct request *rq)
 {
-   if (q->dev && !(rq->rq_flags & RQF_PM) && q->nr_pending++ == 0 &&
-   (q->rpm_status == RPM_SUSPENDED || q->rpm_status == RPM_SUSPENDING))
-   pm_request_resume(q->dev);
+   lockdep_assert_held(q->queue_lock);
+
+   if (q->dev && !(rq->rq_flags & RQF_PM))
+   q->nr_pending++;
 }
 
 static inline void blk_pm_put_request(struct request *rq)
 {
-   if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending)
-   pm_runtime_mark_last_busy(rq->q->dev);
+   lockdep_assert_held(rq->q->queue_lock);
+
+   if (rq->q->dev && !(rq->rq_flags & RQF_PM))
+   --rq->q->nr_pending;
 }
 #else
+static inline void blk_pm_request_resume(struct request_queue *q)
+{
+}
+
+static inline void blk_pm_mark_last_busy(struct request *rq)
+{
+}
+
 static inline void blk_pm_requeue_request(struct request *rq)
 {
 }
diff --git a/block/elevator.c b/block/elevator.c
index e18ac68626e3..1c992bf6cfb1 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -601,6 +601,7 @@ void __elv_add_request(struct request_queue *q, struct 
request *rq, int where)
trace_block_rq_insert(q, rq);
 
blk_pm_add_request(q, rq);
+   blk_pm_request_resume(q);
 
rq->q = q;
 
-- 
2.19.0.605.g01d371f741-goog



Re: [PATCH v10 5/8] percpu-refcount: Introduce percpu_ref_resurrect()

2018-09-26 Thread Bart Van Assche
On Wed, 2018-09-26 at 09:59 -0700, Tejun Heo wrote:
> Hello, Bart.
> 
> On Mon, Sep 24, 2018 at 01:43:35PM -0700, Bart Van Assche wrote:
> > Thanks for the review. But could you please clarify what "after ->percpu_ref
> > is called" refers to?
> 
> Oops, sorry, I meant the confirm_kill callback of
> percpu_ref_kill_and_confirm().

Thanks for the clarification. I will update the comment above
percpu_ref_resurrect().

Bart.



Re: [PATCH v10 7/8] block: Make blk_get_request() block for non-PM requests while suspended

2018-09-26 Thread Bart Van Assche
On Wed, 2018-09-26 at 17:06 +0200, Johannes Thumshirn wrote:
> On Wed, Sep 26, 2018 at 04:57:32PM +0200, Christoph Hellwig wrote:
> > I don't think this actually works given that rpm_status only exists
> > if CONFIG_PM is set.
> 
> I think it'll work as GCC does constant propagation. There are
> actually some places in the kernel that follow this pattern.

This is what gcc on my development system thinks about that proposal:

In file included from ./arch/x86/include/asm/bug.h:83:0,
 from ./include/linux/bug.h:5,
 from ./include/linux/thread_info.h:12,
 from ./arch/x86/include/asm/preempt.h:7,
 from ./include/linux/preempt.h:81,
 from ./include/linux/spinlock.h:51,
 from ./include/linux/seqlock.h:36,
 from ./include/linux/time.h:6,
 from ./include/linux/stat.h:19,
 from ./include/linux/module.h:10,
 from block/blk-core.c:15:
block/blk-core.c: In function ‘elv_next_request’:
block/blk-core.c:2795:44: error: ‘struct request_queue’ has no member named 
‘rpm_status’; did you mean ‘stats’?
WARN_ON_ONCE(q->rpm_status == RPM_SUSPENDED);
^
./include/asm-generic/bug.h:69:25: note: in definition of macro ‘WARN_ON_ONCE’
  int __ret_warn_on = !!(condition);   \
 ^
scripts/Makefile.build:305: recipe for target 'block/blk-core.o' failed

Bart.



Re: [PATCH v2] block: fix deadline elevator drain for zoned block devices

2018-09-26 Thread Bart Van Assche
On Wed, 2018-09-26 at 16:30 +0900, Damien Le Moal wrote:
> diff --git a/block/elevator.c b/block/elevator.c
> index 6a06b5d040e5..8cd81fd6339a 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -609,7 +609,7 @@ void elv_drain_elevator(struct request_queue *q)
>  
>   while (e->type->ops.sq.elevator_dispatch_fn(q, 1))
>   ;
> - if (q->nr_sorted && printed++ < 10) {
> + if (q->nr_sorted && printed++ < 10 && !blk_queue_is_zoned(q)) {
>   printk(KERN_ERR "%s: forced dispatching is broken "
>  "(nr_sorted=%u), please report this\n",
>  q->elevator->type->elevator_name, q->nr_sorted);

It seems wrong to me to perform the blk_queue_is_zoned() check after having
incremented the "printed" variable. Shouldn't that check be performed before
incrementing the "printed" variable to avoid that "printed" gets incremented
if we know that we are not going to report the error message?

Thanks,

Bart.



Re: [PATCH] blk-mq: Allow blocking queue tag iter callbacks

2018-09-25 Thread Bart Van Assche
On Tue, 2018-09-25 at 08:39 -0600, Keith Busch wrote:
> On Tue, Sep 25, 2018 at 10:39:46AM +0800, jianchao.wang wrote:
> > But the issue is the left part of blk_mq_timeout_work is moved out of 
> > protection of q refcount.
> 
> I'm not sure what you mean by "left part". The only part that isn't
> outside the reference with this patch is the part Bart pointed out.
> 
> This looks like it may be fixed by either moving the refcount back up a
> level to all the callers of blk_mq_queue_tag_busy_iter, or add
> cancel_work_sync(>timeout_work) to __blk_mq_update_nr_hw_queues after
> the freeze.

Hi Keith,

How about applying the following (untested) patch on top of your patch?

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 019f9b169887..099e203b5213 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -851,6 +851,9 @@ static void blk_mq_timeout_work(struct work_struct *work)
if (!blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, ))
return;
 
+   if (!percpu_ref_tryget(>q_usage_counter))
+   return;
+
if (next != 0) {
mod_timer(>timeout, next);
} else {
@@ -866,6 +869,7 @@ static void blk_mq_timeout_work(struct work_struct *work)
blk_mq_tag_idle(hctx);
}
}
+   blk_queue_exit(q);
 }

Thanks,

Bart.


Re: [PATCH] blk-mq: Allow blocking queue tag iter callbacks

2018-09-24 Thread Bart Van Assche

On 9/24/18 7:11 PM, jianchao.wang wrote:

Hi Keith

On 09/25/2018 05:09 AM, Keith Busch wrote:

-   /* A deadlock might occur if a request is stuck requiring a
-* timeout at the same time a queue freeze is waiting
-* completion, since the timeout code would not be able to
-* acquire the queue reference here.
-*
-* That's why we don't use blk_queue_enter here; instead, we use
-* percpu_ref_tryget directly, because we need to be able to
-* obtain a reference even in the short window between the queue
-* starting to freeze, by dropping the first reference in
-* blk_freeze_queue_start, and the moment the last request is
-* consumed, marked by the instant q_usage_counter reaches
-* zero.
-*/
-   if (!percpu_ref_tryget(>q_usage_counter))
+   if (!blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, ))
return;


We cannot discard the percpu_ref_tryget here.

There following code in blk_mq_timeout_work still need it:

if (next != 0) {
mod_timer(>timeout, next);
} else {
queue_for_each_hw_ctx(q, hctx, i) {
/* the hctx may be unmapped, so check it here */
if (blk_mq_hw_queue_mapped(hctx))
blk_mq_tag_idle(hctx);
}
}


Hi Jianchao,

Had you noticed that the percpu_ref_tryget() call has been moved into
blk_mq_queue_tag_busy_iter()?

Bart.



Re: [PATCH v10 5/8] percpu-refcount: Introduce percpu_ref_resurrect()

2018-09-24 Thread Bart Van Assche
On Mon, 2018-09-24 at 11:01 -0700, Tejun Heo wrote:
> Hello, Bart.
> 
> On Fri, Sep 21, 2018 at 01:31:19PM -0700, Bart Van Assche wrote:
> > +void percpu_ref_resurrect(struct percpu_ref *ref)
> > +{
> > +   unsigned long __percpu *percpu_count;
> > unsigned long flags;
> >  
> > spin_lock_irqsave(_ref_switch_lock, flags);
> >  
> > -   WARN_ON_ONCE(!percpu_ref_is_zero(ref));
> > +   WARN_ON_ONCE(!(ref->percpu_count_ptr & __PERCPU_REF_DEAD));
> > +   WARN_ON_ONCE(__ref_is_percpu(ref, _count));
> 
> So, this in itself is fine but I'm a bit worried that this might it
> easier to abuse percpu_ref's dying mode.  More specifically, it isn't
> a good idea to depend on percpu_ref's implementation details from
> outside - ie. the only guarantee there ever is that percpu_ref won't
> give out new refs after ->percpu_ref is called - e.g. users can't
> piggy back on implied rcu grace period.
> 
> Can you please note that in the function comment?  Provided that,
> 
> Acked-by: Tejun Heo 

Hi Tejun,

Thanks for the review. But could you please clarify what "after ->percpu_ref
is called" refers to?

Thanks,

Bart.



Re: Regression caused by f5bbbbe4d635

2018-09-24 Thread Bart Van Assche
On Mon, 2018-09-24 at 13:13 -0600, Keith Busch wrote:
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 85a1c1a59c72..28d128450621 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -848,22 +848,6 @@ static void blk_mq_timeout_work(struct work_struct *work)
>   struct blk_mq_hw_ctx *hctx;
>   int i;
>  
> - /* A deadlock might occur if a request is stuck requiring a
> -  * timeout at the same time a queue freeze is waiting
> -  * completion, since the timeout code would not be able to
> -  * acquire the queue reference here.
> -  *
> -  * That's why we don't use blk_queue_enter here; instead, we use
> -  * percpu_ref_tryget directly, because we need to be able to
> -  * obtain a reference even in the short window between the queue
> -  * starting to freeze, by dropping the first reference in
> -  * blk_freeze_queue_start, and the moment the last request is
> -  * consumed, marked by the instant q_usage_counter reaches
> -  * zero.
> -  */
> - if (!percpu_ref_tryget(>q_usage_counter))
> - return;
> -
>   blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, );
>  
>   if (next != 0) {
> @@ -881,7 +865,6 @@ static void blk_mq_timeout_work(struct work_struct *work)
>   blk_mq_tag_idle(hctx);
>   }
>   }
> - blk_queue_exit(q);
>  }

Hi Keith,

The above introduces a behavior change: if the percpu_ref_tryget() call inside
blk_mq_queue_tag_busy_iter() fails then blk_mq_timeout_work() will now call
blk_mq_tag_idle(). I think that's wrong if the percpu_ref_tryget() call fails
due to the queue having been frozen. Please make blk_mq_queue_tag_busy_iter()
return a bool that indicates whether or not it has iterated over the request
queue.

Thanks,

Bart.


[PATCH v2] blk-mq: Document the functions that iterate over requests

2018-09-21 Thread Bart Van Assche
Make it easier to understand the purpose of the functions that iterate
over requests by documenting their purpose. Fix several minor spelling
and grammer mistakes in comments in these functions.

Signed-off-by: Bart Van Assche 
Reviewed-by: Johannes Thumshirn 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
---
Changes compared to v1: converted to kernel-doc format.

 block/blk-mq-tag.c | 71 +-
 1 file changed, 64 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 94e1ed667b6e..40d1667bceac 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -232,13 +232,26 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int 
bitnr, void *data)
 
/*
 * We can hit rq == NULL here, because the tagging functions
-* test and set the bit before assining ->rqs[].
+* test and set the bit before assigning ->rqs[].
 */
if (rq && rq->q == hctx->queue)
iter_data->fn(hctx, rq, iter_data->data, reserved);
return true;
 }
 
+/**
+ * bt_for_each - iterate over the requests associated with a hardware queue
+ * @hctx:  Hardware queue to examine.
+ * @bt:sbitmap to examine. This is either the breserved_tags 
member
+ * or the bitmap_tags member of struct blk_mq_tags.
+ * @fn:Pointer to the function that will be called for each 
request
+ * associated with @hctx that has been assigned a driver tag.
+ * @fn will be called as follows: @fn(@hctx, rq, @data, @reserved)
+ * where rq is a pointer to a request.
+ * @data:  Will be passed as third argument to @fn.
+ * @reserved:  Indicates whether @bt is the breserved_tags member or the
+ * bitmap_tags member of struct blk_mq_tags.
+ */
 static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt,
busy_iter_fn *fn, void *data, bool reserved)
 {
@@ -280,6 +293,18 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned 
int bitnr, void *data)
return true;
 }
 
+/**
+ * bt_tags_for_each - iterate over the requests in a tag map
+ * @tags:  Tag map to iterate over.
+ * @bt:sbitmap to examine. This is either the breserved_tags 
member
+ * or the bitmap_tags member of struct blk_mq_tags.
+ * @fn:Pointer to the function that will be called for each 
started
+ * request. @fn will be called as follows: @fn(rq, @data,
+ * @reserved) where rq is a pointer to a request.
+ * @data:  Will be passed as second argument to @fn.
+ * @reserved:  Indicates whether @bt is the breserved_tags member or the
+ * bitmap_tags member of struct blk_mq_tags.
+ */
 static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue 
*bt,
 busy_tag_iter_fn *fn, void *data, bool reserved)
 {
@@ -294,6 +319,15 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, 
struct sbitmap_queue *bt,
sbitmap_for_each_set(>sb, bt_tags_iter, _data);
 }
 
+/**
+ * blk_mq_all_tag_busy_iter - iterate over all started requests in a tag map
+ * @tags:  Tag map to iterate over.
+ * @fn:Pointer to the function that will be called for each 
started
+ * request. @fn will be called as follows: @fn(rq, @priv,
+ * reserved) where rq is a pointer to a request. 'reserved'
+ * indicates whether or not @rq is a reserved request.
+ * @priv:  Will be passed as second argument to @fn.
+ */
 static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
busy_tag_iter_fn *fn, void *priv)
 {
@@ -302,6 +336,15 @@ static void blk_mq_all_tag_busy_iter(struct blk_mq_tags 
*tags,
bt_tags_for_each(tags, >bitmap_tags, fn, priv, false);
 }
 
+/**
+ * blk_mq_tagset_busy_iter - iterate over all started requests in a tag set
+ * @tagset:Tag set to iterate over.
+ * @fn:Pointer to the function that will be called for each 
started
+ * request. @fn will be called as follows: @fn(rq, @priv,
+ * reserved) where rq is a pointer to a request. 'reserved'
+ * indicates whether or not @rq is a reserved request.
+ * @priv:  Will be passed as second argument to @fn.
+ */
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
busy_tag_iter_fn *fn, void *priv)
 {
@@ -314,6 +357,20 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 }
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
 
+/**
+ * blk_mq_queue_tag_busy_iter - iterate over all requests with a driver tag
+ * @q: Request queue to examine.
+ * @fn:Pointer to the function that will be called for each 
request
+ * on @q. @fn will be called as follows: @fn(hctx, rq, @priv,
+ * reserved) where rq is

[PATCH v10 8/8] blk-mq: Enable support for runtime power management

2018-09-21 Thread Bart Van Assche
Now that the blk-mq core processes power management requests
(marked with RQF_PREEMPT) in other states than RPM_ACTIVE, enable
runtime power management for blk-mq.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-mq.c | 2 ++
 block/blk-pm.c | 6 --
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 96d501e8663c..d384ab700afd 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -33,6 +33,7 @@
 #include "blk-mq.h"
 #include "blk-mq-debugfs.h"
 #include "blk-mq-tag.h"
+#include "blk-pm.h"
 #include "blk-stat.h"
 #include "blk-mq-sched.h"
 #include "blk-rq-qos.h"
@@ -475,6 +476,7 @@ static void __blk_mq_free_request(struct request *rq)
struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
const int sched_tag = rq->internal_tag;
 
+   blk_pm_mark_last_busy(rq);
if (rq->tag != -1)
blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
if (sched_tag != -1)
diff --git a/block/blk-pm.c b/block/blk-pm.c
index 972fbc656846..f8fdae01bea2 100644
--- a/block/blk-pm.c
+++ b/block/blk-pm.c
@@ -30,12 +30,6 @@
  */
 void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
 {
-   /* Don't enable runtime PM for blk-mq until it is ready */
-   if (q->mq_ops) {
-   pm_runtime_disable(dev);
-   return;
-   }
-
q->dev = dev;
q->rpm_status = RPM_ACTIVE;
pm_runtime_set_autosuspend_delay(q->dev, -1);
-- 
2.19.0.444.g18242da7ef-goog



[PATCH v10 3/8] block: Split blk_pm_add_request() and blk_pm_put_request()

2018-09-21 Thread Bart Van Assche
Move the pm_request_resume() and pm_runtime_mark_last_busy() calls into
two new functions and thereby separate legacy block layer code from code
that works for both the legacy block layer and blk-mq. A later patch will
add calls to the new functions in the blk-mq code.

Signed-off-by: Bart Van Assche 
Cc: Martin K. Petersen 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-core.c |  1 +
 block/blk-pm.h   | 36 +++-
 block/elevator.c |  1 +
 3 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 1a691f5269bb..fd91e9bf2893 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1744,6 +1744,7 @@ void __blk_put_request(struct request_queue *q, struct 
request *req)
 
blk_req_zone_write_unlock(req);
blk_pm_put_request(req);
+   blk_pm_mark_last_busy(req);
 
elv_completed_request(q, req);
 
diff --git a/block/blk-pm.h b/block/blk-pm.h
index 1ffc8ef203ec..a8564ea72a41 100644
--- a/block/blk-pm.h
+++ b/block/blk-pm.h
@@ -6,8 +6,23 @@
 #include 
 
 #ifdef CONFIG_PM
+static inline void blk_pm_request_resume(struct request_queue *q)
+{
+   if (q->dev && (q->rpm_status == RPM_SUSPENDED ||
+  q->rpm_status == RPM_SUSPENDING))
+   pm_request_resume(q->dev);
+}
+
+static inline void blk_pm_mark_last_busy(struct request *rq)
+{
+   if (rq->q->dev && !(rq->rq_flags & RQF_PM))
+   pm_runtime_mark_last_busy(rq->q->dev);
+}
+
 static inline void blk_pm_requeue_request(struct request *rq)
 {
+   lockdep_assert_held(rq->q->queue_lock);
+
if (rq->q->dev && !(rq->rq_flags & RQF_PM))
rq->q->nr_pending--;
 }
@@ -15,17 +30,28 @@ static inline void blk_pm_requeue_request(struct request 
*rq)
 static inline void blk_pm_add_request(struct request_queue *q,
  struct request *rq)
 {
-   if (q->dev && !(rq->rq_flags & RQF_PM) && q->nr_pending++ == 0 &&
-   (q->rpm_status == RPM_SUSPENDED || q->rpm_status == RPM_SUSPENDING))
-   pm_request_resume(q->dev);
+   lockdep_assert_held(q->queue_lock);
+
+   if (q->dev && !(rq->rq_flags & RQF_PM))
+   q->nr_pending++;
 }
 
 static inline void blk_pm_put_request(struct request *rq)
 {
-   if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending)
-   pm_runtime_mark_last_busy(rq->q->dev);
+   lockdep_assert_held(rq->q->queue_lock);
+
+   if (rq->q->dev && !(rq->rq_flags & RQF_PM))
+   --rq->q->nr_pending;
 }
 #else
+static inline void blk_pm_request_resume(struct request_queue *q)
+{
+}
+
+static inline void blk_pm_mark_last_busy(struct request *rq)
+{
+}
+
 static inline void blk_pm_requeue_request(struct request *rq)
 {
 }
diff --git a/block/elevator.c b/block/elevator.c
index e18ac68626e3..1c992bf6cfb1 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -601,6 +601,7 @@ void __elv_add_request(struct request_queue *q, struct 
request *rq, int where)
trace_block_rq_insert(q, rq);
 
blk_pm_add_request(q, rq);
+   blk_pm_request_resume(q);
 
rq->q = q;
 
-- 
2.19.0.444.g18242da7ef-goog



[PATCH v10 1/8] block: Move power management code into a new source file

2018-09-21 Thread Bart Van Assche
Move the code for runtime power management from blk-core.c into the
new source file blk-pm.c. Move the corresponding declarations from
 into . For CONFIG_PM=n, leave out
the declarations of the functions that are not used in that mode.
This patch not only reduces the number of #ifdefs in the block layer
core code but also reduces the size of header file 
and hence should help to reduce the build time of the Linux kernel
if CONFIG_PM is not defined.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/Kconfig  |   3 +
 block/Makefile |   1 +
 block/blk-core.c   | 196 +
 block/blk-pm.c | 188 +++
 block/blk-pm.h |  43 +
 block/elevator.c   |  22 +
 drivers/scsi/scsi_pm.c |   1 +
 drivers/scsi/sd.c  |   1 +
 drivers/scsi/sr.c  |   1 +
 include/linux/blk-pm.h |  24 +
 include/linux/blkdev.h |  23 -
 11 files changed, 264 insertions(+), 239 deletions(-)
 create mode 100644 block/blk-pm.c
 create mode 100644 block/blk-pm.h
 create mode 100644 include/linux/blk-pm.h

diff --git a/block/Kconfig b/block/Kconfig
index 1f2469a0123c..85263e7bded6 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -228,4 +228,7 @@ config BLK_MQ_RDMA
depends on BLOCK && INFINIBAND
default y
 
+config BLK_PM
+   def_bool BLOCK && PM
+
 source block/Kconfig.iosched
diff --git a/block/Makefile b/block/Makefile
index 572b33f32c07..27eac600474f 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -37,3 +37,4 @@ obj-$(CONFIG_BLK_WBT) += blk-wbt.o
 obj-$(CONFIG_BLK_DEBUG_FS) += blk-mq-debugfs.o
 obj-$(CONFIG_BLK_DEBUG_FS_ZONED)+= blk-mq-debugfs-zoned.o
 obj-$(CONFIG_BLK_SED_OPAL) += sed-opal.o
+obj-$(CONFIG_BLK_PM)   += blk-pm.o
diff --git a/block/blk-core.c b/block/blk-core.c
index 4dbc93f43b38..6d4dd176bd9d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -42,6 +42,7 @@
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-sched.h"
+#include "blk-pm.h"
 #include "blk-rq-qos.h"
 
 #ifdef CONFIG_DEBUG_FS
@@ -1726,16 +1727,6 @@ void part_round_stats(struct request_queue *q, int cpu, 
struct hd_struct *part)
 }
 EXPORT_SYMBOL_GPL(part_round_stats);
 
-#ifdef CONFIG_PM
-static void blk_pm_put_request(struct request *rq)
-{
-   if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending)
-   pm_runtime_mark_last_busy(rq->q->dev);
-}
-#else
-static inline void blk_pm_put_request(struct request *rq) {}
-#endif
-
 void __blk_put_request(struct request_queue *q, struct request *req)
 {
req_flags_t rq_flags = req->rq_flags;
@@ -3757,191 +3748,6 @@ void blk_finish_plug(struct blk_plug *plug)
 }
 EXPORT_SYMBOL(blk_finish_plug);
 
-#ifdef CONFIG_PM
-/**
- * blk_pm_runtime_init - Block layer runtime PM initialization routine
- * @q: the queue of the device
- * @dev: the device the queue belongs to
- *
- * Description:
- *Initialize runtime-PM-related fields for @q and start auto suspend for
- *@dev. Drivers that want to take advantage of request-based runtime PM
- *should call this function after @dev has been initialized, and its
- *request queue @q has been allocated, and runtime PM for it can not happen
- *yet(either due to disabled/forbidden or its usage_count > 0). In most
- *cases, driver should call this function before any I/O has taken place.
- *
- *This function takes care of setting up using auto suspend for the device,
- *the autosuspend delay is set to -1 to make runtime suspend impossible
- *until an updated value is either set by user or by driver. Drivers do
- *not need to touch other autosuspend settings.
- *
- *The block layer runtime PM is request based, so only works for drivers
- *that use request as their IO unit instead of those directly use bio's.
- */
-void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
-{
-   /* Don't enable runtime PM for blk-mq until it is ready */
-   if (q->mq_ops) {
-   pm_runtime_disable(dev);
-   return;
-   }
-
-   q->dev = dev;
-   q->rpm_status = RPM_ACTIVE;
-   pm_runtime_set_autosuspend_delay(q->dev, -1);
-   pm_runtime_use_autosuspend(q->dev);
-}
-EXPORT_SYMBOL(blk_pm_runtime_init);
-
-/**
- * blk_pre_runtime_suspend - Pre runtime suspend check
- * @q: the queue of the device
- *
- * Description:
- *This function will check if runtime suspend is allowed for the device
- *by examining if there are any requests pending in the queue. If there
- *are requests pending, the device can not be runtime suspended; otherwise,
- *the queue's status will be updated to SUSPENDING and the driver can
- *proceed to s

[PATCH v10 7/8] block: Make blk_get_request() block for non-PM requests while suspended

2018-09-21 Thread Bart Van Assche
Instead of allowing requests that are not power management requests
to enter the queue in runtime suspended status (RPM_SUSPENDED), make
the blk_get_request() caller block. This change fixes a starvation
issue: it is now guaranteed that power management requests will be
executed no matter how many blk_get_request() callers are waiting.
For blk-mq, instead of maintaining the q->nr_pending counter, rely
on q->q_usage_counter. Call pm_runtime_mark_last_busy() every time a
request finishes instead of only if the queue depth drops to zero.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-core.c | 37 -
 block/blk-pm.c   | 44 +++-
 2 files changed, 47 insertions(+), 34 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index fec135ae52cf..16dd3a989753 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2746,30 +2746,6 @@ void blk_account_io_done(struct request *req, u64 now)
}
 }
 
-#ifdef CONFIG_PM
-/*
- * Don't process normal requests when queue is suspended
- * or in the process of suspending/resuming
- */
-static bool blk_pm_allow_request(struct request *rq)
-{
-   switch (rq->q->rpm_status) {
-   case RPM_RESUMING:
-   case RPM_SUSPENDING:
-   return rq->rq_flags & RQF_PM;
-   case RPM_SUSPENDED:
-   return false;
-   default:
-   return true;
-   }
-}
-#else
-static bool blk_pm_allow_request(struct request *rq)
-{
-   return true;
-}
-#endif
-
 void blk_account_io_start(struct request *rq, bool new_io)
 {
struct hd_struct *part;
@@ -2815,11 +2791,14 @@ static struct request *elv_next_request(struct 
request_queue *q)
 
while (1) {
list_for_each_entry(rq, >queue_head, queuelist) {
-   if (blk_pm_allow_request(rq))
-   return rq;
-
-   if (rq->rq_flags & RQF_SOFTBARRIER)
-   break;
+#ifdef CONFIG_PM
+   /*
+* If a request gets queued in state RPM_SUSPENDED
+* then that's a kernel bug.
+*/
+   WARN_ON_ONCE(q->rpm_status == RPM_SUSPENDED);
+#endif
+   return rq;
}
 
/*
diff --git a/block/blk-pm.c b/block/blk-pm.c
index 9b636960d285..972fbc656846 100644
--- a/block/blk-pm.c
+++ b/block/blk-pm.c
@@ -1,8 +1,11 @@
 // SPDX-License-Identifier: GPL-2.0
 
+#include 
 #include 
 #include 
 #include 
+#include "blk-mq.h"
+#include "blk-mq-tag.h"
 
 /**
  * blk_pm_runtime_init - Block layer runtime PM initialization routine
@@ -68,14 +71,40 @@ int blk_pre_runtime_suspend(struct request_queue *q)
if (!q->dev)
return ret;
 
+   WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);
+
+   /*
+* Increase the pm_only counter before checking whether any
+* non-PM blk_queue_enter() calls are in progress to avoid that any
+* new non-PM blk_queue_enter() calls succeed before the pm_only
+* counter is decreased again.
+*/
+   blk_set_pm_only(q);
+   ret = -EBUSY;
+   /* Switch q_usage_counter from per-cpu to atomic mode. */
+   blk_freeze_queue_start(q);
+   /*
+* Wait until atomic mode has been reached. Since that
+* involves calling call_rcu(), it is guaranteed that later
+* blk_queue_enter() calls see the pm-only state. See also
+* http://lwn.net/Articles/573497/.
+*/
+   percpu_ref_switch_to_atomic_sync(>q_usage_counter);
+   if (percpu_ref_is_zero(>q_usage_counter))
+   ret = 0;
+   /* Switch q_usage_counter back to per-cpu mode. */
+   blk_mq_unfreeze_queue(q);
+
spin_lock_irq(q->queue_lock);
-   if (q->nr_pending) {
-   ret = -EBUSY;
+   if (ret < 0)
pm_runtime_mark_last_busy(q->dev);
-   } else {
+   else
q->rpm_status = RPM_SUSPENDING;
-   }
spin_unlock_irq(q->queue_lock);
+
+   if (ret)
+   blk_clear_pm_only(q);
+
return ret;
 }
 EXPORT_SYMBOL(blk_pre_runtime_suspend);
@@ -106,6 +135,9 @@ void blk_post_runtime_suspend(struct request_queue *q, int 
err)
pm_runtime_mark_last_busy(q->dev);
}
spin_unlock_irq(q->queue_lock);
+
+   if (err)
+   blk_clear_pm_only(q);
 }
 EXPORT_SYMBOL(blk_post_runtime_suspend);
 
@@ -153,13 +185,15 @@ void blk_post_runtime_resume(struct request_queue *q, int 
err)
spin_lock_irq(q->queue_lock);
if (!err) {
q->rpm_status = RPM_ACTIVE;
-   __blk_run_queue(q);
pm_runtime_mark_last_bu

[PATCH v10 2/8] block, scsi: Change the preempt-only flag into a counter

2018-09-21 Thread Bart Van Assche
The RQF_PREEMPT flag is used for three purposes:
- In the SCSI core, for making sure that power management requests
  are executed even if a device is in the "quiesced" state.
- For domain validation by SCSI drivers that use the parallel port.
- In the IDE driver, for IDE preempt requests.
Rename "preempt-only" into "pm-only" because the primary purpose of
this mode is power management. Since the power management core may
but does not have to resume a runtime suspended device before
performing system-wide suspend and since a later patch will set
"pm-only" mode as long as a block device is runtime suspended, make
it possible to set "pm-only" mode from more than one context. Since
with this change scsi_device_quiesce() is no longer idempotent, make
that function return early if it is called for a quiesced queue.

Signed-off-by: Bart Van Assche 
Cc: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-core.c| 35 ++-
 block/blk-mq-debugfs.c  | 10 +-
 drivers/scsi/scsi_lib.c | 11 +++
 include/linux/blkdev.h  | 14 +-
 4 files changed, 43 insertions(+), 27 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 6d4dd176bd9d..1a691f5269bb 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -422,24 +422,25 @@ void blk_sync_queue(struct request_queue *q)
 EXPORT_SYMBOL(blk_sync_queue);
 
 /**
- * blk_set_preempt_only - set QUEUE_FLAG_PREEMPT_ONLY
+ * blk_set_pm_only - increment pm_only counter
  * @q: request queue pointer
- *
- * Returns the previous value of the PREEMPT_ONLY flag - 0 if the flag was not
- * set and 1 if the flag was already set.
  */
-int blk_set_preempt_only(struct request_queue *q)
+void blk_set_pm_only(struct request_queue *q)
 {
-   return blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q);
+   atomic_inc(>pm_only);
 }
-EXPORT_SYMBOL_GPL(blk_set_preempt_only);
+EXPORT_SYMBOL_GPL(blk_set_pm_only);
 
-void blk_clear_preempt_only(struct request_queue *q)
+void blk_clear_pm_only(struct request_queue *q)
 {
-   blk_queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
-   wake_up_all(>mq_freeze_wq);
+   int pm_only;
+
+   pm_only = atomic_dec_return(>pm_only);
+   WARN_ON_ONCE(pm_only < 0);
+   if (pm_only == 0)
+   wake_up_all(>mq_freeze_wq);
 }
-EXPORT_SYMBOL_GPL(blk_clear_preempt_only);
+EXPORT_SYMBOL_GPL(blk_clear_pm_only);
 
 /**
  * __blk_run_queue_uncond - run a queue whether or not it has been stopped
@@ -918,7 +919,7 @@ EXPORT_SYMBOL(blk_alloc_queue);
  */
 int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 {
-   const bool preempt = flags & BLK_MQ_REQ_PREEMPT;
+   const bool pm = flags & BLK_MQ_REQ_PREEMPT;
 
while (true) {
bool success = false;
@@ -926,11 +927,11 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
rcu_read_lock();
if (percpu_ref_tryget_live(>q_usage_counter)) {
/*
-* The code that sets the PREEMPT_ONLY flag is
-* responsible for ensuring that that flag is globally
-* visible before the queue is unfrozen.
+* The code that increments the pm_only counter is
+* responsible for ensuring that that counter is
+* globally visible before the queue is unfrozen.
 */
-   if (preempt || !blk_queue_preempt_only(q)) {
+   if (pm || !blk_queue_pm_only(q)) {
success = true;
} else {
percpu_ref_put(>q_usage_counter);
@@ -955,7 +956,7 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
 
wait_event(q->mq_freeze_wq,
   (atomic_read(>mq_freeze_depth) == 0 &&
-   (preempt || !blk_queue_preempt_only(q))) ||
+   (pm || !blk_queue_pm_only(q))) ||
   blk_queue_dying(q));
if (blk_queue_dying(q))
return -ENODEV;
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index cb1e6cf7ac48..a5ea86835fcb 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -102,6 +102,14 @@ static int blk_flags_show(struct seq_file *m, const 
unsigned long flags,
return 0;
 }
 
+static int queue_pm_only_show(void *data, struct seq_file *m)
+{
+   struct request_queue *q = data;
+
+   seq_printf(m, "%d\n", atomic_read(>pm_only));
+   return 0;
+}
+
 #define QUEUE_FLAG_NAME(name) [QUEUE_FLAG_##name] = #name
 static const char *con

[PATCH v10 5/8] percpu-refcount: Introduce percpu_ref_resurrect()

2018-09-21 Thread Bart Van Assche
This function will be used in a later patch to switch the struct
request_queue q_usage_counter from killed back to live. In contrast
to percpu_ref_reinit(), this new function does not require that the
refcount is zero.

Signed-off-by: Bart Van Assche 
Cc: Tejun Heo 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 include/linux/percpu-refcount.h |  1 +
 lib/percpu-refcount.c   | 28 ++--
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 009cdf3d65b6..b297cd1cd4f1 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -108,6 +108,7 @@ void percpu_ref_switch_to_atomic_sync(struct percpu_ref 
*ref);
 void percpu_ref_switch_to_percpu(struct percpu_ref *ref);
 void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
 percpu_ref_func_t *confirm_kill);
+void percpu_ref_resurrect(struct percpu_ref *ref);
 void percpu_ref_reinit(struct percpu_ref *ref);
 
 /**
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 9f96fa7bc000..17fe3e996ddc 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -356,11 +356,35 @@ EXPORT_SYMBOL_GPL(percpu_ref_kill_and_confirm);
  */
 void percpu_ref_reinit(struct percpu_ref *ref)
 {
+   WARN_ON_ONCE(!percpu_ref_is_zero(ref));
+
+   percpu_ref_resurrect(ref);
+}
+EXPORT_SYMBOL_GPL(percpu_ref_reinit);
+
+/**
+ * percpu_ref_resurrect - modify a percpu refcount from dead to live
+ * @ref: perpcu_ref to resurrect
+ *
+ * Modify @ref so that it's in the same state as before percpu_ref_kill() was
+ * called. @ref must have be dead but not exited.
+ *
+ * If @ref->release() frees @ref then the caller is responsible for
+ * guaranteeing that @ref->release() does not get called while this
+ * function is in progress.
+ *
+ * Note that percpu_ref_tryget[_live]() are safe to perform on @ref while
+ * this function is in progress.
+ */
+void percpu_ref_resurrect(struct percpu_ref *ref)
+{
+   unsigned long __percpu *percpu_count;
unsigned long flags;
 
spin_lock_irqsave(_ref_switch_lock, flags);
 
-   WARN_ON_ONCE(!percpu_ref_is_zero(ref));
+   WARN_ON_ONCE(!(ref->percpu_count_ptr & __PERCPU_REF_DEAD));
+   WARN_ON_ONCE(__ref_is_percpu(ref, _count));
 
ref->percpu_count_ptr &= ~__PERCPU_REF_DEAD;
percpu_ref_get(ref);
@@ -368,4 +392,4 @@ void percpu_ref_reinit(struct percpu_ref *ref)
 
spin_unlock_irqrestore(_ref_switch_lock, flags);
 }
-EXPORT_SYMBOL_GPL(percpu_ref_reinit);
+EXPORT_SYMBOL_GPL(percpu_ref_resurrect);
-- 
2.19.0.444.g18242da7ef-goog



[PATCH v10 6/8] block: Allow unfreezing of a queue while requests are in progress

2018-09-21 Thread Bart Van Assche
A later patch will call blk_freeze_queue_start() followed by
blk_mq_unfreeze_queue() without waiting for q_usage_counter to drop
to zero. Make sure that this doesn't cause a kernel warning to appear
by switching from percpu_ref_reinit() to percpu_ref_resurrect(). The
former namely requires that the refcount it operates on is zero.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 block/blk-mq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 85a1c1a59c72..96d501e8663c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -198,7 +198,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
freeze_depth = atomic_dec_return(>mq_freeze_depth);
WARN_ON_ONCE(freeze_depth < 0);
if (!freeze_depth) {
-   percpu_ref_reinit(>q_usage_counter);
+   percpu_ref_resurrect(>q_usage_counter);
wake_up_all(>mq_freeze_wq);
}
 }
-- 
2.19.0.444.g18242da7ef-goog



[PATCH v10 4/8] block: Schedule runtime resume earlier

2018-09-21 Thread Bart Van Assche
Instead of scheduling runtime resume of a request queue after a
request has been queued, schedule asynchronous resume during request
allocation. The new pm_request_resume() calls occur after
blk_queue_enter() has increased the q_usage_counter request queue
member. This change is needed for a later patch that will make request
allocation block while the queue status is not RPM_ACTIVE.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-core.c | 3 ++-
 block/elevator.c | 1 -
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index fd91e9bf2893..fec135ae52cf 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -956,7 +956,8 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
 
wait_event(q->mq_freeze_wq,
   (atomic_read(>mq_freeze_depth) == 0 &&
-   (pm || !blk_queue_pm_only(q))) ||
+   (pm || (blk_pm_request_resume(q),
+   !blk_queue_pm_only(q ||
   blk_queue_dying(q));
if (blk_queue_dying(q))
return -ENODEV;
diff --git a/block/elevator.c b/block/elevator.c
index 1c992bf6cfb1..e18ac68626e3 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -601,7 +601,6 @@ void __elv_add_request(struct request_queue *q, struct 
request *rq, int where)
trace_block_rq_insert(q, rq);
 
blk_pm_add_request(q, rq);
-   blk_pm_request_resume(q);
 
rq->q = q;
 
-- 
2.19.0.444.g18242da7ef-goog



[PATCH v10 0/8] blk-mq: Implement runtime power management

2018-09-21 Thread Bart Van Assche
Hello Jens,

One of the pieces that is missing before blk-mq can be made the default is
implementing runtime power management support for blk-mq.  This patch series
not only implements runtime power management for blk-mq but also fixes a
starvation issue in the power management code for the legacy block
layer. Please consider this patch series for the upstream kernel.

Thanks,

Bart.

Changes compared to v9:
- Left out the patches that document the functions that iterate over requests
  and also the patch that introduces blk_mq_queue_rq_iter().
- Simplified blk_pre_runtime_suspend(): left out the check whether no requests
  are in progress.
- Fixed the race between blk_queue_enter(), queue freezing and runtime power
  management that Ming had identified.
- Added a new patch that introduces percpu_ref_resurrect().

Changes compared to v8:
- Fixed the race that was reported by Jianchao.
- Fixed another spelling issue in a source code comment.

Changes compared to v7:
- Addressed Jianchao's feedback about patch "Make blk_get_request() block for
  non-PM requests while suspended".
- Added two new patches - one that documents the functions that iterate over
  requests and one that introduces a new function that iterates over all
  requests associated with a queue.

Changes compared to v6:
- Left out the patches that split RQF_PREEMPT in three flags.
- Left out the patch that introduces the SCSI device state SDEV_SUSPENDED.
- Left out the patch that introduces blk_pm_runtime_exit().
- Restored the patch that changes the PREEMPT_ONLY flag into a counter.

Changes compared to v5:
- Introduced a new flag RQF_DV that replaces RQF_PREEMPT for SCSI domain
  validation.
- Introduced a new request queue state QUEUE_FLAG_DV_ONLY for SCSI domain
  validation.
- Instead of using SDEV_QUIESCE for both runtime suspend and SCSI domain
  validation, use that state for domain validation only and introduce a new
  state for runtime suspend, namely SDEV_QUIESCE.
- Reallow system suspend during SCSI domain validation.
- Moved the runtime resume call from the request allocation code into
  blk_queue_enter().
- Instead of relying on q_usage_counter, iterate over the tag set to determine
  whether or not any requests are in flight.

Changes compared to v4:
- Dropped the patches "Give RQF_PREEMPT back its original meaning" and
  "Serialize queue freezing and blk_pre_runtime_suspend()".
- Replaced "percpu_ref_read()" with "percpu_is_in_use()".
- Inserted pm_request_resume() calls in the block layer request allocation
  code such that the context that submits a request no longer has to call
  pm_runtime_get().

Changes compared to v3:
- Avoid adverse interactions between system-wide suspend/resume and runtime
  power management by changing the PREEMPT_ONLY flag into a counter.
- Give RQF_PREEMPT back its original meaning, namely that it is only set for
  ide_preempt requests.
- Remove the flag BLK_MQ_REQ_PREEMPT.
- Removed the pm_request_resume() call.

Changes compared to v2:
- Fixed the build for CONFIG_BLOCK=n.
- Added a patch that introduces percpu_ref_read() in the percpu-counter code.
- Added a patch that makes it easier to detect missing pm_runtime_get*() calls.
- Addressed Jianchao's feedback including the comment about runtime overhead
  of switching a per-cpu counter to atomic mode.

Changes compared to v1:
- Moved the runtime power management code into a separate file.
- Addressed Ming's feedback.

Bart Van Assche (8):
  block: Move power management code into a new source file
  block, scsi: Change the preempt-only flag into a counter
  block: Split blk_pm_add_request() and blk_pm_put_request()
  block: Schedule runtime resume earlier
  percpu-refcount: Introduce percpu_ref_resurrect()
  block: Allow unfreezing of a queue while requests are in progress
  block: Make blk_get_request() block for non-PM requests while
suspended
  blk-mq: Enable support for runtime power management

 block/Kconfig   |   3 +
 block/Makefile  |   1 +
 block/blk-core.c| 270 
 block/blk-mq-debugfs.c  |  10 +-
 block/blk-mq.c  |   4 +-
 block/blk-pm.c  | 216 +
 block/blk-pm.h  |  69 
 block/elevator.c|  22 +--
 drivers/scsi/scsi_lib.c |  11 +-
 drivers/scsi/scsi_pm.c  |   1 +
 drivers/scsi/sd.c   |   1 +
 drivers/scsi/sr.c   |   1 +
 include/linux/blk-pm.h  |  24 +++
 include/linux/blkdev.h  |  37 ++---
 include/linux/percpu-refcount.h |   1 +
 lib/percpu-refcount.c   |  28 +++-
 16 files changed, 401 insertions(+), 298 deletions(-)
 create mode 100644 block/blk-pm.c
 create mode 100644 block/blk-pm.h
 create mode 100644 include/linux/blk-pm.h

-- 
2.19.0.444.g18242da7ef-goog



Re: [PATCH v9 1/8] blk-mq: Document the functions that iterate over requests

2018-09-20 Thread Bart Van Assche
On Thu, 2018-09-20 at 09:35 +0200, Johannes Thumshirn wrote:
> On Wed, Sep 19, 2018 at 03:45:23PM -0700, Bart Van Assche wrote: 
> > +/*
> > + * Call function @fn(@hctx, rq, @data, @reserved) for each request
> > queued on
> > + * @hctx that has been assigned a driver tag. @reserved indicates
> > whether @bt
> > + * is the breserved_tags member or the bitmap_tags member of
> > struct
> > + * blk_mq_tags.
> > + */
> >  static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct
> > sbitmap_queue *bt,
> > busy_iter_fn *fn, void *data, bool
> > reserved)
> 
> One minor nit and only if you have to re-send the series, can you
> please also add the usual kernel-doc headers to the comments, i.e.:

OK, I will use the kernel-doc format.

> Otherwise:
> Reviewed-by: Johannes Thumshirn 

Thanks!

Bart.



[PATCH v9 6/8] block: Schedule runtime resume earlier

2018-09-19 Thread Bart Van Assche
Instead of scheduling runtime resume of a request queue after a
request has been queued, schedule asynchronous resume during request
allocation. The new pm_request_resume() calls occur after
blk_queue_enter() has increased the q_usage_counter request queue
member. This change is needed for a later patch that will make request
allocation block while the queue status is not RPM_ACTIVE.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-core.c | 2 ++
 block/elevator.c | 1 -
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index fd91e9bf2893..18b874d5c9c9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -942,6 +942,8 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
if (success)
return 0;
 
+   blk_pm_request_resume(q);
+
if (flags & BLK_MQ_REQ_NOWAIT)
return -EBUSY;
 
diff --git a/block/elevator.c b/block/elevator.c
index 1c992bf6cfb1..e18ac68626e3 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -601,7 +601,6 @@ void __elv_add_request(struct request_queue *q, struct 
request *rq, int where)
trace_block_rq_insert(q, rq);
 
blk_pm_add_request(q, rq);
-   blk_pm_request_resume(q);
 
rq->q = q;
 
-- 
2.19.0.397.gdd90340f6a-goog



[PATCH v9 4/8] block, scsi: Change the preempt-only flag into a counter

2018-09-19 Thread Bart Van Assche
The RQF_PREEMPT flag is used for three purposes:
- In the SCSI core, for making sure that power management requests
  are executed even if a device is in the "quiesced" state.
- For domain validation by SCSI drivers that use the parallel port.
- In the IDE driver, for IDE preempt requests.
Rename "preempt-only" into "pm-only" because the primary purpose of
this mode is power management. Since the power management core may
but does not have to resume a runtime suspended device before
performing system-wide suspend and since a later patch will set
"pm-only" mode as long as a block device is runtime suspended, make
it possible to set "pm-only" mode from more than one context. Since
with this change scsi_device_quiesce() is no longer idempotent, make
that function return early if it is called for a quiesced queue.

Signed-off-by: Bart Van Assche 
Cc: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-core.c| 35 ++-
 block/blk-mq-debugfs.c  | 10 +-
 drivers/scsi/scsi_lib.c | 11 +++
 include/linux/blkdev.h  | 14 +-
 4 files changed, 43 insertions(+), 27 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 6d4dd176bd9d..1a691f5269bb 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -422,24 +422,25 @@ void blk_sync_queue(struct request_queue *q)
 EXPORT_SYMBOL(blk_sync_queue);
 
 /**
- * blk_set_preempt_only - set QUEUE_FLAG_PREEMPT_ONLY
+ * blk_set_pm_only - increment pm_only counter
  * @q: request queue pointer
- *
- * Returns the previous value of the PREEMPT_ONLY flag - 0 if the flag was not
- * set and 1 if the flag was already set.
  */
-int blk_set_preempt_only(struct request_queue *q)
+void blk_set_pm_only(struct request_queue *q)
 {
-   return blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q);
+   atomic_inc(>pm_only);
 }
-EXPORT_SYMBOL_GPL(blk_set_preempt_only);
+EXPORT_SYMBOL_GPL(blk_set_pm_only);
 
-void blk_clear_preempt_only(struct request_queue *q)
+void blk_clear_pm_only(struct request_queue *q)
 {
-   blk_queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
-   wake_up_all(>mq_freeze_wq);
+   int pm_only;
+
+   pm_only = atomic_dec_return(>pm_only);
+   WARN_ON_ONCE(pm_only < 0);
+   if (pm_only == 0)
+   wake_up_all(>mq_freeze_wq);
 }
-EXPORT_SYMBOL_GPL(blk_clear_preempt_only);
+EXPORT_SYMBOL_GPL(blk_clear_pm_only);
 
 /**
  * __blk_run_queue_uncond - run a queue whether or not it has been stopped
@@ -918,7 +919,7 @@ EXPORT_SYMBOL(blk_alloc_queue);
  */
 int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 {
-   const bool preempt = flags & BLK_MQ_REQ_PREEMPT;
+   const bool pm = flags & BLK_MQ_REQ_PREEMPT;
 
while (true) {
bool success = false;
@@ -926,11 +927,11 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
rcu_read_lock();
if (percpu_ref_tryget_live(>q_usage_counter)) {
/*
-* The code that sets the PREEMPT_ONLY flag is
-* responsible for ensuring that that flag is globally
-* visible before the queue is unfrozen.
+* The code that increments the pm_only counter is
+* responsible for ensuring that that counter is
+* globally visible before the queue is unfrozen.
 */
-   if (preempt || !blk_queue_preempt_only(q)) {
+   if (pm || !blk_queue_pm_only(q)) {
success = true;
} else {
percpu_ref_put(>q_usage_counter);
@@ -955,7 +956,7 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
 
wait_event(q->mq_freeze_wq,
   (atomic_read(>mq_freeze_depth) == 0 &&
-   (preempt || !blk_queue_preempt_only(q))) ||
+   (pm || !blk_queue_pm_only(q))) ||
   blk_queue_dying(q));
if (blk_queue_dying(q))
return -ENODEV;
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index cb1e6cf7ac48..a5ea86835fcb 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -102,6 +102,14 @@ static int blk_flags_show(struct seq_file *m, const 
unsigned long flags,
return 0;
 }
 
+static int queue_pm_only_show(void *data, struct seq_file *m)
+{
+   struct request_queue *q = data;
+
+   seq_printf(m, "%d\n", atomic_read(>pm_only));
+   return 0;
+}
+
 #define QUEUE_FLAG_NAME(name) [QUEUE_FLAG_##name] = #name
 static const char *con

[PATCH v9 8/8] blk-mq: Enable support for runtime power management

2018-09-19 Thread Bart Van Assche
Now that the blk-mq core processes power management requests
(marked with RQF_PREEMPT) in other states than RPM_ACTIVE, enable
runtime power management for blk-mq.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-mq.c | 2 ++
 block/blk-pm.c | 6 --
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 85a1c1a59c72..20fdd78b75c7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -33,6 +33,7 @@
 #include "blk-mq.h"
 #include "blk-mq-debugfs.h"
 #include "blk-mq-tag.h"
+#include "blk-pm.h"
 #include "blk-stat.h"
 #include "blk-mq-sched.h"
 #include "blk-rq-qos.h"
@@ -475,6 +476,7 @@ static void __blk_mq_free_request(struct request *rq)
struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
const int sched_tag = rq->internal_tag;
 
+   blk_pm_mark_last_busy(rq);
if (rq->tag != -1)
blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
if (sched_tag != -1)
diff --git a/block/blk-pm.c b/block/blk-pm.c
index 3ad5a7334baa..7c0ac1424f46 100644
--- a/block/blk-pm.c
+++ b/block/blk-pm.c
@@ -30,12 +30,6 @@
  */
 void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
 {
-   /* Don't enable runtime PM for blk-mq until it is ready */
-   if (q->mq_ops) {
-   pm_runtime_disable(dev);
-   return;
-   }
-
q->dev = dev;
q->rpm_status = RPM_ACTIVE;
pm_runtime_set_autosuspend_delay(q->dev, -1);
-- 
2.19.0.397.gdd90340f6a-goog



[PATCH v9 1/8] blk-mq: Document the functions that iterate over requests

2018-09-19 Thread Bart Van Assche
Make it easier to understand the purpose of the functions that iterate
over requests by documenting their purpose. Fix three minor spelling
mistakes in comments in these functions.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 block/blk-mq-tag.c | 30 +++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 94e1ed667b6e..cb5db0c3cc32 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -232,13 +232,19 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int 
bitnr, void *data)
 
/*
 * We can hit rq == NULL here, because the tagging functions
-* test and set the bit before assining ->rqs[].
+* test and set the bit before assigning ->rqs[].
 */
if (rq && rq->q == hctx->queue)
iter_data->fn(hctx, rq, iter_data->data, reserved);
return true;
 }
 
+/*
+ * Call function @fn(@hctx, rq, @data, @reserved) for each request queued on
+ * @hctx that has been assigned a driver tag. @reserved indicates whether @bt
+ * is the breserved_tags member or the bitmap_tags member of struct
+ * blk_mq_tags.
+ */
 static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt,
busy_iter_fn *fn, void *data, bool reserved)
 {
@@ -280,6 +286,11 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned 
int bitnr, void *data)
return true;
 }
 
+/*
+ * Call function @fn(rq, @data, @reserved) for each request in @tags that has
+ * been started. @reserved indicates whether @bt is the breserved_tags member
+ * or the bitmap_tags member of struct blk_mq_tags.
+ */
 static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue 
*bt,
 busy_tag_iter_fn *fn, void *data, bool reserved)
 {
@@ -294,6 +305,10 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, 
struct sbitmap_queue *bt,
sbitmap_for_each_set(>sb, bt_tags_iter, _data);
 }
 
+/*
+ * Call @fn(rq, @priv, reserved) for each started request in @tags. 'reserved'
+ * indicates whether or not 'rq' is a reserved request.
+ */
 static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
busy_tag_iter_fn *fn, void *priv)
 {
@@ -302,6 +317,10 @@ static void blk_mq_all_tag_busy_iter(struct blk_mq_tags 
*tags,
bt_tags_for_each(tags, >bitmap_tags, fn, priv, false);
 }
 
+/*
+ * Call @fn(rq, @priv, reserved) for each request in @tagset. 'reserved'
+ * indicates whether or not 'rq' is a reserved request.
+ */
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
busy_tag_iter_fn *fn, void *priv)
 {
@@ -314,6 +333,11 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 }
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
 
+/*
+ * Call @fn(rq, @priv, reserved) for each request associated with request
+ * queue @q or any queue it shares tags with and that has been assigned a
+ * driver tag. 'reserved' indicates whether or not 'rq' is a reserved request.
+ */
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
void *priv)
 {
@@ -323,7 +347,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, 
busy_iter_fn *fn,
/*
 * __blk_mq_update_nr_hw_queues will update the nr_hw_queues and
 * queue_hw_ctx after freeze the queue. So we could use q_usage_counter
-* to avoid race with it. __blk_mq_update_nr_hw_queues will users
+* to avoid race with it. __blk_mq_update_nr_hw_queues will use
 * synchronize_rcu to ensure all of the users go out of the critical
 * section below and see zeroed q_usage_counter.
 */
@@ -337,7 +361,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, 
busy_iter_fn *fn,
struct blk_mq_tags *tags = hctx->tags;
 
/*
-* If not software queues are currently mapped to this
+* If no software queues are currently mapped to this
 * hardware queue, there's nothing to check
 */
if (!blk_mq_hw_queue_mapped(hctx))
-- 
2.19.0.397.gdd90340f6a-goog



[PATCH v9 7/8] block: Make blk_get_request() block for non-PM requests while suspended

2018-09-19 Thread Bart Van Assche
Instead of allowing requests that are not power management requests
to enter the queue in runtime suspended status (RPM_SUSPENDED), make
the blk_get_request() caller block. This change fixes a starvation
issue: it is now guaranteed that power management requests will be
executed no matter how many blk_get_request() callers are waiting.
For blk-mq, instead of maintaining the q->nr_pending counter, rely
on q->q_usage_counter. Call pm_runtime_mark_last_busy() every time a
request finishes instead of only if the queue depth drops to zero.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-core.c | 37 +-
 block/blk-pm.c   | 81 +---
 2 files changed, 84 insertions(+), 34 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 18b874d5c9c9..ae092ca121d5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2747,30 +2747,6 @@ void blk_account_io_done(struct request *req, u64 now)
}
 }
 
-#ifdef CONFIG_PM
-/*
- * Don't process normal requests when queue is suspended
- * or in the process of suspending/resuming
- */
-static bool blk_pm_allow_request(struct request *rq)
-{
-   switch (rq->q->rpm_status) {
-   case RPM_RESUMING:
-   case RPM_SUSPENDING:
-   return rq->rq_flags & RQF_PM;
-   case RPM_SUSPENDED:
-   return false;
-   default:
-   return true;
-   }
-}
-#else
-static bool blk_pm_allow_request(struct request *rq)
-{
-   return true;
-}
-#endif
-
 void blk_account_io_start(struct request *rq, bool new_io)
 {
struct hd_struct *part;
@@ -2816,11 +2792,14 @@ static struct request *elv_next_request(struct 
request_queue *q)
 
while (1) {
list_for_each_entry(rq, >queue_head, queuelist) {
-   if (blk_pm_allow_request(rq))
-   return rq;
-
-   if (rq->rq_flags & RQF_SOFTBARRIER)
-   break;
+#ifdef CONFIG_PM
+   /*
+* If a request gets queued in state RPM_SUSPENDED
+* then that's a kernel bug.
+*/
+   WARN_ON_ONCE(q->rpm_status == RPM_SUSPENDED);
+#endif
+   return rq;
}
 
/*
diff --git a/block/blk-pm.c b/block/blk-pm.c
index 9b636960d285..3ad5a7334baa 100644
--- a/block/blk-pm.c
+++ b/block/blk-pm.c
@@ -1,8 +1,11 @@
 // SPDX-License-Identifier: GPL-2.0
 
+#include 
 #include 
 #include 
 #include 
+#include "blk-mq.h"
+#include "blk-mq-tag.h"
 
 /**
  * blk_pm_runtime_init - Block layer runtime PM initialization routine
@@ -40,6 +43,34 @@ void blk_pm_runtime_init(struct request_queue *q, struct 
device *dev)
 }
 EXPORT_SYMBOL(blk_pm_runtime_init);
 
+struct in_flight_data {
+   struct request_queue*q;
+   int in_flight;
+};
+
+static void blk_count_in_flight(struct blk_mq_hw_ctx *hctx, struct request *rq,
+   void *priv, bool reserved)
+{
+   struct in_flight_data *in_flight = priv;
+
+   if (rq->q == in_flight->q)
+   in_flight->in_flight++;
+}
+
+/*
+ * Count the number of requests that are in flight for request queue @q. Use
+ * @q->nr_pending for legacy queues. Iterate over the tag set for blk-mq
+ * queues.
+ */
+static int blk_requests_in_flight(struct request_queue *q)
+{
+   struct in_flight_data in_flight = { .q = q };
+
+   if (q->mq_ops)
+   blk_mq_queue_rq_iter(q, blk_count_in_flight, _flight);
+   return q->nr_pending + in_flight.in_flight;
+}
+
 /**
  * blk_pre_runtime_suspend - Pre runtime suspend check
  * @q: the queue of the device
@@ -68,14 +99,49 @@ int blk_pre_runtime_suspend(struct request_queue *q)
if (!q->dev)
return ret;
 
+   WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);
+
+   /*
+* Increase the pm_only counter before checking whether any
+* non-PM blk_queue_enter() calls are in progress to avoid that any
+* new non-PM blk_queue_enter() calls succeed before the pm_only
+* counter is decreased again.
+*/
+   blk_set_pm_only(q);
+   /*
+* This function only gets called if the most recent request activity
+* occurred at least autosuspend_delay_ms ago. Since blk_queue_enter()
+* is called by the request allocation code before
+* pm_request_resume(), if no requests have a tag assigned it is safe
+* to suspend the device.
+*/
+   ret = -EBUSY;
+   if (blk_requests_in_flight(q) == 0) {
+   blk_freeze_queue_start(q);
+   /*
+* Freezing a queue starts a transition of the queue
+* usag

[PATCH v9 0/8] blk-mq: Implement runtime power management

2018-09-19 Thread Bart Van Assche
Hello Jens,

One of the pieces that is missing before blk-mq can be made the default is
implementing runtime power management support for blk-mq.  This patch series
not only implements runtime power management for blk-mq but also fixes a
starvation issue in the power management code for the legacy block
layer. Please consider this patch series for the upstream kernel.

Thanks,

Bart.

Changes compared to v8:
- Fixed the race that was reported by Jianchao and Ming.
- Fixed another spelling issue in a source code comment.

Changes compared to v7:
- Addressed Jianchao's feedback about patch "Make blk_get_request() block for
  non-PM requests while suspended".
- Added two new patches - one that documents the functions that iterate over
  requests and one that introduces a new function that iterates over all
  requests associated with a queue.

Changes compared to v6:
- Left out the patches that split RQF_PREEMPT in three flags.
- Left out the patch that introduces the SCSI device state SDEV_SUSPENDED.
- Left out the patch that introduces blk_pm_runtime_exit().
- Restored the patch that changes the PREEMPT_ONLY flag into a counter.

Changes compared to v5:
- Introduced a new flag RQF_DV that replaces RQF_PREEMPT for SCSI domain
  validation.
- Introduced a new request queue state QUEUE_FLAG_DV_ONLY for SCSI domain
  validation.
- Instead of using SDEV_QUIESCE for both runtime suspend and SCSI domain
  validation, use that state for domain validation only and introduce a new
  state for runtime suspend, namely SDEV_QUIESCE.
- Reallow system suspend during SCSI domain validation.
- Moved the runtime resume call from the request allocation code into
  blk_queue_enter().
- Instead of relying on q_usage_counter, iterate over the tag set to determine
  whether or not any requests are in flight.

Changes compared to v4:
- Dropped the patches "Give RQF_PREEMPT back its original meaning" and
  "Serialize queue freezing and blk_pre_runtime_suspend()".
- Replaced "percpu_ref_read()" with "percpu_is_in_use()".
- Inserted pm_request_resume() calls in the block layer request allocation
  code such that the context that submits a request no longer has to call
  pm_runtime_get().

Changes compared to v3:
- Avoid adverse interactions between system-wide suspend/resume and runtime
  power management by changing the PREEMPT_ONLY flag into a counter.
- Give RQF_PREEMPT back its original meaning, namely that it is only set for
  ide_preempt requests.
- Remove the flag BLK_MQ_REQ_PREEMPT.
- Removed the pm_request_resume() call.

Changes compared to v2:
- Fixed the build for CONFIG_BLOCK=n.
- Added a patch that introduces percpu_ref_read() in the percpu-counter code.
- Added a patch that makes it easier to detect missing pm_runtime_get*() calls.
- Addressed Jianchao's feedback including the comment about runtime overhead
  of switching a per-cpu counter to atomic mode.

Changes compared to v1:
- Moved the runtime power management code into a separate file.
- Addressed Ming's feedback.

Bart Van Assche (8):
  blk-mq: Document the functions that iterate over requests
  blk-mq: Introduce blk_mq_queue_rq_iter()
  block: Move power management code into a new source file
  block, scsi: Change the preempt-only flag into a counter
  block: Split blk_pm_add_request() and blk_pm_put_request()
  block: Schedule runtime resume earlier
  block: Make blk_get_request() block for non-PM requests while
suspended
  blk-mq: Enable support for runtime power management

 block/Kconfig   |   3 +
 block/Makefile  |   1 +
 block/blk-core.c| 271 +---
 block/blk-mq-debugfs.c  |  10 +-
 block/blk-mq-tag.c  |  74 ++-
 block/blk-mq-tag.h  |   2 +
 block/blk-mq.c  |   2 +
 block/blk-pm.c  | 253 +
 block/blk-pm.h  |  69 ++
 block/elevator.c|  22 +---
 drivers/scsi/scsi_lib.c |  11 +-
 drivers/scsi/scsi_pm.c  |   1 +
 drivers/scsi/sd.c   |   1 +
 drivers/scsi/sr.c   |   1 +
 include/linux/blk-pm.h  |  24 
 include/linux/blkdev.h  |  37 ++
 16 files changed, 484 insertions(+), 298 deletions(-)
 create mode 100644 block/blk-pm.c
 create mode 100644 block/blk-pm.h
 create mode 100644 include/linux/blk-pm.h

-- 
2.19.0.397.gdd90340f6a-goog



[PATCH v9 2/8] blk-mq: Introduce blk_mq_queue_rq_iter()

2018-09-19 Thread Bart Van Assche
This function will be used in the patch "Make blk_get_request() block for
non-PM requests while suspended".

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-mq-tag.c | 44 
 block/blk-mq-tag.h |  2 ++
 2 files changed, 46 insertions(+)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index cb5db0c3cc32..f95b41b5f07a 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -374,6 +374,50 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, 
busy_iter_fn *fn,
rcu_read_unlock();
 }
 
+/*
+ * Call @fn(rq, @priv, reserved) for each request associated with request
+ * queue @q or any queue that it shares tags with and that has been assigned a
+ * tag. 'reserved' indicates whether or not 'rq' is a reserved request. In
+ * contrast to blk_mq_queue_tag_busy_iter(), if an I/O scheduler has been
+ * associated with @q, this function also iterates over requests that have
+ * been assigned a scheduler tag but that have not yet been assigned a driver
+ * tag.
+ */
+void blk_mq_queue_rq_iter(struct request_queue *q, busy_iter_fn *fn, void 
*priv)
+{
+   struct blk_mq_hw_ctx *hctx;
+   int i;
+
+   /*
+* __blk_mq_update_nr_hw_queues() will update nr_hw_queues and
+* queue_hw_ctx after having frozen the request queue. So we can use
+* q_usage_counter to avoid a race with that
+* function. __blk_mq_update_nr_hw_queues() uses synchronize_rcu() to
+* ensure that this function leaves the critical section below.
+*/
+   rcu_read_lock();
+   if (percpu_ref_is_zero(>q_usage_counter)) {
+   rcu_read_unlock();
+   return;
+   }
+
+   queue_for_each_hw_ctx(q, hctx, i) {
+   struct blk_mq_tags *tags = hctx->sched_tags ? : hctx->tags;
+
+   /*
+* If no software queues are currently mapped to this
+* hardware queue, there's nothing to check
+*/
+   if (!blk_mq_hw_queue_mapped(hctx))
+   continue;
+
+   if (tags->nr_reserved_tags)
+   bt_for_each(hctx, >breserved_tags, fn, priv, 
true);
+   bt_for_each(hctx, >bitmap_tags, fn, priv, false);
+   }
+   rcu_read_unlock();
+}
+
 static int bt_alloc(struct sbitmap_queue *bt, unsigned int depth,
bool round_robin, int node)
 {
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 61deab0b5a5a..25e62997ed6c 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -35,6 +35,8 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
void *priv);
+void blk_mq_queue_rq_iter(struct request_queue *q, busy_iter_fn *fn,
+ void *priv);
 
 static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
 struct blk_mq_hw_ctx *hctx)
-- 
2.19.0.397.gdd90340f6a-goog



Re: [PATCH v8 6/8] block: Schedule runtime resume earlier

2018-09-19 Thread Bart Van Assche
On Wed, 2018-09-19 at 12:05 +0800, Ming Lei wrote:
> Looks this patch may introduce the following race between queue
> freeze and
> runtime suspend:
> 
> ---
> ---
> CPU0  CPU1
>   CPU2
> ---
> ---
> 
> blk_freeze_queue()
> 
>   blk_mq_alloc_request()
>   blk_queue_enter()
>   blk_pm_request_
> resume()
>   wait_event()
> 
>   
>   blk_pre_runtime_suspend()
>   
>   ->blk_set_pm_only
>   
>   ...
>   
>   blk_post_runtime_suspend()
> 
> ...
> blk_mq_unfreeze_queue()
> ---
> ---
> - CPU0: queue is frozen
> 
> - CPU1: one new request comes, and see queue is frozen, but queue
> isn't
> runtime-suspended yet, then blk_pm_request_resume() does nothing. So
> this
> allocation is blocked in wait_event().
> 
> - CPU2: runtime suspend comes, and queue becomes runtime-suspended
> now
> 
> - CPU0: queue is unfreeze, but the new request allocation in CPU1 may
> never
> be done because the queue is runtime suspended, and wait_event()
> won't return.
> And the expected result is that the queue becomes active and the
> allocation on
> CPU1 is done immediately.

Hello Ming,

Just like for the scenario Jianchao reported, I will address this by
only allowing the suspend to proceed if q_usage_counter == 0.

Bart.



Re: [PATCH v7 5/6] block: Make blk_get_request() block for non-PM requests while suspended

2018-09-19 Thread Bart Van Assche
On Wed, 2018-09-19 at 10:21 +0800, jianchao.wang wrote:
> On 09/19/2018 01:44 AM, Bart Van Assche wrote:
> > There is only one blk_pm_request_resume() call and that call is
> > inside blk_queue_enter() after the pm_only counter has been
> > checked.
> > 
> > For the legacy block layer, nr_pending is increased after the
> > blk_queue_enter() call from inside blk_old_get_request() succeeded.
> > 
> > So I don't see how blk_pm_request_resume() or q->nr_pending++ could
> > escape from the preempt checking?
> 
> For example:
> 
> 
> blk_pre_runtime_suspendgeneric_make_request
>  blk_queue_enter  //
> success here, no blk_pm_request_resume here
>  blk_mq_make_requset
> blk_set_pm_only
> 
> if (blk_requests_in_flight(q) == 0) {
>   synchronize_rcu();
>   if (blk_requests_in_flight(q) == 0)
> ret = 0;
>   }
>  blk_mq_get_request

Hello Jianchao,

I think you are right. I will add a q_usage_counter check before
setting 'ret' to zero.

Bart.


Re: [PATCH blktests 0/3] Add NVMeOF multipath tests

2018-09-18 Thread Bart Van Assche

On 9/18/18 4:24 PM, Omar Sandoval wrote:

On Tue, Sep 18, 2018 at 02:20:59PM -0700, Bart Van Assche wrote:

Can you have a look at the updated master branch of
https://github.com/bvanassche/blktests? That code should no longer fail if
unloading the nvme kernel module fails. Please note that you will need
kernel v4.18 to test these scripts - a KASAN complaint appears if I run
these tests against kernel v4.19-rc4.


Thanks, these pass now. Is it expected that my nvme device gets a
multipath device configured after running these tests?

$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE  MOUNTPOINT
vda  254:00  16G  0 disk
└─vda1   254:10  16G  0 part  /
vdb  254:16   0   8G  0 disk
vdc  254:32   0   8G  0 disk
vdd  254:48   0   8G  0 disk
nvme0n1  259:00   8G  0 disk
└─mpatha 253:00   8G  0 mpath


No, all multipath devices that were created during a test should be 
removed before that test finishes. I will look into this.



Also, can you please fix:

_have_kernel_option NVME_MULTIPATH && exit 1

to not exit on failure? It should use SKIP_REASON and return 1. You
might need to add something like _dont_have_kernel_option to properly
handle the case where the config is not found.


OK, I will change this.


Side note which isn't a blocker for merging is that there's a lot of
duplicated code between these helpers and the srp helpers. How hard
would it be to refactor that?


Are you perhaps referring to the code that is shared between the 
tests/srp/rc tests/nvmeof-mp/rc shell scripts? The hardest part is 
probably to chose a location where to store these functions. Should I 
create a file with common code under common/, under tests/srp/, under 
tests/nvmeof-mp/ or perhaps somewhere else?


Thanks,

Bart.




Re: [PATCH blktests 0/3] Add NVMeOF multipath tests

2018-09-18 Thread Bart Van Assche

On 8/23/18 5:21 PM, Omar Sandoval wrote:

On Thu, Aug 23, 2018 at 01:53:33AM +, Bart Van Assche wrote:

On Tue, 2018-08-21 at 08:46 +0200, Johannes Thumshirn wrote:

On Mon, Aug 20, 2018 at 03:46:45PM +, Bart Van Assche wrote:

Moving these tests into the nvme directory is possible but will make it
harder to run the NVMeOF multipath tests separately. Are you fine with this?


Both way's have it's up and downsides, I agree.

Having two distinct groups requires to run './check nvme nvmeof-mp' to
run full coverage with nvme.

Having it all in one group would require to run './check nvme 18 19 20
21 22 23 24 ...' to get only the dm-mpath ones.

Honestly I hate both but your's (the two distinct groups) is probably
easier to handle in the end, I have to admit.


Omar, do you have a preference for one of the two aforementioned approaches?


Let's keep it in a separate category, since lots of people running nvme
tests probably aren't interested in testing multipath.

A bunch of the tests failed with

modprobe: FATAL: Module nvme is in use.

Maybe related to my test VM having an nvme device?


Hello Omar,

Can you have a look at the updated master branch of 
https://github.com/bvanassche/blktests? That code should no longer fail 
if unloading the nvme kernel module fails. Please note that you will 
need kernel v4.18 to test these scripts - a KASAN complaint appears if I 
run these tests against kernel v4.19-rc4.


Thanks,

Bart.



[PATCH v8 7/8] block: Make blk_get_request() block for non-PM requests while suspended

2018-09-18 Thread Bart Van Assche
Instead of allowing requests that are not power management requests
to enter the queue in runtime suspended status (RPM_SUSPENDED), make
the blk_get_request() caller block. This change fixes a starvation
issue: it is now guaranteed that power management requests will be
executed no matter how many blk_get_request() callers are waiting.
For blk-mq, instead of maintaining the q->nr_pending counter, rely
on q->q_usage_counter. Call pm_runtime_mark_last_busy() every time a
request finishes instead of only if the queue depth drops to zero.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-core.c | 37 ++---
 block/blk-pm.c   | 70 
 2 files changed, 73 insertions(+), 34 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 18b874d5c9c9..ae092ca121d5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2747,30 +2747,6 @@ void blk_account_io_done(struct request *req, u64 now)
}
 }
 
-#ifdef CONFIG_PM
-/*
- * Don't process normal requests when queue is suspended
- * or in the process of suspending/resuming
- */
-static bool blk_pm_allow_request(struct request *rq)
-{
-   switch (rq->q->rpm_status) {
-   case RPM_RESUMING:
-   case RPM_SUSPENDING:
-   return rq->rq_flags & RQF_PM;
-   case RPM_SUSPENDED:
-   return false;
-   default:
-   return true;
-   }
-}
-#else
-static bool blk_pm_allow_request(struct request *rq)
-{
-   return true;
-}
-#endif
-
 void blk_account_io_start(struct request *rq, bool new_io)
 {
struct hd_struct *part;
@@ -2816,11 +2792,14 @@ static struct request *elv_next_request(struct 
request_queue *q)
 
while (1) {
list_for_each_entry(rq, >queue_head, queuelist) {
-   if (blk_pm_allow_request(rq))
-   return rq;
-
-   if (rq->rq_flags & RQF_SOFTBARRIER)
-   break;
+#ifdef CONFIG_PM
+   /*
+* If a request gets queued in state RPM_SUSPENDED
+* then that's a kernel bug.
+*/
+   WARN_ON_ONCE(q->rpm_status == RPM_SUSPENDED);
+#endif
+   return rq;
}
 
/*
diff --git a/block/blk-pm.c b/block/blk-pm.c
index 9b636960d285..dc9bc45b0db5 100644
--- a/block/blk-pm.c
+++ b/block/blk-pm.c
@@ -1,8 +1,11 @@
 // SPDX-License-Identifier: GPL-2.0
 
+#include 
 #include 
 #include 
 #include 
+#include "blk-mq.h"
+#include "blk-mq-tag.h"
 
 /**
  * blk_pm_runtime_init - Block layer runtime PM initialization routine
@@ -40,6 +43,34 @@ void blk_pm_runtime_init(struct request_queue *q, struct 
device *dev)
 }
 EXPORT_SYMBOL(blk_pm_runtime_init);
 
+struct in_flight_data {
+   struct request_queue*q;
+   int in_flight;
+};
+
+static void blk_count_in_flight(struct blk_mq_hw_ctx *hctx, struct request *rq,
+   void *priv, bool reserved)
+{
+   struct in_flight_data *in_flight = priv;
+
+   if (rq->q == in_flight->q)
+   in_flight->in_flight++;
+}
+
+/*
+ * Count the number of requests that are in flight for request queue @q. Use
+ * @q->nr_pending for legacy queues. Iterate over the tag set for blk-mq
+ * queues.
+ */
+static int blk_requests_in_flight(struct request_queue *q)
+{
+   struct in_flight_data in_flight = { .q = q };
+
+   if (q->mq_ops)
+   blk_mq_queue_rq_iter(q, blk_count_in_flight, _flight);
+   return q->nr_pending + in_flight.in_flight;
+}
+
 /**
  * blk_pre_runtime_suspend - Pre runtime suspend check
  * @q: the queue of the device
@@ -68,14 +99,38 @@ int blk_pre_runtime_suspend(struct request_queue *q)
if (!q->dev)
return ret;
 
+   WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);
+
+   blk_set_pm_only(q);
+   /*
+* This function only gets called if the most recent request activity
+* occurred at least autosuspend_delay_ms ago. Since blk_queue_enter()
+* is called by the request allocation code before
+* pm_request_resume(), if no requests have a tag assigned it is safe
+* to suspend the device.
+*/
+   ret = -EBUSY;
+   if (blk_requests_in_flight(q) == 0) {
+   /*
+* Call synchronize_rcu() such that later blk_queue_enter()
+* calls see the pm-only state. See also
+* http://lwn.net/Articles/573497/.
+*/
+   synchronize_rcu();
+   if (blk_requests_in_flight(q) == 0)
+   ret = 0;
+   }
+
spin_lock_irq(q->queue_lock);

[PATCH v8 8/8] blk-mq: Enable support for runtime power management

2018-09-18 Thread Bart Van Assche
Now that the blk-mq core processes power management requests
(marked with RQF_PREEMPT) in other states than RPM_ACTIVE, enable
runtime power management for blk-mq.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-mq.c | 2 ++
 block/blk-pm.c | 6 --
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 85a1c1a59c72..20fdd78b75c7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -33,6 +33,7 @@
 #include "blk-mq.h"
 #include "blk-mq-debugfs.h"
 #include "blk-mq-tag.h"
+#include "blk-pm.h"
 #include "blk-stat.h"
 #include "blk-mq-sched.h"
 #include "blk-rq-qos.h"
@@ -475,6 +476,7 @@ static void __blk_mq_free_request(struct request *rq)
struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
const int sched_tag = rq->internal_tag;
 
+   blk_pm_mark_last_busy(rq);
if (rq->tag != -1)
blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
if (sched_tag != -1)
diff --git a/block/blk-pm.c b/block/blk-pm.c
index dc9bc45b0db5..2a7a9c0f5b99 100644
--- a/block/blk-pm.c
+++ b/block/blk-pm.c
@@ -30,12 +30,6 @@
  */
 void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
 {
-   /* Don't enable runtime PM for blk-mq until it is ready */
-   if (q->mq_ops) {
-   pm_runtime_disable(dev);
-   return;
-   }
-
q->dev = dev;
q->rpm_status = RPM_ACTIVE;
pm_runtime_set_autosuspend_delay(q->dev, -1);
-- 
2.18.0



[PATCH v8 6/8] block: Schedule runtime resume earlier

2018-09-18 Thread Bart Van Assche
Instead of scheduling runtime resume of a request queue after a
request has been queued, schedule asynchronous resume during request
allocation. The new pm_request_resume() calls occur after
blk_queue_enter() has increased the q_usage_counter request queue
member. This change is needed for a later patch that will make request
allocation block while the queue status is not RPM_ACTIVE.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-core.c | 2 ++
 block/elevator.c | 1 -
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index fd91e9bf2893..18b874d5c9c9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -942,6 +942,8 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
if (success)
return 0;
 
+   blk_pm_request_resume(q);
+
if (flags & BLK_MQ_REQ_NOWAIT)
return -EBUSY;
 
diff --git a/block/elevator.c b/block/elevator.c
index 1c992bf6cfb1..e18ac68626e3 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -601,7 +601,6 @@ void __elv_add_request(struct request_queue *q, struct 
request *rq, int where)
trace_block_rq_insert(q, rq);
 
blk_pm_add_request(q, rq);
-   blk_pm_request_resume(q);
 
rq->q = q;
 
-- 
2.18.0



[PATCH v8 3/8] block: Move power management code into a new source file

2018-09-18 Thread Bart Van Assche
Move the code for runtime power management from blk-core.c into the
new source file blk-pm.c. Move the corresponding declarations from
 into . For CONFIG_PM=n, leave out
the declarations of the functions that are not used in that mode.
This patch not only reduces the number of #ifdefs in the block layer
core code but also reduces the size of header file 
and hence should help to reduce the build time of the Linux kernel
if CONFIG_PM is not defined.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/Kconfig  |   3 +
 block/Makefile |   1 +
 block/blk-core.c   | 196 +
 block/blk-pm.c | 188 +++
 block/blk-pm.h |  43 +
 block/elevator.c   |  22 +
 drivers/scsi/scsi_pm.c |   1 +
 drivers/scsi/sd.c  |   1 +
 drivers/scsi/sr.c  |   1 +
 include/linux/blk-pm.h |  24 +
 include/linux/blkdev.h |  23 -
 11 files changed, 264 insertions(+), 239 deletions(-)
 create mode 100644 block/blk-pm.c
 create mode 100644 block/blk-pm.h
 create mode 100644 include/linux/blk-pm.h

diff --git a/block/Kconfig b/block/Kconfig
index 1f2469a0123c..85263e7bded6 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -228,4 +228,7 @@ config BLK_MQ_RDMA
depends on BLOCK && INFINIBAND
default y
 
+config BLK_PM
+   def_bool BLOCK && PM
+
 source block/Kconfig.iosched
diff --git a/block/Makefile b/block/Makefile
index 572b33f32c07..27eac600474f 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -37,3 +37,4 @@ obj-$(CONFIG_BLK_WBT) += blk-wbt.o
 obj-$(CONFIG_BLK_DEBUG_FS) += blk-mq-debugfs.o
 obj-$(CONFIG_BLK_DEBUG_FS_ZONED)+= blk-mq-debugfs-zoned.o
 obj-$(CONFIG_BLK_SED_OPAL) += sed-opal.o
+obj-$(CONFIG_BLK_PM)   += blk-pm.o
diff --git a/block/blk-core.c b/block/blk-core.c
index 4dbc93f43b38..6d4dd176bd9d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -42,6 +42,7 @@
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-sched.h"
+#include "blk-pm.h"
 #include "blk-rq-qos.h"
 
 #ifdef CONFIG_DEBUG_FS
@@ -1726,16 +1727,6 @@ void part_round_stats(struct request_queue *q, int cpu, 
struct hd_struct *part)
 }
 EXPORT_SYMBOL_GPL(part_round_stats);
 
-#ifdef CONFIG_PM
-static void blk_pm_put_request(struct request *rq)
-{
-   if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending)
-   pm_runtime_mark_last_busy(rq->q->dev);
-}
-#else
-static inline void blk_pm_put_request(struct request *rq) {}
-#endif
-
 void __blk_put_request(struct request_queue *q, struct request *req)
 {
req_flags_t rq_flags = req->rq_flags;
@@ -3757,191 +3748,6 @@ void blk_finish_plug(struct blk_plug *plug)
 }
 EXPORT_SYMBOL(blk_finish_plug);
 
-#ifdef CONFIG_PM
-/**
- * blk_pm_runtime_init - Block layer runtime PM initialization routine
- * @q: the queue of the device
- * @dev: the device the queue belongs to
- *
- * Description:
- *Initialize runtime-PM-related fields for @q and start auto suspend for
- *@dev. Drivers that want to take advantage of request-based runtime PM
- *should call this function after @dev has been initialized, and its
- *request queue @q has been allocated, and runtime PM for it can not happen
- *yet(either due to disabled/forbidden or its usage_count > 0). In most
- *cases, driver should call this function before any I/O has taken place.
- *
- *This function takes care of setting up using auto suspend for the device,
- *the autosuspend delay is set to -1 to make runtime suspend impossible
- *until an updated value is either set by user or by driver. Drivers do
- *not need to touch other autosuspend settings.
- *
- *The block layer runtime PM is request based, so only works for drivers
- *that use request as their IO unit instead of those directly use bio's.
- */
-void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
-{
-   /* Don't enable runtime PM for blk-mq until it is ready */
-   if (q->mq_ops) {
-   pm_runtime_disable(dev);
-   return;
-   }
-
-   q->dev = dev;
-   q->rpm_status = RPM_ACTIVE;
-   pm_runtime_set_autosuspend_delay(q->dev, -1);
-   pm_runtime_use_autosuspend(q->dev);
-}
-EXPORT_SYMBOL(blk_pm_runtime_init);
-
-/**
- * blk_pre_runtime_suspend - Pre runtime suspend check
- * @q: the queue of the device
- *
- * Description:
- *This function will check if runtime suspend is allowed for the device
- *by examining if there are any requests pending in the queue. If there
- *are requests pending, the device can not be runtime suspended; otherwise,
- *the queue's status will be updated to SUSPENDING and the driver can
- *proceed to s

[PATCH v8 1/8] blk-mq: Document the functions that iterate over requests

2018-09-18 Thread Bart Van Assche
Make it easier to understand the purpose of the functions that iterate
over requests by documenting their purpose. Fix two minor spelling
mistakes in comments in these functions.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 block/blk-mq-tag.c | 28 ++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 94e1ed667b6e..ef3acb4a80e0 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -232,13 +232,19 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int 
bitnr, void *data)
 
/*
 * We can hit rq == NULL here, because the tagging functions
-* test and set the bit before assining ->rqs[].
+* test and set the bit before assigning ->rqs[].
 */
if (rq && rq->q == hctx->queue)
iter_data->fn(hctx, rq, iter_data->data, reserved);
return true;
 }
 
+/*
+ * Call function @fn(@hctx, rq, @data, @reserved) for each request queued on
+ * @hctx that has been assigned a driver tag. @reserved indicates whether @bt
+ * is the breserved_tags member or the bitmap_tags member of struct
+ * blk_mq_tags.
+ */
 static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt,
busy_iter_fn *fn, void *data, bool reserved)
 {
@@ -280,6 +286,11 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned 
int bitnr, void *data)
return true;
 }
 
+/*
+ * Call function @fn(rq, @data, @reserved) for each request in @tags that has
+ * been started. @reserved indicates whether @bt is the breserved_tags member
+ * or the bitmap_tags member of struct blk_mq_tags.
+ */
 static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue 
*bt,
 busy_tag_iter_fn *fn, void *data, bool reserved)
 {
@@ -294,6 +305,10 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, 
struct sbitmap_queue *bt,
sbitmap_for_each_set(>sb, bt_tags_iter, _data);
 }
 
+/*
+ * Call @fn(rq, @priv, reserved) for each started request in @tags. 'reserved'
+ * indicates whether or not 'rq' is a reserved request.
+ */
 static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
busy_tag_iter_fn *fn, void *priv)
 {
@@ -302,6 +317,10 @@ static void blk_mq_all_tag_busy_iter(struct blk_mq_tags 
*tags,
bt_tags_for_each(tags, >bitmap_tags, fn, priv, false);
 }
 
+/*
+ * Call @fn(rq, @priv, reserved) for each request in @tagset. 'reserved'
+ * indicates whether or not 'rq' is a reserved request.
+ */
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
busy_tag_iter_fn *fn, void *priv)
 {
@@ -314,6 +333,11 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 }
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
 
+/*
+ * Call @fn(rq, @priv, reserved) for each request associated with request
+ * queue @q or any queue it shares tags with and that has been assigned a
+ * driver tag. 'reserved' indicates whether or not 'rq' is a reserved request.
+ */
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
void *priv)
 {
@@ -337,7 +361,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, 
busy_iter_fn *fn,
struct blk_mq_tags *tags = hctx->tags;
 
/*
-* If not software queues are currently mapped to this
+* If no software queues are currently mapped to this
 * hardware queue, there's nothing to check
 */
if (!blk_mq_hw_queue_mapped(hctx))
-- 
2.18.0



[PATCH v8 5/8] block: Split blk_pm_add_request() and blk_pm_put_request()

2018-09-18 Thread Bart Van Assche
Move the pm_request_resume() and pm_runtime_mark_last_busy() calls into
two new functions and thereby separate legacy block layer code from code
that works for both the legacy block layer and blk-mq. A later patch will
add calls to the new functions in the blk-mq code.

Signed-off-by: Bart Van Assche 
Cc: Martin K. Petersen 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-core.c |  1 +
 block/blk-pm.h   | 36 +++-
 block/elevator.c |  1 +
 3 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 1a691f5269bb..fd91e9bf2893 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1744,6 +1744,7 @@ void __blk_put_request(struct request_queue *q, struct 
request *req)
 
blk_req_zone_write_unlock(req);
blk_pm_put_request(req);
+   blk_pm_mark_last_busy(req);
 
elv_completed_request(q, req);
 
diff --git a/block/blk-pm.h b/block/blk-pm.h
index 1ffc8ef203ec..a8564ea72a41 100644
--- a/block/blk-pm.h
+++ b/block/blk-pm.h
@@ -6,8 +6,23 @@
 #include 
 
 #ifdef CONFIG_PM
+static inline void blk_pm_request_resume(struct request_queue *q)
+{
+   if (q->dev && (q->rpm_status == RPM_SUSPENDED ||
+  q->rpm_status == RPM_SUSPENDING))
+   pm_request_resume(q->dev);
+}
+
+static inline void blk_pm_mark_last_busy(struct request *rq)
+{
+   if (rq->q->dev && !(rq->rq_flags & RQF_PM))
+   pm_runtime_mark_last_busy(rq->q->dev);
+}
+
 static inline void blk_pm_requeue_request(struct request *rq)
 {
+   lockdep_assert_held(rq->q->queue_lock);
+
if (rq->q->dev && !(rq->rq_flags & RQF_PM))
rq->q->nr_pending--;
 }
@@ -15,17 +30,28 @@ static inline void blk_pm_requeue_request(struct request 
*rq)
 static inline void blk_pm_add_request(struct request_queue *q,
  struct request *rq)
 {
-   if (q->dev && !(rq->rq_flags & RQF_PM) && q->nr_pending++ == 0 &&
-   (q->rpm_status == RPM_SUSPENDED || q->rpm_status == RPM_SUSPENDING))
-   pm_request_resume(q->dev);
+   lockdep_assert_held(q->queue_lock);
+
+   if (q->dev && !(rq->rq_flags & RQF_PM))
+   q->nr_pending++;
 }
 
 static inline void blk_pm_put_request(struct request *rq)
 {
-   if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending)
-   pm_runtime_mark_last_busy(rq->q->dev);
+   lockdep_assert_held(rq->q->queue_lock);
+
+   if (rq->q->dev && !(rq->rq_flags & RQF_PM))
+   --rq->q->nr_pending;
 }
 #else
+static inline void blk_pm_request_resume(struct request_queue *q)
+{
+}
+
+static inline void blk_pm_mark_last_busy(struct request *rq)
+{
+}
+
 static inline void blk_pm_requeue_request(struct request *rq)
 {
 }
diff --git a/block/elevator.c b/block/elevator.c
index e18ac68626e3..1c992bf6cfb1 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -601,6 +601,7 @@ void __elv_add_request(struct request_queue *q, struct 
request *rq, int where)
trace_block_rq_insert(q, rq);
 
blk_pm_add_request(q, rq);
+   blk_pm_request_resume(q);
 
rq->q = q;
 
-- 
2.18.0



[PATCH v8 4/8] block, scsi: Change the preempt-only flag into a counter

2018-09-18 Thread Bart Van Assche
The RQF_PREEMPT flag is used for three purposes:
- In the SCSI core, for making sure that power management requests
  are executed even if a device is in the "quiesced" state.
- For domain validation by SCSI drivers that use the parallel port.
- In the IDE driver, for IDE preempt requests.
Rename "preempt-only" into "pm-only" because the primary purpose of
this mode is power management. Since the power management core may
but does not have to resume a runtime suspended device before
performing system-wide suspend and since a later patch will set
"pm-only" mode as long as a block device is runtime suspended, make
it possible to set "pm-only" mode from more than one context. Since
with this change scsi_device_quiesce() is no longer idempotent, make
that function return early if it is called for a quiesced queue.

Signed-off-by: Bart Van Assche 
Cc: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-core.c| 35 ++-
 block/blk-mq-debugfs.c  | 10 +-
 drivers/scsi/scsi_lib.c | 11 +++
 include/linux/blkdev.h  | 14 +-
 4 files changed, 43 insertions(+), 27 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 6d4dd176bd9d..1a691f5269bb 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -422,24 +422,25 @@ void blk_sync_queue(struct request_queue *q)
 EXPORT_SYMBOL(blk_sync_queue);
 
 /**
- * blk_set_preempt_only - set QUEUE_FLAG_PREEMPT_ONLY
+ * blk_set_pm_only - increment pm_only counter
  * @q: request queue pointer
- *
- * Returns the previous value of the PREEMPT_ONLY flag - 0 if the flag was not
- * set and 1 if the flag was already set.
  */
-int blk_set_preempt_only(struct request_queue *q)
+void blk_set_pm_only(struct request_queue *q)
 {
-   return blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q);
+   atomic_inc(>pm_only);
 }
-EXPORT_SYMBOL_GPL(blk_set_preempt_only);
+EXPORT_SYMBOL_GPL(blk_set_pm_only);
 
-void blk_clear_preempt_only(struct request_queue *q)
+void blk_clear_pm_only(struct request_queue *q)
 {
-   blk_queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
-   wake_up_all(>mq_freeze_wq);
+   int pm_only;
+
+   pm_only = atomic_dec_return(>pm_only);
+   WARN_ON_ONCE(pm_only < 0);
+   if (pm_only == 0)
+   wake_up_all(>mq_freeze_wq);
 }
-EXPORT_SYMBOL_GPL(blk_clear_preempt_only);
+EXPORT_SYMBOL_GPL(blk_clear_pm_only);
 
 /**
  * __blk_run_queue_uncond - run a queue whether or not it has been stopped
@@ -918,7 +919,7 @@ EXPORT_SYMBOL(blk_alloc_queue);
  */
 int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 {
-   const bool preempt = flags & BLK_MQ_REQ_PREEMPT;
+   const bool pm = flags & BLK_MQ_REQ_PREEMPT;
 
while (true) {
bool success = false;
@@ -926,11 +927,11 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
rcu_read_lock();
if (percpu_ref_tryget_live(>q_usage_counter)) {
/*
-* The code that sets the PREEMPT_ONLY flag is
-* responsible for ensuring that that flag is globally
-* visible before the queue is unfrozen.
+* The code that increments the pm_only counter is
+* responsible for ensuring that that counter is
+* globally visible before the queue is unfrozen.
 */
-   if (preempt || !blk_queue_preempt_only(q)) {
+   if (pm || !blk_queue_pm_only(q)) {
success = true;
} else {
percpu_ref_put(>q_usage_counter);
@@ -955,7 +956,7 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
 
wait_event(q->mq_freeze_wq,
   (atomic_read(>mq_freeze_depth) == 0 &&
-   (preempt || !blk_queue_preempt_only(q))) ||
+   (pm || !blk_queue_pm_only(q))) ||
   blk_queue_dying(q));
if (blk_queue_dying(q))
return -ENODEV;
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index cb1e6cf7ac48..a5ea86835fcb 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -102,6 +102,14 @@ static int blk_flags_show(struct seq_file *m, const 
unsigned long flags,
return 0;
 }
 
+static int queue_pm_only_show(void *data, struct seq_file *m)
+{
+   struct request_queue *q = data;
+
+   seq_printf(m, "%d\n", atomic_read(>pm_only));
+   return 0;
+}
+
 #define QUEUE_FLAG_NAME(name) [QUEUE_FLAG_##name] = #name
 static const char *con

[PATCH v8 2/8] blk-mq: Introduce blk_mq_queue_rq_iter()

2018-09-18 Thread Bart Van Assche
This function will be used in the patch "Make blk_get_request() block for
non-PM requests while suspended".

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-mq-tag.c | 44 
 block/blk-mq-tag.h |  2 ++
 2 files changed, 46 insertions(+)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index ef3acb4a80e0..cf8537017f78 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -374,6 +374,50 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, 
busy_iter_fn *fn,
rcu_read_unlock();
 }
 
+/*
+ * Call @fn(rq, @priv, reserved) for each request associated with request
+ * queue @q or any queue that it shares tags with and that has been assigned a
+ * tag. 'reserved' indicates whether or not 'rq' is a reserved request. In
+ * contrast to blk_mq_queue_tag_busy_iter(), if an I/O scheduler has been
+ * associated with @q, this function also iterates over requests that have
+ * been assigned a scheduler tag but that have not yet been assigned a driver
+ * tag.
+ */
+void blk_mq_queue_rq_iter(struct request_queue *q, busy_iter_fn *fn, void 
*priv)
+{
+   struct blk_mq_hw_ctx *hctx;
+   int i;
+
+   /*
+* __blk_mq_update_nr_hw_queues will update the nr_hw_queues and
+* queue_hw_ctx after freeze the queue. So we could use q_usage_counter
+* to avoid race with it. __blk_mq_update_nr_hw_queues will users
+* synchronize_rcu to ensure all of the users go out of the critical
+* section below and see zeroed q_usage_counter.
+*/
+   rcu_read_lock();
+   if (percpu_ref_is_zero(>q_usage_counter)) {
+   rcu_read_unlock();
+   return;
+   }
+
+   queue_for_each_hw_ctx(q, hctx, i) {
+   struct blk_mq_tags *tags = hctx->sched_tags ? : hctx->tags;
+
+   /*
+* If no software queues are currently mapped to this
+* hardware queue, there's nothing to check
+*/
+   if (!blk_mq_hw_queue_mapped(hctx))
+   continue;
+
+   if (tags->nr_reserved_tags)
+   bt_for_each(hctx, >breserved_tags, fn, priv, 
true);
+   bt_for_each(hctx, >bitmap_tags, fn, priv, false);
+   }
+   rcu_read_unlock();
+}
+
 static int bt_alloc(struct sbitmap_queue *bt, unsigned int depth,
bool round_robin, int node)
 {
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 61deab0b5a5a..25e62997ed6c 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -35,6 +35,8 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
void *priv);
+void blk_mq_queue_rq_iter(struct request_queue *q, busy_iter_fn *fn,
+ void *priv);
 
 static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
 struct blk_mq_hw_ctx *hctx)
-- 
2.18.0



[PATCH v8 0/8] blk-mq: Implement runtime power management

2018-09-18 Thread Bart Van Assche
Hello Jens,

One of the pieces that is missing before blk-mq can be made the default is
implementing runtime power management support for blk-mq.  This patch series
not only implements runtime power management for blk-mq but also fixes a
starvation issue in the power management code for the legacy block
layer. Please consider this patch series for the upstream kernel.

Thanks,

Bart.

Changes compared to v7:
- Addressed Jianchao's feedback about patch "Make blk_get_request() block for
  non-PM requests while suspended".
- Added two new patches - one that documents the functions that iterate over
  requests and one that introduces a new function that iterates over all
  requests associated with a queue.

Changes compared to v6:
- Left out the patches that split RQF_PREEMPT in three flags.
- Left out the patch that introduces the SCSI device state SDEV_SUSPENDED.
- Left out the patch that introduces blk_pm_runtime_exit().
- Restored the patch that changes the PREEMPT_ONLY flag into a counter.

Changes compared to v5:
- Introduced a new flag RQF_DV that replaces RQF_PREEMPT for SCSI domain
  validation.
- Introduced a new request queue state QUEUE_FLAG_DV_ONLY for SCSI domain
  validation.
- Instead of using SDEV_QUIESCE for both runtime suspend and SCSI domain
  validation, use that state for domain validation only and introduce a new
  state for runtime suspend, namely SDEV_QUIESCE.
- Reallow system suspend during SCSI domain validation.
- Moved the runtime resume call from the request allocation code into
  blk_queue_enter().
- Instead of relying on q_usage_counter, iterate over the tag set to determine
  whether or not any requests are in flight.

Changes compared to v4:
- Dropped the patches "Give RQF_PREEMPT back its original meaning" and
  "Serialize queue freezing and blk_pre_runtime_suspend()".
- Replaced "percpu_ref_read()" with "percpu_is_in_use()".
- Inserted pm_request_resume() calls in the block layer request allocation
  code such that the context that submits a request no longer has to call
  pm_runtime_get().

Changes compared to v3:
- Avoid adverse interactions between system-wide suspend/resume and runtime
  power management by changing the PREEMPT_ONLY flag into a counter.
- Give RQF_PREEMPT back its original meaning, namely that it is only set for
  ide_preempt requests.
- Remove the flag BLK_MQ_REQ_PREEMPT.
- Removed the pm_request_resume() call.

Changes compared to v2:
- Fixed the build for CONFIG_BLOCK=n.
- Added a patch that introduces percpu_ref_read() in the percpu-counter code.
- Added a patch that makes it easier to detect missing pm_runtime_get*() calls.
- Addressed Jianchao's feedback including the comment about runtime overhead
  of switching a per-cpu counter to atomic mode.

Changes compared to v1:
- Moved the runtime power management code into a separate file.
- Addressed Ming's feedback.

Bart Van Assche (6):
  block: Move power management code into a new source file
  block, scsi: Change the preempt-only flag into a counter
  block: Split blk_pm_add_request() and blk_pm_put_request()
  block: Schedule runtime resume earlier
  block: Make blk_get_request() block for non-PM requests while
suspended
  blk-mq: Enable support for runtime power management

Bart Van Assche (8):
  blk-mq: Document the functions that iterate over requests
  blk-mq: Introduce blk_mq_queue_rq_iter()
  block: Move power management code into a new source file
  block, scsi: Change the preempt-only flag into a counter
  block: Split blk_pm_add_request() and blk_pm_put_request()
  block: Schedule runtime resume earlier
  block: Make blk_get_request() block for non-PM requests while
suspended
  blk-mq: Enable support for runtime power management

 block/Kconfig   |   3 +
 block/Makefile  |   1 +
 block/blk-core.c| 271 +---
 block/blk-mq-debugfs.c  |  10 +-
 block/blk-mq-tag.c  |  72 ++-
 block/blk-mq-tag.h  |   2 +
 block/blk-mq.c  |   2 +
 block/blk-pm.c  | 242 +++
 block/blk-pm.h  |  69 ++
 block/elevator.c|  22 +---
 drivers/scsi/scsi_lib.c |  11 +-
 drivers/scsi/scsi_pm.c  |   1 +
 drivers/scsi/sd.c   |   1 +
 drivers/scsi/sr.c   |   1 +
 include/linux/blk-pm.h  |  24 
 include/linux/blkdev.h  |  37 ++
 16 files changed, 472 insertions(+), 297 deletions(-)
 create mode 100644 block/blk-pm.c
 create mode 100644 block/blk-pm.h
 create mode 100644 include/linux/blk-pm.h

-- 
2.18.0



Re: [PATCH v7 5/6] block: Make blk_get_request() block for non-PM requests while suspended

2018-09-18 Thread Bart Van Assche

On 9/17/18 7:39 PM, jianchao.wang wrote:

On 09/18/2018 04:15 AM, Bart Van Assche wrote:

Instead of allowing requests that are not power management requests
to enter the queue in runtime suspended status (RPM_SUSPENDED), make
the blk_get_request() caller block. This change fixes a starvation
issue: it is now guaranteed that power management requests will be
executed no matter how many blk_get_request() callers are waiting.
Instead of maintaining the q->nr_pending counter, rely on
q->q_usage_counter.


Looks like we still depend on this nr_pending for blk-legacy.


That's right. I will update the commit message.


blk_mq_queue_tag_busy_iter only accounts the driver tags. This will only work 
w/o io scheduler


+
  /**
   * blk_pre_runtime_suspend - Pre runtime suspend check
   * @q: the queue of the device
@@ -68,14 +101,38 @@ int blk_pre_runtime_suspend(struct request_queue *q)
if (!q->dev)
return ret;
  
+	WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);

+
+   blk_set_pm_only(q);
+   /*
+* This function only gets called if the most recent
+* pm_request_resume() call occurred at least autosuspend_delay_ms

^^^
pm_runtime_mark_last_busy ?


Since every pm_request_resume() call from the block layer is followed by 
a pm_runtime_mark_last_busy() call and since the latter is called later 
I think you are right. I will update the comment.



+* ago. Since blk_queue_enter() is called by the request allocation
+* code before pm_request_resume(), if no requests have a tag assigned
+* it is safe to suspend the device.
+*/
+   ret = -EBUSY;
+   if (blk_requests_in_flight(q) == 0) {
+   /*
+* Call synchronize_rcu() such that later blk_queue_enter()
+* calls see the pm-only state. See also
+* 
https://urldefense.proofpoint.com/v2/url?u=http-3A__lwn.net_Articles_573497_=DwIDAg=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=o3yai95U5Ge5APoSiMJX64Z8wlI7gf3x0mFnfHpuA4E=VPnOWAeo4J4VB944La0uBcynCgVHE-qp52b_9pV8NH4=.
+*/
+   synchronize_rcu();
+   if (blk_requests_in_flight(q) == 0)


Seems not safe here.

For blk-mq path:
Someone may have escaped from the preempt checking, missed the 
blk_pm_request_resume there,
entered into generic_make_request, but have not allocated request and occupied 
any tag.

There could be a similar scenario for blk-legacy path, the q->nr_pending is 
increased when
request is queued.

So I guess the q_usage_counter checking is still needed here.


There is only one blk_pm_request_resume() call and that call is inside 
blk_queue_enter() after the pm_only counter has been checked.


For the legacy block layer, nr_pending is increased after the 
blk_queue_enter() call from inside blk_old_get_request() succeeded.


So I don't see how blk_pm_request_resume() or q->nr_pending++ could 
escape from the preempt checking?


Thanks,

Bart.


Re: [PATCH v7 0/6] blk-mq: Implement runtime power management

2018-09-17 Thread Bart Van Assche

On 9/17/18 1:11 PM, Bart Van Assche wrote:

[ ... ]


Please ignore this e-mail thread - it contains a mixup of two versions 
of this patch series.


Bart.



[PATCH v7 5/6] block: Make blk_get_request() block for non-PM requests while suspended

2018-09-17 Thread Bart Van Assche
Instead of allowing requests that are not power management requests
to enter the queue in runtime suspended status (RPM_SUSPENDED), make
the blk_get_request() caller block. This change fixes a starvation
issue: it is now guaranteed that power management requests will be
executed no matter how many blk_get_request() callers are waiting.
Instead of maintaining the q->nr_pending counter, rely on
q->q_usage_counter. Call pm_runtime_mark_last_busy() every time a
request finishes instead of only if the queue depth drops to zero.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-core.c | 37 ++---
 block/blk-pm.c   | 72 
 2 files changed, 75 insertions(+), 34 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 18b874d5c9c9..ae092ca121d5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2747,30 +2747,6 @@ void blk_account_io_done(struct request *req, u64 now)
}
 }
 
-#ifdef CONFIG_PM
-/*
- * Don't process normal requests when queue is suspended
- * or in the process of suspending/resuming
- */
-static bool blk_pm_allow_request(struct request *rq)
-{
-   switch (rq->q->rpm_status) {
-   case RPM_RESUMING:
-   case RPM_SUSPENDING:
-   return rq->rq_flags & RQF_PM;
-   case RPM_SUSPENDED:
-   return false;
-   default:
-   return true;
-   }
-}
-#else
-static bool blk_pm_allow_request(struct request *rq)
-{
-   return true;
-}
-#endif
-
 void blk_account_io_start(struct request *rq, bool new_io)
 {
struct hd_struct *part;
@@ -2816,11 +2792,14 @@ static struct request *elv_next_request(struct 
request_queue *q)
 
while (1) {
list_for_each_entry(rq, >queue_head, queuelist) {
-   if (blk_pm_allow_request(rq))
-   return rq;
-
-   if (rq->rq_flags & RQF_SOFTBARRIER)
-   break;
+#ifdef CONFIG_PM
+   /*
+* If a request gets queued in state RPM_SUSPENDED
+* then that's a kernel bug.
+*/
+   WARN_ON_ONCE(q->rpm_status == RPM_SUSPENDED);
+#endif
+   return rq;
}
 
/*
diff --git a/block/blk-pm.c b/block/blk-pm.c
index 9b636960d285..5f21bedcb307 100644
--- a/block/blk-pm.c
+++ b/block/blk-pm.c
@@ -1,8 +1,11 @@
 // SPDX-License-Identifier: GPL-2.0
 
+#include 
 #include 
 #include 
 #include 
+#include "blk-mq.h"
+#include "blk-mq-tag.h"
 
 /**
  * blk_pm_runtime_init - Block layer runtime PM initialization routine
@@ -40,6 +43,36 @@ void blk_pm_runtime_init(struct request_queue *q, struct 
device *dev)
 }
 EXPORT_SYMBOL(blk_pm_runtime_init);
 
+struct in_flight_data {
+   struct request_queue*q;
+   int in_flight;
+};
+
+static void blk_count_in_flight(struct blk_mq_hw_ctx *hctx, struct request *rq,
+   void *priv, bool reserved)
+{
+   struct in_flight_data *in_flight = priv;
+
+   if (rq->q == in_flight->q)
+   in_flight->in_flight++;
+}
+
+/*
+ * Count the number of requests that are in flight for request queue @q. Use
+ * @q->nr_pending for legacy queues. Iterate over the tag set for blk-mq
+ * queues.  Use blk_mq_queue_tag_busy_iter() instead of
+ * blk_mq_tagset_busy_iter() because the latter only considers requests that
+ * have already been started.
+ */
+static int blk_requests_in_flight(struct request_queue *q)
+{
+   struct in_flight_data in_flight = { .q = q };
+
+   if (q->mq_ops)
+   blk_mq_queue_tag_busy_iter(q, blk_count_in_flight, _flight);
+   return q->nr_pending + in_flight.in_flight;
+}
+
 /**
  * blk_pre_runtime_suspend - Pre runtime suspend check
  * @q: the queue of the device
@@ -68,14 +101,38 @@ int blk_pre_runtime_suspend(struct request_queue *q)
if (!q->dev)
return ret;
 
+   WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);
+
+   blk_set_pm_only(q);
+   /*
+* This function only gets called if the most recent
+* pm_request_resume() call occurred at least autosuspend_delay_ms
+* ago. Since blk_queue_enter() is called by the request allocation
+* code before pm_request_resume(), if no requests have a tag assigned
+* it is safe to suspend the device.
+*/
+   ret = -EBUSY;
+   if (blk_requests_in_flight(q) == 0) {
+   /*
+* Call synchronize_rcu() such that later blk_queue_enter()
+* calls see the pm-only state. See also
+* http://lwn.net/Articles/573497/.
+*/
+   synchronize_rcu();
+  

[PATCH v7 4/6] block: Schedule runtime resume earlier

2018-09-17 Thread Bart Van Assche
Instead of scheduling runtime resume of a request queue after a
request has been queued, schedule asynchronous resume during request
allocation. The new pm_request_resume() calls occur after
blk_queue_enter() has increased the q_usage_counter request queue
member. This change is needed for a later patch that will make request
allocation block while the queue status is not RPM_ACTIVE.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-core.c | 2 ++
 block/elevator.c | 1 -
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index fd91e9bf2893..18b874d5c9c9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -942,6 +942,8 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
if (success)
return 0;
 
+   blk_pm_request_resume(q);
+
if (flags & BLK_MQ_REQ_NOWAIT)
return -EBUSY;
 
diff --git a/block/elevator.c b/block/elevator.c
index 1c992bf6cfb1..e18ac68626e3 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -601,7 +601,6 @@ void __elv_add_request(struct request_queue *q, struct 
request *rq, int where)
trace_block_rq_insert(q, rq);
 
blk_pm_add_request(q, rq);
-   blk_pm_request_resume(q);
 
rq->q = q;
 
-- 
2.18.0



  1   2   3   4   5   6   7   8   9   10   >