Re: [PATCH v3] blk-mq: punt failed direct issue to dispatch list
On Fri, 2018-12-07 at 09:35 -0700, Jens Axboe wrote: > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c > index 29bfe8017a2d..9e5bda8800f8 100644 > --- a/block/blk-mq-sched.c > +++ b/block/blk-mq-sched.c > @@ -377,6 +377,16 @@ void blk_mq_sched_insert_request(struct request *rq, > bool at_head, > > WARN_ON(e && (rq->tag != -1)); > > + /* > + * It's illegal to insert a request into the scheduler that has > + * been through ->queue_rq(). Warn for that case, and use a bypass > + * insert to be safe. > + */ Shouldn't this refer to requests that have been prepared instead of requests that have been through ->queue_rq()? I think this function is called for requests that are requeued. Requeued requests have been through ->queue_rq() but are unprepared before being requeued. Thanks, Bart.
Re: [PATCH v3] blk-mq: punt failed direct issue to dispatch list
On Thu, 2018-12-06 at 22:17 -0700, Jens Axboe wrote: > Instead of making special cases for what we can direct issue, and now > having to deal with DM solving the livelock while still retaining a BUSY > condition feedback loop, always just add a request that has been through > ->queue_rq() to the hardware queue dispatch list. These are safe to use > as no merging can take place there. Additionally, if requests do have > prepped data from drivers, we aren't dependent on them not sharing space > in the request structure to safely add them to the IO scheduler lists. How about making blk_mq_sched_insert_request() complain if a request is passed to it in which the RQF_DONTPREP flag has been set to avoid that this problem is reintroduced in the future? Otherwise this patch looks fine to me. Bart.
Re: [GIT PULL] Follow up block fix
On Thu, 2018-12-06 at 17:26 -0700, Jens Axboe wrote: > On 12/6/18 5:20 PM, Bart Van Assche wrote: > > Which branch does that tag correspond to? Even after having run git fetch > > --tags I can't find that tag in your repository. > > I pushed it before I sent the email, where are you looking? > > http://git.kernel.dk/cgit/linux-block/tag/?h=for-linus-20181206 On my development system git has been configured to fetch from the following repository: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/ Should I fetch your branches from the git.kernel.dk repository instead? Thanks, Bart.
Re: [GIT PULL] Follow up block fix
On Thu, 2018-12-06 at 16:59 -0700, Jens Axboe wrote: > Just a single followup fix to the corruption fix from yesterday. We have > an exported interface that does direct dispatch, with DM being the sole > user of it. Change that to do bypass insert always instead of attempting > direct dispatch. This fixes a deadlock that can happen on DM. > Additionally, it gets rid of any exported user of direct dispatch, and > enables DM to drop any BUSY handling from using the interface. > > Please pull! > > > git://git.kernel.dk/linux-block.git tags/for-linus-20181206 Hi Jens, Which branch does that tag correspond to? Even after having run git fetch --tags I can't find that tag in your repository. Additionally, the patch on your for-linus branch is missing the "Cc: stable" and "Reported-by:" tags that you had promised to add. Did I look at the wrong branch? Thanks, Bart.
Re: [PATCH] block: fix direct dispatch issue failure for clones
On Thu, 2018-12-06 at 15:21 -0700, Jens Axboe wrote: > On 12/6/18 3:20 PM, Jens Axboe wrote: > > After the direct dispatch corruption fix, we permanently disallow direct > > dispatch of non read/write requests. This works fine off the normal IO > > path, as they will be retried like any other failed direct dispatch > > request. But for the blk_insert_cloned_request() that only DM uses to > > bypass the bottom level scheduler, we always first attempt direct > > dispatch. For some types of requests, that's now a permanent failure, > > and no amount of retrying will make that succeed. > > > > Don't use direct dispatch off the cloned insert path, always just use > > bypass inserts. This still bypasses the bottom level scheduler, which is > > what DM wants. > > > > Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") > > Signed-off-by: Jens Axboe > > Bart, I'll add your reported-by here of course, and also a stable CC > since the original patch went into stable. Feel free to add the following: Tested-by: Bart Van Assche
Re: for-next branch and blktests/srp
On Thu, 2018-12-06 at 08:47 -0800, Bart Van Assche wrote: > If I merge Jens' for-next branch with Linus' master branch, boot the > resulting kernel in a VM and run blktests/tests/srp/002 then that test > never finishes. The same test passes against Linus' master branch. I > think this is a regression. The following appears in the system log if > I run that test: > > Call Trace: > INFO: task kworker/0:1:12 blocked for more than 120 seconds. > Call Trace: > INFO: task ext4lazyinit:2079 blocked for more than 120 seconds. > Call Trace: > INFO: task fio:2151 blocked for more than 120 seconds. > Call Trace: > INFO: task fio:2154 blocked for more than 120 seconds. Hi Jens, My test results so far are as follows: * With kernel v4.20-rc5 test srp/002 passes. * With your for-next branch test srp/002 reports the symptoms reported in my e-mail. * With Linus' master branch from this morning test srp/002 fails in the same way as your for-next branch. * Also with Linus' master branch, test srp/002 passes if I revert the following commit: ffe81d45322c ("blk-mq: fix corruption with direct issue"). So it seems like that commit fixed one regression but introduced another regression. Bart.
Re: for-next branch and blktests/srp
On Thu, 2018-12-06 at 11:12 -0700, Jens Axboe wrote: > On 12/6/18 11:10 AM, Bart Van Assche wrote: > > On Thu, 2018-12-06 at 10:02 -0800, Bart Van Assche wrote: > > > On Thu, 2018-12-06 at 10:48 -0700, Jens Axboe wrote: > > > > which would result in a non-zero exit, which should be expected for this > > > > test? > > > > > > Test srp/002 simulates network failures while running fio on top of > > > dm-mpath. > > > Since queue_if_no_path is enabled in multipath.conf dm-mpath will keep > > > retrying > > > the block layer requests it receives if these requests fail due to path > > > removal. > > > If fio reports "io_u error" for this test then that means that the test > > > failed. > > > I haven't seen fio reporting any I/O errors for this test with any > > > upstream > > > kernel that I tested in the past year or so. > > > > Hi Jens, > > > > Please also verify that the version of multipath-tools that you are running > > includes > > the following patch: > > I installed from apt... Says this: > > # apt show multipath-tools > Package: multipath-tools > Version: 0.6.4-5+deb9u1 > > I can run a self-compiled one, if this one doesn't work. Please do. I use the following script to build and install multipath-tools over the Debian binaries: export LIB=/lib apt-get install -y libaio-dev libdevmapper-dev libjson-c-dev librados-dev libreadline-dev libsystemd-dev liburcu-dev make -s make -s install Bart.
Re: for-next branch and blktests/srp
On Thu, 2018-12-06 at 10:02 -0800, Bart Van Assche wrote: > On Thu, 2018-12-06 at 10:48 -0700, Jens Axboe wrote: > > which would result in a non-zero exit, which should be expected for this > > test? > > Test srp/002 simulates network failures while running fio on top of dm-mpath. > Since queue_if_no_path is enabled in multipath.conf dm-mpath will keep > retrying > the block layer requests it receives if these requests fail due to path > removal. > If fio reports "io_u error" for this test then that means that the test > failed. > I haven't seen fio reporting any I/O errors for this test with any upstream > kernel that I tested in the past year or so. Hi Jens, Please also verify that the version of multipath-tools that you are running includes the following patch: >From a2675025ae9f652b005345b9082f5042b32c992c Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Fri, 18 Nov 2016 13:33:44 -0800 Subject: [PATCH] Avoid that reloading a map sporadically triggers I/O errors Avoid that reloading a map while there are no paths triggers a flush and hence unwanted I/O errors if 'queue_if_no_path' is enabled. Fixes: commit d569988e7528 ("libmultipath: Fixup 'DM_DEVICE_RELOAD' handling") Signed-off-by: Bart Van Assche --- libmultipath/devmapper.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/libmultipath/devmapper.c b/libmultipath/devmapper.c index 9b6b0537ba6d..1576dd017d6a 100644 --- a/libmultipath/devmapper.c +++ b/libmultipath/devmapper.c @@ -392,7 +392,7 @@ int dm_addmap_reload(struct multipath *mpp, char *params, int flush) params, ADDMAP_RO, SKIP_KPARTX_OFF); } if (r) - r = dm_simplecmd(DM_DEVICE_RESUME, mpp->alias, flush, + r = dm_simplecmd(DM_DEVICE_RESUME, mpp->alias, !flush, 1, udev_flags, 0); return r; }
Re: for-next branch and blktests/srp
On Thu, 2018-12-06 at 10:48 -0700, Jens Axboe wrote: > which would result in a non-zero exit, which should be expected for this > test? Hi Jens, Test srp/002 simulates network failures while running fio on top of dm-mpath. Since queue_if_no_path is enabled in multipath.conf dm-mpath will keep retrying the block layer requests it receives if these requests fail due to path removal. If fio reports "io_u error" for this test then that means that the test failed. I haven't seen fio reporting any I/O errors for this test with any upstream kernel that I tested in the past year or so. Bart.
Re: for-next branch and blktests/srp
On Thu, 2018-12-06 at 10:00 -0700, Jens Axboe wrote: > On 12/6/18 9:47 AM, Bart Van Assche wrote: > > If I merge Jens' for-next branch with Linus' master branch, boot the > > resulting kernel in a VM and run blktests/tests/srp/002 then that test > > never finishes. The same test passes against Linus' master branch. I > > think this is a regression. The following appears in the system log if > > I run that test: > > You are running that test on a dm device? Can you shed some light on > how that dm device is setup? Hi Jens, The dm device referred to in my e-mail is a dm-mpath device created by test srp/002. All parameters of that device are under control of that test script. >From the srp/002 test script: DESCRIPTION="File I/O on top of multipath concurrently with logout and login (mq)" >From srp/multipath.conf: defaults { find_multipaths no user_friendly_names yes queue_without_daemonno } devices { device { vendor "LIO-ORG|SCST_BIO|FUSIONIO" product ".*" features"1 queue_if_no_path" path_checkertur } } Bart.
for-next branch and blktests/srp
Hello, If I merge Jens' for-next branch with Linus' master branch, boot the resulting kernel in a VM and run blktests/tests/srp/002 then that test never finishes. The same test passes against Linus' master branch. I think this is a regression. The following appears in the system log if I run that test: Call Trace: INFO: task kworker/0:1:12 blocked for more than 120 seconds. Call Trace: INFO: task ext4lazyinit:2079 blocked for more than 120 seconds. Call Trace: INFO: task fio:2151 blocked for more than 120 seconds. Call Trace: INFO: task fio:2154 blocked for more than 120 seconds. My list-pending-block-requests script produces the following output: dm-0 dm-0/hctx0/active:0 dm-0/hctx0/dispatch:b856a515 {.op=WRITE_ZEROES, .cmd_flags=SYNC, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1064, .internal_tag=-1} dm-0/hctx0/dispatch:6a2dca0a {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1065, .internal_tag=-1} dm-0/hctx0/dispatch:1baa5734 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1085, .internal_tag=-1} dm-0/hctx0/dispatch:2d618841 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1025, .internal_tag=-1} dm-0/hctx0/dispatch:c8e6e436 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1250, .internal_tag=-1} dm-0/hctx0/dispatch:7af12ca3 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1057, .internal_tag=-1} dm-0/hctx0/dispatch:2ad59d1e {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1060, .internal_tag=-1} dm-0/hctx0/dispatch:51c35166 {.op=READ, .cmd_flags=, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1188, .internal_tag=-1} dm-0/hctx0/dispatch:05da9610 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1253, .internal_tag=-1} dm-0/hctx0/dispatch:a25483c5 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1254, .internal_tag=-1} dm-0/hctx0/dispatch:7ef87035 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1255, .internal_tag=-1} dm-0/hctx0/dispatch:8a8f15c0 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1261, .internal_tag=-1} dm-0/hctx0/dispatch:d65ae959 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1263, .internal_tag=-1} dm-0/hctx0/dispatch:88aac981 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1264, .internal_tag=-1} dm-0/hctx0/dispatch:89489e70 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1265, .internal_tag=-1} dm-0/hctx0/dispatch:7c69e2d2 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1154, .internal_tag=-1} dm-0/hctx0/dispatch:e32aa9b5 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1155, .internal_tag=-1} dm-0/hctx0/dispatch:5b88a332 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1157, .internal_tag=-1} dm-0/hctx0/dispatch:1de14ef0 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1159, .internal_tag=-1} dm-0/hctx0/dispatch:a380d0f3 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1161, .internal_tag=-1} dm-0/hctx0/dispatch:c48cfa16 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1162, .internal_tag=-1} dm-0/hctx0/dispatch:11d96382 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1163, .internal_tag=-1} dm-0/hctx0/dispatch:b5efd5d7 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1167, .internal_tag=-1} dm-0/hctx0/dispatch:80f5c476 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1153, .internal_tag=-1} dm-0/hctx0/dispatch:c73c244d {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1266, .internal_tag=-1} dm-0/hctx0/dispatch:69c43800 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1268, .internal_tag=-1} dm-0/hctx0/dispatch:d2ec3c96 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1269, .internal_tag=-1} dm-0/hctx0/dispatch:9b2099db {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1270, .internal_tag=-1} dm-0/hctx0/dispatch:7d433090 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1272, .internal_tag=-1} dm-0/hctx0/dispatch:afa7d2ea {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1274, .internal_tag=-1} dm-0/hctx0/dispatch:c0f9d890 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1276, .internal_tag=-1}
for-next branch and throttling
Hello, If I merge Jens' for-next branch with Linus' master branch and boot the resulting kernel in a VM then the call trace shown below appears. This call trace does not appear when building and booting the code from Linus' master branch. Is this perhaps a regression? WARNING: CPU: 1 PID: 257 at block/blk-throttle.c:2128 blk_throtl_bio+0xc00/0x1120 Modules linked in: floppy(+) virtio_blk(+) virtio_net net_failover failover i2c_piix4 pata_acpi CPU: 1 PID: 257 Comm: systemd-udevd Not tainted 4.20.0-rc5-dbg+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 RIP: 0010:blk_throtl_bio+0xc00/0x1120 Call Trace: ? create_task_io_context+0x1a2/0x1e0 ? lock_downgrade+0x2e0/0x2e0 ? kasan_check_read+0x11/0x20 ? do_raw_spin_unlock+0xa8/0x140 ? preempt_count_sub+0x18/0xd0 generic_make_request_checks+0x432/0xcc0 ? trace_event_raw_event_block_rq_requeue+0x290/0x290 ? __lock_acquire+0x5b0/0x17e0 ? __bio_associate_blkg.isra.33+0x1d0/0x3c0 generic_make_request+0x151/0x950 ? blk_queue_enter+0x7e0/0x7e0 submit_bio+0x9b/0x250 ? submit_bio+0x9b/0x250 ? generic_make_request+0x950/0x950 ? guard_bio_eod+0x120/0x2d0 submit_bh_wbc+0x2df/0x310 block_read_full_page+0x34c/0x4c0 ? I_BDEV+0x20/0x20 ? __bread_gfp+0x170/0x170 ? add_to_page_cache_locked+0x20/0x20 ? blkdev_writepages+0x10/0x10 ? blkdev_writepages+0x10/0x10 blkdev_readpage+0x18/0x20 do_read_cache_page+0x5ed/0xbf0 ? blkdev_writepages+0x10/0x10 ? find_valid_gpt+0x1a8/0x8b0 ? efi_partition+0x147/0x6c1 ? check_partition+0x1b2/0x314 ? blkdev_get+0x1f2/0x5a0 ? __device_add_disk+0x6c1/0x7e0 ? device_add_disk+0x13/0x20 ? pagecache_get_page+0x420/0x420 ? driver_probe_device+0x118/0x180 ? __driver_attach+0x167/0x190 ? bus_for_each_dev+0x105/0x160 ? driver_attach+0x2b/0x30 ? bus_add_driver+0x23d/0x350 ? driver_register+0xdc/0x160 ? register_virtio_driver+0x3c/0x60 ? init+0x51/0x1000 [virtio_blk] ? do_one_initcall+0xb4/0x3a1 ? do_init_module+0x103/0x360 ? load_module+0x443e/0x4630 ? __do_sys_finit_module+0x12d/0x1b0 ? __x64_sys_finit_module+0x43/0x50 ? do_syscall_64+0x71/0x210 ? entry_SYSCALL_64_after_hwframe+0x49/0xbe ? match_held_lock+0x20/0x250 ? match_held_lock+0x35/0x250 ? blkdev_writepages+0x10/0x10 read_cache_page+0x3a/0x50 read_dev_sector+0x64/0x170 read_lba+0x1f9/0x3a0 ? msdos_partition+0x1370/0x1370 ? find_valid_gpt+0x1a8/0x8b0 ? kmem_cache_alloc_trace+0x151/0x340 ? find_valid_gpt+0x1a8/0x8b0 find_valid_gpt+0x1c6/0x8b0 ? is_gpt_valid.part.5+0x640/0x640 ? __alloc_pages_nodemask+0x1c4/0x4d0 ? lockdep_hardirqs_on+0x182/0x260 ? get_page_from_freelist+0x648/0x2070 ? mark_lock+0x69/0x8d0 ? trace_hardirqs_on+0x24/0x130 ? kasan_unpoison_shadow+0x35/0x50 ? get_page_from_freelist+0x648/0x2070 ? __asan_loadN+0xf/0x20 ? widen_string+0x73/0x170 ? __asan_loadN+0xf/0x20 ? widen_string+0x73/0x170 ? get_page_from_freelist+0x1eea/0x2070 ? set_precision+0x90/0x90 ? string+0x135/0x180 efi_partition+0x147/0x6c1 ? format_decode+0xce/0x550 ? enable_ptr_key_workfn+0x30/0x30 ? find_valid_gpt+0x8b0/0x8b0 ? vsnprintf+0x11d/0x820 ? pointer+0x4d0/0x4d0 ? snprintf+0x88/0xa0 ? snprintf+0x88/0xa0 ? vsprintf+0x20/0x20 ? find_valid_gpt+0x8b0/0x8b0 check_partition+0x1b2/0x314 rescan_partitions+0x13b/0x440 __blkdev_get+0x61f/0x930 ? check_disk_change+0xa0/0xa0 ? lock_downgrade+0x2e0/0x2e0 ? var_wake_function+0x90/0x90 blkdev_get+0x1f2/0x5a0 ? preempt_count_sub+0x18/0xd0 ? _raw_spin_unlock+0x31/0x50 ? bd_may_claim+0x80/0x80 ? bdget+0x26b/0x280 ? blkdev_writepage+0x20/0x20 ? kasan_check_read+0x11/0x20 ? kobject_put+0x23/0x220 __device_add_disk+0x6c1/0x7e0 ? blk_alloc_devt+0x150/0x150 ? __asan_storeN+0x12/0x20 device_add_disk+0x13/0x20 virtblk_probe+0x839/0xace [virtio_blk] ? virtblk_restore+0xd0/0xd0 [virtio_blk] ? mutex_unlock+0x12/0x20 ? kernfs_activate+0xd0/0xe0 ? kasan_check_write+0x14/0x20 ? kernfs_put+0x2c/0x280 ? vring_transport_features+0x19/0x70 ? vp_finalize_features+0xe6/0x160 ? vp_get_status+0x29/0x30 ? virtio_finalize_features+0xbe/0xe0 virtio_dev_probe+0x27d/0x390 really_probe+0x191/0x560 ? driver_probe_device+0x180/0x180 driver_probe_device+0x118/0x180 ? driver_probe_device+0x180/0x180 __driver_attach+0x167/0x190 bus_for_each_dev+0x105/0x160 ? subsys_dev_iter_exit+0x10/0x10 ? kasan_check_read+0x11/0x20 ? preempt_count_sub+0x18/0xd0 ? _raw_spin_unlock+0x31/0x50 driver_attach+0x2b/0x30 bus_add_driver+0x23d/0x350 driver_register+0xdc/0x160 ? 0xa007 register_virtio_driver+0x3c/0x60 init+0x51/0x1000 [virtio_blk] do_one_initcall+0xb4/0x3a1 ? trace_event_raw_event_initcall_finish+0x120/0x120 ? kasan_unpoison_shadow+0x35/0x50 ? __asan_register_globals+0x7b/0xa0 do_init_module+0x103/0x360 load_module+0x443e/0x4630 ? module_frob_arch_sections+0x20/0x20 ? vfs_read+0xe6/0x1e0 ? kernel_read+0x79/0xa0 ? kasan_check_write+0x14/0x20 ? kernel_read_file+0x259/0x310 ? do_mmap+0x431/0x6c0 __do_sys_finit_module+0x12d/0x1b0 ?
Re: [PATCH v3] block: add documentation for io_timeout
On Wed, 2018-12-05 at 22:17 +0800, Weiping Zhang wrote: > +Description: > + io_timeout is a request’s timeouts at block layer in > + milliseconds. When the underlying driver starts processing > + a request, the generic block layer will start a timer, if > + this request cannot be completed in io_timeout milliseconds, > + a timeout event will occur. Sorry but I think this description is still somewhat confusing. How about changing that description into the following? io_timeout is the request timeout in milliseconds. If a request does not complete in this time then the block driver timeout handler is invoked. That timeout handler can decide to retry the request, to fail it or to start a device recovery strategy. Bart.
Re: [PATCH 1/2] block: get rid of BLK_MAX_TIMEOUT
On Thu, 2018-12-06 at 22:18 +0800, Weiping Zhang wrote: > Before this patch, even we set io_timeout to 30*HZ(default), but > blk_rq_timeout always return jiffies +5*HZ, > [1]. if there no pending request in timeout list, the timer callback > blk_rq_timed_out_timer will be called after 5*HZ, and then > blk_mq_check_expired will check is there exist some request > was delayed by compare jiffies and request->deadline, obvious > request is not timeout because we set request's timeouts is 30*HZ. > So for this case timer callback should be called at jiffies + 30 instead > of jiffies + 5*HZ. > > [2]. if there are pending request in timeout list, we compare request's > expiry and queue's expiry. If time_after(request->expire, queue->expire) > modify > queue->timeout.expire to request->expire, otherwise do nothing. > > So I think this patch just solve problem in [1], no other regression, or > what's > I missing here ? The blk_rq_timeout() function was introduced by commit 0d2602ca30e4 ("blk-mq: improve support for shared tags maps"). I think the purpose of that function is to make sure that the nr_active counter in struct blk_mq_hw_ctx gets updated at least once every five seconds. So there are two problems with this patch: - It reduces the frequency of 'nr_active' updates. I think that is wrong and also that it will negatively affect drivers that rely on this functionality, e.g. the SCSI core. - The patch description does not match the code changes in this patch. Bart.
Re: [PATCH v3] block: add documentation for io_timeout
On Wed, 2018-12-05 at 22:59 +0800, Weiping Zhang wrote: > Weiping Zhang 于2018年12月5日周三 下午10:49写道: > > Christoph Hellwig 于2018年12月5日周三 下午10:40写道: > > > Can you please also send a patch to not show this attribute for > > > drivers without a timeout handler? Thanks! > > Is there a simple way do that ? How about checking the timeout member of struct blk_mq_ops for blk-mq and checking the rq_timed_out_fn member in struct request_queue for the legacy block layer? > Shall we return -ENOTSUPP when user read/write this attribute when > driver has no timeout handler ? A much more elegant solution is to introduce a sysfs attribute group for the io_timeout attribute and to make that group visible only if a timeout handler has been defined. See e.g. disk_attr_group in block/genhd.c for an example. Bart.
Re: [PATCH 1/2] block: get rid of BLK_MAX_TIMEOUT
On Wed, 2018-12-05 at 23:37 +0800, Weiping Zhang wrote: > @@ -130,7 +119,7 @@ void blk_add_timer(struct request *req) >* than an existing one, modify the timer. Round up to next nearest >* second. >*/ > - expiry = blk_rq_timeout(round_jiffies_up(expiry)); > + expiry = round_jiffies_up(expiry); If you would have read the comment above this code, you would have known that this patch does not do what you think it does and additionally that it introduces a regression. Bart.
Re: [PATCH] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag
On 12/4/18 2:00 AM, Kashyap Desai wrote: Problem statement : Whenever try to get outstanding request via scsi_host_find_tag, block layer will return stale entries instead of actual outstanding request. Kernel panic if stale entry is inaccessible or memory is reused. Fix : Undo request mapping in blk_mq_put_driver_tag nce request is return. More detail : Whenever each SDEV entry is created, block layer allocate separate tags and static requestis.Those requests are not valid after SDEV is deleted from the system. On the fly, block layer maps static rqs to rqs as below from blk_mq_get_driver_tag() data.hctx->tags->rqs[rq->tag] = rq; Above mapping is active in-used requests and it is the same mapping which is referred in function scsi_host_find_tag(). After running some IOs, “data.hctx->tags->rqs[rq->tag]” will have some entries which will never be reset in block layer. There would be a kernel panic, If request pointing to “data.hctx->tags->rqs[rq->tag]” is part of “sdev” which is removed and as part of that all the memory allocation of request associated with that sdev might be reused or inaccessible to the driver. Kernel panic snippet - BUG: unable to handle kernel paging request at ff800010 IP: [] mpt3sas_scsih_scsi_lookup_get+0x6c/0xc0 [mpt3sas] PGD aa4414067 PUD 0 Oops: [#1] SMP Call Trace: [] mpt3sas_get_st_from_smid+0x1f/0x60 [mpt3sas] [] scsih_shutdown+0x55/0x100 [mpt3sas] Cc: Signed-off-by: Kashyap Desai Signed-off-by: Sreekanth Reddy --- block/blk-mq.h | 1 + 1 file changed, 1 insertion(+) diff --git a/block/blk-mq.h b/block/blk-mq.h index 9497b47..57432be 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -175,6 +175,7 @@ static inline bool blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx) static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx, struct request *rq) { +hctx->tags->rqs[rq->tag] = NULL; blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag); rq->tag = -1; No SCSI driver should call scsi_host_find_tag() after a request has finished. The above patch introduces yet another race and hence can't be a proper fix. Bart.
Re: block: sbitmap related lockdep warning
On Mon, 2018-12-03 at 15:24 -0700, Jens Axboe wrote: > On 12/3/18 3:02 AM, Ming Lei wrote: > > Hi, > > > > Just found there is sbmitmap related lockdep warning, not take a close > > look yet, maybe > > it is caused by recent sbitmap change. > > > > [1] test > > - modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 > > submit_queues=1 hw_queue_depth=1 > > - then run fio on the 4 null_blk devices > > This is a false positive - lockdep thinks that ->swap_lock needs to be > IRQ safe since it's called with IRQs disabled from the > blk_mq_mark_tag_wait() path. But we never grab the lock from IRQ > context. I wonder how to teach lockdep about that... There is probably a better solution, but one possible solution is to disable lockdep checking for swap_lock by using lockdep_set_novalidate_class(). Bart.
Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests
On 12/1/18 9:11 AM, Hannes Reinecke wrote: On 12/1/18 5:48 PM, Christoph Hellwig wrote: On Fri, Nov 30, 2018 at 01:36:09PM -0700, Jens Axboe wrote: On 11/30/18 1:26 PM, Keith Busch wrote: A driver may wish to iterate every tagged request, not just ones that satisfy blk_mq_request_started(). The intended use is so a driver may terminate entered requests on quiesced queues. How about we just move the started check into the handler passed in for those that care about it? Much saner to make the interface iterate everything, and leave whatever state check to the callback. So we used to do that, and I changed it back in May to test for MQ_RQ_IN_FLIGHT, and then Ming changed it to check blk_mq_request_started. So this is clearly a minefield of sorts.. Note that at least mtip32xx, nbd, skd and the various nvme transports want to use the function to terminate all requests in the error path, and it would be great to have one single understood, documented and debugged helper for that in the core, so this is a vote for moving more of the logic in your second helper into the core code. skd will need actually use ->complete to release resources for that, though and mtip plays some odd abort bits. If it weren't for the interesting abort behavior in nvme-fc that means we could even unexport the low-level interface. Yes, I'm very much in favour of this, too. We always have this IMO slightly weird notion of stopping the queue, set some error flags in the driver, then _restarting_ the queue, just so that the driver then sees the error flag and terminates the requests. Which I always found quite counter-intuitive. So having a common helper for terminating requests for queue errors would be very welcomed here. But when we have that we really should audit all drivers to ensure they do the right thin (tm). Would calling blk_abort_request() for all outstanding requests be sufficient to avoid that the queue has to be stopped and restarted in the nvme-fc driver? Bart.
Re: [PATCH] block: fix single range discard merge
On Fri, 2018-11-30 at 10:20 -0700, Jens Axboe wrote: > On 11/30/18 10:18 AM, Bart Van Assche wrote: > > On Sat, 2018-12-01 at 00:38 +0800, Ming Lei wrote: > > > Fixes: 445251d0f4d329a ("blk-mq: fix discard merge with scheduler > > > attached") > > > > Since this patch fixes a bug introduced in kernel v4.16, does it need > > a "Cc: stable" tag? > > Like the other one, isn't stable implied with Fixes in there? You'd want > a stable backport for any kernel that has that patchset. I think that's > a stronger hint than stable cc. (+Greg KH) Hi Greg, Would it be possible to clarify what your preferences are for adding a "Cc: stable" tag? Thanks, Bart.
Re: [PATCH 01/27] aio: fix failure to put the file pointer
On Fri, 2018-11-30 at 10:08 -0700, Jens Axboe wrote: > On 11/30/18 10:07 AM, Bart Van Assche wrote: > > On Fri, 2018-11-30 at 09:56 -0700, Jens Axboe wrote: > > > If the ioprio capability check fails, we return without putting > > > the file pointer. > > > > > > Fixes: d9a08a9e616b ("fs: Add aio iopriority support") > > > Reviewed-by: Johannes Thumshirn > > > Reviewed-by: Christoph Hellwig > > > Signed-off-by: Jens Axboe > > > --- > > > fs/aio.c | 1 + > > > 1 file changed, 1 insertion(+) > > > > > > diff --git a/fs/aio.c b/fs/aio.c > > > index b984918be4b7..205390c0c1bb 100644 > > > --- a/fs/aio.c > > > +++ b/fs/aio.c > > > @@ -1436,6 +1436,7 @@ static int aio_prep_rw(struct kiocb *req, struct > > > iocb *iocb) > > > ret = ioprio_check_cap(iocb->aio_reqprio); > > > if (ret) { > > > pr_debug("aio ioprio check cap error: %d\n", ret); > > > + fput(req->ki_filp); > > > return ret; > > > } > > > > Since this patch fixes a bug that was introduced in kernel v4.18, does this > > patch need a "Cc: stable" tag? > > The fixes should take care of that by itself, I hope. Hi Jens, My understanding is that patches that have a "Cc: stable" tag are guaranteed to be integrated in a stable kernel sooner or later. Without that tag it depends on the stable kernel maintainer whether or not these patches get picked up. I think the "AUTOSEL" tag Sasha Levin uses indicates that a patch was picked up for one of his stable kernels and that it did not have a "Cc: stable" tag. I'm not sure Greg KH picks up patches that only have a "Fixes:" tag but no "Cc: stable" tag. Bart.
Re: [PATCH] block: fix single range discard merge
On Sat, 2018-12-01 at 00:38 +0800, Ming Lei wrote: > Fixes: 445251d0f4d329a ("blk-mq: fix discard merge with scheduler attached") Since this patch fixes a bug introduced in kernel v4.16, does it need a "Cc: stable" tag? Thanks, Bart.
Re: [PATCH 05/27] block: ensure that async polled IO is marked REQ_NOWAIT
On Fri, 2018-11-30 at 09:56 -0700, Jens Axboe wrote: > We can't wait for polled events to complete, as they may require active > polling from whoever submitted it. If that is the same task that is > submitting new IO, we could deadlock waiting for IO to complete that > this task is supposed to be completing itself. > > Signed-off-by: Jens Axboe > --- > fs/block_dev.c | 10 +- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 6de8d35f6e41..ebc3d5a0f424 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -402,8 +402,16 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter > *iter, int nr_pages) > > nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); > if (!nr_pages) { > - if (iocb->ki_flags & IOCB_HIPRI) > + if (iocb->ki_flags & IOCB_HIPRI) { > bio->bi_opf |= REQ_HIPRI; > + /* > + * For async polled IO, we can't wait for > + * requests to complete, as they may also be > + * polled and require active reaping. > + */ > + if (!is_sync) > + bio->bi_opf |= REQ_NOWAIT; > + } > > qc = submit_bio(bio); > WRITE_ONCE(iocb->ki_cookie, qc); Setting REQ_NOWAIT from inside the block layer will make the code that submits requests harder to review. Have you considered to make this code fail I/O if REQ_NOWAIT has not been set and to require that the context that submits I/O sets REQ_NOWAIT? Thanks, Bart.
Re: [PATCH 01/27] aio: fix failure to put the file pointer
On Fri, 2018-11-30 at 09:56 -0700, Jens Axboe wrote: > If the ioprio capability check fails, we return without putting > the file pointer. > > Fixes: d9a08a9e616b ("fs: Add aio iopriority support") > Reviewed-by: Johannes Thumshirn > Reviewed-by: Christoph Hellwig > Signed-off-by: Jens Axboe > --- > fs/aio.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/aio.c b/fs/aio.c > index b984918be4b7..205390c0c1bb 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -1436,6 +1436,7 @@ static int aio_prep_rw(struct kiocb *req, struct iocb > *iocb) > ret = ioprio_check_cap(iocb->aio_reqprio); > if (ret) { > pr_debug("aio ioprio check cap error: %d\n", ret); > + fput(req->ki_filp); > return ret; > } Since this patch fixes a bug that was introduced in kernel v4.18, does this patch need a "Cc: stable" tag? Thanks, Bart.
Re: [PATCH v2] block: add documentation for io_timeout
On Thu, 2018-11-29 at 18:22 +0800, Weiping Zhang wrote: > add documentation for /sys/block//queue/io_timeout Patch descriptions should consist of full sentences. That means that these should start with a capital letter and end with a period. > + > +What:/sys/block//queue/io_timeout > +Date:November 2018 > +Contact: Weiping Zhang > +Description: > + io_timeout is the timeout in milliseconds of a request in > + block layer. The block layer will start a timer when low > + level device driver start the request, and cancel timer > + when request completes. The grammar of that explanation needs to be improved ... I can spot three grammatical issues. Thanks, Bart.
Re: [PATCH] block: add documentation for io_timeout
On Thu, 2018-11-29 at 00:54 +0800, Weiping Zhang wrote: > add documentation for /sys/block//queue/io_timeout > > Signed-off-by: Weiping Zhang > --- > Documentation/ABI/testing/sysfs-block | 9 + > Documentation/block/queue-sysfs.txt | 6 ++ > 2 files changed, 15 insertions(+) > > diff --git a/Documentation/ABI/testing/sysfs-block > b/Documentation/ABI/testing/sysfs-block > index dea212db9df3..f37cca16 100644 > --- a/Documentation/ABI/testing/sysfs-block > +++ b/Documentation/ABI/testing/sysfs-block > @@ -271,3 +271,12 @@ Description: > size of 512B sectors of the zones of the device, with > the eventual exception of the last zone of the device > which may be smaller. > + > +What:/sys/block//queue/io_timeout > +Date:November 2018 > +Contact: Weiping Zhang > +Description: > + io_timeout is the timeout in millisecods of a request in > + block layer. The block layer will start a timer when low > + level device driver start the request, and cancle timer > + when request completes. > diff --git a/Documentation/block/queue-sysfs.txt > b/Documentation/block/queue-sysfs.txt > index 2c1e67058fd3..cbd44fe056fa 100644 > --- a/Documentation/block/queue-sysfs.txt > +++ b/Documentation/block/queue-sysfs.txt > @@ -67,6 +67,12 @@ If set to a value larger than 0, the kernel will put the > process issuing > IO to sleep for this amont of microseconds before entering classic > polling. > > +io_timeout (RW) > +--- > +This is the timeout in millisecods of a request in block layer. > +The block layer will start a timer when low level device driver start > +the request, and cancle timer when request completes. > + > iostats (RW) > - > This file is used to control (on/off) the iostats accounting of the Please run this patch through a spelling checker. I think "millisecods" should be changed into "milliseconds" and also that "cancle" should be changed into "cancel". Thanks, Bart.
Re: [RFC PATCH v2] block: add io timeout to sysfs
On Mon, 2018-11-19 at 22:11 +0800, Weiping Zhang wrote: > Give a interface to adjust io timeout by device. > > Signed-off-by: Weiping Zhang > --- > > Changes since v1: > * make sure timeout > 0 > > block/blk-sysfs.c | 27 +++ > 1 file changed, 27 insertions(+) Documentation for new block layer sysfs attributes should be added in Documentation/ABI/testing/sysfs-block and also in Documentation/block/queue-sysfs.txt. Please add such documentation for this new attribute. > +static ssize_t queue_io_timeout_store(struct request_queue *q, const char > *page, > + size_t count) > +{ > + unsigned int val; > + int err; > + > + err = kstrtou32(page, 10, ); > + if (err || val == 0) > + return -EINVAL; > + > + blk_queue_rq_timeout(q, val); > + > + return count; > +} Setting the block layer timeout to a very high value (e.g. hours) may make it look like a request got stuck without users having an easy way of figuring out what is going on. I'm wondering whether this function should restrict the upper bound for block layer timeouts. How about limiting timeout values to ten minutes? Thanks, Bart.
Re: [PATCH] kyber: fix wrong strlcpy() size in trace_kyber_latency()
On 11/11/18 9:25 AM, Omar Sandoval wrote: From: Omar Sandoval When copying to the latency type, we should be passing LATENCY_TYPE_LEN, not DOMAIN_LEN. This isn't a problem in practice because we only pass "total" or "I/O", but let's fix it. Reported-by: Jordan Glover Signed-off-by: Omar Sandoval --- include/trace/events/kyber.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/trace/events/kyber.h b/include/trace/events/kyber.h index a9834c37ac40..7aaa298375ad 100644 --- a/include/trace/events/kyber.h +++ b/include/trace/events/kyber.h @@ -32,7 +32,7 @@ TRACE_EVENT(kyber_latency, TP_fast_assign( __entry->dev = disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent))); strlcpy(__entry->domain, domain, DOMAIN_LEN); - strlcpy(__entry->type, type, DOMAIN_LEN); + strlcpy(__entry->type, type, LATENCY_TYPE_LEN); __entry->percentile = percentile; __entry->numerator = numerator; __entry->denominator = denominator; Can both strlcpy() invocations be changed such that the third argument is a sizeof() expression instead of an explicit constant? I think that would make the Kyber tracing code easier to verify for humans. Thanks, Bart.
Re: [PATCH 3/6] skd_main: don't use req->special
On 11/9/18 10:32 AM, Christoph Hellwig wrote: Add a retries field to the internal request structure instead, which gets set to zero on the first submission. Signed-off-by: Christoph Hellwig --- drivers/block/skd_main.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c index a0196477165f..88f29b06d90e 100644 --- a/drivers/block/skd_main.c +++ b/drivers/block/skd_main.c @@ -181,6 +181,7 @@ struct skd_request_context { struct fit_completion_entry_v1 completion; struct fit_comp_error_info err_info; + int retries; blk_status_t status; }; @@ -495,6 +496,11 @@ static blk_status_t skd_mq_queue_rq(struct blk_mq_hw_ctx *hctx, if (unlikely(skdev->state != SKD_DRVR_STATE_ONLINE)) return skd_fail_all(q) ? BLK_STS_IOERR : BLK_STS_RESOURCE; + if (!(req->rq_flags & RQF_DONTPREP)) { + skreq->retries = 0; + req->rq_flags |= RQF_DONTPREP; + } + blk_mq_start_request(req); WARN_ONCE(tag >= skd_max_queue_depth, "%#x > %#x (nr_requests = %lu)\n", @@ -1426,7 +1432,7 @@ static void skd_resolve_req_exception(struct skd_device *skdev, break; case SKD_CHECK_STATUS_REQUEUE_REQUEST: - if ((unsigned long) ++req->special < SKD_MAX_RETRIES) { + if ((unsigned long) skreq->retries < SKD_MAX_RETRIES) { skd_log_skreq(skdev, skreq, "retry"); blk_mq_requeue_request(req, true); break; Hi Christoph, Where has the code been moved to that increments the new skreq->retries counter? Thanks, Bart.
Re: [PATCH 1/2] blk-mq-tag: change busy_iter_fn to return whether to continue or not
On Thu, 2018-11-08 at 10:50 -0700, Jens Axboe wrote: > How about this incremental? > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c > index 097e9a67d5f5..87bc5df72d48 100644 > --- a/block/blk-mq-tag.c > +++ b/block/blk-mq-tag.c > @@ -248,7 +248,8 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int > bitnr, void *data) > * @fn: Pointer to the function that will be called for each > request > * associated with @hctx that has been assigned a driver tag. > * @fn will be called as follows: @fn(@hctx, rq, @data, @reserved) > - * where rq is a pointer to a request. > + * where rq is a pointer to a request. Return true to continue > + * iterating tags, false to stop. > * @data:Will be passed as third argument to @fn. > * @reserved:Indicates whether @bt is the breserved_tags member or > the > * bitmap_tags member of struct blk_mq_tags. > @@ -301,7 +302,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned > int bitnr, void *data) > * or the bitmap_tags member of struct blk_mq_tags. > * @fn: Pointer to the function that will be called for each > started > * request. @fn will be called as follows: @fn(rq, @data, > - * @reserved) where rq is a pointer to a request. > + * @reserved) where rq is a pointer to a request. Return true > + * to continue iterating tags, false to stop. > * @data:Will be passed as second argument to @fn. > * @reserved:Indicates whether @bt is the breserved_tags member or > the > * bitmap_tags member of struct blk_mq_tags. > @@ -326,7 +328,8 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, > struct sbitmap_queue *bt, > * @fn: Pointer to the function that will be called for each > started > * request. @fn will be called as follows: @fn(rq, @priv, > * reserved) where rq is a pointer to a request. 'reserved' > - * indicates whether or not @rq is a reserved request. > + * indicates whether or not @rq is a reserved request. Return > + * true to continue iterating tags, false to stop. > * @priv:Will be passed as second argument to @fn. > */ > static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags, > @@ -343,7 +346,8 @@ static void blk_mq_all_tag_busy_iter(struct blk_mq_tags > *tags, > * @fn: Pointer to the function that will be called for each > started > * request. @fn will be called as follows: @fn(rq, @priv, > * reserved) where rq is a pointer to a request. 'reserved' > - * indicates whether or not @rq is a reserved request. > + * indicates whether or not @rq is a reserved request. Return > + * true to continue iterating tags, false to stop. > * @priv:Will be passed as second argument to @fn. > */ > void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, This looks fine to me. Thanks! Bart.
Re: [PATCH 1/2] blk-mq-tag: change busy_iter_fn to return whether to continue or not
On Thu, 2018-11-08 at 09:31 -0700, Jens Axboe wrote: > On 11/8/18 9:28 AM, Bart Van Assche wrote: > > On Thu, 2018-11-08 at 09:06 -0700, Jens Axboe wrote: > > > --- a/block/blk-mq-debugfs.c > > > +++ b/block/blk-mq-debugfs.c > > > @@ -424,13 +424,15 @@ struct show_busy_params { > > > * Note: the state of a request may change while this function is in > > > progress, > > > * e.g. due to a concurrent blk_mq_finish_request() call. > > > */ > > > -static void hctx_show_busy_rq(struct request *rq, void *data, bool > > > reserved) > > > +static bool hctx_show_busy_rq(struct request *rq, void *data, bool > > > reserved) > > > { > > > > Please update the kdoc header above hctx_show_busy_rq() such that it > > reflects > > the new behavior. I'm referring to the "will be called for each request" > > part. > > Otherwise this patch looks fine to me. > > Took a look at the comment, and what do you want changed? There's no change > in behavior for hctx_show_busy_rq(), it loops all requests just like before. > We just return true to ensure we continue iterating. Oops, I added my reply below the wrong function. I wanted to refer to the following comment above blk_mq_queue_tag_busy_iter(): * @fn: Pointer to the function that will be called for each request Additionally, how about similar comments above bt_for_each(), bt_tags_for_each() blk_mq_all_tag_busy_iter() and blk_mq_tagset_busy_iter()? I think this patch affects all these functions. Thanks, Bart.
Re: [PATCH 2/2] blk-mq: provide a helper to check if a queue is busy
On Thu, 2018-11-08 at 09:06 -0700, Jens Axboe wrote: > +static bool blk_mq_check_busy(struct blk_mq_hw_ctx *hctx, struct request *rq, > + void *priv, bool reserved) > +{ > + /* > + * If we find a request, we know the queue is busy. Return false > + * to stop the iteration. > + */ > + if (rq->q == hctx->queue) { > + bool *busy = (bool *) priv; I think the "(bool *)" cast can be left out. Anyway: Reviewed-by: Bart Van Assche
Re: [PATCH 1/2] blk-mq-tag: change busy_iter_fn to return whether to continue or not
On Thu, 2018-11-08 at 09:06 -0700, Jens Axboe wrote: > --- a/block/blk-mq-debugfs.c > +++ b/block/blk-mq-debugfs.c > @@ -424,13 +424,15 @@ struct show_busy_params { > * Note: the state of a request may change while this function is in > progress, > * e.g. due to a concurrent blk_mq_finish_request() call. > */ > -static void hctx_show_busy_rq(struct request *rq, void *data, bool reserved) > +static bool hctx_show_busy_rq(struct request *rq, void *data, bool reserved) > { Please update the kdoc header above hctx_show_busy_rq() such that it reflects the new behavior. I'm referring to the "will be called for each request" part. Otherwise this patch looks fine to me. Bart.
Re: [PATCH blktests] fix discontiguous-io compile error on 32 bit systems
On 11/5/18 6:19 PM, yuyufen wrote: I am sorry that I did not see the discussion before this. And you are right. Please ignore this patch. Don't worry - this can happen :-) Bart.
Re: [PATCH blktests] fix discontiguous-io compile error on 32 bit systems
On Thu, 2018-11-01 at 14:35 +0800, Yufen Yu wrote: > When make discontiguous-io.cpp with -m32, g++ compiler reports > error for std::min(long unsigned int, size_t) has diffent > arguments type. > > fixes: fd21728886e7 ("Add the discontiguous-io test program") > Signed-off-by: Yufen Yu > --- > src/discontiguous-io.cpp | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/src/discontiguous-io.cpp b/src/discontiguous-io.cpp > index 5e0ee0f..855aba9 100644 > --- a/src/discontiguous-io.cpp > +++ b/src/discontiguous-io.cpp > @@ -291,7 +291,7 @@ int main(int argc, char **argv) > unsigned char *p = &*buf.begin(); > for (int i = 0; i < len / 4; i++) > iov.append(p + 4 + i * 8, > -std::min(4ul, len - i * 4)); > +std::min((size_t)4, len - i * 4)); > } else { > iov.append(&*buf.begin(), buf.size()); > } Are you reading the messages posted on linux-block? An alternative that I like better has been discussed in this e-mail thread: https://www.spinics.net/lists/linux-block/msg32181.html Bart.
Re: [PATCH blktests -v2] Fix build failure for discontiguous-io on 32-bit platforms
On Tue, 2018-10-30 at 12:02 -0400, Theodore Y. Ts'o wrote: > On Tue, Oct 30, 2018 at 08:02:55AM -0700, Bart Van Assche wrote: > > Details about how the build fails on 32-bit systems would have been welcome > > in the commit message. Anyway: > > > > Reviewed-by: Bart Van Assche > > For the record, here's the failure. It's the usual incomprehensible > C++ error reporting. :-) > > [ ... ] > discontiguous-io.cpp: In function 'int main(int, char**)': > discontiguous-io.cpp:294:34: error: no matching function for call to > 'min(long unsigned int, size_t)' > std::min(4ul, len - i * 4)); > ^ > [ ... ] Thanks Ted for having provided the full compiler output. However, I don't think that the entire error message should go in the commit message. To me, the part of the compiler error message I cited tells the whole story :-) Bart.
Re: [PATCH blktests -v2] Fix build failure for discontiguous-io on 32-bit platforms
On Tue, 2018-10-30 at 10:36 -0400, Theodore Ts'o wrote: > Signed-off-by: Theodore Ts'o > --- > src/discontiguous-io.cpp | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/src/discontiguous-io.cpp b/src/discontiguous-io.cpp > index 5e0ee0f..f51a305 100644 > --- a/src/discontiguous-io.cpp > +++ b/src/discontiguous-io.cpp > @@ -251,7 +251,7 @@ int main(int argc, char **argv) > const char *dev; > int c; > std::vector buf; > - size_t len = 512; > + unsigned long len = 512; > > while ((c = getopt(argc, argv, "hl:o:sw")) != EOF) { > switch (c) { Details about how the build fails on 32-bit systems would have been welcome in the commit message. Anyway: Reviewed-by: Bart Van Assche
Re: [PATCH blktests] Fix build failure for discontiguous-io on 32-bit platforms
On Mon, 2018-10-29 at 17:08 -0400, Theodore Y. Ts'o wrote: > On Mon, Oct 29, 2018 at 09:26:43AM -0700, Bart Van Assche wrote: > > > > Have you considered to change the data type of 'len' from size_t into > > unsigned long > > instead of inserting this cast? That would make it clear that no integer > > truncation > > happens in the iov.append() call. > > Well the potential integer truncation that could happen is here: > > case 'l': len = strtoul(optarg, NULL, 0); break; If the data type of 'len' would be changed into unsigned long, how could that assignment cause integer truncation since strtoul() returns an unsigned long value? > I have a sneaking suspicion that ib_srp has zero change of > working on 32-bit systems (which is what uses this function) so this > was more about making sure blktests would build when I'm building the > 32-bit version of my test appliance VM. What makes you think that ib_srp won't work on 32-bit systems? I don't think that anyone uses this driver on a 32-bit system. I'm not aware of any aspect of that driver that restricts it to 64-bit systems only either. > (For that matter, I've been banging my head against a brick wall > trying to make the srp tests work in either a KVM or GCE environment, > even with a 64-bit VM, but that's a different issue. It's not high > priority for me, at the moment, but eventually I would like the > {kvm,gce,android}-xfstest test appliance VM's generated by the > xfstests-bld repo to be able to run all of blktests.) Can you be more specific about what didn't work? Were you unable to start the tests or did a particular test fail? > What offends me more is that later on there are scopes when len gets > shadowed with a "ssize_t len;" definition --- I whould have thought > gcc or clang would have complained bitterly about that, but I guess > not. I agree that it would be an improvement to get rid of variable shadowing. Once that has been done the -Wshadow compiler flag can be added. I will look into this if nobody else starts looking into this soon. Bart.
Re: [PATCH blktests] Fix build failure for discontiguous-io on 32-bit platforms
On Mon, 2018-10-29 at 12:15 -0400, Theodore Ts'o wrote: > Signed-off-by: Theodore Ts'o > --- > src/discontiguous-io.cpp | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/src/discontiguous-io.cpp b/src/discontiguous-io.cpp > index 5e0ee0f..a59c18d 100644 > --- a/src/discontiguous-io.cpp > +++ b/src/discontiguous-io.cpp > @@ -291,7 +291,7 @@ int main(int argc, char **argv) > unsigned char *p = &*buf.begin(); > for (int i = 0; i < len / 4; i++) > iov.append(p + 4 + i * 8, > -std::min(4ul, len - i * 4)); > +std::min(4ul, (unsigned long) len - > i * 4)); > } else { > iov.append(&*buf.begin(), buf.size()); > } Hi Ted, Have you considered to change the data type of 'len' from size_t into unsigned long instead of inserting this cast? That would make it clear that no integer truncation happens in the iov.append() call. Thanks, Bart.
Re: [PATCH 2/5] block: move .dma_alignment into q->limits
On Thu, 2018-10-18 at 21:18 +0800, Ming Lei wrote: > Turns out q->dma_alignement should be stack limit because now bvec table ^^ dma_alignment? > is immutalbe, the underlying queue's dma alignment has to be perceptible ^ ^^^ immutable?observable? > by stack driver, so IO buffer can be allocated as dma aligned before ^ stacked? Anyway: Reviewed-by: Bart Van Assche
Re: [PATCH 0/5] block: introduce helpers for allocating io buffer from slab
On Thu, 2018-10-18 at 07:03 -0700, Matthew Wilcox wrote: > On Thu, Oct 18, 2018 at 09:18:12PM +0800, Ming Lei wrote: > > Filesystems may allocate io buffer from slab, and use this buffer to > > submit bio. This way may break storage drivers if they have special > > requirement on DMA alignment. > > Before we go down this road, could we have a discussion about what > hardware actually requires this? Storage has this weird assumption that > I/Os must be at least 512 byte aligned in memory, and I don't know where > this idea comes from. Network devices can do arbitrary byte alignment. > Even USB controllers can do arbitrary byte alignment. Sure, performance > is going to suck and there are definite risks on some architectures > with doing IOs that are sub-cacheline aligned, but why is storage such a > special snowflake that we assume that host controllers are only capable > of doing 512-byte aligned DMAs? > > I just dragged out the NCR53c810 data sheet from 1993, and it's capable of > doing arbitrary alignment of DMA. NVMe is capable of 4-byte aligned DMA. > What hardware needs this 512 byte alignment? How about starting with modifying the queue_dma_alignment() function? The current implementation of that function is as follows: static inline int queue_dma_alignment(struct request_queue *q) { return q ? q->dma_alignment : 511; } In other words, for block drivers that do not set the DMA alignment explicitly it is assumed that these drivers need 512 byte alignment. I think the "512 byte alignment as default" was introduced in 2002. From Thomas Gleixner's history tree, commit ad519c6902fb: +static inline int queue_dma_alignment(request_queue_t *q) +{ + int retval = 511; + + if (q && q->dma_alignment) + retval = q->dma_alignment; + + return retval; +} + +static inline int bdev_dma_aligment(struct block_device *bdev) +{ + return queue_dma_alignment(bdev_get_queue(bdev)); +} + Bart.
Re: [PATCH v2] block: BFQ default for single queue devices
On 10/17/18 3:05 AM, Jan Kara wrote: Well, the problem with this is that big distro people really don't care much because they already use udev for tuning the IO scheduler. So whatever defaults the kernel is going to pick likely won't be seen by distro customers. Embedded people seem to be driving this effort because they either don't run udev or they feel not all their teams building new products have enough expertise to come up with a proper set of rules... What's missing in this discussion is a definition of "embedded system". Is that a system like a streaming player for TV channels that neither has a keyboard nor a display or a system that can run multiple apps simultaneously like a smartphone? I think the difference matters because some embedded devices hardly do any background I/O nor load any executable code from storage after boot. So at least for some embedded devices the problem discussed in this e-mail thread does not exist. Bart.
Re: [PATCH v2] block: BFQ default for single queue devices
On Mon, 2018-10-15 at 16:10 +0200, Linus Walleij wrote: > + * For blk-mq devices, we default to using: > + * - "none" for multiqueue devices (nr_hw_queues != 1) > + * - "bfq", if available, for single queue devices > + * - "mq-deadline" if "bfq" is not available for single queue devices > + * - "none" for single queue devices as well as last resort For SATA SSDs nr_hw_queues == 1 so this patch will also affect these SSDs. Since this patch is an attempt to improve performance, I'd like to see measurement data for one or more recent SATA SSDs before a decision is taken about what to do with this patch. Thanks, Bart.
[PATCH] blk-mq-debugfs: Also show requests that have not yet been started
When debugging e.g. the SCSI timeout handler it is important that requests that have not yet been started or that already have completed are also reported through debugfs. Signed-off-by: Bart Van Assche Cc: Christoph Hellwig Cc: Ming Lei Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Martin K. Petersen --- block/blk-mq-debugfs.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index a5ea86835fcb..41b86f50d126 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -431,8 +431,7 @@ static void hctx_show_busy_rq(struct request *rq, void *data, bool reserved) { const struct show_busy_params *params = data; - if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx && - blk_mq_rq_state(rq) != MQ_RQ_IDLE) + if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx) __blk_mq_debugfs_rq_show(params->m, list_entry_rq(>queuelist)); } -- 2.19.0.605.g01d371f741-goog
Re: [PATCH] blk-mq-debugfs: Also show requests that have not yet been started
On Wed, 2018-10-03 at 16:12 -0600, Jens Axboe wrote: > On 10/3/18 3:42 PM, Bart Van Assche wrote: > > On Fri, 2018-01-12 at 22:11 +0000, Bart Van Assche wrote: > > > /* > > > + * Show "busy" requests - these are the requests owned by the block > > > driver. > > > + * The test list_empty(>queuelist) is used to figure out whether or > > > not > > > + * a request is owned by the block driver. That test works because the > > > block > > > + * layer core uses list_del_init() consistently to remove a request from > > > one > > > + * of the request lists. > > > + * > > > * Note: the state of a request may change while this function is in > > > progress, > > > * e.g. due to a concurrent blk_mq_finish_request() call. > > > */ > > > @@ -402,7 +408,7 @@ static void hctx_show_busy_rq(struct request *rq, > > > void *data, bool reserved) > > > const struct show_busy_params *params = data; > > > > > > if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx && > > > - blk_mq_rq_state(rq) != MQ_RQ_IDLE) > > > + list_empty(>queuelist)) > > > __blk_mq_debugfs_rq_show(params->m, > > >list_entry_rq(>queuelist)); > > > } > > > > Hello Jens, > > > > Can you share your opinion about the above patch? > > I just convince myself that the list check is super useful. The request > could be on any number of lists, either not yet seen by the driver, or > maybe sitting in dispatch. You're only going to be finding these > requests if the tag is allocated, which means that it's already in some > sort of non-idle state. > > So enlighten me why we need the list check at all? Wouldn't it be better > to simply remove it, and ensure that the debug print includes things > like list state, rq state, etc? Hello Jens, I have tried to leave the list_empty() check out but if I do that then requests that have the state "idle" (allocated but not yet started) also show up. I think these should be left out from the output produced by reading the "busy" attribute because such requests are not interesting when analyzing an I/O lockup: nullb0/hctx1/busy:abe67123 {.op=READ, .cmd_flags=, .rq_flags=IO_STAT|STATS, .s tate=idle, .tag=63, .internal_tag=-1} Thanks, Bart.
Re: [PATCH] blk-mq-debugfs: Also show requests that have not yet been started
On Fri, 2018-01-12 at 22:11 +, Bart Van Assche wrote: > On Fri, 2018-01-12 at 15:05 -0700, Jens Axboe wrote: > > On 1/12/18 3:00 PM, Bart Van Assche wrote: > > > On Fri, 2018-01-12 at 14:55 -0700, Jens Axboe wrote: > > > > On 1/12/18 2:52 PM, Bart Van Assche wrote: > > > > > When debugging e.g. the SCSI timeout handler it is important that > > > > > requests that have not yet been started or that already have > > > > > completed are also reported through debugfs. > > > > > > > > > > This patch depends on a patch that went upstream recently, namely > > > > > commit 14e3062fb185 ("scsi: core: Fix a scsi_show_rq() NULL pointer > > > > > dereference"). > > > > > > > > Why don't we just kill the check, and dump any request that has a > > > > matching hctx? We already know the bit was set, so just print > > > > all of them. > > > > > > It is very helpful during debugging that requests owned by a block driver > > > and > > > requests owned by the block layer core show up in different debugfs files. > > > Removing the check completely would cause all requests to show up in the > > > same > > > debugfs file and would make interpreting the contents of these debugfs > > > files > > > much harder. > > > > Yeah, we'd have to make it just one file at that point. I'm not hugely > > against the queuelist check, but probably warrants a comment as it's not > > immediately clear (as opposed to the idle check, or the previous START > > bit check). > > How about the below? > > diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c > index 19db3f583bf1..da859dac442b 100644 > --- a/block/blk-mq-debugfs.c > +++ b/block/blk-mq-debugfs.c > @@ -394,6 +394,12 @@ struct show_busy_params { > }; > > /* > + * Show "busy" requests - these are the requests owned by the block driver. > + * The test list_empty(>queuelist) is used to figure out whether or not > + * a request is owned by the block driver. That test works because the block > + * layer core uses list_del_init() consistently to remove a request from one > + * of the request lists. > + * > * Note: the state of a request may change while this function is in > progress, > * e.g. due to a concurrent blk_mq_finish_request() call. > */ > @@ -402,7 +408,7 @@ static void hctx_show_busy_rq(struct request *rq, void > *data, bool reserved) > const struct show_busy_params *params = data; > > if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx && > - blk_mq_rq_state(rq) != MQ_RQ_IDLE) > + list_empty(>queuelist)) > __blk_mq_debugfs_rq_show(params->m, >list_entry_rq(>queuelist)); > } Hello Jens, Can you share your opinion about the above patch? Thanks, Bart.
[PATCH] block: Finish renaming REQ_DISCARD into REQ_OP_DISCARD
From: Bart Van Assche Some time ago REQ_DISCARD was renamed into REQ_OP_DISCARD. Some comments and documentation files were not updated however. Update these comments and documentation files. See also commit 4e1b2d52a80d ("block, fs, drivers: remove REQ_OP compat defs and related code"). Signed-off-by: Bart Van Assche Cc: Mike Christie Cc: Martin K. Petersen Cc: Philipp Reisner Cc: Lars Ellenberg --- Documentation/blockdev/zram.txt| 2 +- Documentation/device-mapper/log-writes.txt | 2 +- drivers/block/drbd/drbd_int.h | 2 +- drivers/block/drbd/drbd_main.c | 2 +- drivers/block/drbd/drbd_protocol.h | 4 ++-- drivers/block/drbd/drbd_req.c | 2 +- drivers/block/drbd/drbd_worker.c | 2 +- drivers/target/target_core_spc.c | 6 +++--- 8 files changed, 11 insertions(+), 11 deletions(-) diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt index 875b2b56b87f..3c1b5ab54bc0 100644 --- a/Documentation/blockdev/zram.txt +++ b/Documentation/blockdev/zram.txt @@ -190,7 +190,7 @@ whitespace: notify_free Depending on device usage scenario it may account a) the number of pages freed because of swap slot free notifications or b) the number of pages freed because of - REQ_DISCARD requests sent by bio. The former ones are + REQ_OP_DISCARD requests sent by bio. The former ones are sent to a swap block device when a swap slot is freed, which implies that this disk is being used as a swap disk. The latter ones are sent by filesystem mounted with diff --git a/Documentation/device-mapper/log-writes.txt b/Documentation/device-mapper/log-writes.txt index f4ebcbaf50f3..b638d124be6a 100644 --- a/Documentation/device-mapper/log-writes.txt +++ b/Documentation/device-mapper/log-writes.txt @@ -38,7 +38,7 @@ inconsistent file system. Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as they complete as those requests will obviously bypass the device cache. -Any REQ_DISCARD requests are treated like WRITE requests. Otherwise we would +Any REQ_OP_DISCARD requests are treated like WRITE requests. Otherwise we would have all the DISCARD requests, and then the WRITE requests and then the FLUSH request. Consider the following example: diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h index 8a4c1328e6f9..1e47db57b9d2 100644 --- a/drivers/block/drbd/drbd_int.h +++ b/drivers/block/drbd/drbd_int.h @@ -429,7 +429,7 @@ enum { __EE_CALL_AL_COMPLETE_IO, __EE_MAY_SET_IN_SYNC, - /* is this a TRIM aka REQ_DISCARD? */ + /* is this a TRIM aka REQ_OP_DISCARD? */ __EE_IS_TRIM, /* In case a barrier failed, diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c index 15a9ffce9012..55fd104f1ed4 100644 --- a/drivers/block/drbd/drbd_main.c +++ b/drivers/block/drbd/drbd_main.c @@ -1673,7 +1673,7 @@ static u32 bio_flags_to_wire(struct drbd_connection *connection, return bio->bi_opf & REQ_SYNC ? DP_RW_SYNC : 0; } -/* Used to send write or TRIM aka REQ_DISCARD requests +/* Used to send write or TRIM aka REQ_OP_DISCARD requests * R_PRIMARY -> Peer (P_DATA, P_TRIM) */ int drbd_send_dblock(struct drbd_peer_device *peer_device, struct drbd_request *req) diff --git a/drivers/block/drbd/drbd_protocol.h b/drivers/block/drbd/drbd_protocol.h index c3081f93051c..48dabbb21e11 100644 --- a/drivers/block/drbd/drbd_protocol.h +++ b/drivers/block/drbd/drbd_protocol.h @@ -57,7 +57,7 @@ enum drbd_packet { P_PROTOCOL_UPDATE = 0x2d, /* data sock: is used in established connections */ /* 0x2e to 0x30 reserved, used in drbd 9 */ - /* REQ_DISCARD. We used "discard" in different contexts before, + /* REQ_OP_DISCARD. We used "discard" in different contexts before, * which is why I chose TRIM here, to disambiguate. */ P_TRIM= 0x31, @@ -126,7 +126,7 @@ struct p_header100 { #define DP_UNPLUG 8 /* not used anymore */ #define DP_FUA 16 /* equals REQ_FUA */ #define DP_FLUSH 32 /* equals REQ_PREFLUSH */ -#define DP_DISCARD 64 /* equals REQ_DISCARD */ +#define DP_DISCARD 64 /* equals REQ_OP_DISCARD */ #define DP_SEND_RECEIVE_ACK 128 /* This is a proto B write request */ #define DP_SEND_WRITE_ACK 256 /* This is a proto C write request */ #define DP_WSAME512 /* equiv. REQ_WRITE_SAME */ diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c index 19cac36e9737..1c4da17e902e 100644 --- a/drivers/block/drbd/drbd_req.c +++ b/drivers/block/drbd/drbd_req.c @@ -650,7 +650,7 @@ int __req_mod(struct drbd_request *req, enum drbd_req_event what, case DISCARD_COMPLETED_NOTSUPP:
Re: [PATCH blktests 0/3] Add NVMeOF multipath tests
On Tue, 2018-09-18 at 17:18 -0700, Omar Sandoval wrote: > On Tue, Sep 18, 2018 at 05:02:47PM -0700, Bart Van Assche wrote: > > On 9/18/18 4:24 PM, Omar Sandoval wrote: > > > On Tue, Sep 18, 2018 at 02:20:59PM -0700, Bart Van Assche wrote: > > > > Can you have a look at the updated master branch of > > > > https://github.com/bvanassche/blktests? That code should no longer fail > > > > if > > > > unloading the nvme kernel module fails. Please note that you will need > > > > kernel v4.18 to test these scripts - a KASAN complaint appears if I run > > > > these tests against kernel v4.19-rc4. > > > > > > Thanks, these pass now. Is it expected that my nvme device gets a > > > multipath device configured after running these tests? > > > > > > $ lsblk > > > NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT > > > vda 254:00 16G 0 disk > > > └─vda1 254:10 16G 0 part / > > > vdb 254:16 0 8G 0 disk > > > vdc 254:32 0 8G 0 disk > > > vdd 254:48 0 8G 0 disk > > > nvme0n1 259:00 8G 0 disk > > > └─mpatha 253:00 8G 0 mpath > > > > No, all multipath devices that were created during a test should be removed > > before that test finishes. I will look into this. > > > > > Also, can you please fix: > > > > > > _have_kernel_option NVME_MULTIPATH && exit 1 > > > > > > to not exit on failure? It should use SKIP_REASON and return 1. You > > > might need to add something like _dont_have_kernel_option to properly > > > handle the case where the config is not found. > > > > OK, I will change this. > > > > > Side note which isn't a blocker for merging is that there's a lot of > > > duplicated code between these helpers and the srp helpers. How hard > > > would it be to refactor that? > > > > Are you perhaps referring to the code that is shared between the > > tests/srp/rc tests/nvmeof-mp/rc shell scripts? > > Yes, those. > > > The hardest part is probably > > to chose a location where to store these functions. Should I create a file > > with common code under common/, under tests/srp/, under tests/nvmeof-mp/ or > > perhaps somewhere else? > > Just put it under common. Hi Omar, All feedback mentioned above has been addressed. The following pull request has been updated: https://github.com/osandov/blktests/pull/33. Please let me know if you want me to post these patches on the linux-block mailing list. Note: neither the upstream kernel v4.18 nor v4.19-rc4 are stable enough to pass all nvmeof-mp tests if kernel debugging options like KASAN are enabled. Additionally, the NVMe device_add_disk() race condition often causes multipathd to refuse to consider /dev/nvme... devices. The output on my test setup is as follows (all tests pass): # ./check -q nvmeof-mp nvmeof-mp/001 (Log in and log out) [passed] runtime 1.528s ... 1.909s nvmeof-mp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [ passed]time 38.968s ... runtime 38.968s ... 38.571s nvmeof-mp/004 (File I/O on top of multipath concurrently with logout and login (sq-on- nvmeof-mp/004 (File I/O on top of multipath concurrently with logout and login (sq-on- mq)) [passed]38.632s ... runtime 38.632s ... 37.529s nvmeof-mp/005 (Direct I/O with large transfer sizes and bs=4M) [passed] runtime 13.382s ... 13.684s nvmeof-mp/006 (Direct I/O with large transfer sizes and bs=8M) [passed] runtime 13.511s ... 13.480s nvmeof-mp/009 (Buffered I/O with large transfer sizes and bs=4M) [passed] runtime 13.665s ... 13.763s nvmeof-mp/010 (Buffered I/O with large transfer sizes and bs=8M) [passed] runtime 13.442s ... 13.900s nvmeof-mp/011 (Block I/O on top of multipath concurrently with logout and login) [pass ed] runtime 37.988s ... runtime 37.988s ... 37.945s nvmeof-mp/012 (dm-mpath on top of multiple I/O schedulers) [passed] runtime 21.659s ... 21.733s Thanks, Bart.
Re: [PATCHv3 0/5] genhd: register default groups with device_add_disk()
On Fri, 2018-09-21 at 07:48 +0200, Christoph Hellwig wrote: > Can you resend this with the one easy fixup pointed out? It would > be good to finally get the race fix merged. Seconded. I also would like to see these patches being merged upstream. Bart.
[PATCH v11 5/8] percpu-refcount: Introduce percpu_ref_resurrect()
This function will be used in a later patch to switch the struct request_queue q_usage_counter from killed back to live. In contrast to percpu_ref_reinit(), this new function does not require that the refcount is zero. Signed-off-by: Bart Van Assche Acked-by: Tejun Heo Reviewed-by: Ming Lei Cc: Christoph Hellwig Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn --- include/linux/percpu-refcount.h | 1 + lib/percpu-refcount.c | 28 ++-- 2 files changed, 27 insertions(+), 2 deletions(-) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index 009cdf3d65b6..b297cd1cd4f1 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -108,6 +108,7 @@ void percpu_ref_switch_to_atomic_sync(struct percpu_ref *ref); void percpu_ref_switch_to_percpu(struct percpu_ref *ref); void percpu_ref_kill_and_confirm(struct percpu_ref *ref, percpu_ref_func_t *confirm_kill); +void percpu_ref_resurrect(struct percpu_ref *ref); void percpu_ref_reinit(struct percpu_ref *ref); /** diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index 9f96fa7bc000..de10b8c0bff6 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -356,11 +356,35 @@ EXPORT_SYMBOL_GPL(percpu_ref_kill_and_confirm); */ void percpu_ref_reinit(struct percpu_ref *ref) { + WARN_ON_ONCE(!percpu_ref_is_zero(ref)); + + percpu_ref_resurrect(ref); +} +EXPORT_SYMBOL_GPL(percpu_ref_reinit); + +/** + * percpu_ref_resurrect - modify a percpu refcount from dead to live + * @ref: perpcu_ref to resurrect + * + * Modify @ref so that it's in the same state as before percpu_ref_kill() was + * called. @ref must be dead but must not yet have exited. + * + * If @ref->release() frees @ref then the caller is responsible for + * guaranteeing that @ref->release() does not get called while this + * function is in progress. + * + * Note that percpu_ref_tryget[_live]() are safe to perform on @ref while + * this function is in progress. + */ +void percpu_ref_resurrect(struct percpu_ref *ref) +{ + unsigned long __percpu *percpu_count; unsigned long flags; spin_lock_irqsave(_ref_switch_lock, flags); - WARN_ON_ONCE(!percpu_ref_is_zero(ref)); + WARN_ON_ONCE(!(ref->percpu_count_ptr & __PERCPU_REF_DEAD)); + WARN_ON_ONCE(__ref_is_percpu(ref, _count)); ref->percpu_count_ptr &= ~__PERCPU_REF_DEAD; percpu_ref_get(ref); @@ -368,4 +392,4 @@ void percpu_ref_reinit(struct percpu_ref *ref) spin_unlock_irqrestore(_ref_switch_lock, flags); } -EXPORT_SYMBOL_GPL(percpu_ref_reinit); +EXPORT_SYMBOL_GPL(percpu_ref_resurrect); -- 2.19.0.605.g01d371f741-goog
[PATCH v11 4/8] block: Schedule runtime resume earlier
Instead of scheduling runtime resume of a request queue after a request has been queued, schedule asynchronous resume during request allocation. The new pm_request_resume() calls occur after blk_queue_enter() has increased the q_usage_counter request queue member. This change is needed for a later patch that will make request allocation block while the queue status is not RPM_ACTIVE. Signed-off-by: Bart Van Assche Reviewed-by: Ming Lei Reviewed-by: Christoph Hellwig Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-core.c | 3 ++- block/elevator.c | 1 - 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index fd91e9bf2893..fec135ae52cf 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -956,7 +956,8 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) wait_event(q->mq_freeze_wq, (atomic_read(>mq_freeze_depth) == 0 && - (pm || !blk_queue_pm_only(q))) || + (pm || (blk_pm_request_resume(q), + !blk_queue_pm_only(q || blk_queue_dying(q)); if (blk_queue_dying(q)) return -ENODEV; diff --git a/block/elevator.c b/block/elevator.c index 1c992bf6cfb1..e18ac68626e3 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -601,7 +601,6 @@ void __elv_add_request(struct request_queue *q, struct request *rq, int where) trace_block_rq_insert(q, rq); blk_pm_add_request(q, rq); - blk_pm_request_resume(q); rq->q = q; -- 2.19.0.605.g01d371f741-goog
[PATCH v11 8/8] blk-mq: Enable support for runtime power management
Now that the blk-mq core processes power management requests (marked with RQF_PREEMPT) in other states than RPM_ACTIVE, enable runtime power management for blk-mq. Signed-off-by: Bart Van Assche Reviewed-by: Ming Lei Reviewed-by: Christoph Hellwig Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-mq.c | 2 ++ block/blk-pm.c | 6 -- 2 files changed, 2 insertions(+), 6 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 96d501e8663c..d384ab700afd 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -33,6 +33,7 @@ #include "blk-mq.h" #include "blk-mq-debugfs.h" #include "blk-mq-tag.h" +#include "blk-pm.h" #include "blk-stat.h" #include "blk-mq-sched.h" #include "blk-rq-qos.h" @@ -475,6 +476,7 @@ static void __blk_mq_free_request(struct request *rq) struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu); const int sched_tag = rq->internal_tag; + blk_pm_mark_last_busy(rq); if (rq->tag != -1) blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag); if (sched_tag != -1) diff --git a/block/blk-pm.c b/block/blk-pm.c index 972fbc656846..f8fdae01bea2 100644 --- a/block/blk-pm.c +++ b/block/blk-pm.c @@ -30,12 +30,6 @@ */ void blk_pm_runtime_init(struct request_queue *q, struct device *dev) { - /* Don't enable runtime PM for blk-mq until it is ready */ - if (q->mq_ops) { - pm_runtime_disable(dev); - return; - } - q->dev = dev; q->rpm_status = RPM_ACTIVE; pm_runtime_set_autosuspend_delay(q->dev, -1); -- 2.19.0.605.g01d371f741-goog
[PATCH v11 7/8] block: Make blk_get_request() block for non-PM requests while suspended
Instead of allowing requests that are not power management requests to enter the queue in runtime suspended status (RPM_SUSPENDED), make the blk_get_request() caller block. This change fixes a starvation issue: it is now guaranteed that power management requests will be executed no matter how many blk_get_request() callers are waiting. For blk-mq, instead of maintaining the q->nr_pending counter, rely on q->q_usage_counter. Call pm_runtime_mark_last_busy() every time a request finishes instead of only if the queue depth drops to zero. Signed-off-by: Bart Van Assche Reviewed-by: Ming Lei Reviewed-by: Christoph Hellwig Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-core.c | 37 - block/blk-pm.c | 44 +++- 2 files changed, 47 insertions(+), 34 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index fec135ae52cf..16dd3a989753 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2746,30 +2746,6 @@ void blk_account_io_done(struct request *req, u64 now) } } -#ifdef CONFIG_PM -/* - * Don't process normal requests when queue is suspended - * or in the process of suspending/resuming - */ -static bool blk_pm_allow_request(struct request *rq) -{ - switch (rq->q->rpm_status) { - case RPM_RESUMING: - case RPM_SUSPENDING: - return rq->rq_flags & RQF_PM; - case RPM_SUSPENDED: - return false; - default: - return true; - } -} -#else -static bool blk_pm_allow_request(struct request *rq) -{ - return true; -} -#endif - void blk_account_io_start(struct request *rq, bool new_io) { struct hd_struct *part; @@ -2815,11 +2791,14 @@ static struct request *elv_next_request(struct request_queue *q) while (1) { list_for_each_entry(rq, >queue_head, queuelist) { - if (blk_pm_allow_request(rq)) - return rq; - - if (rq->rq_flags & RQF_SOFTBARRIER) - break; +#ifdef CONFIG_PM + /* +* If a request gets queued in state RPM_SUSPENDED +* then that's a kernel bug. +*/ + WARN_ON_ONCE(q->rpm_status == RPM_SUSPENDED); +#endif + return rq; } /* diff --git a/block/blk-pm.c b/block/blk-pm.c index 9b636960d285..972fbc656846 100644 --- a/block/blk-pm.c +++ b/block/blk-pm.c @@ -1,8 +1,11 @@ // SPDX-License-Identifier: GPL-2.0 +#include #include #include #include +#include "blk-mq.h" +#include "blk-mq-tag.h" /** * blk_pm_runtime_init - Block layer runtime PM initialization routine @@ -68,14 +71,40 @@ int blk_pre_runtime_suspend(struct request_queue *q) if (!q->dev) return ret; + WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE); + + /* +* Increase the pm_only counter before checking whether any +* non-PM blk_queue_enter() calls are in progress to avoid that any +* new non-PM blk_queue_enter() calls succeed before the pm_only +* counter is decreased again. +*/ + blk_set_pm_only(q); + ret = -EBUSY; + /* Switch q_usage_counter from per-cpu to atomic mode. */ + blk_freeze_queue_start(q); + /* +* Wait until atomic mode has been reached. Since that +* involves calling call_rcu(), it is guaranteed that later +* blk_queue_enter() calls see the pm-only state. See also +* http://lwn.net/Articles/573497/. +*/ + percpu_ref_switch_to_atomic_sync(>q_usage_counter); + if (percpu_ref_is_zero(>q_usage_counter)) + ret = 0; + /* Switch q_usage_counter back to per-cpu mode. */ + blk_mq_unfreeze_queue(q); + spin_lock_irq(q->queue_lock); - if (q->nr_pending) { - ret = -EBUSY; + if (ret < 0) pm_runtime_mark_last_busy(q->dev); - } else { + else q->rpm_status = RPM_SUSPENDING; - } spin_unlock_irq(q->queue_lock); + + if (ret) + blk_clear_pm_only(q); + return ret; } EXPORT_SYMBOL(blk_pre_runtime_suspend); @@ -106,6 +135,9 @@ void blk_post_runtime_suspend(struct request_queue *q, int err) pm_runtime_mark_last_busy(q->dev); } spin_unlock_irq(q->queue_lock); + + if (err) + blk_clear_pm_only(q); } EXPORT_SYMBOL(blk_post_runtime_suspend); @@ -153,13 +185,15 @@ void blk_post_runtime_resume(struct request_queue *q, int err) spin_lock_irq(q->queue_lock); if (!err) { q->rpm_status = RPM_ACTIVE; - __blk_run_queue(q); pm_
[PATCH v11 2/8] block, scsi: Change the preempt-only flag into a counter
The RQF_PREEMPT flag is used for three purposes: - In the SCSI core, for making sure that power management requests are executed even if a device is in the "quiesced" state. - For domain validation by SCSI drivers that use the parallel port. - In the IDE driver, for IDE preempt requests. Rename "preempt-only" into "pm-only" because the primary purpose of this mode is power management. Since the power management core may but does not have to resume a runtime suspended device before performing system-wide suspend and since a later patch will set "pm-only" mode as long as a block device is runtime suspended, make it possible to set "pm-only" mode from more than one context. Since with this change scsi_device_quiesce() is no longer idempotent, make that function return early if it is called for a quiesced queue. Signed-off-by: Bart Van Assche Acked-by: Martin K. Petersen Reviewed-by: Hannes Reinecke Reviewed-by: Christoph Hellwig Reviewed-by: Ming Lei Cc: Jianchao Wang Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-core.c| 35 ++- block/blk-mq-debugfs.c | 10 +- drivers/scsi/scsi_lib.c | 11 +++ include/linux/blkdev.h | 14 +- 4 files changed, 43 insertions(+), 27 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 6d4dd176bd9d..1a691f5269bb 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -422,24 +422,25 @@ void blk_sync_queue(struct request_queue *q) EXPORT_SYMBOL(blk_sync_queue); /** - * blk_set_preempt_only - set QUEUE_FLAG_PREEMPT_ONLY + * blk_set_pm_only - increment pm_only counter * @q: request queue pointer - * - * Returns the previous value of the PREEMPT_ONLY flag - 0 if the flag was not - * set and 1 if the flag was already set. */ -int blk_set_preempt_only(struct request_queue *q) +void blk_set_pm_only(struct request_queue *q) { - return blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q); + atomic_inc(>pm_only); } -EXPORT_SYMBOL_GPL(blk_set_preempt_only); +EXPORT_SYMBOL_GPL(blk_set_pm_only); -void blk_clear_preempt_only(struct request_queue *q) +void blk_clear_pm_only(struct request_queue *q) { - blk_queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q); - wake_up_all(>mq_freeze_wq); + int pm_only; + + pm_only = atomic_dec_return(>pm_only); + WARN_ON_ONCE(pm_only < 0); + if (pm_only == 0) + wake_up_all(>mq_freeze_wq); } -EXPORT_SYMBOL_GPL(blk_clear_preempt_only); +EXPORT_SYMBOL_GPL(blk_clear_pm_only); /** * __blk_run_queue_uncond - run a queue whether or not it has been stopped @@ -918,7 +919,7 @@ EXPORT_SYMBOL(blk_alloc_queue); */ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) { - const bool preempt = flags & BLK_MQ_REQ_PREEMPT; + const bool pm = flags & BLK_MQ_REQ_PREEMPT; while (true) { bool success = false; @@ -926,11 +927,11 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) rcu_read_lock(); if (percpu_ref_tryget_live(>q_usage_counter)) { /* -* The code that sets the PREEMPT_ONLY flag is -* responsible for ensuring that that flag is globally -* visible before the queue is unfrozen. +* The code that increments the pm_only counter is +* responsible for ensuring that that counter is +* globally visible before the queue is unfrozen. */ - if (preempt || !blk_queue_preempt_only(q)) { + if (pm || !blk_queue_pm_only(q)) { success = true; } else { percpu_ref_put(>q_usage_counter); @@ -955,7 +956,7 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) wait_event(q->mq_freeze_wq, (atomic_read(>mq_freeze_depth) == 0 && - (preempt || !blk_queue_preempt_only(q))) || + (pm || !blk_queue_pm_only(q))) || blk_queue_dying(q)); if (blk_queue_dying(q)) return -ENODEV; diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index cb1e6cf7ac48..a5ea86835fcb 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -102,6 +102,14 @@ static int blk_flags_show(struct seq_file *m, const unsigned long flags, return 0; } +static int queue_pm_only_show(void *data, struct seq_file *m) +{ + struct request_queue *q = data; + + seq_printf(m, "%d\n", atomic_read(>pm_only)); + return 0; +} + #define QUEUE_FLAG_NAME(name) [QUEUE_FLAG_##name] = #name static const char *con
[PATCH v11 0/8] blk-mq: Implement runtime power management
Hello Jens, One of the pieces that is missing before blk-mq can be made the default is implementing runtime power management support for blk-mq. This patch series not only implements runtime power management for blk-mq but also fixes a starvation issue in the power management code for the legacy block layer. Please consider this patch series for the upstream kernel. Thanks, Bart. Changes compared to v10: - Added a comment in the percpu-refcount patch in this series as Tejun asked. - Updated Acked-by / Reviewed-by tags. Changes compared to v9: - Left out the patches that document the functions that iterate over requests and also the patch that introduces blk_mq_queue_rq_iter(). - Simplified blk_pre_runtime_suspend(): left out the check whether no requests are in progress. - Fixed the race between blk_queue_enter(), queue freezing and runtime power management that Ming had identified. - Added a new patch that introduces percpu_ref_resurrect(). Changes compared to v8: - Fixed the race that was reported by Jianchao. - Fixed another spelling issue in a source code comment. Changes compared to v7: - Addressed Jianchao's feedback about patch "Make blk_get_request() block for non-PM requests while suspended". - Added two new patches - one that documents the functions that iterate over requests and one that introduces a new function that iterates over all requests associated with a queue. Changes compared to v6: - Left out the patches that split RQF_PREEMPT in three flags. - Left out the patch that introduces the SCSI device state SDEV_SUSPENDED. - Left out the patch that introduces blk_pm_runtime_exit(). - Restored the patch that changes the PREEMPT_ONLY flag into a counter. Changes compared to v5: - Introduced a new flag RQF_DV that replaces RQF_PREEMPT for SCSI domain validation. - Introduced a new request queue state QUEUE_FLAG_DV_ONLY for SCSI domain validation. - Instead of using SDEV_QUIESCE for both runtime suspend and SCSI domain validation, use that state for domain validation only and introduce a new state for runtime suspend, namely SDEV_QUIESCE. - Reallow system suspend during SCSI domain validation. - Moved the runtime resume call from the request allocation code into blk_queue_enter(). - Instead of relying on q_usage_counter, iterate over the tag set to determine whether or not any requests are in flight. Changes compared to v4: - Dropped the patches "Give RQF_PREEMPT back its original meaning" and "Serialize queue freezing and blk_pre_runtime_suspend()". - Replaced "percpu_ref_read()" with "percpu_is_in_use()". - Inserted pm_request_resume() calls in the block layer request allocation code such that the context that submits a request no longer has to call pm_runtime_get(). Changes compared to v3: - Avoid adverse interactions between system-wide suspend/resume and runtime power management by changing the PREEMPT_ONLY flag into a counter. - Give RQF_PREEMPT back its original meaning, namely that it is only set for ide_preempt requests. - Remove the flag BLK_MQ_REQ_PREEMPT. - Removed the pm_request_resume() call. Changes compared to v2: - Fixed the build for CONFIG_BLOCK=n. - Added a patch that introduces percpu_ref_read() in the percpu-counter code. - Added a patch that makes it easier to detect missing pm_runtime_get*() calls. - Addressed Jianchao's feedback including the comment about runtime overhead of switching a per-cpu counter to atomic mode. Changes compared to v1: - Moved the runtime power management code into a separate file. - Addressed Ming's feedback. Bart Van Assche (8): block: Move power management code into a new source file block, scsi: Change the preempt-only flag into a counter block: Split blk_pm_add_request() and blk_pm_put_request() block: Schedule runtime resume earlier percpu-refcount: Introduce percpu_ref_resurrect() block: Allow unfreezing of a queue while requests are in progress block: Make blk_get_request() block for non-PM requests while suspended blk-mq: Enable support for runtime power management block/Kconfig | 3 + block/Makefile | 1 + block/blk-core.c| 270 block/blk-mq-debugfs.c | 10 +- block/blk-mq.c | 4 +- block/blk-pm.c | 216 + block/blk-pm.h | 69 block/elevator.c| 22 +-- drivers/scsi/scsi_lib.c | 11 +- drivers/scsi/scsi_pm.c | 1 + drivers/scsi/sd.c | 1 + drivers/scsi/sr.c | 1 + include/linux/blk-pm.h | 24 +++ include/linux/blkdev.h | 37 ++--- include/linux/percpu-refcount.h | 1 + lib/percpu-refcount.c | 28 +++- 16 files changed, 401 insertions(+), 298 deletions(-) create mode 100644 block/blk-pm.c create mode 100644 block/b
[PATCH v11 1/8] block: Move power management code into a new source file
Move the code for runtime power management from blk-core.c into the new source file blk-pm.c. Move the corresponding declarations from into . For CONFIG_PM=n, leave out the declarations of the functions that are not used in that mode. This patch not only reduces the number of #ifdefs in the block layer core code but also reduces the size of header file and hence should help to reduce the build time of the Linux kernel if CONFIG_PM is not defined. Signed-off-by: Bart Van Assche Reviewed-by: Ming Lei Reviewed-by: Christoph Hellwig Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/Kconfig | 3 + block/Makefile | 1 + block/blk-core.c | 196 + block/blk-pm.c | 188 +++ block/blk-pm.h | 43 + block/elevator.c | 22 + drivers/scsi/scsi_pm.c | 1 + drivers/scsi/sd.c | 1 + drivers/scsi/sr.c | 1 + include/linux/blk-pm.h | 24 + include/linux/blkdev.h | 23 - 11 files changed, 264 insertions(+), 239 deletions(-) create mode 100644 block/blk-pm.c create mode 100644 block/blk-pm.h create mode 100644 include/linux/blk-pm.h diff --git a/block/Kconfig b/block/Kconfig index 1f2469a0123c..85263e7bded6 100644 --- a/block/Kconfig +++ b/block/Kconfig @@ -228,4 +228,7 @@ config BLK_MQ_RDMA depends on BLOCK && INFINIBAND default y +config BLK_PM + def_bool BLOCK && PM + source block/Kconfig.iosched diff --git a/block/Makefile b/block/Makefile index 572b33f32c07..27eac600474f 100644 --- a/block/Makefile +++ b/block/Makefile @@ -37,3 +37,4 @@ obj-$(CONFIG_BLK_WBT) += blk-wbt.o obj-$(CONFIG_BLK_DEBUG_FS) += blk-mq-debugfs.o obj-$(CONFIG_BLK_DEBUG_FS_ZONED)+= blk-mq-debugfs-zoned.o obj-$(CONFIG_BLK_SED_OPAL) += sed-opal.o +obj-$(CONFIG_BLK_PM) += blk-pm.o diff --git a/block/blk-core.c b/block/blk-core.c index 4dbc93f43b38..6d4dd176bd9d 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -42,6 +42,7 @@ #include "blk.h" #include "blk-mq.h" #include "blk-mq-sched.h" +#include "blk-pm.h" #include "blk-rq-qos.h" #ifdef CONFIG_DEBUG_FS @@ -1726,16 +1727,6 @@ void part_round_stats(struct request_queue *q, int cpu, struct hd_struct *part) } EXPORT_SYMBOL_GPL(part_round_stats); -#ifdef CONFIG_PM -static void blk_pm_put_request(struct request *rq) -{ - if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending) - pm_runtime_mark_last_busy(rq->q->dev); -} -#else -static inline void blk_pm_put_request(struct request *rq) {} -#endif - void __blk_put_request(struct request_queue *q, struct request *req) { req_flags_t rq_flags = req->rq_flags; @@ -3757,191 +3748,6 @@ void blk_finish_plug(struct blk_plug *plug) } EXPORT_SYMBOL(blk_finish_plug); -#ifdef CONFIG_PM -/** - * blk_pm_runtime_init - Block layer runtime PM initialization routine - * @q: the queue of the device - * @dev: the device the queue belongs to - * - * Description: - *Initialize runtime-PM-related fields for @q and start auto suspend for - *@dev. Drivers that want to take advantage of request-based runtime PM - *should call this function after @dev has been initialized, and its - *request queue @q has been allocated, and runtime PM for it can not happen - *yet(either due to disabled/forbidden or its usage_count > 0). In most - *cases, driver should call this function before any I/O has taken place. - * - *This function takes care of setting up using auto suspend for the device, - *the autosuspend delay is set to -1 to make runtime suspend impossible - *until an updated value is either set by user or by driver. Drivers do - *not need to touch other autosuspend settings. - * - *The block layer runtime PM is request based, so only works for drivers - *that use request as their IO unit instead of those directly use bio's. - */ -void blk_pm_runtime_init(struct request_queue *q, struct device *dev) -{ - /* Don't enable runtime PM for blk-mq until it is ready */ - if (q->mq_ops) { - pm_runtime_disable(dev); - return; - } - - q->dev = dev; - q->rpm_status = RPM_ACTIVE; - pm_runtime_set_autosuspend_delay(q->dev, -1); - pm_runtime_use_autosuspend(q->dev); -} -EXPORT_SYMBOL(blk_pm_runtime_init); - -/** - * blk_pre_runtime_suspend - Pre runtime suspend check - * @q: the queue of the device - * - * Description: - *This function will check if runtime suspend is allowed for the device - *by examining if there are any requests pending in the queue. If there - *are requests pending, the device can not be runtime suspended; otherwise, - *the queue's status will be updated to SUSPENDING and the driver can - *pro
[PATCH v11 6/8] block: Allow unfreezing of a queue while requests are in progress
A later patch will call blk_freeze_queue_start() followed by blk_mq_unfreeze_queue() without waiting for q_usage_counter to drop to zero. Make sure that this doesn't cause a kernel warning to appear by switching from percpu_ref_reinit() to percpu_ref_resurrect(). The former namely requires that the refcount it operates on is zero. Signed-off-by: Bart Van Assche Reviewed-by: Ming Lei Reviewed-by: Christoph Hellwig Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn --- block/blk-mq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 85a1c1a59c72..96d501e8663c 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -198,7 +198,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q) freeze_depth = atomic_dec_return(>mq_freeze_depth); WARN_ON_ONCE(freeze_depth < 0); if (!freeze_depth) { - percpu_ref_reinit(>q_usage_counter); + percpu_ref_resurrect(>q_usage_counter); wake_up_all(>mq_freeze_wq); } } -- 2.19.0.605.g01d371f741-goog
[PATCH v11 3/8] block: Split blk_pm_add_request() and blk_pm_put_request()
Move the pm_request_resume() and pm_runtime_mark_last_busy() calls into two new functions and thereby separate legacy block layer code from code that works for both the legacy block layer and blk-mq. A later patch will add calls to the new functions in the blk-mq code. Signed-off-by: Bart Van Assche Reviewed-by: Ming Lei Reviewed-by: Christoph Hellwig Cc: Martin K. Petersen Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-core.c | 1 + block/blk-pm.h | 36 +++- block/elevator.c | 1 + 3 files changed, 33 insertions(+), 5 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 1a691f5269bb..fd91e9bf2893 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1744,6 +1744,7 @@ void __blk_put_request(struct request_queue *q, struct request *req) blk_req_zone_write_unlock(req); blk_pm_put_request(req); + blk_pm_mark_last_busy(req); elv_completed_request(q, req); diff --git a/block/blk-pm.h b/block/blk-pm.h index 1ffc8ef203ec..a8564ea72a41 100644 --- a/block/blk-pm.h +++ b/block/blk-pm.h @@ -6,8 +6,23 @@ #include #ifdef CONFIG_PM +static inline void blk_pm_request_resume(struct request_queue *q) +{ + if (q->dev && (q->rpm_status == RPM_SUSPENDED || + q->rpm_status == RPM_SUSPENDING)) + pm_request_resume(q->dev); +} + +static inline void blk_pm_mark_last_busy(struct request *rq) +{ + if (rq->q->dev && !(rq->rq_flags & RQF_PM)) + pm_runtime_mark_last_busy(rq->q->dev); +} + static inline void blk_pm_requeue_request(struct request *rq) { + lockdep_assert_held(rq->q->queue_lock); + if (rq->q->dev && !(rq->rq_flags & RQF_PM)) rq->q->nr_pending--; } @@ -15,17 +30,28 @@ static inline void blk_pm_requeue_request(struct request *rq) static inline void blk_pm_add_request(struct request_queue *q, struct request *rq) { - if (q->dev && !(rq->rq_flags & RQF_PM) && q->nr_pending++ == 0 && - (q->rpm_status == RPM_SUSPENDED || q->rpm_status == RPM_SUSPENDING)) - pm_request_resume(q->dev); + lockdep_assert_held(q->queue_lock); + + if (q->dev && !(rq->rq_flags & RQF_PM)) + q->nr_pending++; } static inline void blk_pm_put_request(struct request *rq) { - if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending) - pm_runtime_mark_last_busy(rq->q->dev); + lockdep_assert_held(rq->q->queue_lock); + + if (rq->q->dev && !(rq->rq_flags & RQF_PM)) + --rq->q->nr_pending; } #else +static inline void blk_pm_request_resume(struct request_queue *q) +{ +} + +static inline void blk_pm_mark_last_busy(struct request *rq) +{ +} + static inline void blk_pm_requeue_request(struct request *rq) { } diff --git a/block/elevator.c b/block/elevator.c index e18ac68626e3..1c992bf6cfb1 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -601,6 +601,7 @@ void __elv_add_request(struct request_queue *q, struct request *rq, int where) trace_block_rq_insert(q, rq); blk_pm_add_request(q, rq); + blk_pm_request_resume(q); rq->q = q; -- 2.19.0.605.g01d371f741-goog
Re: [PATCH v10 5/8] percpu-refcount: Introduce percpu_ref_resurrect()
On Wed, 2018-09-26 at 09:59 -0700, Tejun Heo wrote: > Hello, Bart. > > On Mon, Sep 24, 2018 at 01:43:35PM -0700, Bart Van Assche wrote: > > Thanks for the review. But could you please clarify what "after ->percpu_ref > > is called" refers to? > > Oops, sorry, I meant the confirm_kill callback of > percpu_ref_kill_and_confirm(). Thanks for the clarification. I will update the comment above percpu_ref_resurrect(). Bart.
Re: [PATCH v10 7/8] block: Make blk_get_request() block for non-PM requests while suspended
On Wed, 2018-09-26 at 17:06 +0200, Johannes Thumshirn wrote: > On Wed, Sep 26, 2018 at 04:57:32PM +0200, Christoph Hellwig wrote: > > I don't think this actually works given that rpm_status only exists > > if CONFIG_PM is set. > > I think it'll work as GCC does constant propagation. There are > actually some places in the kernel that follow this pattern. This is what gcc on my development system thinks about that proposal: In file included from ./arch/x86/include/asm/bug.h:83:0, from ./include/linux/bug.h:5, from ./include/linux/thread_info.h:12, from ./arch/x86/include/asm/preempt.h:7, from ./include/linux/preempt.h:81, from ./include/linux/spinlock.h:51, from ./include/linux/seqlock.h:36, from ./include/linux/time.h:6, from ./include/linux/stat.h:19, from ./include/linux/module.h:10, from block/blk-core.c:15: block/blk-core.c: In function ‘elv_next_request’: block/blk-core.c:2795:44: error: ‘struct request_queue’ has no member named ‘rpm_status’; did you mean ‘stats’? WARN_ON_ONCE(q->rpm_status == RPM_SUSPENDED); ^ ./include/asm-generic/bug.h:69:25: note: in definition of macro ‘WARN_ON_ONCE’ int __ret_warn_on = !!(condition); \ ^ scripts/Makefile.build:305: recipe for target 'block/blk-core.o' failed Bart.
Re: [PATCH v2] block: fix deadline elevator drain for zoned block devices
On Wed, 2018-09-26 at 16:30 +0900, Damien Le Moal wrote: > diff --git a/block/elevator.c b/block/elevator.c > index 6a06b5d040e5..8cd81fd6339a 100644 > --- a/block/elevator.c > +++ b/block/elevator.c > @@ -609,7 +609,7 @@ void elv_drain_elevator(struct request_queue *q) > > while (e->type->ops.sq.elevator_dispatch_fn(q, 1)) > ; > - if (q->nr_sorted && printed++ < 10) { > + if (q->nr_sorted && printed++ < 10 && !blk_queue_is_zoned(q)) { > printk(KERN_ERR "%s: forced dispatching is broken " > "(nr_sorted=%u), please report this\n", > q->elevator->type->elevator_name, q->nr_sorted); It seems wrong to me to perform the blk_queue_is_zoned() check after having incremented the "printed" variable. Shouldn't that check be performed before incrementing the "printed" variable to avoid that "printed" gets incremented if we know that we are not going to report the error message? Thanks, Bart.
Re: [PATCH] blk-mq: Allow blocking queue tag iter callbacks
On Tue, 2018-09-25 at 08:39 -0600, Keith Busch wrote: > On Tue, Sep 25, 2018 at 10:39:46AM +0800, jianchao.wang wrote: > > But the issue is the left part of blk_mq_timeout_work is moved out of > > protection of q refcount. > > I'm not sure what you mean by "left part". The only part that isn't > outside the reference with this patch is the part Bart pointed out. > > This looks like it may be fixed by either moving the refcount back up a > level to all the callers of blk_mq_queue_tag_busy_iter, or add > cancel_work_sync(>timeout_work) to __blk_mq_update_nr_hw_queues after > the freeze. Hi Keith, How about applying the following (untested) patch on top of your patch? diff --git a/block/blk-mq.c b/block/blk-mq.c index 019f9b169887..099e203b5213 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -851,6 +851,9 @@ static void blk_mq_timeout_work(struct work_struct *work) if (!blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, )) return; + if (!percpu_ref_tryget(>q_usage_counter)) + return; + if (next != 0) { mod_timer(>timeout, next); } else { @@ -866,6 +869,7 @@ static void blk_mq_timeout_work(struct work_struct *work) blk_mq_tag_idle(hctx); } } + blk_queue_exit(q); } Thanks, Bart.
Re: [PATCH] blk-mq: Allow blocking queue tag iter callbacks
On 9/24/18 7:11 PM, jianchao.wang wrote: Hi Keith On 09/25/2018 05:09 AM, Keith Busch wrote: - /* A deadlock might occur if a request is stuck requiring a -* timeout at the same time a queue freeze is waiting -* completion, since the timeout code would not be able to -* acquire the queue reference here. -* -* That's why we don't use blk_queue_enter here; instead, we use -* percpu_ref_tryget directly, because we need to be able to -* obtain a reference even in the short window between the queue -* starting to freeze, by dropping the first reference in -* blk_freeze_queue_start, and the moment the last request is -* consumed, marked by the instant q_usage_counter reaches -* zero. -*/ - if (!percpu_ref_tryget(>q_usage_counter)) + if (!blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, )) return; We cannot discard the percpu_ref_tryget here. There following code in blk_mq_timeout_work still need it: if (next != 0) { mod_timer(>timeout, next); } else { queue_for_each_hw_ctx(q, hctx, i) { /* the hctx may be unmapped, so check it here */ if (blk_mq_hw_queue_mapped(hctx)) blk_mq_tag_idle(hctx); } } Hi Jianchao, Had you noticed that the percpu_ref_tryget() call has been moved into blk_mq_queue_tag_busy_iter()? Bart.
Re: [PATCH v10 5/8] percpu-refcount: Introduce percpu_ref_resurrect()
On Mon, 2018-09-24 at 11:01 -0700, Tejun Heo wrote: > Hello, Bart. > > On Fri, Sep 21, 2018 at 01:31:19PM -0700, Bart Van Assche wrote: > > +void percpu_ref_resurrect(struct percpu_ref *ref) > > +{ > > + unsigned long __percpu *percpu_count; > > unsigned long flags; > > > > spin_lock_irqsave(_ref_switch_lock, flags); > > > > - WARN_ON_ONCE(!percpu_ref_is_zero(ref)); > > + WARN_ON_ONCE(!(ref->percpu_count_ptr & __PERCPU_REF_DEAD)); > > + WARN_ON_ONCE(__ref_is_percpu(ref, _count)); > > So, this in itself is fine but I'm a bit worried that this might it > easier to abuse percpu_ref's dying mode. More specifically, it isn't > a good idea to depend on percpu_ref's implementation details from > outside - ie. the only guarantee there ever is that percpu_ref won't > give out new refs after ->percpu_ref is called - e.g. users can't > piggy back on implied rcu grace period. > > Can you please note that in the function comment? Provided that, > > Acked-by: Tejun Heo Hi Tejun, Thanks for the review. But could you please clarify what "after ->percpu_ref is called" refers to? Thanks, Bart.
Re: Regression caused by f5bbbbe4d635
On Mon, 2018-09-24 at 13:13 -0600, Keith Busch wrote: > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 85a1c1a59c72..28d128450621 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -848,22 +848,6 @@ static void blk_mq_timeout_work(struct work_struct *work) > struct blk_mq_hw_ctx *hctx; > int i; > > - /* A deadlock might occur if a request is stuck requiring a > - * timeout at the same time a queue freeze is waiting > - * completion, since the timeout code would not be able to > - * acquire the queue reference here. > - * > - * That's why we don't use blk_queue_enter here; instead, we use > - * percpu_ref_tryget directly, because we need to be able to > - * obtain a reference even in the short window between the queue > - * starting to freeze, by dropping the first reference in > - * blk_freeze_queue_start, and the moment the last request is > - * consumed, marked by the instant q_usage_counter reaches > - * zero. > - */ > - if (!percpu_ref_tryget(>q_usage_counter)) > - return; > - > blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, ); > > if (next != 0) { > @@ -881,7 +865,6 @@ static void blk_mq_timeout_work(struct work_struct *work) > blk_mq_tag_idle(hctx); > } > } > - blk_queue_exit(q); > } Hi Keith, The above introduces a behavior change: if the percpu_ref_tryget() call inside blk_mq_queue_tag_busy_iter() fails then blk_mq_timeout_work() will now call blk_mq_tag_idle(). I think that's wrong if the percpu_ref_tryget() call fails due to the queue having been frozen. Please make blk_mq_queue_tag_busy_iter() return a bool that indicates whether or not it has iterated over the request queue. Thanks, Bart.
[PATCH v2] blk-mq: Document the functions that iterate over requests
Make it easier to understand the purpose of the functions that iterate over requests by documenting their purpose. Fix several minor spelling and grammer mistakes in comments in these functions. Signed-off-by: Bart Van Assche Reviewed-by: Johannes Thumshirn Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke --- Changes compared to v1: converted to kernel-doc format. block/blk-mq-tag.c | 71 +- 1 file changed, 64 insertions(+), 7 deletions(-) diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 94e1ed667b6e..40d1667bceac 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -232,13 +232,26 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) /* * We can hit rq == NULL here, because the tagging functions -* test and set the bit before assining ->rqs[]. +* test and set the bit before assigning ->rqs[]. */ if (rq && rq->q == hctx->queue) iter_data->fn(hctx, rq, iter_data->data, reserved); return true; } +/** + * bt_for_each - iterate over the requests associated with a hardware queue + * @hctx: Hardware queue to examine. + * @bt:sbitmap to examine. This is either the breserved_tags member + * or the bitmap_tags member of struct blk_mq_tags. + * @fn:Pointer to the function that will be called for each request + * associated with @hctx that has been assigned a driver tag. + * @fn will be called as follows: @fn(@hctx, rq, @data, @reserved) + * where rq is a pointer to a request. + * @data: Will be passed as third argument to @fn. + * @reserved: Indicates whether @bt is the breserved_tags member or the + * bitmap_tags member of struct blk_mq_tags. + */ static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt, busy_iter_fn *fn, void *data, bool reserved) { @@ -280,6 +293,18 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) return true; } +/** + * bt_tags_for_each - iterate over the requests in a tag map + * @tags: Tag map to iterate over. + * @bt:sbitmap to examine. This is either the breserved_tags member + * or the bitmap_tags member of struct blk_mq_tags. + * @fn:Pointer to the function that will be called for each started + * request. @fn will be called as follows: @fn(rq, @data, + * @reserved) where rq is a pointer to a request. + * @data: Will be passed as second argument to @fn. + * @reserved: Indicates whether @bt is the breserved_tags member or the + * bitmap_tags member of struct blk_mq_tags. + */ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt, busy_tag_iter_fn *fn, void *data, bool reserved) { @@ -294,6 +319,15 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt, sbitmap_for_each_set(>sb, bt_tags_iter, _data); } +/** + * blk_mq_all_tag_busy_iter - iterate over all started requests in a tag map + * @tags: Tag map to iterate over. + * @fn:Pointer to the function that will be called for each started + * request. @fn will be called as follows: @fn(rq, @priv, + * reserved) where rq is a pointer to a request. 'reserved' + * indicates whether or not @rq is a reserved request. + * @priv: Will be passed as second argument to @fn. + */ static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn, void *priv) { @@ -302,6 +336,15 @@ static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags, bt_tags_for_each(tags, >bitmap_tags, fn, priv, false); } +/** + * blk_mq_tagset_busy_iter - iterate over all started requests in a tag set + * @tagset:Tag set to iterate over. + * @fn:Pointer to the function that will be called for each started + * request. @fn will be called as follows: @fn(rq, @priv, + * reserved) where rq is a pointer to a request. 'reserved' + * indicates whether or not @rq is a reserved request. + * @priv: Will be passed as second argument to @fn. + */ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, busy_tag_iter_fn *fn, void *priv) { @@ -314,6 +357,20 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, } EXPORT_SYMBOL(blk_mq_tagset_busy_iter); +/** + * blk_mq_queue_tag_busy_iter - iterate over all requests with a driver tag + * @q: Request queue to examine. + * @fn:Pointer to the function that will be called for each request + * on @q. @fn will be called as follows: @fn(hctx, rq, @priv, + * reserved) where rq is
[PATCH v10 8/8] blk-mq: Enable support for runtime power management
Now that the blk-mq core processes power management requests (marked with RQF_PREEMPT) in other states than RPM_ACTIVE, enable runtime power management for blk-mq. Signed-off-by: Bart Van Assche Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-mq.c | 2 ++ block/blk-pm.c | 6 -- 2 files changed, 2 insertions(+), 6 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 96d501e8663c..d384ab700afd 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -33,6 +33,7 @@ #include "blk-mq.h" #include "blk-mq-debugfs.h" #include "blk-mq-tag.h" +#include "blk-pm.h" #include "blk-stat.h" #include "blk-mq-sched.h" #include "blk-rq-qos.h" @@ -475,6 +476,7 @@ static void __blk_mq_free_request(struct request *rq) struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu); const int sched_tag = rq->internal_tag; + blk_pm_mark_last_busy(rq); if (rq->tag != -1) blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag); if (sched_tag != -1) diff --git a/block/blk-pm.c b/block/blk-pm.c index 972fbc656846..f8fdae01bea2 100644 --- a/block/blk-pm.c +++ b/block/blk-pm.c @@ -30,12 +30,6 @@ */ void blk_pm_runtime_init(struct request_queue *q, struct device *dev) { - /* Don't enable runtime PM for blk-mq until it is ready */ - if (q->mq_ops) { - pm_runtime_disable(dev); - return; - } - q->dev = dev; q->rpm_status = RPM_ACTIVE; pm_runtime_set_autosuspend_delay(q->dev, -1); -- 2.19.0.444.g18242da7ef-goog
[PATCH v10 3/8] block: Split blk_pm_add_request() and blk_pm_put_request()
Move the pm_request_resume() and pm_runtime_mark_last_busy() calls into two new functions and thereby separate legacy block layer code from code that works for both the legacy block layer and blk-mq. A later patch will add calls to the new functions in the blk-mq code. Signed-off-by: Bart Van Assche Cc: Martin K. Petersen Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-core.c | 1 + block/blk-pm.h | 36 +++- block/elevator.c | 1 + 3 files changed, 33 insertions(+), 5 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 1a691f5269bb..fd91e9bf2893 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1744,6 +1744,7 @@ void __blk_put_request(struct request_queue *q, struct request *req) blk_req_zone_write_unlock(req); blk_pm_put_request(req); + blk_pm_mark_last_busy(req); elv_completed_request(q, req); diff --git a/block/blk-pm.h b/block/blk-pm.h index 1ffc8ef203ec..a8564ea72a41 100644 --- a/block/blk-pm.h +++ b/block/blk-pm.h @@ -6,8 +6,23 @@ #include #ifdef CONFIG_PM +static inline void blk_pm_request_resume(struct request_queue *q) +{ + if (q->dev && (q->rpm_status == RPM_SUSPENDED || + q->rpm_status == RPM_SUSPENDING)) + pm_request_resume(q->dev); +} + +static inline void blk_pm_mark_last_busy(struct request *rq) +{ + if (rq->q->dev && !(rq->rq_flags & RQF_PM)) + pm_runtime_mark_last_busy(rq->q->dev); +} + static inline void blk_pm_requeue_request(struct request *rq) { + lockdep_assert_held(rq->q->queue_lock); + if (rq->q->dev && !(rq->rq_flags & RQF_PM)) rq->q->nr_pending--; } @@ -15,17 +30,28 @@ static inline void blk_pm_requeue_request(struct request *rq) static inline void blk_pm_add_request(struct request_queue *q, struct request *rq) { - if (q->dev && !(rq->rq_flags & RQF_PM) && q->nr_pending++ == 0 && - (q->rpm_status == RPM_SUSPENDED || q->rpm_status == RPM_SUSPENDING)) - pm_request_resume(q->dev); + lockdep_assert_held(q->queue_lock); + + if (q->dev && !(rq->rq_flags & RQF_PM)) + q->nr_pending++; } static inline void blk_pm_put_request(struct request *rq) { - if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending) - pm_runtime_mark_last_busy(rq->q->dev); + lockdep_assert_held(rq->q->queue_lock); + + if (rq->q->dev && !(rq->rq_flags & RQF_PM)) + --rq->q->nr_pending; } #else +static inline void blk_pm_request_resume(struct request_queue *q) +{ +} + +static inline void blk_pm_mark_last_busy(struct request *rq) +{ +} + static inline void blk_pm_requeue_request(struct request *rq) { } diff --git a/block/elevator.c b/block/elevator.c index e18ac68626e3..1c992bf6cfb1 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -601,6 +601,7 @@ void __elv_add_request(struct request_queue *q, struct request *rq, int where) trace_block_rq_insert(q, rq); blk_pm_add_request(q, rq); + blk_pm_request_resume(q); rq->q = q; -- 2.19.0.444.g18242da7ef-goog
[PATCH v10 1/8] block: Move power management code into a new source file
Move the code for runtime power management from blk-core.c into the new source file blk-pm.c. Move the corresponding declarations from into . For CONFIG_PM=n, leave out the declarations of the functions that are not used in that mode. This patch not only reduces the number of #ifdefs in the block layer core code but also reduces the size of header file and hence should help to reduce the build time of the Linux kernel if CONFIG_PM is not defined. Signed-off-by: Bart Van Assche Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/Kconfig | 3 + block/Makefile | 1 + block/blk-core.c | 196 + block/blk-pm.c | 188 +++ block/blk-pm.h | 43 + block/elevator.c | 22 + drivers/scsi/scsi_pm.c | 1 + drivers/scsi/sd.c | 1 + drivers/scsi/sr.c | 1 + include/linux/blk-pm.h | 24 + include/linux/blkdev.h | 23 - 11 files changed, 264 insertions(+), 239 deletions(-) create mode 100644 block/blk-pm.c create mode 100644 block/blk-pm.h create mode 100644 include/linux/blk-pm.h diff --git a/block/Kconfig b/block/Kconfig index 1f2469a0123c..85263e7bded6 100644 --- a/block/Kconfig +++ b/block/Kconfig @@ -228,4 +228,7 @@ config BLK_MQ_RDMA depends on BLOCK && INFINIBAND default y +config BLK_PM + def_bool BLOCK && PM + source block/Kconfig.iosched diff --git a/block/Makefile b/block/Makefile index 572b33f32c07..27eac600474f 100644 --- a/block/Makefile +++ b/block/Makefile @@ -37,3 +37,4 @@ obj-$(CONFIG_BLK_WBT) += blk-wbt.o obj-$(CONFIG_BLK_DEBUG_FS) += blk-mq-debugfs.o obj-$(CONFIG_BLK_DEBUG_FS_ZONED)+= blk-mq-debugfs-zoned.o obj-$(CONFIG_BLK_SED_OPAL) += sed-opal.o +obj-$(CONFIG_BLK_PM) += blk-pm.o diff --git a/block/blk-core.c b/block/blk-core.c index 4dbc93f43b38..6d4dd176bd9d 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -42,6 +42,7 @@ #include "blk.h" #include "blk-mq.h" #include "blk-mq-sched.h" +#include "blk-pm.h" #include "blk-rq-qos.h" #ifdef CONFIG_DEBUG_FS @@ -1726,16 +1727,6 @@ void part_round_stats(struct request_queue *q, int cpu, struct hd_struct *part) } EXPORT_SYMBOL_GPL(part_round_stats); -#ifdef CONFIG_PM -static void blk_pm_put_request(struct request *rq) -{ - if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending) - pm_runtime_mark_last_busy(rq->q->dev); -} -#else -static inline void blk_pm_put_request(struct request *rq) {} -#endif - void __blk_put_request(struct request_queue *q, struct request *req) { req_flags_t rq_flags = req->rq_flags; @@ -3757,191 +3748,6 @@ void blk_finish_plug(struct blk_plug *plug) } EXPORT_SYMBOL(blk_finish_plug); -#ifdef CONFIG_PM -/** - * blk_pm_runtime_init - Block layer runtime PM initialization routine - * @q: the queue of the device - * @dev: the device the queue belongs to - * - * Description: - *Initialize runtime-PM-related fields for @q and start auto suspend for - *@dev. Drivers that want to take advantage of request-based runtime PM - *should call this function after @dev has been initialized, and its - *request queue @q has been allocated, and runtime PM for it can not happen - *yet(either due to disabled/forbidden or its usage_count > 0). In most - *cases, driver should call this function before any I/O has taken place. - * - *This function takes care of setting up using auto suspend for the device, - *the autosuspend delay is set to -1 to make runtime suspend impossible - *until an updated value is either set by user or by driver. Drivers do - *not need to touch other autosuspend settings. - * - *The block layer runtime PM is request based, so only works for drivers - *that use request as their IO unit instead of those directly use bio's. - */ -void blk_pm_runtime_init(struct request_queue *q, struct device *dev) -{ - /* Don't enable runtime PM for blk-mq until it is ready */ - if (q->mq_ops) { - pm_runtime_disable(dev); - return; - } - - q->dev = dev; - q->rpm_status = RPM_ACTIVE; - pm_runtime_set_autosuspend_delay(q->dev, -1); - pm_runtime_use_autosuspend(q->dev); -} -EXPORT_SYMBOL(blk_pm_runtime_init); - -/** - * blk_pre_runtime_suspend - Pre runtime suspend check - * @q: the queue of the device - * - * Description: - *This function will check if runtime suspend is allowed for the device - *by examining if there are any requests pending in the queue. If there - *are requests pending, the device can not be runtime suspended; otherwise, - *the queue's status will be updated to SUSPENDING and the driver can - *proceed to s
[PATCH v10 7/8] block: Make blk_get_request() block for non-PM requests while suspended
Instead of allowing requests that are not power management requests to enter the queue in runtime suspended status (RPM_SUSPENDED), make the blk_get_request() caller block. This change fixes a starvation issue: it is now guaranteed that power management requests will be executed no matter how many blk_get_request() callers are waiting. For blk-mq, instead of maintaining the q->nr_pending counter, rely on q->q_usage_counter. Call pm_runtime_mark_last_busy() every time a request finishes instead of only if the queue depth drops to zero. Signed-off-by: Bart Van Assche Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-core.c | 37 - block/blk-pm.c | 44 +++- 2 files changed, 47 insertions(+), 34 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index fec135ae52cf..16dd3a989753 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2746,30 +2746,6 @@ void blk_account_io_done(struct request *req, u64 now) } } -#ifdef CONFIG_PM -/* - * Don't process normal requests when queue is suspended - * or in the process of suspending/resuming - */ -static bool blk_pm_allow_request(struct request *rq) -{ - switch (rq->q->rpm_status) { - case RPM_RESUMING: - case RPM_SUSPENDING: - return rq->rq_flags & RQF_PM; - case RPM_SUSPENDED: - return false; - default: - return true; - } -} -#else -static bool blk_pm_allow_request(struct request *rq) -{ - return true; -} -#endif - void blk_account_io_start(struct request *rq, bool new_io) { struct hd_struct *part; @@ -2815,11 +2791,14 @@ static struct request *elv_next_request(struct request_queue *q) while (1) { list_for_each_entry(rq, >queue_head, queuelist) { - if (blk_pm_allow_request(rq)) - return rq; - - if (rq->rq_flags & RQF_SOFTBARRIER) - break; +#ifdef CONFIG_PM + /* +* If a request gets queued in state RPM_SUSPENDED +* then that's a kernel bug. +*/ + WARN_ON_ONCE(q->rpm_status == RPM_SUSPENDED); +#endif + return rq; } /* diff --git a/block/blk-pm.c b/block/blk-pm.c index 9b636960d285..972fbc656846 100644 --- a/block/blk-pm.c +++ b/block/blk-pm.c @@ -1,8 +1,11 @@ // SPDX-License-Identifier: GPL-2.0 +#include #include #include #include +#include "blk-mq.h" +#include "blk-mq-tag.h" /** * blk_pm_runtime_init - Block layer runtime PM initialization routine @@ -68,14 +71,40 @@ int blk_pre_runtime_suspend(struct request_queue *q) if (!q->dev) return ret; + WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE); + + /* +* Increase the pm_only counter before checking whether any +* non-PM blk_queue_enter() calls are in progress to avoid that any +* new non-PM blk_queue_enter() calls succeed before the pm_only +* counter is decreased again. +*/ + blk_set_pm_only(q); + ret = -EBUSY; + /* Switch q_usage_counter from per-cpu to atomic mode. */ + blk_freeze_queue_start(q); + /* +* Wait until atomic mode has been reached. Since that +* involves calling call_rcu(), it is guaranteed that later +* blk_queue_enter() calls see the pm-only state. See also +* http://lwn.net/Articles/573497/. +*/ + percpu_ref_switch_to_atomic_sync(>q_usage_counter); + if (percpu_ref_is_zero(>q_usage_counter)) + ret = 0; + /* Switch q_usage_counter back to per-cpu mode. */ + blk_mq_unfreeze_queue(q); + spin_lock_irq(q->queue_lock); - if (q->nr_pending) { - ret = -EBUSY; + if (ret < 0) pm_runtime_mark_last_busy(q->dev); - } else { + else q->rpm_status = RPM_SUSPENDING; - } spin_unlock_irq(q->queue_lock); + + if (ret) + blk_clear_pm_only(q); + return ret; } EXPORT_SYMBOL(blk_pre_runtime_suspend); @@ -106,6 +135,9 @@ void blk_post_runtime_suspend(struct request_queue *q, int err) pm_runtime_mark_last_busy(q->dev); } spin_unlock_irq(q->queue_lock); + + if (err) + blk_clear_pm_only(q); } EXPORT_SYMBOL(blk_post_runtime_suspend); @@ -153,13 +185,15 @@ void blk_post_runtime_resume(struct request_queue *q, int err) spin_lock_irq(q->queue_lock); if (!err) { q->rpm_status = RPM_ACTIVE; - __blk_run_queue(q); pm_runtime_mark_last_bu
[PATCH v10 2/8] block, scsi: Change the preempt-only flag into a counter
The RQF_PREEMPT flag is used for three purposes: - In the SCSI core, for making sure that power management requests are executed even if a device is in the "quiesced" state. - For domain validation by SCSI drivers that use the parallel port. - In the IDE driver, for IDE preempt requests. Rename "preempt-only" into "pm-only" because the primary purpose of this mode is power management. Since the power management core may but does not have to resume a runtime suspended device before performing system-wide suspend and since a later patch will set "pm-only" mode as long as a block device is runtime suspended, make it possible to set "pm-only" mode from more than one context. Since with this change scsi_device_quiesce() is no longer idempotent, make that function return early if it is called for a quiesced queue. Signed-off-by: Bart Van Assche Cc: Martin K. Petersen Reviewed-by: Hannes Reinecke Reviewed-by: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-core.c| 35 ++- block/blk-mq-debugfs.c | 10 +- drivers/scsi/scsi_lib.c | 11 +++ include/linux/blkdev.h | 14 +- 4 files changed, 43 insertions(+), 27 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 6d4dd176bd9d..1a691f5269bb 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -422,24 +422,25 @@ void blk_sync_queue(struct request_queue *q) EXPORT_SYMBOL(blk_sync_queue); /** - * blk_set_preempt_only - set QUEUE_FLAG_PREEMPT_ONLY + * blk_set_pm_only - increment pm_only counter * @q: request queue pointer - * - * Returns the previous value of the PREEMPT_ONLY flag - 0 if the flag was not - * set and 1 if the flag was already set. */ -int blk_set_preempt_only(struct request_queue *q) +void blk_set_pm_only(struct request_queue *q) { - return blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q); + atomic_inc(>pm_only); } -EXPORT_SYMBOL_GPL(blk_set_preempt_only); +EXPORT_SYMBOL_GPL(blk_set_pm_only); -void blk_clear_preempt_only(struct request_queue *q) +void blk_clear_pm_only(struct request_queue *q) { - blk_queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q); - wake_up_all(>mq_freeze_wq); + int pm_only; + + pm_only = atomic_dec_return(>pm_only); + WARN_ON_ONCE(pm_only < 0); + if (pm_only == 0) + wake_up_all(>mq_freeze_wq); } -EXPORT_SYMBOL_GPL(blk_clear_preempt_only); +EXPORT_SYMBOL_GPL(blk_clear_pm_only); /** * __blk_run_queue_uncond - run a queue whether or not it has been stopped @@ -918,7 +919,7 @@ EXPORT_SYMBOL(blk_alloc_queue); */ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) { - const bool preempt = flags & BLK_MQ_REQ_PREEMPT; + const bool pm = flags & BLK_MQ_REQ_PREEMPT; while (true) { bool success = false; @@ -926,11 +927,11 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) rcu_read_lock(); if (percpu_ref_tryget_live(>q_usage_counter)) { /* -* The code that sets the PREEMPT_ONLY flag is -* responsible for ensuring that that flag is globally -* visible before the queue is unfrozen. +* The code that increments the pm_only counter is +* responsible for ensuring that that counter is +* globally visible before the queue is unfrozen. */ - if (preempt || !blk_queue_preempt_only(q)) { + if (pm || !blk_queue_pm_only(q)) { success = true; } else { percpu_ref_put(>q_usage_counter); @@ -955,7 +956,7 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) wait_event(q->mq_freeze_wq, (atomic_read(>mq_freeze_depth) == 0 && - (preempt || !blk_queue_preempt_only(q))) || + (pm || !blk_queue_pm_only(q))) || blk_queue_dying(q)); if (blk_queue_dying(q)) return -ENODEV; diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index cb1e6cf7ac48..a5ea86835fcb 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -102,6 +102,14 @@ static int blk_flags_show(struct seq_file *m, const unsigned long flags, return 0; } +static int queue_pm_only_show(void *data, struct seq_file *m) +{ + struct request_queue *q = data; + + seq_printf(m, "%d\n", atomic_read(>pm_only)); + return 0; +} + #define QUEUE_FLAG_NAME(name) [QUEUE_FLAG_##name] = #name static const char *con
[PATCH v10 5/8] percpu-refcount: Introduce percpu_ref_resurrect()
This function will be used in a later patch to switch the struct request_queue q_usage_counter from killed back to live. In contrast to percpu_ref_reinit(), this new function does not require that the refcount is zero. Signed-off-by: Bart Van Assche Cc: Tejun Heo Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn --- include/linux/percpu-refcount.h | 1 + lib/percpu-refcount.c | 28 ++-- 2 files changed, 27 insertions(+), 2 deletions(-) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index 009cdf3d65b6..b297cd1cd4f1 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -108,6 +108,7 @@ void percpu_ref_switch_to_atomic_sync(struct percpu_ref *ref); void percpu_ref_switch_to_percpu(struct percpu_ref *ref); void percpu_ref_kill_and_confirm(struct percpu_ref *ref, percpu_ref_func_t *confirm_kill); +void percpu_ref_resurrect(struct percpu_ref *ref); void percpu_ref_reinit(struct percpu_ref *ref); /** diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index 9f96fa7bc000..17fe3e996ddc 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -356,11 +356,35 @@ EXPORT_SYMBOL_GPL(percpu_ref_kill_and_confirm); */ void percpu_ref_reinit(struct percpu_ref *ref) { + WARN_ON_ONCE(!percpu_ref_is_zero(ref)); + + percpu_ref_resurrect(ref); +} +EXPORT_SYMBOL_GPL(percpu_ref_reinit); + +/** + * percpu_ref_resurrect - modify a percpu refcount from dead to live + * @ref: perpcu_ref to resurrect + * + * Modify @ref so that it's in the same state as before percpu_ref_kill() was + * called. @ref must have be dead but not exited. + * + * If @ref->release() frees @ref then the caller is responsible for + * guaranteeing that @ref->release() does not get called while this + * function is in progress. + * + * Note that percpu_ref_tryget[_live]() are safe to perform on @ref while + * this function is in progress. + */ +void percpu_ref_resurrect(struct percpu_ref *ref) +{ + unsigned long __percpu *percpu_count; unsigned long flags; spin_lock_irqsave(_ref_switch_lock, flags); - WARN_ON_ONCE(!percpu_ref_is_zero(ref)); + WARN_ON_ONCE(!(ref->percpu_count_ptr & __PERCPU_REF_DEAD)); + WARN_ON_ONCE(__ref_is_percpu(ref, _count)); ref->percpu_count_ptr &= ~__PERCPU_REF_DEAD; percpu_ref_get(ref); @@ -368,4 +392,4 @@ void percpu_ref_reinit(struct percpu_ref *ref) spin_unlock_irqrestore(_ref_switch_lock, flags); } -EXPORT_SYMBOL_GPL(percpu_ref_reinit); +EXPORT_SYMBOL_GPL(percpu_ref_resurrect); -- 2.19.0.444.g18242da7ef-goog
[PATCH v10 6/8] block: Allow unfreezing of a queue while requests are in progress
A later patch will call blk_freeze_queue_start() followed by blk_mq_unfreeze_queue() without waiting for q_usage_counter to drop to zero. Make sure that this doesn't cause a kernel warning to appear by switching from percpu_ref_reinit() to percpu_ref_resurrect(). The former namely requires that the refcount it operates on is zero. Signed-off-by: Bart Van Assche Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn --- block/blk-mq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 85a1c1a59c72..96d501e8663c 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -198,7 +198,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q) freeze_depth = atomic_dec_return(>mq_freeze_depth); WARN_ON_ONCE(freeze_depth < 0); if (!freeze_depth) { - percpu_ref_reinit(>q_usage_counter); + percpu_ref_resurrect(>q_usage_counter); wake_up_all(>mq_freeze_wq); } } -- 2.19.0.444.g18242da7ef-goog
[PATCH v10 4/8] block: Schedule runtime resume earlier
Instead of scheduling runtime resume of a request queue after a request has been queued, schedule asynchronous resume during request allocation. The new pm_request_resume() calls occur after blk_queue_enter() has increased the q_usage_counter request queue member. This change is needed for a later patch that will make request allocation block while the queue status is not RPM_ACTIVE. Signed-off-by: Bart Van Assche Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-core.c | 3 ++- block/elevator.c | 1 - 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index fd91e9bf2893..fec135ae52cf 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -956,7 +956,8 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) wait_event(q->mq_freeze_wq, (atomic_read(>mq_freeze_depth) == 0 && - (pm || !blk_queue_pm_only(q))) || + (pm || (blk_pm_request_resume(q), + !blk_queue_pm_only(q || blk_queue_dying(q)); if (blk_queue_dying(q)) return -ENODEV; diff --git a/block/elevator.c b/block/elevator.c index 1c992bf6cfb1..e18ac68626e3 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -601,7 +601,6 @@ void __elv_add_request(struct request_queue *q, struct request *rq, int where) trace_block_rq_insert(q, rq); blk_pm_add_request(q, rq); - blk_pm_request_resume(q); rq->q = q; -- 2.19.0.444.g18242da7ef-goog
[PATCH v10 0/8] blk-mq: Implement runtime power management
Hello Jens, One of the pieces that is missing before blk-mq can be made the default is implementing runtime power management support for blk-mq. This patch series not only implements runtime power management for blk-mq but also fixes a starvation issue in the power management code for the legacy block layer. Please consider this patch series for the upstream kernel. Thanks, Bart. Changes compared to v9: - Left out the patches that document the functions that iterate over requests and also the patch that introduces blk_mq_queue_rq_iter(). - Simplified blk_pre_runtime_suspend(): left out the check whether no requests are in progress. - Fixed the race between blk_queue_enter(), queue freezing and runtime power management that Ming had identified. - Added a new patch that introduces percpu_ref_resurrect(). Changes compared to v8: - Fixed the race that was reported by Jianchao. - Fixed another spelling issue in a source code comment. Changes compared to v7: - Addressed Jianchao's feedback about patch "Make blk_get_request() block for non-PM requests while suspended". - Added two new patches - one that documents the functions that iterate over requests and one that introduces a new function that iterates over all requests associated with a queue. Changes compared to v6: - Left out the patches that split RQF_PREEMPT in three flags. - Left out the patch that introduces the SCSI device state SDEV_SUSPENDED. - Left out the patch that introduces blk_pm_runtime_exit(). - Restored the patch that changes the PREEMPT_ONLY flag into a counter. Changes compared to v5: - Introduced a new flag RQF_DV that replaces RQF_PREEMPT for SCSI domain validation. - Introduced a new request queue state QUEUE_FLAG_DV_ONLY for SCSI domain validation. - Instead of using SDEV_QUIESCE for both runtime suspend and SCSI domain validation, use that state for domain validation only and introduce a new state for runtime suspend, namely SDEV_QUIESCE. - Reallow system suspend during SCSI domain validation. - Moved the runtime resume call from the request allocation code into blk_queue_enter(). - Instead of relying on q_usage_counter, iterate over the tag set to determine whether or not any requests are in flight. Changes compared to v4: - Dropped the patches "Give RQF_PREEMPT back its original meaning" and "Serialize queue freezing and blk_pre_runtime_suspend()". - Replaced "percpu_ref_read()" with "percpu_is_in_use()". - Inserted pm_request_resume() calls in the block layer request allocation code such that the context that submits a request no longer has to call pm_runtime_get(). Changes compared to v3: - Avoid adverse interactions between system-wide suspend/resume and runtime power management by changing the PREEMPT_ONLY flag into a counter. - Give RQF_PREEMPT back its original meaning, namely that it is only set for ide_preempt requests. - Remove the flag BLK_MQ_REQ_PREEMPT. - Removed the pm_request_resume() call. Changes compared to v2: - Fixed the build for CONFIG_BLOCK=n. - Added a patch that introduces percpu_ref_read() in the percpu-counter code. - Added a patch that makes it easier to detect missing pm_runtime_get*() calls. - Addressed Jianchao's feedback including the comment about runtime overhead of switching a per-cpu counter to atomic mode. Changes compared to v1: - Moved the runtime power management code into a separate file. - Addressed Ming's feedback. Bart Van Assche (8): block: Move power management code into a new source file block, scsi: Change the preempt-only flag into a counter block: Split blk_pm_add_request() and blk_pm_put_request() block: Schedule runtime resume earlier percpu-refcount: Introduce percpu_ref_resurrect() block: Allow unfreezing of a queue while requests are in progress block: Make blk_get_request() block for non-PM requests while suspended blk-mq: Enable support for runtime power management block/Kconfig | 3 + block/Makefile | 1 + block/blk-core.c| 270 block/blk-mq-debugfs.c | 10 +- block/blk-mq.c | 4 +- block/blk-pm.c | 216 + block/blk-pm.h | 69 block/elevator.c| 22 +-- drivers/scsi/scsi_lib.c | 11 +- drivers/scsi/scsi_pm.c | 1 + drivers/scsi/sd.c | 1 + drivers/scsi/sr.c | 1 + include/linux/blk-pm.h | 24 +++ include/linux/blkdev.h | 37 ++--- include/linux/percpu-refcount.h | 1 + lib/percpu-refcount.c | 28 +++- 16 files changed, 401 insertions(+), 298 deletions(-) create mode 100644 block/blk-pm.c create mode 100644 block/blk-pm.h create mode 100644 include/linux/blk-pm.h -- 2.19.0.444.g18242da7ef-goog
Re: [PATCH v9 1/8] blk-mq: Document the functions that iterate over requests
On Thu, 2018-09-20 at 09:35 +0200, Johannes Thumshirn wrote: > On Wed, Sep 19, 2018 at 03:45:23PM -0700, Bart Van Assche wrote: > > +/* > > + * Call function @fn(@hctx, rq, @data, @reserved) for each request > > queued on > > + * @hctx that has been assigned a driver tag. @reserved indicates > > whether @bt > > + * is the breserved_tags member or the bitmap_tags member of > > struct > > + * blk_mq_tags. > > + */ > > static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct > > sbitmap_queue *bt, > > busy_iter_fn *fn, void *data, bool > > reserved) > > One minor nit and only if you have to re-send the series, can you > please also add the usual kernel-doc headers to the comments, i.e.: OK, I will use the kernel-doc format. > Otherwise: > Reviewed-by: Johannes Thumshirn Thanks! Bart.
[PATCH v9 6/8] block: Schedule runtime resume earlier
Instead of scheduling runtime resume of a request queue after a request has been queued, schedule asynchronous resume during request allocation. The new pm_request_resume() calls occur after blk_queue_enter() has increased the q_usage_counter request queue member. This change is needed for a later patch that will make request allocation block while the queue status is not RPM_ACTIVE. Signed-off-by: Bart Van Assche Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-core.c | 2 ++ block/elevator.c | 1 - 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/block/blk-core.c b/block/blk-core.c index fd91e9bf2893..18b874d5c9c9 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -942,6 +942,8 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) if (success) return 0; + blk_pm_request_resume(q); + if (flags & BLK_MQ_REQ_NOWAIT) return -EBUSY; diff --git a/block/elevator.c b/block/elevator.c index 1c992bf6cfb1..e18ac68626e3 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -601,7 +601,6 @@ void __elv_add_request(struct request_queue *q, struct request *rq, int where) trace_block_rq_insert(q, rq); blk_pm_add_request(q, rq); - blk_pm_request_resume(q); rq->q = q; -- 2.19.0.397.gdd90340f6a-goog
[PATCH v9 4/8] block, scsi: Change the preempt-only flag into a counter
The RQF_PREEMPT flag is used for three purposes: - In the SCSI core, for making sure that power management requests are executed even if a device is in the "quiesced" state. - For domain validation by SCSI drivers that use the parallel port. - In the IDE driver, for IDE preempt requests. Rename "preempt-only" into "pm-only" because the primary purpose of this mode is power management. Since the power management core may but does not have to resume a runtime suspended device before performing system-wide suspend and since a later patch will set "pm-only" mode as long as a block device is runtime suspended, make it possible to set "pm-only" mode from more than one context. Since with this change scsi_device_quiesce() is no longer idempotent, make that function return early if it is called for a quiesced queue. Signed-off-by: Bart Van Assche Cc: Martin K. Petersen Reviewed-by: Hannes Reinecke Reviewed-by: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-core.c| 35 ++- block/blk-mq-debugfs.c | 10 +- drivers/scsi/scsi_lib.c | 11 +++ include/linux/blkdev.h | 14 +- 4 files changed, 43 insertions(+), 27 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 6d4dd176bd9d..1a691f5269bb 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -422,24 +422,25 @@ void blk_sync_queue(struct request_queue *q) EXPORT_SYMBOL(blk_sync_queue); /** - * blk_set_preempt_only - set QUEUE_FLAG_PREEMPT_ONLY + * blk_set_pm_only - increment pm_only counter * @q: request queue pointer - * - * Returns the previous value of the PREEMPT_ONLY flag - 0 if the flag was not - * set and 1 if the flag was already set. */ -int blk_set_preempt_only(struct request_queue *q) +void blk_set_pm_only(struct request_queue *q) { - return blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q); + atomic_inc(>pm_only); } -EXPORT_SYMBOL_GPL(blk_set_preempt_only); +EXPORT_SYMBOL_GPL(blk_set_pm_only); -void blk_clear_preempt_only(struct request_queue *q) +void blk_clear_pm_only(struct request_queue *q) { - blk_queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q); - wake_up_all(>mq_freeze_wq); + int pm_only; + + pm_only = atomic_dec_return(>pm_only); + WARN_ON_ONCE(pm_only < 0); + if (pm_only == 0) + wake_up_all(>mq_freeze_wq); } -EXPORT_SYMBOL_GPL(blk_clear_preempt_only); +EXPORT_SYMBOL_GPL(blk_clear_pm_only); /** * __blk_run_queue_uncond - run a queue whether or not it has been stopped @@ -918,7 +919,7 @@ EXPORT_SYMBOL(blk_alloc_queue); */ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) { - const bool preempt = flags & BLK_MQ_REQ_PREEMPT; + const bool pm = flags & BLK_MQ_REQ_PREEMPT; while (true) { bool success = false; @@ -926,11 +927,11 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) rcu_read_lock(); if (percpu_ref_tryget_live(>q_usage_counter)) { /* -* The code that sets the PREEMPT_ONLY flag is -* responsible for ensuring that that flag is globally -* visible before the queue is unfrozen. +* The code that increments the pm_only counter is +* responsible for ensuring that that counter is +* globally visible before the queue is unfrozen. */ - if (preempt || !blk_queue_preempt_only(q)) { + if (pm || !blk_queue_pm_only(q)) { success = true; } else { percpu_ref_put(>q_usage_counter); @@ -955,7 +956,7 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) wait_event(q->mq_freeze_wq, (atomic_read(>mq_freeze_depth) == 0 && - (preempt || !blk_queue_preempt_only(q))) || + (pm || !blk_queue_pm_only(q))) || blk_queue_dying(q)); if (blk_queue_dying(q)) return -ENODEV; diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index cb1e6cf7ac48..a5ea86835fcb 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -102,6 +102,14 @@ static int blk_flags_show(struct seq_file *m, const unsigned long flags, return 0; } +static int queue_pm_only_show(void *data, struct seq_file *m) +{ + struct request_queue *q = data; + + seq_printf(m, "%d\n", atomic_read(>pm_only)); + return 0; +} + #define QUEUE_FLAG_NAME(name) [QUEUE_FLAG_##name] = #name static const char *con
[PATCH v9 8/8] blk-mq: Enable support for runtime power management
Now that the blk-mq core processes power management requests (marked with RQF_PREEMPT) in other states than RPM_ACTIVE, enable runtime power management for blk-mq. Signed-off-by: Bart Van Assche Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-mq.c | 2 ++ block/blk-pm.c | 6 -- 2 files changed, 2 insertions(+), 6 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 85a1c1a59c72..20fdd78b75c7 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -33,6 +33,7 @@ #include "blk-mq.h" #include "blk-mq-debugfs.h" #include "blk-mq-tag.h" +#include "blk-pm.h" #include "blk-stat.h" #include "blk-mq-sched.h" #include "blk-rq-qos.h" @@ -475,6 +476,7 @@ static void __blk_mq_free_request(struct request *rq) struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu); const int sched_tag = rq->internal_tag; + blk_pm_mark_last_busy(rq); if (rq->tag != -1) blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag); if (sched_tag != -1) diff --git a/block/blk-pm.c b/block/blk-pm.c index 3ad5a7334baa..7c0ac1424f46 100644 --- a/block/blk-pm.c +++ b/block/blk-pm.c @@ -30,12 +30,6 @@ */ void blk_pm_runtime_init(struct request_queue *q, struct device *dev) { - /* Don't enable runtime PM for blk-mq until it is ready */ - if (q->mq_ops) { - pm_runtime_disable(dev); - return; - } - q->dev = dev; q->rpm_status = RPM_ACTIVE; pm_runtime_set_autosuspend_delay(q->dev, -1); -- 2.19.0.397.gdd90340f6a-goog
[PATCH v9 1/8] blk-mq: Document the functions that iterate over requests
Make it easier to understand the purpose of the functions that iterate over requests by documenting their purpose. Fix three minor spelling mistakes in comments in these functions. Signed-off-by: Bart Van Assche Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn --- block/blk-mq-tag.c | 30 +++--- 1 file changed, 27 insertions(+), 3 deletions(-) diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 94e1ed667b6e..cb5db0c3cc32 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -232,13 +232,19 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) /* * We can hit rq == NULL here, because the tagging functions -* test and set the bit before assining ->rqs[]. +* test and set the bit before assigning ->rqs[]. */ if (rq && rq->q == hctx->queue) iter_data->fn(hctx, rq, iter_data->data, reserved); return true; } +/* + * Call function @fn(@hctx, rq, @data, @reserved) for each request queued on + * @hctx that has been assigned a driver tag. @reserved indicates whether @bt + * is the breserved_tags member or the bitmap_tags member of struct + * blk_mq_tags. + */ static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt, busy_iter_fn *fn, void *data, bool reserved) { @@ -280,6 +286,11 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) return true; } +/* + * Call function @fn(rq, @data, @reserved) for each request in @tags that has + * been started. @reserved indicates whether @bt is the breserved_tags member + * or the bitmap_tags member of struct blk_mq_tags. + */ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt, busy_tag_iter_fn *fn, void *data, bool reserved) { @@ -294,6 +305,10 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt, sbitmap_for_each_set(>sb, bt_tags_iter, _data); } +/* + * Call @fn(rq, @priv, reserved) for each started request in @tags. 'reserved' + * indicates whether or not 'rq' is a reserved request. + */ static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn, void *priv) { @@ -302,6 +317,10 @@ static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags, bt_tags_for_each(tags, >bitmap_tags, fn, priv, false); } +/* + * Call @fn(rq, @priv, reserved) for each request in @tagset. 'reserved' + * indicates whether or not 'rq' is a reserved request. + */ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, busy_tag_iter_fn *fn, void *priv) { @@ -314,6 +333,11 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, } EXPORT_SYMBOL(blk_mq_tagset_busy_iter); +/* + * Call @fn(rq, @priv, reserved) for each request associated with request + * queue @q or any queue it shares tags with and that has been assigned a + * driver tag. 'reserved' indicates whether or not 'rq' is a reserved request. + */ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn, void *priv) { @@ -323,7 +347,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn, /* * __blk_mq_update_nr_hw_queues will update the nr_hw_queues and * queue_hw_ctx after freeze the queue. So we could use q_usage_counter -* to avoid race with it. __blk_mq_update_nr_hw_queues will users +* to avoid race with it. __blk_mq_update_nr_hw_queues will use * synchronize_rcu to ensure all of the users go out of the critical * section below and see zeroed q_usage_counter. */ @@ -337,7 +361,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn, struct blk_mq_tags *tags = hctx->tags; /* -* If not software queues are currently mapped to this +* If no software queues are currently mapped to this * hardware queue, there's nothing to check */ if (!blk_mq_hw_queue_mapped(hctx)) -- 2.19.0.397.gdd90340f6a-goog
[PATCH v9 7/8] block: Make blk_get_request() block for non-PM requests while suspended
Instead of allowing requests that are not power management requests to enter the queue in runtime suspended status (RPM_SUSPENDED), make the blk_get_request() caller block. This change fixes a starvation issue: it is now guaranteed that power management requests will be executed no matter how many blk_get_request() callers are waiting. For blk-mq, instead of maintaining the q->nr_pending counter, rely on q->q_usage_counter. Call pm_runtime_mark_last_busy() every time a request finishes instead of only if the queue depth drops to zero. Signed-off-by: Bart Van Assche Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-core.c | 37 +- block/blk-pm.c | 81 +--- 2 files changed, 84 insertions(+), 34 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 18b874d5c9c9..ae092ca121d5 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2747,30 +2747,6 @@ void blk_account_io_done(struct request *req, u64 now) } } -#ifdef CONFIG_PM -/* - * Don't process normal requests when queue is suspended - * or in the process of suspending/resuming - */ -static bool blk_pm_allow_request(struct request *rq) -{ - switch (rq->q->rpm_status) { - case RPM_RESUMING: - case RPM_SUSPENDING: - return rq->rq_flags & RQF_PM; - case RPM_SUSPENDED: - return false; - default: - return true; - } -} -#else -static bool blk_pm_allow_request(struct request *rq) -{ - return true; -} -#endif - void blk_account_io_start(struct request *rq, bool new_io) { struct hd_struct *part; @@ -2816,11 +2792,14 @@ static struct request *elv_next_request(struct request_queue *q) while (1) { list_for_each_entry(rq, >queue_head, queuelist) { - if (blk_pm_allow_request(rq)) - return rq; - - if (rq->rq_flags & RQF_SOFTBARRIER) - break; +#ifdef CONFIG_PM + /* +* If a request gets queued in state RPM_SUSPENDED +* then that's a kernel bug. +*/ + WARN_ON_ONCE(q->rpm_status == RPM_SUSPENDED); +#endif + return rq; } /* diff --git a/block/blk-pm.c b/block/blk-pm.c index 9b636960d285..3ad5a7334baa 100644 --- a/block/blk-pm.c +++ b/block/blk-pm.c @@ -1,8 +1,11 @@ // SPDX-License-Identifier: GPL-2.0 +#include #include #include #include +#include "blk-mq.h" +#include "blk-mq-tag.h" /** * blk_pm_runtime_init - Block layer runtime PM initialization routine @@ -40,6 +43,34 @@ void blk_pm_runtime_init(struct request_queue *q, struct device *dev) } EXPORT_SYMBOL(blk_pm_runtime_init); +struct in_flight_data { + struct request_queue*q; + int in_flight; +}; + +static void blk_count_in_flight(struct blk_mq_hw_ctx *hctx, struct request *rq, + void *priv, bool reserved) +{ + struct in_flight_data *in_flight = priv; + + if (rq->q == in_flight->q) + in_flight->in_flight++; +} + +/* + * Count the number of requests that are in flight for request queue @q. Use + * @q->nr_pending for legacy queues. Iterate over the tag set for blk-mq + * queues. + */ +static int blk_requests_in_flight(struct request_queue *q) +{ + struct in_flight_data in_flight = { .q = q }; + + if (q->mq_ops) + blk_mq_queue_rq_iter(q, blk_count_in_flight, _flight); + return q->nr_pending + in_flight.in_flight; +} + /** * blk_pre_runtime_suspend - Pre runtime suspend check * @q: the queue of the device @@ -68,14 +99,49 @@ int blk_pre_runtime_suspend(struct request_queue *q) if (!q->dev) return ret; + WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE); + + /* +* Increase the pm_only counter before checking whether any +* non-PM blk_queue_enter() calls are in progress to avoid that any +* new non-PM blk_queue_enter() calls succeed before the pm_only +* counter is decreased again. +*/ + blk_set_pm_only(q); + /* +* This function only gets called if the most recent request activity +* occurred at least autosuspend_delay_ms ago. Since blk_queue_enter() +* is called by the request allocation code before +* pm_request_resume(), if no requests have a tag assigned it is safe +* to suspend the device. +*/ + ret = -EBUSY; + if (blk_requests_in_flight(q) == 0) { + blk_freeze_queue_start(q); + /* +* Freezing a queue starts a transition of the queue +* usag
[PATCH v9 0/8] blk-mq: Implement runtime power management
Hello Jens, One of the pieces that is missing before blk-mq can be made the default is implementing runtime power management support for blk-mq. This patch series not only implements runtime power management for blk-mq but also fixes a starvation issue in the power management code for the legacy block layer. Please consider this patch series for the upstream kernel. Thanks, Bart. Changes compared to v8: - Fixed the race that was reported by Jianchao and Ming. - Fixed another spelling issue in a source code comment. Changes compared to v7: - Addressed Jianchao's feedback about patch "Make blk_get_request() block for non-PM requests while suspended". - Added two new patches - one that documents the functions that iterate over requests and one that introduces a new function that iterates over all requests associated with a queue. Changes compared to v6: - Left out the patches that split RQF_PREEMPT in three flags. - Left out the patch that introduces the SCSI device state SDEV_SUSPENDED. - Left out the patch that introduces blk_pm_runtime_exit(). - Restored the patch that changes the PREEMPT_ONLY flag into a counter. Changes compared to v5: - Introduced a new flag RQF_DV that replaces RQF_PREEMPT for SCSI domain validation. - Introduced a new request queue state QUEUE_FLAG_DV_ONLY for SCSI domain validation. - Instead of using SDEV_QUIESCE for both runtime suspend and SCSI domain validation, use that state for domain validation only and introduce a new state for runtime suspend, namely SDEV_QUIESCE. - Reallow system suspend during SCSI domain validation. - Moved the runtime resume call from the request allocation code into blk_queue_enter(). - Instead of relying on q_usage_counter, iterate over the tag set to determine whether or not any requests are in flight. Changes compared to v4: - Dropped the patches "Give RQF_PREEMPT back its original meaning" and "Serialize queue freezing and blk_pre_runtime_suspend()". - Replaced "percpu_ref_read()" with "percpu_is_in_use()". - Inserted pm_request_resume() calls in the block layer request allocation code such that the context that submits a request no longer has to call pm_runtime_get(). Changes compared to v3: - Avoid adverse interactions between system-wide suspend/resume and runtime power management by changing the PREEMPT_ONLY flag into a counter. - Give RQF_PREEMPT back its original meaning, namely that it is only set for ide_preempt requests. - Remove the flag BLK_MQ_REQ_PREEMPT. - Removed the pm_request_resume() call. Changes compared to v2: - Fixed the build for CONFIG_BLOCK=n. - Added a patch that introduces percpu_ref_read() in the percpu-counter code. - Added a patch that makes it easier to detect missing pm_runtime_get*() calls. - Addressed Jianchao's feedback including the comment about runtime overhead of switching a per-cpu counter to atomic mode. Changes compared to v1: - Moved the runtime power management code into a separate file. - Addressed Ming's feedback. Bart Van Assche (8): blk-mq: Document the functions that iterate over requests blk-mq: Introduce blk_mq_queue_rq_iter() block: Move power management code into a new source file block, scsi: Change the preempt-only flag into a counter block: Split blk_pm_add_request() and blk_pm_put_request() block: Schedule runtime resume earlier block: Make blk_get_request() block for non-PM requests while suspended blk-mq: Enable support for runtime power management block/Kconfig | 3 + block/Makefile | 1 + block/blk-core.c| 271 +--- block/blk-mq-debugfs.c | 10 +- block/blk-mq-tag.c | 74 ++- block/blk-mq-tag.h | 2 + block/blk-mq.c | 2 + block/blk-pm.c | 253 + block/blk-pm.h | 69 ++ block/elevator.c| 22 +--- drivers/scsi/scsi_lib.c | 11 +- drivers/scsi/scsi_pm.c | 1 + drivers/scsi/sd.c | 1 + drivers/scsi/sr.c | 1 + include/linux/blk-pm.h | 24 include/linux/blkdev.h | 37 ++ 16 files changed, 484 insertions(+), 298 deletions(-) create mode 100644 block/blk-pm.c create mode 100644 block/blk-pm.h create mode 100644 include/linux/blk-pm.h -- 2.19.0.397.gdd90340f6a-goog
[PATCH v9 2/8] blk-mq: Introduce blk_mq_queue_rq_iter()
This function will be used in the patch "Make blk_get_request() block for non-PM requests while suspended". Signed-off-by: Bart Van Assche Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-mq-tag.c | 44 block/blk-mq-tag.h | 2 ++ 2 files changed, 46 insertions(+) diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index cb5db0c3cc32..f95b41b5f07a 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -374,6 +374,50 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn, rcu_read_unlock(); } +/* + * Call @fn(rq, @priv, reserved) for each request associated with request + * queue @q or any queue that it shares tags with and that has been assigned a + * tag. 'reserved' indicates whether or not 'rq' is a reserved request. In + * contrast to blk_mq_queue_tag_busy_iter(), if an I/O scheduler has been + * associated with @q, this function also iterates over requests that have + * been assigned a scheduler tag but that have not yet been assigned a driver + * tag. + */ +void blk_mq_queue_rq_iter(struct request_queue *q, busy_iter_fn *fn, void *priv) +{ + struct blk_mq_hw_ctx *hctx; + int i; + + /* +* __blk_mq_update_nr_hw_queues() will update nr_hw_queues and +* queue_hw_ctx after having frozen the request queue. So we can use +* q_usage_counter to avoid a race with that +* function. __blk_mq_update_nr_hw_queues() uses synchronize_rcu() to +* ensure that this function leaves the critical section below. +*/ + rcu_read_lock(); + if (percpu_ref_is_zero(>q_usage_counter)) { + rcu_read_unlock(); + return; + } + + queue_for_each_hw_ctx(q, hctx, i) { + struct blk_mq_tags *tags = hctx->sched_tags ? : hctx->tags; + + /* +* If no software queues are currently mapped to this +* hardware queue, there's nothing to check +*/ + if (!blk_mq_hw_queue_mapped(hctx)) + continue; + + if (tags->nr_reserved_tags) + bt_for_each(hctx, >breserved_tags, fn, priv, true); + bt_for_each(hctx, >bitmap_tags, fn, priv, false); + } + rcu_read_unlock(); +} + static int bt_alloc(struct sbitmap_queue *bt, unsigned int depth, bool round_robin, int node) { diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h index 61deab0b5a5a..25e62997ed6c 100644 --- a/block/blk-mq-tag.h +++ b/block/blk-mq-tag.h @@ -35,6 +35,8 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx, extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool); void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn, void *priv); +void blk_mq_queue_rq_iter(struct request_queue *q, busy_iter_fn *fn, + void *priv); static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt, struct blk_mq_hw_ctx *hctx) -- 2.19.0.397.gdd90340f6a-goog
Re: [PATCH v8 6/8] block: Schedule runtime resume earlier
On Wed, 2018-09-19 at 12:05 +0800, Ming Lei wrote: > Looks this patch may introduce the following race between queue > freeze and > runtime suspend: > > --- > --- > CPU0 CPU1 > CPU2 > --- > --- > > blk_freeze_queue() > > blk_mq_alloc_request() > blk_queue_enter() > blk_pm_request_ > resume() > wait_event() > > > blk_pre_runtime_suspend() > > ->blk_set_pm_only > > ... > > blk_post_runtime_suspend() > > ... > blk_mq_unfreeze_queue() > --- > --- > - CPU0: queue is frozen > > - CPU1: one new request comes, and see queue is frozen, but queue > isn't > runtime-suspended yet, then blk_pm_request_resume() does nothing. So > this > allocation is blocked in wait_event(). > > - CPU2: runtime suspend comes, and queue becomes runtime-suspended > now > > - CPU0: queue is unfreeze, but the new request allocation in CPU1 may > never > be done because the queue is runtime suspended, and wait_event() > won't return. > And the expected result is that the queue becomes active and the > allocation on > CPU1 is done immediately. Hello Ming, Just like for the scenario Jianchao reported, I will address this by only allowing the suspend to proceed if q_usage_counter == 0. Bart.
Re: [PATCH v7 5/6] block: Make blk_get_request() block for non-PM requests while suspended
On Wed, 2018-09-19 at 10:21 +0800, jianchao.wang wrote: > On 09/19/2018 01:44 AM, Bart Van Assche wrote: > > There is only one blk_pm_request_resume() call and that call is > > inside blk_queue_enter() after the pm_only counter has been > > checked. > > > > For the legacy block layer, nr_pending is increased after the > > blk_queue_enter() call from inside blk_old_get_request() succeeded. > > > > So I don't see how blk_pm_request_resume() or q->nr_pending++ could > > escape from the preempt checking? > > For example: > > > blk_pre_runtime_suspendgeneric_make_request > blk_queue_enter // > success here, no blk_pm_request_resume here > blk_mq_make_requset > blk_set_pm_only > > if (blk_requests_in_flight(q) == 0) { > synchronize_rcu(); > if (blk_requests_in_flight(q) == 0) > ret = 0; > } > blk_mq_get_request Hello Jianchao, I think you are right. I will add a q_usage_counter check before setting 'ret' to zero. Bart.
Re: [PATCH blktests 0/3] Add NVMeOF multipath tests
On 9/18/18 4:24 PM, Omar Sandoval wrote: On Tue, Sep 18, 2018 at 02:20:59PM -0700, Bart Van Assche wrote: Can you have a look at the updated master branch of https://github.com/bvanassche/blktests? That code should no longer fail if unloading the nvme kernel module fails. Please note that you will need kernel v4.18 to test these scripts - a KASAN complaint appears if I run these tests against kernel v4.19-rc4. Thanks, these pass now. Is it expected that my nvme device gets a multipath device configured after running these tests? $ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT vda 254:00 16G 0 disk └─vda1 254:10 16G 0 part / vdb 254:16 0 8G 0 disk vdc 254:32 0 8G 0 disk vdd 254:48 0 8G 0 disk nvme0n1 259:00 8G 0 disk └─mpatha 253:00 8G 0 mpath No, all multipath devices that were created during a test should be removed before that test finishes. I will look into this. Also, can you please fix: _have_kernel_option NVME_MULTIPATH && exit 1 to not exit on failure? It should use SKIP_REASON and return 1. You might need to add something like _dont_have_kernel_option to properly handle the case where the config is not found. OK, I will change this. Side note which isn't a blocker for merging is that there's a lot of duplicated code between these helpers and the srp helpers. How hard would it be to refactor that? Are you perhaps referring to the code that is shared between the tests/srp/rc tests/nvmeof-mp/rc shell scripts? The hardest part is probably to chose a location where to store these functions. Should I create a file with common code under common/, under tests/srp/, under tests/nvmeof-mp/ or perhaps somewhere else? Thanks, Bart.
Re: [PATCH blktests 0/3] Add NVMeOF multipath tests
On 8/23/18 5:21 PM, Omar Sandoval wrote: On Thu, Aug 23, 2018 at 01:53:33AM +, Bart Van Assche wrote: On Tue, 2018-08-21 at 08:46 +0200, Johannes Thumshirn wrote: On Mon, Aug 20, 2018 at 03:46:45PM +, Bart Van Assche wrote: Moving these tests into the nvme directory is possible but will make it harder to run the NVMeOF multipath tests separately. Are you fine with this? Both way's have it's up and downsides, I agree. Having two distinct groups requires to run './check nvme nvmeof-mp' to run full coverage with nvme. Having it all in one group would require to run './check nvme 18 19 20 21 22 23 24 ...' to get only the dm-mpath ones. Honestly I hate both but your's (the two distinct groups) is probably easier to handle in the end, I have to admit. Omar, do you have a preference for one of the two aforementioned approaches? Let's keep it in a separate category, since lots of people running nvme tests probably aren't interested in testing multipath. A bunch of the tests failed with modprobe: FATAL: Module nvme is in use. Maybe related to my test VM having an nvme device? Hello Omar, Can you have a look at the updated master branch of https://github.com/bvanassche/blktests? That code should no longer fail if unloading the nvme kernel module fails. Please note that you will need kernel v4.18 to test these scripts - a KASAN complaint appears if I run these tests against kernel v4.19-rc4. Thanks, Bart.
[PATCH v8 7/8] block: Make blk_get_request() block for non-PM requests while suspended
Instead of allowing requests that are not power management requests to enter the queue in runtime suspended status (RPM_SUSPENDED), make the blk_get_request() caller block. This change fixes a starvation issue: it is now guaranteed that power management requests will be executed no matter how many blk_get_request() callers are waiting. For blk-mq, instead of maintaining the q->nr_pending counter, rely on q->q_usage_counter. Call pm_runtime_mark_last_busy() every time a request finishes instead of only if the queue depth drops to zero. Signed-off-by: Bart Van Assche Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-core.c | 37 ++--- block/blk-pm.c | 70 2 files changed, 73 insertions(+), 34 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 18b874d5c9c9..ae092ca121d5 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2747,30 +2747,6 @@ void blk_account_io_done(struct request *req, u64 now) } } -#ifdef CONFIG_PM -/* - * Don't process normal requests when queue is suspended - * or in the process of suspending/resuming - */ -static bool blk_pm_allow_request(struct request *rq) -{ - switch (rq->q->rpm_status) { - case RPM_RESUMING: - case RPM_SUSPENDING: - return rq->rq_flags & RQF_PM; - case RPM_SUSPENDED: - return false; - default: - return true; - } -} -#else -static bool blk_pm_allow_request(struct request *rq) -{ - return true; -} -#endif - void blk_account_io_start(struct request *rq, bool new_io) { struct hd_struct *part; @@ -2816,11 +2792,14 @@ static struct request *elv_next_request(struct request_queue *q) while (1) { list_for_each_entry(rq, >queue_head, queuelist) { - if (blk_pm_allow_request(rq)) - return rq; - - if (rq->rq_flags & RQF_SOFTBARRIER) - break; +#ifdef CONFIG_PM + /* +* If a request gets queued in state RPM_SUSPENDED +* then that's a kernel bug. +*/ + WARN_ON_ONCE(q->rpm_status == RPM_SUSPENDED); +#endif + return rq; } /* diff --git a/block/blk-pm.c b/block/blk-pm.c index 9b636960d285..dc9bc45b0db5 100644 --- a/block/blk-pm.c +++ b/block/blk-pm.c @@ -1,8 +1,11 @@ // SPDX-License-Identifier: GPL-2.0 +#include #include #include #include +#include "blk-mq.h" +#include "blk-mq-tag.h" /** * blk_pm_runtime_init - Block layer runtime PM initialization routine @@ -40,6 +43,34 @@ void blk_pm_runtime_init(struct request_queue *q, struct device *dev) } EXPORT_SYMBOL(blk_pm_runtime_init); +struct in_flight_data { + struct request_queue*q; + int in_flight; +}; + +static void blk_count_in_flight(struct blk_mq_hw_ctx *hctx, struct request *rq, + void *priv, bool reserved) +{ + struct in_flight_data *in_flight = priv; + + if (rq->q == in_flight->q) + in_flight->in_flight++; +} + +/* + * Count the number of requests that are in flight for request queue @q. Use + * @q->nr_pending for legacy queues. Iterate over the tag set for blk-mq + * queues. + */ +static int blk_requests_in_flight(struct request_queue *q) +{ + struct in_flight_data in_flight = { .q = q }; + + if (q->mq_ops) + blk_mq_queue_rq_iter(q, blk_count_in_flight, _flight); + return q->nr_pending + in_flight.in_flight; +} + /** * blk_pre_runtime_suspend - Pre runtime suspend check * @q: the queue of the device @@ -68,14 +99,38 @@ int blk_pre_runtime_suspend(struct request_queue *q) if (!q->dev) return ret; + WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE); + + blk_set_pm_only(q); + /* +* This function only gets called if the most recent request activity +* occurred at least autosuspend_delay_ms ago. Since blk_queue_enter() +* is called by the request allocation code before +* pm_request_resume(), if no requests have a tag assigned it is safe +* to suspend the device. +*/ + ret = -EBUSY; + if (blk_requests_in_flight(q) == 0) { + /* +* Call synchronize_rcu() such that later blk_queue_enter() +* calls see the pm-only state. See also +* http://lwn.net/Articles/573497/. +*/ + synchronize_rcu(); + if (blk_requests_in_flight(q) == 0) + ret = 0; + } + spin_lock_irq(q->queue_lock);
[PATCH v8 8/8] blk-mq: Enable support for runtime power management
Now that the blk-mq core processes power management requests (marked with RQF_PREEMPT) in other states than RPM_ACTIVE, enable runtime power management for blk-mq. Signed-off-by: Bart Van Assche Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-mq.c | 2 ++ block/blk-pm.c | 6 -- 2 files changed, 2 insertions(+), 6 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 85a1c1a59c72..20fdd78b75c7 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -33,6 +33,7 @@ #include "blk-mq.h" #include "blk-mq-debugfs.h" #include "blk-mq-tag.h" +#include "blk-pm.h" #include "blk-stat.h" #include "blk-mq-sched.h" #include "blk-rq-qos.h" @@ -475,6 +476,7 @@ static void __blk_mq_free_request(struct request *rq) struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu); const int sched_tag = rq->internal_tag; + blk_pm_mark_last_busy(rq); if (rq->tag != -1) blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag); if (sched_tag != -1) diff --git a/block/blk-pm.c b/block/blk-pm.c index dc9bc45b0db5..2a7a9c0f5b99 100644 --- a/block/blk-pm.c +++ b/block/blk-pm.c @@ -30,12 +30,6 @@ */ void blk_pm_runtime_init(struct request_queue *q, struct device *dev) { - /* Don't enable runtime PM for blk-mq until it is ready */ - if (q->mq_ops) { - pm_runtime_disable(dev); - return; - } - q->dev = dev; q->rpm_status = RPM_ACTIVE; pm_runtime_set_autosuspend_delay(q->dev, -1); -- 2.18.0
[PATCH v8 6/8] block: Schedule runtime resume earlier
Instead of scheduling runtime resume of a request queue after a request has been queued, schedule asynchronous resume during request allocation. The new pm_request_resume() calls occur after blk_queue_enter() has increased the q_usage_counter request queue member. This change is needed for a later patch that will make request allocation block while the queue status is not RPM_ACTIVE. Signed-off-by: Bart Van Assche Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-core.c | 2 ++ block/elevator.c | 1 - 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/block/blk-core.c b/block/blk-core.c index fd91e9bf2893..18b874d5c9c9 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -942,6 +942,8 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) if (success) return 0; + blk_pm_request_resume(q); + if (flags & BLK_MQ_REQ_NOWAIT) return -EBUSY; diff --git a/block/elevator.c b/block/elevator.c index 1c992bf6cfb1..e18ac68626e3 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -601,7 +601,6 @@ void __elv_add_request(struct request_queue *q, struct request *rq, int where) trace_block_rq_insert(q, rq); blk_pm_add_request(q, rq); - blk_pm_request_resume(q); rq->q = q; -- 2.18.0
[PATCH v8 3/8] block: Move power management code into a new source file
Move the code for runtime power management from blk-core.c into the new source file blk-pm.c. Move the corresponding declarations from into . For CONFIG_PM=n, leave out the declarations of the functions that are not used in that mode. This patch not only reduces the number of #ifdefs in the block layer core code but also reduces the size of header file and hence should help to reduce the build time of the Linux kernel if CONFIG_PM is not defined. Signed-off-by: Bart Van Assche Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/Kconfig | 3 + block/Makefile | 1 + block/blk-core.c | 196 + block/blk-pm.c | 188 +++ block/blk-pm.h | 43 + block/elevator.c | 22 + drivers/scsi/scsi_pm.c | 1 + drivers/scsi/sd.c | 1 + drivers/scsi/sr.c | 1 + include/linux/blk-pm.h | 24 + include/linux/blkdev.h | 23 - 11 files changed, 264 insertions(+), 239 deletions(-) create mode 100644 block/blk-pm.c create mode 100644 block/blk-pm.h create mode 100644 include/linux/blk-pm.h diff --git a/block/Kconfig b/block/Kconfig index 1f2469a0123c..85263e7bded6 100644 --- a/block/Kconfig +++ b/block/Kconfig @@ -228,4 +228,7 @@ config BLK_MQ_RDMA depends on BLOCK && INFINIBAND default y +config BLK_PM + def_bool BLOCK && PM + source block/Kconfig.iosched diff --git a/block/Makefile b/block/Makefile index 572b33f32c07..27eac600474f 100644 --- a/block/Makefile +++ b/block/Makefile @@ -37,3 +37,4 @@ obj-$(CONFIG_BLK_WBT) += blk-wbt.o obj-$(CONFIG_BLK_DEBUG_FS) += blk-mq-debugfs.o obj-$(CONFIG_BLK_DEBUG_FS_ZONED)+= blk-mq-debugfs-zoned.o obj-$(CONFIG_BLK_SED_OPAL) += sed-opal.o +obj-$(CONFIG_BLK_PM) += blk-pm.o diff --git a/block/blk-core.c b/block/blk-core.c index 4dbc93f43b38..6d4dd176bd9d 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -42,6 +42,7 @@ #include "blk.h" #include "blk-mq.h" #include "blk-mq-sched.h" +#include "blk-pm.h" #include "blk-rq-qos.h" #ifdef CONFIG_DEBUG_FS @@ -1726,16 +1727,6 @@ void part_round_stats(struct request_queue *q, int cpu, struct hd_struct *part) } EXPORT_SYMBOL_GPL(part_round_stats); -#ifdef CONFIG_PM -static void blk_pm_put_request(struct request *rq) -{ - if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending) - pm_runtime_mark_last_busy(rq->q->dev); -} -#else -static inline void blk_pm_put_request(struct request *rq) {} -#endif - void __blk_put_request(struct request_queue *q, struct request *req) { req_flags_t rq_flags = req->rq_flags; @@ -3757,191 +3748,6 @@ void blk_finish_plug(struct blk_plug *plug) } EXPORT_SYMBOL(blk_finish_plug); -#ifdef CONFIG_PM -/** - * blk_pm_runtime_init - Block layer runtime PM initialization routine - * @q: the queue of the device - * @dev: the device the queue belongs to - * - * Description: - *Initialize runtime-PM-related fields for @q and start auto suspend for - *@dev. Drivers that want to take advantage of request-based runtime PM - *should call this function after @dev has been initialized, and its - *request queue @q has been allocated, and runtime PM for it can not happen - *yet(either due to disabled/forbidden or its usage_count > 0). In most - *cases, driver should call this function before any I/O has taken place. - * - *This function takes care of setting up using auto suspend for the device, - *the autosuspend delay is set to -1 to make runtime suspend impossible - *until an updated value is either set by user or by driver. Drivers do - *not need to touch other autosuspend settings. - * - *The block layer runtime PM is request based, so only works for drivers - *that use request as their IO unit instead of those directly use bio's. - */ -void blk_pm_runtime_init(struct request_queue *q, struct device *dev) -{ - /* Don't enable runtime PM for blk-mq until it is ready */ - if (q->mq_ops) { - pm_runtime_disable(dev); - return; - } - - q->dev = dev; - q->rpm_status = RPM_ACTIVE; - pm_runtime_set_autosuspend_delay(q->dev, -1); - pm_runtime_use_autosuspend(q->dev); -} -EXPORT_SYMBOL(blk_pm_runtime_init); - -/** - * blk_pre_runtime_suspend - Pre runtime suspend check - * @q: the queue of the device - * - * Description: - *This function will check if runtime suspend is allowed for the device - *by examining if there are any requests pending in the queue. If there - *are requests pending, the device can not be runtime suspended; otherwise, - *the queue's status will be updated to SUSPENDING and the driver can - *proceed to s
[PATCH v8 1/8] blk-mq: Document the functions that iterate over requests
Make it easier to understand the purpose of the functions that iterate over requests by documenting their purpose. Fix two minor spelling mistakes in comments in these functions. Signed-off-by: Bart Van Assche Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn --- block/blk-mq-tag.c | 28 ++-- 1 file changed, 26 insertions(+), 2 deletions(-) diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 94e1ed667b6e..ef3acb4a80e0 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -232,13 +232,19 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) /* * We can hit rq == NULL here, because the tagging functions -* test and set the bit before assining ->rqs[]. +* test and set the bit before assigning ->rqs[]. */ if (rq && rq->q == hctx->queue) iter_data->fn(hctx, rq, iter_data->data, reserved); return true; } +/* + * Call function @fn(@hctx, rq, @data, @reserved) for each request queued on + * @hctx that has been assigned a driver tag. @reserved indicates whether @bt + * is the breserved_tags member or the bitmap_tags member of struct + * blk_mq_tags. + */ static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt, busy_iter_fn *fn, void *data, bool reserved) { @@ -280,6 +286,11 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) return true; } +/* + * Call function @fn(rq, @data, @reserved) for each request in @tags that has + * been started. @reserved indicates whether @bt is the breserved_tags member + * or the bitmap_tags member of struct blk_mq_tags. + */ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt, busy_tag_iter_fn *fn, void *data, bool reserved) { @@ -294,6 +305,10 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt, sbitmap_for_each_set(>sb, bt_tags_iter, _data); } +/* + * Call @fn(rq, @priv, reserved) for each started request in @tags. 'reserved' + * indicates whether or not 'rq' is a reserved request. + */ static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn, void *priv) { @@ -302,6 +317,10 @@ static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags, bt_tags_for_each(tags, >bitmap_tags, fn, priv, false); } +/* + * Call @fn(rq, @priv, reserved) for each request in @tagset. 'reserved' + * indicates whether or not 'rq' is a reserved request. + */ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, busy_tag_iter_fn *fn, void *priv) { @@ -314,6 +333,11 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, } EXPORT_SYMBOL(blk_mq_tagset_busy_iter); +/* + * Call @fn(rq, @priv, reserved) for each request associated with request + * queue @q or any queue it shares tags with and that has been assigned a + * driver tag. 'reserved' indicates whether or not 'rq' is a reserved request. + */ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn, void *priv) { @@ -337,7 +361,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn, struct blk_mq_tags *tags = hctx->tags; /* -* If not software queues are currently mapped to this +* If no software queues are currently mapped to this * hardware queue, there's nothing to check */ if (!blk_mq_hw_queue_mapped(hctx)) -- 2.18.0
[PATCH v8 5/8] block: Split blk_pm_add_request() and blk_pm_put_request()
Move the pm_request_resume() and pm_runtime_mark_last_busy() calls into two new functions and thereby separate legacy block layer code from code that works for both the legacy block layer and blk-mq. A later patch will add calls to the new functions in the blk-mq code. Signed-off-by: Bart Van Assche Cc: Martin K. Petersen Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-core.c | 1 + block/blk-pm.h | 36 +++- block/elevator.c | 1 + 3 files changed, 33 insertions(+), 5 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 1a691f5269bb..fd91e9bf2893 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1744,6 +1744,7 @@ void __blk_put_request(struct request_queue *q, struct request *req) blk_req_zone_write_unlock(req); blk_pm_put_request(req); + blk_pm_mark_last_busy(req); elv_completed_request(q, req); diff --git a/block/blk-pm.h b/block/blk-pm.h index 1ffc8ef203ec..a8564ea72a41 100644 --- a/block/blk-pm.h +++ b/block/blk-pm.h @@ -6,8 +6,23 @@ #include #ifdef CONFIG_PM +static inline void blk_pm_request_resume(struct request_queue *q) +{ + if (q->dev && (q->rpm_status == RPM_SUSPENDED || + q->rpm_status == RPM_SUSPENDING)) + pm_request_resume(q->dev); +} + +static inline void blk_pm_mark_last_busy(struct request *rq) +{ + if (rq->q->dev && !(rq->rq_flags & RQF_PM)) + pm_runtime_mark_last_busy(rq->q->dev); +} + static inline void blk_pm_requeue_request(struct request *rq) { + lockdep_assert_held(rq->q->queue_lock); + if (rq->q->dev && !(rq->rq_flags & RQF_PM)) rq->q->nr_pending--; } @@ -15,17 +30,28 @@ static inline void blk_pm_requeue_request(struct request *rq) static inline void blk_pm_add_request(struct request_queue *q, struct request *rq) { - if (q->dev && !(rq->rq_flags & RQF_PM) && q->nr_pending++ == 0 && - (q->rpm_status == RPM_SUSPENDED || q->rpm_status == RPM_SUSPENDING)) - pm_request_resume(q->dev); + lockdep_assert_held(q->queue_lock); + + if (q->dev && !(rq->rq_flags & RQF_PM)) + q->nr_pending++; } static inline void blk_pm_put_request(struct request *rq) { - if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending) - pm_runtime_mark_last_busy(rq->q->dev); + lockdep_assert_held(rq->q->queue_lock); + + if (rq->q->dev && !(rq->rq_flags & RQF_PM)) + --rq->q->nr_pending; } #else +static inline void blk_pm_request_resume(struct request_queue *q) +{ +} + +static inline void blk_pm_mark_last_busy(struct request *rq) +{ +} + static inline void blk_pm_requeue_request(struct request *rq) { } diff --git a/block/elevator.c b/block/elevator.c index e18ac68626e3..1c992bf6cfb1 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -601,6 +601,7 @@ void __elv_add_request(struct request_queue *q, struct request *rq, int where) trace_block_rq_insert(q, rq); blk_pm_add_request(q, rq); + blk_pm_request_resume(q); rq->q = q; -- 2.18.0
[PATCH v8 4/8] block, scsi: Change the preempt-only flag into a counter
The RQF_PREEMPT flag is used for three purposes: - In the SCSI core, for making sure that power management requests are executed even if a device is in the "quiesced" state. - For domain validation by SCSI drivers that use the parallel port. - In the IDE driver, for IDE preempt requests. Rename "preempt-only" into "pm-only" because the primary purpose of this mode is power management. Since the power management core may but does not have to resume a runtime suspended device before performing system-wide suspend and since a later patch will set "pm-only" mode as long as a block device is runtime suspended, make it possible to set "pm-only" mode from more than one context. Since with this change scsi_device_quiesce() is no longer idempotent, make that function return early if it is called for a quiesced queue. Signed-off-by: Bart Van Assche Cc: Martin K. Petersen Reviewed-by: Hannes Reinecke Reviewed-by: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-core.c| 35 ++- block/blk-mq-debugfs.c | 10 +- drivers/scsi/scsi_lib.c | 11 +++ include/linux/blkdev.h | 14 +- 4 files changed, 43 insertions(+), 27 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 6d4dd176bd9d..1a691f5269bb 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -422,24 +422,25 @@ void blk_sync_queue(struct request_queue *q) EXPORT_SYMBOL(blk_sync_queue); /** - * blk_set_preempt_only - set QUEUE_FLAG_PREEMPT_ONLY + * blk_set_pm_only - increment pm_only counter * @q: request queue pointer - * - * Returns the previous value of the PREEMPT_ONLY flag - 0 if the flag was not - * set and 1 if the flag was already set. */ -int blk_set_preempt_only(struct request_queue *q) +void blk_set_pm_only(struct request_queue *q) { - return blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q); + atomic_inc(>pm_only); } -EXPORT_SYMBOL_GPL(blk_set_preempt_only); +EXPORT_SYMBOL_GPL(blk_set_pm_only); -void blk_clear_preempt_only(struct request_queue *q) +void blk_clear_pm_only(struct request_queue *q) { - blk_queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q); - wake_up_all(>mq_freeze_wq); + int pm_only; + + pm_only = atomic_dec_return(>pm_only); + WARN_ON_ONCE(pm_only < 0); + if (pm_only == 0) + wake_up_all(>mq_freeze_wq); } -EXPORT_SYMBOL_GPL(blk_clear_preempt_only); +EXPORT_SYMBOL_GPL(blk_clear_pm_only); /** * __blk_run_queue_uncond - run a queue whether or not it has been stopped @@ -918,7 +919,7 @@ EXPORT_SYMBOL(blk_alloc_queue); */ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) { - const bool preempt = flags & BLK_MQ_REQ_PREEMPT; + const bool pm = flags & BLK_MQ_REQ_PREEMPT; while (true) { bool success = false; @@ -926,11 +927,11 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) rcu_read_lock(); if (percpu_ref_tryget_live(>q_usage_counter)) { /* -* The code that sets the PREEMPT_ONLY flag is -* responsible for ensuring that that flag is globally -* visible before the queue is unfrozen. +* The code that increments the pm_only counter is +* responsible for ensuring that that counter is +* globally visible before the queue is unfrozen. */ - if (preempt || !blk_queue_preempt_only(q)) { + if (pm || !blk_queue_pm_only(q)) { success = true; } else { percpu_ref_put(>q_usage_counter); @@ -955,7 +956,7 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) wait_event(q->mq_freeze_wq, (atomic_read(>mq_freeze_depth) == 0 && - (preempt || !blk_queue_preempt_only(q))) || + (pm || !blk_queue_pm_only(q))) || blk_queue_dying(q)); if (blk_queue_dying(q)) return -ENODEV; diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index cb1e6cf7ac48..a5ea86835fcb 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -102,6 +102,14 @@ static int blk_flags_show(struct seq_file *m, const unsigned long flags, return 0; } +static int queue_pm_only_show(void *data, struct seq_file *m) +{ + struct request_queue *q = data; + + seq_printf(m, "%d\n", atomic_read(>pm_only)); + return 0; +} + #define QUEUE_FLAG_NAME(name) [QUEUE_FLAG_##name] = #name static const char *con
[PATCH v8 2/8] blk-mq: Introduce blk_mq_queue_rq_iter()
This function will be used in the patch "Make blk_get_request() block for non-PM requests while suspended". Signed-off-by: Bart Van Assche Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-mq-tag.c | 44 block/blk-mq-tag.h | 2 ++ 2 files changed, 46 insertions(+) diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index ef3acb4a80e0..cf8537017f78 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -374,6 +374,50 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn, rcu_read_unlock(); } +/* + * Call @fn(rq, @priv, reserved) for each request associated with request + * queue @q or any queue that it shares tags with and that has been assigned a + * tag. 'reserved' indicates whether or not 'rq' is a reserved request. In + * contrast to blk_mq_queue_tag_busy_iter(), if an I/O scheduler has been + * associated with @q, this function also iterates over requests that have + * been assigned a scheduler tag but that have not yet been assigned a driver + * tag. + */ +void blk_mq_queue_rq_iter(struct request_queue *q, busy_iter_fn *fn, void *priv) +{ + struct blk_mq_hw_ctx *hctx; + int i; + + /* +* __blk_mq_update_nr_hw_queues will update the nr_hw_queues and +* queue_hw_ctx after freeze the queue. So we could use q_usage_counter +* to avoid race with it. __blk_mq_update_nr_hw_queues will users +* synchronize_rcu to ensure all of the users go out of the critical +* section below and see zeroed q_usage_counter. +*/ + rcu_read_lock(); + if (percpu_ref_is_zero(>q_usage_counter)) { + rcu_read_unlock(); + return; + } + + queue_for_each_hw_ctx(q, hctx, i) { + struct blk_mq_tags *tags = hctx->sched_tags ? : hctx->tags; + + /* +* If no software queues are currently mapped to this +* hardware queue, there's nothing to check +*/ + if (!blk_mq_hw_queue_mapped(hctx)) + continue; + + if (tags->nr_reserved_tags) + bt_for_each(hctx, >breserved_tags, fn, priv, true); + bt_for_each(hctx, >bitmap_tags, fn, priv, false); + } + rcu_read_unlock(); +} + static int bt_alloc(struct sbitmap_queue *bt, unsigned int depth, bool round_robin, int node) { diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h index 61deab0b5a5a..25e62997ed6c 100644 --- a/block/blk-mq-tag.h +++ b/block/blk-mq-tag.h @@ -35,6 +35,8 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx, extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool); void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn, void *priv); +void blk_mq_queue_rq_iter(struct request_queue *q, busy_iter_fn *fn, + void *priv); static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt, struct blk_mq_hw_ctx *hctx) -- 2.18.0
[PATCH v8 0/8] blk-mq: Implement runtime power management
Hello Jens, One of the pieces that is missing before blk-mq can be made the default is implementing runtime power management support for blk-mq. This patch series not only implements runtime power management for blk-mq but also fixes a starvation issue in the power management code for the legacy block layer. Please consider this patch series for the upstream kernel. Thanks, Bart. Changes compared to v7: - Addressed Jianchao's feedback about patch "Make blk_get_request() block for non-PM requests while suspended". - Added two new patches - one that documents the functions that iterate over requests and one that introduces a new function that iterates over all requests associated with a queue. Changes compared to v6: - Left out the patches that split RQF_PREEMPT in three flags. - Left out the patch that introduces the SCSI device state SDEV_SUSPENDED. - Left out the patch that introduces blk_pm_runtime_exit(). - Restored the patch that changes the PREEMPT_ONLY flag into a counter. Changes compared to v5: - Introduced a new flag RQF_DV that replaces RQF_PREEMPT for SCSI domain validation. - Introduced a new request queue state QUEUE_FLAG_DV_ONLY for SCSI domain validation. - Instead of using SDEV_QUIESCE for both runtime suspend and SCSI domain validation, use that state for domain validation only and introduce a new state for runtime suspend, namely SDEV_QUIESCE. - Reallow system suspend during SCSI domain validation. - Moved the runtime resume call from the request allocation code into blk_queue_enter(). - Instead of relying on q_usage_counter, iterate over the tag set to determine whether or not any requests are in flight. Changes compared to v4: - Dropped the patches "Give RQF_PREEMPT back its original meaning" and "Serialize queue freezing and blk_pre_runtime_suspend()". - Replaced "percpu_ref_read()" with "percpu_is_in_use()". - Inserted pm_request_resume() calls in the block layer request allocation code such that the context that submits a request no longer has to call pm_runtime_get(). Changes compared to v3: - Avoid adverse interactions between system-wide suspend/resume and runtime power management by changing the PREEMPT_ONLY flag into a counter. - Give RQF_PREEMPT back its original meaning, namely that it is only set for ide_preempt requests. - Remove the flag BLK_MQ_REQ_PREEMPT. - Removed the pm_request_resume() call. Changes compared to v2: - Fixed the build for CONFIG_BLOCK=n. - Added a patch that introduces percpu_ref_read() in the percpu-counter code. - Added a patch that makes it easier to detect missing pm_runtime_get*() calls. - Addressed Jianchao's feedback including the comment about runtime overhead of switching a per-cpu counter to atomic mode. Changes compared to v1: - Moved the runtime power management code into a separate file. - Addressed Ming's feedback. Bart Van Assche (6): block: Move power management code into a new source file block, scsi: Change the preempt-only flag into a counter block: Split blk_pm_add_request() and blk_pm_put_request() block: Schedule runtime resume earlier block: Make blk_get_request() block for non-PM requests while suspended blk-mq: Enable support for runtime power management Bart Van Assche (8): blk-mq: Document the functions that iterate over requests blk-mq: Introduce blk_mq_queue_rq_iter() block: Move power management code into a new source file block, scsi: Change the preempt-only flag into a counter block: Split blk_pm_add_request() and blk_pm_put_request() block: Schedule runtime resume earlier block: Make blk_get_request() block for non-PM requests while suspended blk-mq: Enable support for runtime power management block/Kconfig | 3 + block/Makefile | 1 + block/blk-core.c| 271 +--- block/blk-mq-debugfs.c | 10 +- block/blk-mq-tag.c | 72 ++- block/blk-mq-tag.h | 2 + block/blk-mq.c | 2 + block/blk-pm.c | 242 +++ block/blk-pm.h | 69 ++ block/elevator.c| 22 +--- drivers/scsi/scsi_lib.c | 11 +- drivers/scsi/scsi_pm.c | 1 + drivers/scsi/sd.c | 1 + drivers/scsi/sr.c | 1 + include/linux/blk-pm.h | 24 include/linux/blkdev.h | 37 ++ 16 files changed, 472 insertions(+), 297 deletions(-) create mode 100644 block/blk-pm.c create mode 100644 block/blk-pm.h create mode 100644 include/linux/blk-pm.h -- 2.18.0
Re: [PATCH v7 5/6] block: Make blk_get_request() block for non-PM requests while suspended
On 9/17/18 7:39 PM, jianchao.wang wrote: On 09/18/2018 04:15 AM, Bart Van Assche wrote: Instead of allowing requests that are not power management requests to enter the queue in runtime suspended status (RPM_SUSPENDED), make the blk_get_request() caller block. This change fixes a starvation issue: it is now guaranteed that power management requests will be executed no matter how many blk_get_request() callers are waiting. Instead of maintaining the q->nr_pending counter, rely on q->q_usage_counter. Looks like we still depend on this nr_pending for blk-legacy. That's right. I will update the commit message. blk_mq_queue_tag_busy_iter only accounts the driver tags. This will only work w/o io scheduler + /** * blk_pre_runtime_suspend - Pre runtime suspend check * @q: the queue of the device @@ -68,14 +101,38 @@ int blk_pre_runtime_suspend(struct request_queue *q) if (!q->dev) return ret; + WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE); + + blk_set_pm_only(q); + /* +* This function only gets called if the most recent +* pm_request_resume() call occurred at least autosuspend_delay_ms ^^^ pm_runtime_mark_last_busy ? Since every pm_request_resume() call from the block layer is followed by a pm_runtime_mark_last_busy() call and since the latter is called later I think you are right. I will update the comment. +* ago. Since blk_queue_enter() is called by the request allocation +* code before pm_request_resume(), if no requests have a tag assigned +* it is safe to suspend the device. +*/ + ret = -EBUSY; + if (blk_requests_in_flight(q) == 0) { + /* +* Call synchronize_rcu() such that later blk_queue_enter() +* calls see the pm-only state. See also +* https://urldefense.proofpoint.com/v2/url?u=http-3A__lwn.net_Articles_573497_=DwIDAg=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=o3yai95U5Ge5APoSiMJX64Z8wlI7gf3x0mFnfHpuA4E=VPnOWAeo4J4VB944La0uBcynCgVHE-qp52b_9pV8NH4=. +*/ + synchronize_rcu(); + if (blk_requests_in_flight(q) == 0) Seems not safe here. For blk-mq path: Someone may have escaped from the preempt checking, missed the blk_pm_request_resume there, entered into generic_make_request, but have not allocated request and occupied any tag. There could be a similar scenario for blk-legacy path, the q->nr_pending is increased when request is queued. So I guess the q_usage_counter checking is still needed here. There is only one blk_pm_request_resume() call and that call is inside blk_queue_enter() after the pm_only counter has been checked. For the legacy block layer, nr_pending is increased after the blk_queue_enter() call from inside blk_old_get_request() succeeded. So I don't see how blk_pm_request_resume() or q->nr_pending++ could escape from the preempt checking? Thanks, Bart.
Re: [PATCH v7 0/6] blk-mq: Implement runtime power management
On 9/17/18 1:11 PM, Bart Van Assche wrote: [ ... ] Please ignore this e-mail thread - it contains a mixup of two versions of this patch series. Bart.
[PATCH v7 5/6] block: Make blk_get_request() block for non-PM requests while suspended
Instead of allowing requests that are not power management requests to enter the queue in runtime suspended status (RPM_SUSPENDED), make the blk_get_request() caller block. This change fixes a starvation issue: it is now guaranteed that power management requests will be executed no matter how many blk_get_request() callers are waiting. Instead of maintaining the q->nr_pending counter, rely on q->q_usage_counter. Call pm_runtime_mark_last_busy() every time a request finishes instead of only if the queue depth drops to zero. Signed-off-by: Bart Van Assche Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-core.c | 37 ++--- block/blk-pm.c | 72 2 files changed, 75 insertions(+), 34 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 18b874d5c9c9..ae092ca121d5 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2747,30 +2747,6 @@ void blk_account_io_done(struct request *req, u64 now) } } -#ifdef CONFIG_PM -/* - * Don't process normal requests when queue is suspended - * or in the process of suspending/resuming - */ -static bool blk_pm_allow_request(struct request *rq) -{ - switch (rq->q->rpm_status) { - case RPM_RESUMING: - case RPM_SUSPENDING: - return rq->rq_flags & RQF_PM; - case RPM_SUSPENDED: - return false; - default: - return true; - } -} -#else -static bool blk_pm_allow_request(struct request *rq) -{ - return true; -} -#endif - void blk_account_io_start(struct request *rq, bool new_io) { struct hd_struct *part; @@ -2816,11 +2792,14 @@ static struct request *elv_next_request(struct request_queue *q) while (1) { list_for_each_entry(rq, >queue_head, queuelist) { - if (blk_pm_allow_request(rq)) - return rq; - - if (rq->rq_flags & RQF_SOFTBARRIER) - break; +#ifdef CONFIG_PM + /* +* If a request gets queued in state RPM_SUSPENDED +* then that's a kernel bug. +*/ + WARN_ON_ONCE(q->rpm_status == RPM_SUSPENDED); +#endif + return rq; } /* diff --git a/block/blk-pm.c b/block/blk-pm.c index 9b636960d285..5f21bedcb307 100644 --- a/block/blk-pm.c +++ b/block/blk-pm.c @@ -1,8 +1,11 @@ // SPDX-License-Identifier: GPL-2.0 +#include #include #include #include +#include "blk-mq.h" +#include "blk-mq-tag.h" /** * blk_pm_runtime_init - Block layer runtime PM initialization routine @@ -40,6 +43,36 @@ void blk_pm_runtime_init(struct request_queue *q, struct device *dev) } EXPORT_SYMBOL(blk_pm_runtime_init); +struct in_flight_data { + struct request_queue*q; + int in_flight; +}; + +static void blk_count_in_flight(struct blk_mq_hw_ctx *hctx, struct request *rq, + void *priv, bool reserved) +{ + struct in_flight_data *in_flight = priv; + + if (rq->q == in_flight->q) + in_flight->in_flight++; +} + +/* + * Count the number of requests that are in flight for request queue @q. Use + * @q->nr_pending for legacy queues. Iterate over the tag set for blk-mq + * queues. Use blk_mq_queue_tag_busy_iter() instead of + * blk_mq_tagset_busy_iter() because the latter only considers requests that + * have already been started. + */ +static int blk_requests_in_flight(struct request_queue *q) +{ + struct in_flight_data in_flight = { .q = q }; + + if (q->mq_ops) + blk_mq_queue_tag_busy_iter(q, blk_count_in_flight, _flight); + return q->nr_pending + in_flight.in_flight; +} + /** * blk_pre_runtime_suspend - Pre runtime suspend check * @q: the queue of the device @@ -68,14 +101,38 @@ int blk_pre_runtime_suspend(struct request_queue *q) if (!q->dev) return ret; + WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE); + + blk_set_pm_only(q); + /* +* This function only gets called if the most recent +* pm_request_resume() call occurred at least autosuspend_delay_ms +* ago. Since blk_queue_enter() is called by the request allocation +* code before pm_request_resume(), if no requests have a tag assigned +* it is safe to suspend the device. +*/ + ret = -EBUSY; + if (blk_requests_in_flight(q) == 0) { + /* +* Call synchronize_rcu() such that later blk_queue_enter() +* calls see the pm-only state. See also +* http://lwn.net/Articles/573497/. +*/ + synchronize_rcu(); +
[PATCH v7 4/6] block: Schedule runtime resume earlier
Instead of scheduling runtime resume of a request queue after a request has been queued, schedule asynchronous resume during request allocation. The new pm_request_resume() calls occur after blk_queue_enter() has increased the q_usage_counter request queue member. This change is needed for a later patch that will make request allocation block while the queue status is not RPM_ACTIVE. Signed-off-by: Bart Van Assche Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-core.c | 2 ++ block/elevator.c | 1 - 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/block/blk-core.c b/block/blk-core.c index fd91e9bf2893..18b874d5c9c9 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -942,6 +942,8 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) if (success) return 0; + blk_pm_request_resume(q); + if (flags & BLK_MQ_REQ_NOWAIT) return -EBUSY; diff --git a/block/elevator.c b/block/elevator.c index 1c992bf6cfb1..e18ac68626e3 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -601,7 +601,6 @@ void __elv_add_request(struct request_queue *q, struct request *rq, int where) trace_block_rq_insert(q, rq); blk_pm_add_request(q, rq); - blk_pm_request_resume(q); rq->q = q; -- 2.18.0