Re: [PATCH] elevator: allow specifying elevator by config boot param for blk-mq devices
On Thu, Jul 13, 2017 at 08:02:41AM -0600, Jens Axboe wrote: > Because it sucks as an interface - there's no way to apply different > settings to different devices, and using the kernel command line to > control something like this is ugly. It never should have been done. > > The sysfs interface, either manually, scripted, or through udev, > makes a lot more sense. Not that I disagree with your reasons, but not being able to select a mq scheduler confuses quite some users out there. Byte, Johannes -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH] elevator: allow specifying elevator by config boot param for blk-mq devices
> Il giorno 13 lug 2017, alle ore 16:02, Jens Axboe ha > scritto: > > On 07/13/2017 04:11 AM, Johannes Thumshirn wrote: >> On Wed, Jul 12, 2017 at 04:51:33PM -0600, Jens Axboe wrote: >>> No, that boot option was a mistake, let's not propagate that to mq >>> scheduling as well. >> >> Can you please explain why? I as well have requests from our users to >> select the mq schedulers on the command line. > > Because it sucks as an interface - there's no way to apply different > settings to different devices, and using the kernel command line to > control something like this is ugly. It never should have been done. > > The sysfs interface, either manually, scripted, or through udev, > makes a lot more sense. > One doubt: with the new interface, and using, I guess, udev, is it still possible to control which I/O scheduler is actually used during all the boot process, or at least almost all of it? Thanks, Paolo > -- > Jens Axboe >
[PATCH] efi: add gpt_overprovisioned kernel command line parameter
Allow a gpt partition table to be used if it is overprovisioned, the last usable lba in the GPT headers is beyond the device boundaries. This feature is enabled if gpt_overprovisioned is added to the kernel command line. It will not expose any individual partitions that go beyond the device boundaries, just the ones that are under that limit. Default off, one can perform an exploratory boot of the kernel and triage or backup the properly provisioned partitions. This would allow the system to at least partially boot. For example if the boot, root or system filesystems reside on properly provisioned partitions. Examples, helpful should a RAID array be resized downwards, or during embedded development should an overprovisioned partition table update be flashed to a device by accident. Signed-off-by: Mark Salyzyn --- block/partitions/efi.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/block/partitions/efi.c b/block/partitions/efi.c index 39f70d968754..e18ee5cf138c 100644 --- a/block/partitions/efi.c +++ b/block/partitions/efi.c @@ -117,6 +117,15 @@ force_gpt_fn(char *str) } __setup("gpt", force_gpt_fn); +static int overprovisioned_gpt; +static int __init +overprovisioned_gpt_fn(char *str) +{ + overprovisioned_gpt = 1; + return 1; +} +__setup("gpt_overprovisioned", overprovisioned_gpt_fn); + /** * efi_crc32() - EFI version of crc32 function @@ -420,7 +429,8 @@ static int is_gpt_valid(struct parsed_partitions *state, u64 lba, pr_debug("GPT: last_usable_lba incorrect: %lld > %lld\n", (unsigned long long)le64_to_cpu((*gpt)->last_usable_lba), (unsigned long long)lastlba); - goto fail; + if (!overprovisioned_gpt) + goto fail; } if (le64_to_cpu((*gpt)->last_usable_lba) < le64_to_cpu((*gpt)->first_usable_lba)) { pr_debug("GPT: last_usable_lba incorrect: %lld > %lld\n", -- 2.13.2.932.g7449e964c-goog
Re: [PATCH 2/2] md: raid1/raid10: initialize bvec table via bio_add_page()
On Thu, Jul 13 2017, Shaohua Li wrote: > On Thu, Jul 13, 2017 at 05:20:52PM +0800, Ming Lei wrote: >> On Thu, Jul 13, 2017 at 01:09:28PM +1000, NeilBrown wrote: >> > On Thu, Jul 13 2017, Ming Lei wrote: >> > >> > > On Thu, Jul 13, 2017 at 10:01:33AM +1000, NeilBrown wrote: >> > >> On Wed, Jul 12 2017, Ming Lei wrote: >> > >> >> > >> > We will support multipage bvec soon, so initialize bvec >> > >> > table using the standardy way instead of writing the >> > >> > talbe directly. Otherwise it won't work any more once >> > >> > multipage bvec is enabled. >> > >> > >> > >> > Acked-by: Guoqing Jiang >> > >> > Signed-off-by: Ming Lei >> > >> > --- >> > >> > drivers/md/md.c | 21 + >> > >> > drivers/md/md.h | 3 +++ >> > >> > drivers/md/raid1.c | 16 ++-- >> > >> > drivers/md/raid10.c | 4 ++-- >> > >> > 4 files changed, 28 insertions(+), 16 deletions(-) >> > >> > >> > >> > diff --git a/drivers/md/md.c b/drivers/md/md.c >> > >> > index 8cdca0296749..cc8dcd928dde 100644 >> > >> > --- a/drivers/md/md.c >> > >> > +++ b/drivers/md/md.c >> > >> > @@ -9130,6 +9130,27 @@ void md_reload_sb(struct mddev *mddev, int nr) >> > >> > } >> > >> > EXPORT_SYMBOL(md_reload_sb); >> > >> > >> > >> > +/* generally called after bio_reset() for reseting bvec */ >> > >> > +void md_bio_reset_resync_pages(struct bio *bio, struct resync_pages >> > >> > *rp, >> > >> > +int size) >> > >> > +{ >> > >> > + int idx = 0; >> > >> > + >> > >> > + /* initialize bvec table again */ >> > >> > + do { >> > >> > + struct page *page = resync_fetch_page(rp, idx); >> > >> > + int len = min_t(int, size, PAGE_SIZE); >> > >> > + >> > >> > + /* >> > >> > + * won't fail because the vec table is big >> > >> > + * enough to hold all these pages >> > >> > + */ >> > >> > + bio_add_page(bio, page, len, 0); >> > >> > + size -= len; >> > >> > + } while (idx++ < RESYNC_PAGES && size > 0); >> > >> > +} >> > >> > +EXPORT_SYMBOL(md_bio_reset_resync_pages); >> > >> >> > >> I really don't think this is a good idea. >> > >> This code is specific to raid1/raid10. It is not generic md code. So >> > >> it doesn't belong here. >> > > >> > > We already added 'struct resync_pages' in drivers/md/md.h, so I think >> > > it is reasonable to add this function into drivers/md/md.c >> > >> > Alternative perspective: it was unreasonable to add "resync_pages" to >> > md.h. >> > >> > > >> > >> >> > >> If you want to remove code duplication, then work on moving all raid1 >> > >> functionality into raid10.c, then discard raid1.c >> > > >> > > This patch is for avoiding new code duplication, not for removing current >> > > duplication. >> > > >> > >> >> > >> Or at the very least, have a separate "raid1-10.c" file for the common >> > >> code. >> > > >> > > You suggested it last time, but looks too overkill to be taken. But if >> > > someone wants to refactor raid1 and raid10, I think it can be a good >> > > start, >> > > but still not belong to this patch. >> > >> > You are trying to create common code for raid1 and raid10. This does >> > not belong in md.c. >> > If you really want to have a single copy of common code, then it exactly >> > is the role of this patch to create a place to put it. >> > I'm not saying you should put all common code in raid1-10.c. Just the >> > function that you have identified. >> >> I really don't want to waste time on this kind of thing, I can do >> either one frankly. >> >> Shaohua, could you share me which way you like to merge? I can do it in >> either way. > > I don't have strong preference, but Neil's suggestion does make the code a > little better. Of course, only put the function into the raid1-10.c right now. To make it as easy as possible, I would suggest creating raid1-10.c containing just this function (and maybe the definitions from md.h), and declare the function "static" and #include raid1-10.c into raid1.c and raid10.c. i.e. no worrying about modules and exporting symbols. Anyone who cares (and that might even be me) could move functionality gradually out of raid1.c and raid10.c in raid1-10.c. Maybe where would come a tipping-point where it is easy to just discard raid1.c and raid10.c and finish the job. Thanks, NeilBrown signature.asc Description: PGP signature
[for-4.14 RFC PATCH 0/2] dm rq: eliminate historic blk-mq and .request_fn queue stacking restrictions
Hi, Please review the 2 patches in this series. I'm left on-the-fence about whether there is any point re-enabling "dm-mq stacked on old .request_fn device(s)" -- especially given the awkward details documented in the 1/2 patch header. I welcome any feedback on this, thanks! BTW, I tested these changes using mptest: git://github.com/snitm/mptest.git with the following permutations ('runtest' script tweaked before each permutation): echo N > /sys/module/scsi_mod/parameters/use_blk_mq echo N > /sys/module/dm_mod/parameters/use_blk_mq echo Y > /sys/module/scsi_mod/parameters/use_blk_mq echo Y > /sys/module/dm_mod/parameters/use_blk_mq echo Y > /sys/module/scsi_mod/parameters/use_blk_mq echo N > /sys/module/dm_mod/parameters/use_blk_mq echo N > /sys/module/scsi_mod/parameters/use_blk_mq echo Y > /sys/module/dm_mod/parameters/use_blk_mq Mike Snitzer (2): dm rq: avoid deadlock if dm-mq is stacked on old .request_fn device(s) dm rq: eliminate historic blk-mq and .request_fn queue stacking restrictions drivers/md/dm-mpath.c | 16 ++-- drivers/md/dm-rq.c| 13 + drivers/md/dm-table.c | 31 +++ drivers/md/dm-target.c| 4 ++-- drivers/md/dm.h | 1 - include/linux/device-mapper.h | 3 ++- 6 files changed, 26 insertions(+), 42 deletions(-) -- 2.10.1
[for-4.14 RFC PATCH 2/2] dm rq: eliminate historic blk-mq and .request_fn queue stacking restrictions
Currently if dm_mod.use_blk_mq=Y (or a DM-multipath table is loaded with queue_mode=mq) and all underlying devices are not blk-mq, DM core will fail with the error: "table load rejected: all devices are not blk-mq request-stackable" This all-blk-mq-or-nothing approach is too cut-throat because it prevents access to data stored on what could have been a previously working multipath setup (e.g. if user decides to try dm_mod.use_blk_mq=Y or queue_mode=mq only to find their underlying devices aren't blk-mq). This restriction, and others like not being able to stack a top-level blk-mq request-queue ontop of old .request_fn device(s), can be removed thanks to commit eb8db831be ("dm: always defer request allocation to the owner of the request_queue"). Now that request-based DM will always rely on the target (multipath) to call blk_get_request() to create a clone request it is possible to support all 4 permutations of stacking old .request_fn and blk-mq request_queues. Depends-on: eb8db831be ("dm: always defer request allocation to the owner of the request_queue") Reported-by: Ewan Milne Signed-off-by: Mike Snitzer --- drivers/md/dm-rq.c| 5 - drivers/md/dm-table.c | 31 +++ drivers/md/dm.h | 1 - 3 files changed, 3 insertions(+), 34 deletions(-) diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c index 95bb44c..d64677b 100644 --- a/drivers/md/dm-rq.c +++ b/drivers/md/dm-rq.c @@ -782,11 +782,6 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t) struct dm_target *immutable_tgt; int err; - if (!dm_table_all_blk_mq_devices(t)) { - DMERR("request-based dm-mq may only be stacked on blk-mq device(s)"); - return -EINVAL; - } - md->tag_set = kzalloc_node(sizeof(struct blk_mq_tag_set), GFP_KERNEL, md->numa_node_id); if (!md->tag_set) return -ENOMEM; diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index a39bcd9..e630768 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -46,7 +46,6 @@ struct dm_table { bool integrity_supported:1; bool singleton:1; - bool all_blk_mq:1; unsigned integrity_added:1; /* @@ -910,7 +909,6 @@ static int dm_table_determine_type(struct dm_table *t) { unsigned i; unsigned bio_based = 0, request_based = 0, hybrid = 0; - unsigned sq_count = 0, mq_count = 0; struct dm_target *tgt; struct dm_dev_internal *dd; struct list_head *devices = dm_table_get_devices(t); @@ -985,11 +983,9 @@ static int dm_table_determine_type(struct dm_table *t) int srcu_idx; struct dm_table *live_table = dm_get_live_table(t->md, &srcu_idx); - /* inherit live table's type and all_blk_mq */ - if (live_table) { + /* inherit live table's type */ + if (live_table) t->type = live_table->type; - t->all_blk_mq = live_table->all_blk_mq; - } dm_put_live_table(t->md, srcu_idx); return 0; } @@ -999,25 +995,9 @@ static int dm_table_determine_type(struct dm_table *t) struct request_queue *q = bdev_get_queue(dd->dm_dev->bdev); if (!blk_queue_stackable(q)) { - DMERR("table load rejected: including" - " non-request-stackable devices"); + DMERR("table load rejected: includes non-request-stackable devices"); return -EINVAL; } - - if (q->mq_ops) - mq_count++; - else - sq_count++; - } - if (sq_count && mq_count) { - DMERR("table load rejected: not all devices are blk-mq request-stackable"); - return -EINVAL; - } - t->all_blk_mq = mq_count > 0; - - if (t->type == DM_TYPE_MQ_REQUEST_BASED && !t->all_blk_mq) { - DMERR("table load rejected: all devices are not blk-mq request-stackable"); - return -EINVAL; } return 0; @@ -1067,11 +1047,6 @@ bool dm_table_request_based(struct dm_table *t) return __table_type_request_based(dm_table_get_type(t)); } -bool dm_table_all_blk_mq_devices(struct dm_table *t) -{ - return t->all_blk_mq; -} - static int dm_table_alloc_md_mempools(struct dm_table *t, struct mapped_device *md) { enum dm_queue_mode type = dm_table_get_type(t); diff --git a/drivers/md/dm.h b/drivers/md/dm.h index 38c84c0..c484c4d 100644 --- a/drivers/md/dm.h +++ b/drivers/md/dm.h @@ -70,7 +70,6 @@ struct dm_target *dm_table_get_immutable_target(struct dm_table *t); struct dm_target *dm_table_get_wildcard_target(struct dm_table *t); bool dm_table_bio_based(struct dm_table *t); bool dm_table_request_based(struct dm_table *t); -bool dm_t
[for-4.14 RFC PATCH 1/2] dm rq: avoid deadlock if dm-mq is stacked on old .request_fn device(s)
Conditionally use __blk_put_request() or blk_put_request() instead of just blk_put_request() in multipath_release_clone(). Otherwise a deadlock will occur because scsi_end_request() will take the clone request's queue_lock, around its call to blk_finish_request(), and then the later blk_put_request() also tries to take it: [12749.916332] queued_spin_lock_slowpath+0xb/0xf [12749.916335] _raw_spin_lock_irqsave+0x37/0x40 [12749.916339] blk_put_request+0x39/0x60 [12749.916342] multipath_release_clone+0xe/0x10 [dm_multipath] [12749.916350] dm_softirq_done+0x156/0x240 [dm_mod] [12749.916353] __blk_mq_complete_request+0x90/0x140 [12749.916355] blk_mq_complete_request+0x16/0x20 [12749.916360] dm_complete_request+0x23/0x40 [dm_mod] [12749.916365] end_clone_request+0x1d/0x20 [dm_mod] [12749.916367] blk_finish_request+0x83/0x120 [12749.916370] scsi_end_request+0x12d/0x1d0 [12749.916371] scsi_io_completion+0x13c/0x630 [12749.916374] ? set_next_entity+0x7c/0x780 [12749.916376] scsi_finish_command+0xd9/0x120 [12749.916378] scsi_softirq_done+0x12a/0x150 [12749.916380] blk_done_softirq+0x9e/0xd0 [12749.916382] __do_softirq+0xc9/0x269 [12749.916384] run_ksoftirqd+0x29/0x50 [12749.916385] smpboot_thread_fn+0x110/0x160 [12749.916387] kthread+0x109/0x140 [12749.916389] ? sort_range+0x30/0x30 [12749.916390] ? kthread_park+0x60/0x60 [12749.916391] ret_from_fork+0x25/0x30 This "fix" is gross in that the long-term fitness of stacking blk-mq DM multipath (dm-mq) ontop of old .request_fn devices is questionable. The above stack trace shows just how ugly it is to have old .request_fn SCSI code cascade into blk-mq code during DM multipath request completion. Signed-off-by: Mike Snitzer --- drivers/md/dm-mpath.c | 16 ++-- drivers/md/dm-rq.c| 8 +--- drivers/md/dm-target.c| 4 ++-- include/linux/device-mapper.h | 3 ++- 4 files changed, 23 insertions(+), 8 deletions(-) diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c index 0e8ab5b..34cf7b6 100644 --- a/drivers/md/dm-mpath.c +++ b/drivers/md/dm-mpath.c @@ -520,9 +520,21 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq, return DM_MAPIO_REMAPPED; } -static void multipath_release_clone(struct request *clone) +static void multipath_release_clone(struct dm_target *ti, struct request *clone) { - blk_put_request(clone); + struct multipath *m = ti->private; + struct request_queue *q = clone->q; + + if (!q->mq_ops && m->queue_mode == DM_TYPE_MQ_REQUEST_BASED) { + /* +* dm-mq on .request_fn already holds clone->q->queue_lock +* via blk_finish_request()... +* - true for .request_fn SCSI, but is it _always_ true? +*/ + lockdep_assert_held(q->queue_lock); + __blk_put_request(q, clone); + } else + blk_put_request(clone); } /* diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c index c6ebc5b..95bb44c 100644 --- a/drivers/md/dm-rq.c +++ b/drivers/md/dm-rq.c @@ -220,11 +220,12 @@ static void dm_end_request(struct request *clone, blk_status_t error) { int rw = rq_data_dir(clone); struct dm_rq_target_io *tio = clone->end_io_data; + struct dm_target *ti = tio->ti; struct mapped_device *md = tio->md; struct request *rq = tio->orig; blk_rq_unprep_clone(clone); - tio->ti->type->release_clone_rq(clone); + ti->type->release_clone_rq(ti, clone); rq_end_stats(md, rq); if (!rq->q->mq_ops) @@ -267,6 +268,7 @@ static void dm_mq_delay_requeue_request(struct request *rq, unsigned long msecs) static void dm_requeue_original_request(struct dm_rq_target_io *tio, bool delay_requeue) { + struct dm_target *ti = tio->ti; struct mapped_device *md = tio->md; struct request *rq = tio->orig; int rw = rq_data_dir(rq); @@ -274,7 +276,7 @@ static void dm_requeue_original_request(struct dm_rq_target_io *tio, bool delay_ rq_end_stats(md, rq); if (tio->clone) { blk_rq_unprep_clone(tio->clone); - tio->ti->type->release_clone_rq(tio->clone); + ti->type->release_clone_rq(ti, tio->clone); } if (!rq->q->mq_ops) @@ -488,7 +490,7 @@ static int map_request(struct dm_rq_target_io *tio) case DM_MAPIO_REMAPPED: if (setup_clone(clone, rq, tio, GFP_ATOMIC)) { /* -ENOMEM */ - ti->type->release_clone_rq(clone); + ti->type->release_clone_rq(ti, clone); return DM_MAPIO_REQUEUE; } diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c index c0d7e60..adbd17b 100644 --- a/drivers/md/dm-target.c +++ b/drivers/md/dm-target.c @@ -138,12 +138,12 @@ static int io_err_clone_and_map_rq(struct dm_target *ti, struct request *rq, return DM_MAPI
Re: 4.12 NULL pointer dereference in kmem_cache_free on USB storage removal
On Thu, 2017-07-13 at 23:24 +0300, Meelis Roos wrote: > [258062.320700] RIP: 0010:kmem_cache_free+0x12/0x160 > [258062.320886] Call Trace: > [258062.320897] scsi_exit_rq+0x4d/0x60 > [258062.320909] free_request_size+0x1c/0x30 > [258062.320923] mempool_destroy+0x1d/0x60 > [258062.320935] blk_exit_rl+0x1b/0x40 > [258062.320946] __blk_release_queue+0x7d/0x120 > [258062.320959] process_one_work+0x1af/0x340 > [258062.320972] worker_thread+0x43/0x3e0 > [258062.320984] kthread+0xfe/0x130 > [258062.320995] ? create_worker+0x170/0x170 > [258062.321007] ? kthread_create_on_node+0x40/0x40 > [258062.321022] ret_from_fork+0x22/0x30 Hello Meelis, Thank you for your report. Can you apply commit 8e6882545d8c ("scsi: Avoid that scsi_exit_rq() triggers a use-after-free") on top of kernel v4.12 and retest? That commit has been tagged "Cc: stable" so I hope that this patch will be included in kernel v4.12.1. However, that kernel is not yet available unfortunately ... Thanks, Bart.
Re: [v4.12-rc6 regression] commit dc9edc44de6c introduced use-after-free
On Thu, 2017-06-29 at 19:34 +0800, Eryu Guan wrote: > Hi all, > > I got a use-after-free report from kasan-enabled kernel, when running > fstests xfs/279 (generic/108 could trigger too). I appended the console > log at the end of email. > > git bisect pointed first bad commit to dc9edc44de6c ("block: Fix a > blk_exit_rl() regression"), and reverting that commit on top of > v4.12-rc7 kernel does resolve the use-after-free. > > I can reproduce it by simply inserting & removing scsi_debug module. > > modprobe scsi_debug > modprobe -r scsi_debug > > If you need more info please let me know. > > Thanks, > Eryu > > [ 101.977744] run fstests xfs/279 at 2017-06-29 19:08:59 > [ 102.458699] scsi host5: scsi_debug: version 1.86 [20160430] > [ 102.458699] dev_size_mb=128, opts=0x0, submit_queues=1, statistics=0 > [ 102.472103] scsi 5:0:0:0: Direct-Access Linuxscsi_debug 0186 > PQ: 0 ANSI: 7 > [ 102.503428] sd 5:0:0:0: Attached scsi generic sg5 type 0 > [ 102.505414] sd 5:0:0:0: [sde] 262144 512-byte logical blocks: (134 MB/128 > MiB) > [ 102.505418] sd 5:0:0:0: [sde] 4096-byte physical blocks > [ 102.506568] sd 5:0:0:0: [sde] Write Protect is off > [ 102.508874] sd 5:0:0:0: [sde] Write cache: enabled, read cache: enabled, > supports DPO and FUA > [ 102.535845] sd 5:0:0:0: [sde] Attached SCSI disk > [ 104.876076] sd 5:0:0:0: [sde] Synchronizing SCSI cache > [ 104.92] > == > [ 104.932796] BUG: KASAN: use-after-free in scsi_exit_rq+0xf3/0x120 > [ 104.938886] Read of size 1 at addr 88022d574580 by task kworker/3:1/78 > [ 104.945755] > [ 104.947254] CPU: 3 PID: 78 Comm: kworker/3:1 Not tainted 4.12.0-rc6.kasan > #98 > [ 104.954382] Hardware name: IBM System x3550 M3 -[7944OEJ]-/90Y4784 , > BIOS -[D6E150CUS-1.11]- 02/08/2011 > [ 104.964117] Workqueue: events __blk_release_queue > [ 104.968819] Call Trace: > [ 104.971271] dump_stack+0x63/0x89 > [ 104.974588] print_address_description+0x78/0x290 > [ 104.979291] ? scsi_exit_rq+0xf3/0x120 > [ 104.983042] kasan_report+0x230/0x340 > [ 104.986706] __asan_report_load1_noabort+0x19/0x20 > [ 104.991496] scsi_exit_rq+0xf3/0x120 > [ 104.995074] free_request_size+0x44/0x60 > [ 104.998999] mempool_destroy.part.6+0x9b/0x150 > [ 105.003444] mempool_destroy+0x13/0x20 > [ 105.007195] blk_exit_rl+0x3b/0x60 > [ 105.010599] __blk_release_queue+0x14c/0x410 > [ 105.014874] process_one_work+0x5be/0xe90 > [ 105.018883] worker_thread+0xe4/0xe70 > [ 105.022547] ? pci_mmcfg_check_reserved+0x110/0x110 > [ 105.027423] kthread+0x2d3/0x3d0 > [ 105.030653] ? process_one_work+0xe90/0xe90 > [ 105.034836] ? kthread_create_on_node+0xb0/0xb0 > [ 105.039366] ret_from_fork+0x25/0x30 > [ 105.042940] > [ 105.044436] Allocated by task 2763: > [ 105.047927] save_stack_trace+0x1b/0x20 > [ 105.051761] save_stack+0x46/0xd0 > [ 105.055074] kasan_kmalloc+0xad/0xe0 > [ 105.058653] __kmalloc+0x105/0x1f0 > [ 105.062057] scsi_host_alloc+0x6d/0x11b0 > [ 105.065980] 0xa0ad5ba6 > [ 105.069123] driver_probe_device+0x5d2/0xc70 > [ 105.073393] __device_attach_driver+0x1d3/0x2a0 > [ 105.077920] bus_for_each_drv+0x114/0x1c0 > [ 105.081928] __device_attach+0x1bf/0x290 > [ 105.085850] device_initial_probe+0x13/0x20 > [ 105.090031] bus_probe_device+0x19b/0x240 > [ 105.094038] device_add+0x842/0x1420 > [ 105.097616] device_register+0x1a/0x20 > [ 105.101365] 0xa0adf185 > [ 105.104507] 0xa0920a55 > [ 105.107650] do_one_initcall+0x91/0x210 > [ 105.111487] do_init_module+0x1bb/0x549 > [ 105.115323] load_module+0x4ea8/0x5f50 > [ 105.119073] SYSC_finit_module+0x169/0x1a0 > [ 105.123169] SyS_finit_module+0xe/0x10 > [ 105.126919] do_syscall_64+0x18a/0x410 > [ 105.130669] return_from_SYSCALL_64+0x0/0x6a > [ 105.134937] > [ 105.136432] Freed by task 2823: > [ 105.139573] save_stack_trace+0x1b/0x20 > [ 105.143407] save_stack+0x46/0xd0 > [ 105.146721] kasan_slab_free+0x72/0xc0 > [ 105.150471] kfree+0x96/0x1a0 > [ 105.153440] scsi_host_dev_release+0x2cb/0x430 > [ 105.157883] device_release+0x76/0x1d0 > [ 105.161634] kobject_put+0x192/0x3f0 > [ 105.165209] put_device+0x17/0x20 > [ 105.168524] scsi_host_put+0x15/0x20 > [ 105.172100] 0xa0ad8e0b > [ 105.175242] device_release_driver_internal+0x26a/0x4e0 > [ 105.180463] device_release_driver+0x12/0x20 > [ 105.184733] bus_remove_device+0x2d0/0x590 > [ 105.188830] device_del+0x526/0x8d0 > [ 105.192317] device_unregister+0x1a/0xa0 > [ 105.196239] 0xa0ad6381 > [ 105.199379] 0xa0ae8924 > [ 105.202520] SyS_delete_module+0x38e/0x440 > [ 105.206617] do_syscall_64+0x18a/0x410 > [ 105.210366] return_from_SYSCALL_64+0x0/0x6a > [ 105.214634] > [ 105.216130] The buggy address belongs to the object at 88022d574400 > [ 105.216130] which belongs to the cache kmalloc-2048 of size 2048 > [ 105.228808] The buggy address is
4.12 NULL pointer dereference in kmem_cache_free on USB storage removal
Just found this in dmesg, on disconnecting unmounted camera USB storage. [219369.449387] usb 3-2: new full-speed USB device number 3 using xhci_hcd [219369.598434] usb 3-2: New USB device found, idVendor=054c, idProduct=05be [219369.598437] usb 3-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3 [219369.598439] usb 3-2: Product: USB Bus-powered Device [219369.598440] usb 3-2: Manufacturer: Sony [219369.598442] usb 3-2: SerialNumber: C67320732569 [219369.605665] input: Sony USB Bus-powered Device as /devices/pci:00/:00:14.0/usb3/3-2/3-2:1.0/0003:054C:05BE.0002/input/input14 [219369.605843] hid-generic 0003:054C:05BE.0002: input,hidraw1: USB HID v1.11 Device [Sony USB Bus-powered Device] on usb-:00:14.0-2/input0 [219369.626538] usb 3-2: USB disconnect, device number 3 [219373.529247] usb 3-2: new high-speed USB device number 4 using xhci_hcd [219373.671251] usb 3-2: New USB device found, idVendor=054c, idProduct=0529 [219373.671254] usb 3-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3 [219373.671255] usb 3-2: Product: DSC-RX100 [219373.671257] usb 3-2: Manufacturer: Sony [219373.671258] usb 3-2: SerialNumber: C67320732569 [219373.672166] usb-storage 3-2:1.0: USB Mass Storage device detected [219373.672280] scsi host7: usb-storage 3-2:1.0 [219374.687670] scsi 7:0:0:0: Direct-Access Sony DSC 1.00 PQ: 0 ANSI: 0 [219374.688100] sd 7:0:0:0: Attached scsi generic sg7 type 0 [219374.796049] sd 7:0:0:0: [sdg] 62676992 512-byte logical blocks: (32.1 GB/29.9 GiB) [219374.796934] sd 7:0:0:0: [sdg] Write Protect is off [219374.796937] sd 7:0:0:0: [sdg] Mode Sense: 03 00 00 00 [219374.798338] sd 7:0:0:0: [sdg] No Caching mode page found [219374.798343] sd 7:0:0:0: [sdg] Assuming drive cache: write through [219374.811127] sdg: sdg1 [219374.817442] sd 7:0:0:0: [sdg] Attached SCSI removable disk [219378.245104] FAT-fs (sdg1): utf8 is not a recommended IO charset for FAT filesystems, filesystem will be case sensitive! [219378.286284] FAT-fs (sdg1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck. [255861.349304] usb 3-3: USB disconnect, device number 2 [255861.706636] usb 3-3: new low-speed USB device number 5 using xhci_hcd [255861.851692] usb 3-3: New USB device found, idVendor=046d, idProduct=c00e [255861.851697] usb 3-3: New USB device strings: Mfr=1, Product=2, SerialNumber=0 [255861.851699] usb 3-3: Product: USB-PS/2 Optical Mouse [255861.851701] usb 3-3: Manufacturer: Logitech [255861.855516] input: Logitech USB-PS/2 Optical Mouse as /devices/pci:00/:00:14.0/usb3/3-3/3-3:1.0/0003:046D:C00E.0003/input/input15 [255861.855767] hid-generic 0003:046D:C00E.0003: input,hidraw0: USB HID v1.10 Mouse [Logitech USB-PS/2 Optical Mouse] on usb-:00:14.0-3/input0 [258062.212706] usb 3-2: USB disconnect, device number 4 [258062.320255] BUG: unable to handle kernel NULL pointer dereference at 0009 [258062.320290] IP: kmem_cache_free+0x12/0x160 [258062.320303] PGD 0 [258062.320303] P4D 0 [258062.320324] Oops: [#1] SMP [258062.320334] Modules linked in: nls_utf8 nls_cp775 vfat fat joydev uinput snd_hrtimer uas usb_storage xt_CT iptable_raw xt_length xt_mark iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat ipt_REJECT nf_reject_ipv4 xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c iptable_filter cpufreq_powersave cpufreq_userspace usbhid snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp kvm_intel iTCO_wdt iTCO_vendor_support kvm irqbypass crct10dif_pclmul i915 crc32c_intel iosf_mbi i2c_algo_bit drm_kms_helper drm sr_mod cdrom intel_gtt agpgart xhci_pci e1000e syscopyarea snd_hda_intel xhci_hcd ehci_pci snd_hda_codec sysfillrect ehci_hcd pcspkr serio_raw sysimgblt snd_hda_core e100 ptp usbcore lpc_ich fb_sys_fops [258062.320536] mii sg pps_core usb_common snd_hwdep mfd_core i2c_i801 nuvoton_cir rc_core video coretemp hwmon 8021q garp mrp stp llc snd_seq_dummy snd_seq_oss snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer snd soundcore parport_pc lp parport ip_tables x_tables autofs4 [258062.320623] CPU: 2 PID: 21200 Comm: kworker/2:2 Not tainted 4.12.0 #181 [258062.320641] Hardware name: /DH77KC, BIOS KCH7710H.86A.0110.2013.0513.1018 05/13/2013 [258062.320668] Workqueue: events __blk_release_queue [258062.320682] task: 88010da0ea00 task.stack: c900029f8000 [258062.320700] RIP: 0010:kmem_cache_free+0x12/0x160 [258062.320713] RSP: 0018:c900029fbdb8 EFLAGS: 00010202 [258062.320729] RAX: 813ab050 RBX: 8802151e1880 RCX: 0003 [258062.320749] RDX: 880213736840 RSI: 8802151e1880 RDI: [258062.320768] RBP: c900029fbdc8 R08: 880213736920 R09: 000180800040 [258062.320788] R10: c900029fbe20 R11: 0c00 R12: [2580
Re: [PATCH 2/6] SCSI: use blk_mq_run_hw_queues() in scsi_kick_queue()
On Thu, 2017-07-13 at 18:23 +0800, Ming Lei wrote: > Please discuss the congestion control in patch 4 and 5. Hello Ming, Since there is a significant risk that this patch series will introduce performance and/or functional regressions, I will wait with reviewing this patch series further until it has been shown that there is no less risky alternative to realize the same performance improvement. I think Jens had already suggested a possible alternative, namely by only starting a queue if it really has to be started. Bart.
Re: [PATCH 4/6] blk-mq: use EWMA to estimate congestion threshold
On Thu, 2017-07-13 at 23:32 +0800, Ming Lei wrote: > On Thu, Jul 13, 2017 at 02:56:38PM +, Bart Van Assche wrote: > > hctx_may_queue() severely limits the queue depth if many LUNs are associated > > with the same SCSI host. I think that this is a performance regression > > compared to scsi-sq and that this performance regression should be fixed. > > IMO, it is hard to evaluate/compare perf between scsi-mq vs scsi-sq: > > - how many LUNs do you run IO on concurrently? > - evaluate the perf on single LUN or multi LUN? > > BTW, active_queues is a runtime variable which accounts the actual active > queues in use. Hello Ming, What I described can be verified easily by running fio against a single LUN of a SCSI host with which a large number of LUNs are associated. BTW, something I overlooked in my analysis of the active_queues variable is that it is not only decremented if a queue is destroyed but also if a queue is idle during (block layer request timeout) seconds. So the problem I described will only occur if some software submits I/O requests periodically to all SCSI LUNs with a shorter interval than the I/O timeout. I'm not sure any Linux software does this today. As an example, the time between path checks by the multipathd TUR checker typically exceeds the timeout of a single TUR check. Bart.
Re: [PATCH 2/2] block: delete bio_uninit
On Thu, Jul 13, 2017 at 09:50:23AM -0700, Shaohua Li wrote: > > As said in the last mail I think there is no point in having this call.. > > I'm hesitant to do this. bio_associate_blkcg/bio_associate_current can be > called in any time for a bio, so we not just attach cgroup info to info in bio > submit (maybe the bio_associate_blkcg/bio_associate_current callers do sumbit > always, but I didn't audit yet). bio_associate_current is only called from blk_throtl_assoc_bio, which is only called from blk_throtl_bio, which is only called from blkcg_bio_issue_check, which is only called from generic_make_request_checks, which is only called from generic_make_request, which at that point consumes the bio from the callers perspective. bio_associate_blkcg might be a different story, but then we should treat both very differently. > The other reason is I'd like > blk_throtl_bio_endio is only called once for the whole bio not for splitted > bio, so this depends on bio_free to free cgroup info for chained bio which > always does bio_put. Then we'll need to fix that. We really should not require every caller of bio_init to pair it with a new uninit call, which would be a whole lot more work.
Re: [PATCH 2/2] block: delete bio_uninit
On Thu, Jul 13, 2017 at 09:45:19AM +0200, Christoph Hellwig wrote: > > static void bio_free(struct bio *bio) > > { > > struct bio_set *bs = bio->bi_pool; > > void *p; > > > > - bio_uninit(bio); > > + bio_disassociate_task(bio); > > As said in the last mail I think there is no point in having this call.. I'm hesitant to do this. bio_associate_blkcg/bio_associate_current can be called in any time for a bio, so we not just attach cgroup info to info in bio submit (maybe the bio_associate_blkcg/bio_associate_current callers do sumbit always, but I didn't audit yet). The other reason is I'd like blk_throtl_bio_endio is only called once for the whole bio not for splitted bio, so this depends on bio_free to free cgroup info for chained bio which always does bio_put. > > @@ -294,7 +289,7 @@ void bio_reset(struct bio *bio) > > { > > unsigned long flags = bio->bi_flags & (~0UL << BIO_RESET_BITS); > > > > - bio_uninit(bio); > > + bio_disassociate_task(bio); > > .. and this one. And I suspect it would make sense to have all these > changes in a single patch. This one is likely ok, but again I didn't audit yet. I'm ok these changes are in a single patch. Thanks, Shaohua
Re: [PATCH 2/2] md: raid1/raid10: initialize bvec table via bio_add_page()
On Thu, Jul 13, 2017 at 05:20:52PM +0800, Ming Lei wrote: > On Thu, Jul 13, 2017 at 01:09:28PM +1000, NeilBrown wrote: > > On Thu, Jul 13 2017, Ming Lei wrote: > > > > > On Thu, Jul 13, 2017 at 10:01:33AM +1000, NeilBrown wrote: > > >> On Wed, Jul 12 2017, Ming Lei wrote: > > >> > > >> > We will support multipage bvec soon, so initialize bvec > > >> > table using the standardy way instead of writing the > > >> > talbe directly. Otherwise it won't work any more once > > >> > multipage bvec is enabled. > > >> > > > >> > Acked-by: Guoqing Jiang > > >> > Signed-off-by: Ming Lei > > >> > --- > > >> > drivers/md/md.c | 21 + > > >> > drivers/md/md.h | 3 +++ > > >> > drivers/md/raid1.c | 16 ++-- > > >> > drivers/md/raid10.c | 4 ++-- > > >> > 4 files changed, 28 insertions(+), 16 deletions(-) > > >> > > > >> > diff --git a/drivers/md/md.c b/drivers/md/md.c > > >> > index 8cdca0296749..cc8dcd928dde 100644 > > >> > --- a/drivers/md/md.c > > >> > +++ b/drivers/md/md.c > > >> > @@ -9130,6 +9130,27 @@ void md_reload_sb(struct mddev *mddev, int nr) > > >> > } > > >> > EXPORT_SYMBOL(md_reload_sb); > > >> > > > >> > +/* generally called after bio_reset() for reseting bvec */ > > >> > +void md_bio_reset_resync_pages(struct bio *bio, struct resync_pages > > >> > *rp, > > >> > + int size) > > >> > +{ > > >> > + int idx = 0; > > >> > + > > >> > + /* initialize bvec table again */ > > >> > + do { > > >> > + struct page *page = resync_fetch_page(rp, idx); > > >> > + int len = min_t(int, size, PAGE_SIZE); > > >> > + > > >> > + /* > > >> > + * won't fail because the vec table is big > > >> > + * enough to hold all these pages > > >> > + */ > > >> > + bio_add_page(bio, page, len, 0); > > >> > + size -= len; > > >> > + } while (idx++ < RESYNC_PAGES && size > 0); > > >> > +} > > >> > +EXPORT_SYMBOL(md_bio_reset_resync_pages); > > >> > > >> I really don't think this is a good idea. > > >> This code is specific to raid1/raid10. It is not generic md code. So > > >> it doesn't belong here. > > > > > > We already added 'struct resync_pages' in drivers/md/md.h, so I think > > > it is reasonable to add this function into drivers/md/md.c > > > > Alternative perspective: it was unreasonable to add "resync_pages" to > > md.h. > > > > > > > >> > > >> If you want to remove code duplication, then work on moving all raid1 > > >> functionality into raid10.c, then discard raid1.c > > > > > > This patch is for avoiding new code duplication, not for removing current > > > duplication. > > > > > >> > > >> Or at the very least, have a separate "raid1-10.c" file for the common > > >> code. > > > > > > You suggested it last time, but looks too overkill to be taken. But if > > > someone wants to refactor raid1 and raid10, I think it can be a good > > > start, > > > but still not belong to this patch. > > > > You are trying to create common code for raid1 and raid10. This does > > not belong in md.c. > > If you really want to have a single copy of common code, then it exactly > > is the role of this patch to create a place to put it. > > I'm not saying you should put all common code in raid1-10.c. Just the > > function that you have identified. > > I really don't want to waste time on this kind of thing, I can do > either one frankly. > > Shaohua, could you share me which way you like to merge? I can do it in > either way. I don't have strong preference, but Neil's suggestion does make the code a little better. Of course, only put the function into the raid1-10.c right now. Thanks, Shaohua
Re: [PATCH] blk-mq: only alloc hctx point as many as nr_hw_queues
On 07/13/2017 10:30 AM, weiping zhang wrote: > On Thu, Jul 13, 2017 at 10:14:31AM -0600, Jens Axboe wrote: >> On 07/13/2017 10:10 AM, weiping zhang wrote: >>> one hw queue only has one hctx, reduce it's number same as nr_hw_queues. >> >> What happens if a device registers with eg 4 hw queues, then later >> calls blk_mq_update_nr_hw_queues()? >> > > your means, 10 cpu, and device has 10 hw queues, but only register 4, > and then update it to 10, no place store new hctx. Am i understanding > right ? if yes, could you tell me why driver do that ? thanks. It doesn't matter why the driver would do that, the fact is that this is a feature we support and it's a part of the API. NBD uses it, for instance. Your patch breaks it, for saving a bit of memory. So I will not include it. -- Jens Axboe
Re: [PATCH] blk-mq: only alloc hctx point as many as nr_hw_queues
On Thu, Jul 13, 2017 at 10:14:31AM -0600, Jens Axboe wrote: > On 07/13/2017 10:10 AM, weiping zhang wrote: > > one hw queue only has one hctx, reduce it's number same as nr_hw_queues. > > What happens if a device registers with eg 4 hw queues, then later > calls blk_mq_update_nr_hw_queues()? > your means, 10 cpu, and device has 10 hw queues, but only register 4, and then update it to 10, no place store new hctx. Am i understanding right ? if yes, could you tell me why driver do that ? thanks. -- weiping
Re: [PATCH] blk-mq: only alloc hctx point as many as nr_hw_queues
On 07/13/2017 10:10 AM, weiping zhang wrote: > one hw queue only has one hctx, reduce it's number same as nr_hw_queues. What happens if a device registers with eg 4 hw queues, then later calls blk_mq_update_nr_hw_queues()? -- Jens Axboe
[PATCH] blk-mq: only alloc hctx point as many as nr_hw_queues
one hw queue only has one hctx, reduce it's number same as nr_hw_queues. Signed-off-by: weiping zhang --- block/blk-mq.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 05dfa3f..8c98f59 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2320,8 +2320,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, /* init q->mq_kobj and sw queues' kobjects */ blk_mq_sysfs_init(q); - q->queue_hw_ctx = kzalloc_node(nr_cpu_ids * sizeof(*(q->queue_hw_ctx)), - GFP_KERNEL, set->numa_node); + q->queue_hw_ctx = kzalloc_node(set->nr_hw_queues * + sizeof(*(q->queue_hw_ctx)),GFP_KERNEL, set->numa_node); if (!q->queue_hw_ctx) goto err_percpu; -- 2.9.4
Re: [PATCH 4/6] blk-mq: use EWMA to estimate congestion threshold
On Thu, Jul 13, 2017 at 02:56:38PM +, Bart Van Assche wrote: > On Thu, 2017-07-13 at 18:43 +0800, Ming Lei wrote: > > On Wed, Jul 12, 2017 at 03:39:14PM +, Bart Van Assche wrote: > > > On Wed, 2017-07-12 at 10:30 +0800, Ming Lei wrote: > > > > On Tue, Jul 11, 2017 at 12:25:16PM -0600, Jens Axboe wrote: > > > > > What happens with fluid congestion boundaries, with shared tags? > > > > > > > > The approach in this patch should work, but the threshold may not > > > > be accurate in this way, one simple method is to use the average > > > > tag weight in EWMA, like this: > > > > > > > > sbitmap_weight() / hctx->tags->active_queues > > > > > > Hello Ming, > > > > > > That approach would result in a severe performance degradation. > > > "active_queues" > > > namely represents the number of queues against which I/O ever has been > > > queued. > > > If e.g. 64 LUNs would be associated with a single SCSI host and all 64 > > > LUNs are > > > responding and if the queue depth would also be 64 then the approach you > > > proposed will reduce the effective queue depth per LUN from 64 to 1. > > > > No, this approach does _not_ reduce the effective queue depth, it only > > stops the queue for a while when the queue is busy enough. > > > > In this case, there may not have congestion because for blk-mq at most > > allows > > to assign queue_depth/active_queues tags to each LUN, please see > > hctx_may_queue(). > > Hello Ming, > > hctx_may_queue() severely limits the queue depth if many LUNs are associated > with the same SCSI host. I think that this is a performance regression > compared to scsi-sq and that this performance regression should be fixed. IMO, it is hard to evaluate/compare perf between scsi-mq vs scsi-sq: - how many LUNs do you run IO on concurrently? - evaluate the perf on single LUN or multi LUN? BTW, active_queues is a runtime variable which accounts the actual active queues in use. -- Ming
Re: [PATCH 4/6] blk-mq: use EWMA to estimate congestion threshold
On Thu, 2017-07-13 at 18:43 +0800, Ming Lei wrote: > On Wed, Jul 12, 2017 at 03:39:14PM +, Bart Van Assche wrote: > > On Wed, 2017-07-12 at 10:30 +0800, Ming Lei wrote: > > > On Tue, Jul 11, 2017 at 12:25:16PM -0600, Jens Axboe wrote: > > > > What happens with fluid congestion boundaries, with shared tags? > > > > > > The approach in this patch should work, but the threshold may not > > > be accurate in this way, one simple method is to use the average > > > tag weight in EWMA, like this: > > > > > > sbitmap_weight() / hctx->tags->active_queues > > > > Hello Ming, > > > > That approach would result in a severe performance degradation. > > "active_queues" > > namely represents the number of queues against which I/O ever has been > > queued. > > If e.g. 64 LUNs would be associated with a single SCSI host and all 64 LUNs > > are > > responding and if the queue depth would also be 64 then the approach you > > proposed will reduce the effective queue depth per LUN from 64 to 1. > > No, this approach does _not_ reduce the effective queue depth, it only > stops the queue for a while when the queue is busy enough. > > In this case, there may not have congestion because for blk-mq at most allows > to assign queue_depth/active_queues tags to each LUN, please see > hctx_may_queue(). Hello Ming, hctx_may_queue() severely limits the queue depth if many LUNs are associated with the same SCSI host. I think that this is a performance regression compared to scsi-sq and that this performance regression should be fixed. Bart.
Re: [PATCH] blk-mq-debugfs: add mapping show for hw queue to cpu
On Wed, Jul 12, 2017 at 11:25:12AM -0600, Jens Axboe wrote: > On 07/12/2017 11:13 AM, weiping zhang wrote: > > On Wed, Jul 12, 2017 at 10:57:37AM -0600, Jens Axboe wrote: > >> On 07/12/2017 10:54 AM, weiping zhang wrote: > >>> A mapping show as following: > >>> > >>> hctx cpus > >>> hctx0 0 1 > >>> hctx1 2 > >>> hctx2 3 > >>> hctx3 4 5 > >> > >> We already have that information in the /sys/block//mq/X/cpu_list > >> > >> where X is the hardware queue number. Why do we need it in debugfs as > >> well, presented differently? > > > > this arribute give a more obviously showing, only by 1 "cat" command. > > /sys/bloc//mq/X/cpu_list not easy get mapping for all hctx by 1 > > command. > > also /sys/kernel/debug/block/xxx/hctxN/cpuX can export that info, but > > these two methods both not obviously. > > The point is the information is there, and I'm sure most people can > work out the shell logic to collect this info. I see zero benefit to > adding this to debugfs. > Hi Jens, It seems no other people wanna this patch, please skip it, thanks for your replay ^_^. -- weiping
Re: [PATCH] nbd: kill unused ret in recv_work
On 07/13/2017 05:20 AM, Kefeng Wang wrote: > No need to return value in queue work, kill ret variable. Applied, thanks. -- Jens Axboe
Re: [PATCH] elevator: allow specifying elevator by config boot param for blk-mq devices
On 07/13/2017 04:11 AM, Johannes Thumshirn wrote: > On Wed, Jul 12, 2017 at 04:51:33PM -0600, Jens Axboe wrote: >> No, that boot option was a mistake, let's not propagate that to mq >> scheduling as well. > > Can you please explain why? I as well have requests from our users to > select the mq schedulers on the command line. Because it sucks as an interface - there's no way to apply different settings to different devices, and using the kernel command line to control something like this is ugly. It never should have been done. The sysfs interface, either manually, scripted, or through udev, makes a lot more sense. -- Jens Axboe
Re: [PATCH] elevator: allow specifying elevator by config boot param for blk-mq devices
On Wed, Jul 12, 2017 at 04:51:33PM -0600, Jens Axboe wrote: > No, that boot option was a mistake, let's not propagate that to mq > scheduling as well. Can you please explain why? I as well have requests from our users to select the mq schedulers on the command line. Thanks, Johannes -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
[PATCH] nbd: kill unused ret in recv_work
No need to return value in queue work, kill ret variable. Signed-off-by: Kefeng Wang --- drivers/block/nbd.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index dea7d85..87a0a29 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -626,7 +626,6 @@ static void recv_work(struct work_struct *work) struct nbd_device *nbd = args->nbd; struct nbd_config *config = nbd->config; struct nbd_cmd *cmd; - int ret = 0; while (1) { cmd = nbd_read_stat(nbd, args->index); @@ -636,7 +635,6 @@ static void recv_work(struct work_struct *work) mutex_lock(&nsock->tx_lock); nbd_mark_nsock_dead(nbd, nsock, 1); mutex_unlock(&nsock->tx_lock); - ret = PTR_ERR(cmd); break; } -- 1.8.3.1
Re: [PATCH 4/6] blk-mq: use EWMA to estimate congestion threshold
On Wed, Jul 12, 2017 at 03:39:14PM +, Bart Van Assche wrote: > On Wed, 2017-07-12 at 10:30 +0800, Ming Lei wrote: > > On Tue, Jul 11, 2017 at 12:25:16PM -0600, Jens Axboe wrote: > > > What happens with fluid congestion boundaries, with shared tags? > > > > The approach in this patch should work, but the threshold may not > > be accurate in this way, one simple method is to use the average > > tag weight in EWMA, like this: > > > > sbitmap_weight() / hctx->tags->active_queues > > Hello Ming, > > That approach would result in a severe performance degradation. > "active_queues" > namely represents the number of queues against which I/O ever has been queued. > If e.g. 64 LUNs would be associated with a single SCSI host and all 64 LUNs > are > responding and if the queue depth would also be 64 then the approach you > proposed will reduce the effective queue depth per LUN from 64 to 1. No, this approach does _not_ reduce the effective queue depth, it only stops the queue for a while when the queue is busy enough. In this case, there may not have congestion because for blk-mq at most allows to assign queue_depth/active_queues tags to each LUN, please see hctx_may_queue(). Then get_driver_tag() can only allow to return one pending tag at most to the request_queue(LUN). The algorithm in this patch only starts to work when congestion happens, that said it is only run when BLK_STS_RESOURCE is returned from .queue_rq(). This approach is for avoiding to dispatch requests to one busy queue unnecessarily, so that we don't need to heat CPU unnecessarily, and merge gets improved meantime. -- Ming
Re: [PATCH 2/6] SCSI: use blk_mq_run_hw_queues() in scsi_kick_queue()
On Wed, Jul 12, 2017 at 03:12:37PM +, Bart Van Assche wrote: > On Wed, 2017-07-12 at 11:15 +0800, Ming Lei wrote: > > On Tue, Jul 11, 2017 at 07:57:53PM +, Bart Van Assche wrote: > > > On Wed, 2017-07-12 at 02:20 +0800, Ming Lei wrote: > > > > Now SCSI won't stop queue, and not necessary to use > > > > blk_mq_start_hw_queues(), so switch to blk_mq_run_hw_queues() > > > > instead. > > > > > > > > Cc: "James E.J. Bottomley" > > > > Cc: "Martin K. Petersen" > > > > Cc: linux-s...@vger.kernel.org > > > > Signed-off-by: Ming Lei > > > > --- > > > > drivers/scsi/scsi_lib.c | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c > > > > index f6097b89d5d3..91d890356b78 100644 > > > > --- a/drivers/scsi/scsi_lib.c > > > > +++ b/drivers/scsi/scsi_lib.c > > > > @@ -333,7 +333,7 @@ void scsi_device_unbusy(struct scsi_device *sdev) > > > > static void scsi_kick_queue(struct request_queue *q) > > > > { > > > > if (q->mq_ops) > > > > - blk_mq_start_hw_queues(q); > > > > + blk_mq_run_hw_queues(q, false); > > > > else > > > > blk_run_queue(q); > > > > } > > > > > > Hello Ming, > > > > > > Now that we have separate flags to represent the "stopped" and "quiesced" > > > states, wouldn't it be better not to modify scsi_kick_queue() but instead > > > to > > > stop a SCSI hardware queue again if scsi_queue_rq() returns > > > BLK_STS_RESOURCE? > > > See also commits 36e3cf273977 ("scsi: Avoid that SCSI queues get stuck") > > > and > > > commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped > > > queues"). > > > > As you can see in the following patches, all stop/start queue APIs will > > be removed, and the 'stopped' state will become a internal state. > > > > It doesn't make sense to clear 'stopped' for scsi anymore, since the > > queue won't be stopped actually. So we need this change. > > Hello Ming, > > As Jens already noticed, this approach won't work properly for concurrent I/O > to multiple LUNs associated with the same SCSI host. This approach won't work > properly for dm-mpath either. Sorry but I'm not convinced that it's possible We can deal with special cases by passing flag, so that we don't need to expose 'stopped' flag to drivers. Again, it has caused lots of trouble. If you check linus tree, most of start/stop queue API has been replaced with quiesce/unquiesce. We can remove others too. > to replace the stop/start queue API for all block drivers by an algorithm > that is based on estimating the queue depth. Anyway this patch is correct, and doesn't make sense to clear 'stopped' in SCSI since SCSI doesn't stop queue at all even though not introduce congest control, right? Please discuss the congestion control in patch 4 and 5. -- Ming
Re: [PATCH 16/19] bcache: increase the number of open buckets
On 2017/7/1 上午4:43, bca...@lists.ewheeler.net wrote: > From: Tang Junhui > > In currently, we only alloc 6 open buckets for each cache set, > but in usually, we always attach about 10 or so backend devices for > each cache set, and the each bcache device are always accessed by > about 10 or so threads in top application layer. So 6 open buckets > are too few, It has led to that each of the same thread write data > to different buckets, which would cause low efficiency write-back, > and also cause buckets inefficient, and would be Very easy to run > out of. > > I add debug message in bch_open_buckets_alloc() to print alloc bucket > info, and test with ten bcache devices with a cache set, and each > bcache device is accessed by ten threads. > > From the debug message, we can see that, after the modification, One > bucket is more likely to assign to the same thread, and the data from > the same thread are more likely to write the same bucket. Usually the > same thread always write/read the same backend device, so it is good > for write-back and also promote the usage efficiency of buckets. > > Signed-off-by: Tang Junhui Nice catch for performance ! Reviewed-by: Coly Li Thanks. > --- > drivers/md/bcache/alloc.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c > index ca4abe1..cacbe2d 100644 > --- a/drivers/md/bcache/alloc.c > +++ b/drivers/md/bcache/alloc.c > @@ -68,6 +68,8 @@ > #include > #include > > +#define MAX_OPEN_BUCKETS 128 > + > /* Bucket heap / gen */ > > uint8_t bch_inc_gen(struct cache *ca, struct bucket *b) > @@ -671,7 +673,7 @@ int bch_open_buckets_alloc(struct cache_set *c) > > spin_lock_init(&c->data_bucket_lock); > > - for (i = 0; i < 6; i++) { > + for (i = 0; i < MAX_OPEN_BUCKETS; i++) { > struct open_bucket *b = kzalloc(sizeof(*b), GFP_KERNEL); > if (!b) > return -ENOMEM; > -- Coly Li
Re: [PATCH 18/19] bcache: silence static checker warning
On 2017/7/1 上午4:43, bca...@lists.ewheeler.net wrote: > From: Dan Carpenter > > In olden times, closure_return() used to have a hidden return built in. > We removed the hidden return but forgot to add a new return here. If > "c" were NULL we would oops on the next line, but fortunately "c" is > never NULL. Let's just remove the if statement. > > Signed-off-by: Dan Carpenter > --- > drivers/md/bcache/super.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c > index 24cb9b7..243391d 100644 > --- a/drivers/md/bcache/super.c > +++ b/drivers/md/bcache/super.c > @@ -1381,9 +1381,6 @@ static void cache_set_flush(struct closure *cl) > struct btree *b; > unsigned i; > > - if (!c) > - closure_return(cl); > - > bch_cache_accounting_destroy(&c->accounting); > > kobject_put(&c->internal); > Agree, cache_set_flush() is only called from a continue_at() in __cache_set_unregister(). In this case, cl is always not NULL. Reviewed-by: Coly Li Thanks. -- Coly Li
Re: [PATCH 2/2] md: raid1/raid10: initialize bvec table via bio_add_page()
On Thu, Jul 13, 2017 at 01:09:28PM +1000, NeilBrown wrote: > On Thu, Jul 13 2017, Ming Lei wrote: > > > On Thu, Jul 13, 2017 at 10:01:33AM +1000, NeilBrown wrote: > >> On Wed, Jul 12 2017, Ming Lei wrote: > >> > >> > We will support multipage bvec soon, so initialize bvec > >> > table using the standardy way instead of writing the > >> > talbe directly. Otherwise it won't work any more once > >> > multipage bvec is enabled. > >> > > >> > Acked-by: Guoqing Jiang > >> > Signed-off-by: Ming Lei > >> > --- > >> > drivers/md/md.c | 21 + > >> > drivers/md/md.h | 3 +++ > >> > drivers/md/raid1.c | 16 ++-- > >> > drivers/md/raid10.c | 4 ++-- > >> > 4 files changed, 28 insertions(+), 16 deletions(-) > >> > > >> > diff --git a/drivers/md/md.c b/drivers/md/md.c > >> > index 8cdca0296749..cc8dcd928dde 100644 > >> > --- a/drivers/md/md.c > >> > +++ b/drivers/md/md.c > >> > @@ -9130,6 +9130,27 @@ void md_reload_sb(struct mddev *mddev, int nr) > >> > } > >> > EXPORT_SYMBOL(md_reload_sb); > >> > > >> > +/* generally called after bio_reset() for reseting bvec */ > >> > +void md_bio_reset_resync_pages(struct bio *bio, struct resync_pages *rp, > >> > + int size) > >> > +{ > >> > +int idx = 0; > >> > + > >> > +/* initialize bvec table again */ > >> > +do { > >> > +struct page *page = resync_fetch_page(rp, idx); > >> > +int len = min_t(int, size, PAGE_SIZE); > >> > + > >> > +/* > >> > + * won't fail because the vec table is big > >> > + * enough to hold all these pages > >> > + */ > >> > +bio_add_page(bio, page, len, 0); > >> > +size -= len; > >> > +} while (idx++ < RESYNC_PAGES && size > 0); > >> > +} > >> > +EXPORT_SYMBOL(md_bio_reset_resync_pages); > >> > >> I really don't think this is a good idea. > >> This code is specific to raid1/raid10. It is not generic md code. So > >> it doesn't belong here. > > > > We already added 'struct resync_pages' in drivers/md/md.h, so I think > > it is reasonable to add this function into drivers/md/md.c > > Alternative perspective: it was unreasonable to add "resync_pages" to > md.h. > > > > >> > >> If you want to remove code duplication, then work on moving all raid1 > >> functionality into raid10.c, then discard raid1.c > > > > This patch is for avoiding new code duplication, not for removing current > > duplication. > > > >> > >> Or at the very least, have a separate "raid1-10.c" file for the common > >> code. > > > > You suggested it last time, but looks too overkill to be taken. But if > > someone wants to refactor raid1 and raid10, I think it can be a good start, > > but still not belong to this patch. > > You are trying to create common code for raid1 and raid10. This does > not belong in md.c. > If you really want to have a single copy of common code, then it exactly > is the role of this patch to create a place to put it. > I'm not saying you should put all common code in raid1-10.c. Just the > function that you have identified. I really don't want to waste time on this kind of thing, I can do either one frankly. Shaohua, could you share me which way you like to merge? I can do it in either way. -- Ming
Re: [PATCH 2/2] block: delete bio_uninit
> static void bio_free(struct bio *bio) > { > struct bio_set *bs = bio->bi_pool; > void *p; > > - bio_uninit(bio); > + bio_disassociate_task(bio); As said in the last mail I think there is no point in having this call.. > @@ -294,7 +289,7 @@ void bio_reset(struct bio *bio) > { > unsigned long flags = bio->bi_flags & (~0UL << BIO_RESET_BITS); > > - bio_uninit(bio); > + bio_disassociate_task(bio); .. and this one. And I suspect it would make sense to have all these changes in a single patch.
Re: linux-next scsi-mq hang in suspend-resume
On Wed, Jul 12, 2017 at 10:50:19AM -0600, Jens Axboe wrote: > On 07/12/2017 08:51 AM, Tomi Sarvela wrote: > > Hello there, > > > > I've been running Intel GFX CI testing for linux DRM-Tip i915 driver, > > and couple of weeks ago we took linux-next for a ride to see what kind > > of integration problems there might pop up when pulling 4.13-rc1. > > Latest results can be seen at > > > > https://intel-gfx-ci.01.org/CI/next-issues.html > > https://intel-gfx-ci.01.org/CI/next-all.html > > > > The purple blocks are hangs, starting from 20170628 (20170627 was > > untestable due to locking changes which were reverted). Traces were > > pointing to ext4 but bisecting between good 20170626 and bad 20170628 > > pointed to: > > > > commit 5c279bd9e40624f4ab6e688671026d6005b066fa > > Date: Fri Jun 16 10:27:55 2017 +0200 > > > > scsi: default to scsi-mq > > > > Reproduction is 100% or close to it when running two i-g-t tests as a > > testlist. I'm assuming that it creates the correct amount or pattern > > of actions to the device. The testlist consists of the following > > lines: > > > > igt@gem_exec_gttfill@basic > > igt@gem_exec_suspend@basic-s3 > > > > Kernel option scsi_mod.use_blk_mq=0 hides the issue on testhosts. > > Configuration option was copied over on testhosts and 20170712 was re- > > tested, that's why today looks so much greener. > > > > More information including traces and reproduction instructions at > > https://bugzilla.kernel.org/show_bug.cgi?id=196223 > > > > I can run patchsets through the farm, if needed. In addition, daily > > linux-next tags are automatically tested and results published. > > Christoph, any ideas? Smells like something in SCSI, my notebook > with nvme/blk-mq suspend/resumes just fine. There isn't much mq-specific scsi code, so it's probably an interaction of both. I'll see if the bugzilla has enough data to reproduce it locally. Although I really wish people wouldn't use #TY^$Y^$ bugzilla and just post the important data to the list :( > > -- > Jens Axboe > ---end quoted text---