how to enlarge value of max_sectors_kb
From: Tang JunhuiThere is a machine with very little max_sectors_kb size: [root@ceph151 queue]# pwd /sys/block/sdd/queue [root@ceph151 queue]# cat max_hw_sectors_kb 256 [root@ceph151 queue]# cat max_sectors_kb 256 The performance is very low when I run big I/Os. I can not modify it directly, I need a little bigger, so how can I enlarge it? the system info: [root@ceph151 queue]# uname -r 3.10.0-327.22.2.el7.x86_64 [root@ceph151 queue]# lsmod |grep scsi scsi_transport_fc 64056 1 lpfc scsi_tgt 20027 1 scsi_transport_fc [root@ceph151 queue]# lsscsi [0:0:0:0]diskASR7805 system V1.0 /dev/sda [0:1:8:0]diskATA Hitachi HUA72201 JP4O /dev/sdb [0:1:9:0]diskATA Hitachi HUA72201 JP4O /dev/sdc [0:1:10:0] diskATA Hitachi HUA72201 JP4O /dev/sdd [0:1:11:0] diskATA Hitachi HUA72201 JP4O /dev/sde [0:1:12:0] diskATA Hitachi HUA72201 JP4O /dev/sdf [0:1:13:0] diskATA Hitachi HUA72201 JP4O /dev/sdg [0:1:14:0] diskATA Hitachi HUA72201 JP4O /dev/sdh [0:1:15:0] diskATA Hitachi HUA72201 JP4O /dev/sdi [0:1:16:0] diskATA Hitachi HUA72201 JP4O /dev/sdj [0:1:17:0] diskATA Hitachi HUA72201 JP4O /dev/sdk [0:1:18:0] diskATA Hitachi HUA72201 JP4O /dev/sdl [0:1:19:0] diskATA Hitachi HUA72201 JP4O /dev/sdm [0:1:20:0] diskTOSHIBA AL13SEB600 0101 - [0:1:21:0] diskTOSHIBA AL13SEB600 0101 - [0:3:0:0]enclosu ZTE CORP 720381 EvBd 255 1 - Any comment would be wellcome. Thanks, Tang Junhui
Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance
On Fri, Jan 12 2018 at 8:00pm -0500, Bart Van Asschewrote: > On Fri, 2018-01-12 at 19:52 -0500, Mike Snitzer wrote: > > It was 50 ms before it was 100 ms. No real explaination for these > > values other than they seem to make Bart's IB SRP testbed happy? > > But that constant was not introduced by me in the dm code. No actually it was (not that there's anything wrong with that): commit 06eb061f48594aa369f6e852b352410298b317a8 Author: Bart Van Assche Date: Fri Apr 7 16:50:44 2017 -0700 dm mpath: requeue after a small delay if blk_get_request() fails If blk_get_request() returns ENODEV then multipath_clone_and_map() causes a request to be requeued immediately. This can cause a kworker thread to spend 100% of the CPU time of a single core in __blk_mq_run_hw_queue() and also can cause device removal to never finish. Avoid this by only requeuing after a delay if blk_get_request() fails. Additionally, reduce the requeue delay. Cc: sta...@vger.kernel.org # 4.9+ Signed-off-by: Bart Van Assche Signed-off-by: Mike Snitzer Note that this commit actually details a different case where a blk_get_request() (in existing code) return of -ENODEV is a very compelling case to use DM_MAPIO_DELAY_REQUEUE. SO I'll revisit what is appropriate in multipath_clone_and_map() on Monday. I need a break... have a good weekend Bart.
Re: [PATCH V3 2/5] dm-mpath: return DM_MAPIO_REQUEUE in case of rq allocation failure
On Fri, Jan 12, 2018 at 07:04:28PM +, Bart Van Assche wrote: > On Thu, 2018-01-11 at 14:01 +0800, Ming Lei wrote: > > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c > > index 86bf502a8e51..fcddf5a62581 100644 > > --- a/drivers/md/dm-mpath.c > > +++ b/drivers/md/dm-mpath.c > > @@ -533,8 +533,20 @@ static int multipath_clone_and_map(struct dm_target > > *ti, struct request *rq, > > if (queue_dying) { > > atomic_inc(>pg_init_in_progress); > > activate_or_offline_path(pgpath); > > + return DM_MAPIO_DELAY_REQUEUE; > > } > > - return DM_MAPIO_DELAY_REQUEUE; > > + > > + /* > > +* blk-mq's SCHED_RESTART can cover this requeue, so > > +* we needn't to deal with it by DELAY_REQUEUE. More > > +* importantly, we have to return DM_MAPIO_REQUEUE > > +* so that blk-mq can get the queue busy feedback, > > +* otherwise I/O merge can be hurt. > > +*/ > > + if (q->mq_ops) > > + return DM_MAPIO_REQUEUE; > > + else > > + return DM_MAPIO_DELAY_REQUEUE; > > } > > Sorry but the approach of this patch looks wrong to me. I'm afraid that this > approach will cause 100% CPU consumption if the underlying .queue_rq() > function returns BLK_STS_RESOURCE for another reason than what can be detected > by the .get_budget() call. This can happen if e.g. a SCSI LLD .queuecommand() > implementation returns SCSI_MLQUEUE_HOST_BUSY. Many SCSI LLDs can do this: > $ git grep 'SCSI_MLQUEUE_[^_]*_BUSY' | wc -l > 204 > > Isn't this a severe regression? No, it is not. This DM_MAPIO_REQUEUE will be converted to BLK_STS_RESOURCE in dm-rq, then it is returned to blk-mq. That means when running out of resource for dispatching the underlying request, we tell blk-mq the truth. This way has been used for other blk-mq drivers, such NVMe,..., more importantly we have to provide this real feedback to blk-mq for fixing the performance issue. The blk_get_request(GFP_ATOMIC) is just like a kmalloc(ATOMIC), and memory is shared system-wide too. You can see there are lots kmalloc(ATOMIC) in NVMe IO path, when it fails, BLK_STS_RESOURCE is returned to blk-mq. blk-mq has built-in SCHED_RESTART mechanism to handle out of RESOURCE when BLK_STS_RESOURCE is observed. We only implements .get_bugdet() in SCSI only, and other blk-mq drivers: NVMe, ..., often returns BLK_STS_RESOURCE to blk-mq too, so wrt. this usage there isn't any difference between dm-mpath and other cases, right? Thanks, Ming
Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance
On Fri, 2018-01-12 at 19:52 -0500, Mike Snitzer wrote: > It was 50 ms before it was 100 ms. No real explaination for these > values other than they seem to make Bart's IB SRP testbed happy? But that constant was not introduced by me in the dm code. See e.g. the following commits: commit d548b34b062b60b4f4df295a0b4823dfda1f1fc4 Author: Mike SnitzerDate: Thu Mar 5 22:21:10 2015 -0500 dm: reduce the queue delay used in dm_request_fn from 100ms to 10ms Commit 7eaceaccab ("block: remove per-queue plugging") didn't justify DM's use of a 100ms delay; such an extended delay is a liability when there is reason to re-kick the queue. Signed-off-by: Mike Snitzer commit 7eaceaccab5f40bbfda044629a6298616aeaed50 Author: Jens Axboe Date: Thu Mar 10 08:52:07 2011 +0100 block: remove per-queue plugging Code has been converted over to the new explicit on-stack plugging, and delay users have been converted to use the new API for that. So lets kill off the old plugging along with aops->sync_page(). Signed-off-by: Jens Axboe Bart.
Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance
On Fri, Jan 12 2018 at 2:53pm -0500, Elliott, Robert (Persistent Memory)wrote: > > > > -Original Message- > > From: linux-block-ow...@vger.kernel.org [mailto:linux-block- > > ow...@vger.kernel.org] On Behalf Of Bart Van Assche > ... > > The intention of commit 6077c2d706097c0 was to address the last mentioned > > case. It may be possible to move the delayed queue rerun from the > > dm_queue_rq() into dm_requeue_original_request(). But I think it would be > > wrong to rerun the queue immediately in case a SCSI target system returns > > "BUSY". > > Seeing SCSI BUSY mentioned here... > > On its own, patch 6077c2d706 seems to be adding an arbitrarily selected > magic value of 100 ms without explanation in the patch description or > in the added code. It was 50 ms before it was 100 ms. No real explaination for these values other than they seem to make Bart's IB SRP testbed happy? > But, dm_request_original_request() already seems to have chosen that value > for similar purposes: > unsigned long delay_ms = delay_requeue ? 100 : 0; > > So, making them share a #define would indicate there's a reason for that > particular value. Any change to the value would be picked up everywhere. > > All the other callers of blk_mq_delay_run_hw_queue() use macros: > drivers/md/dm-rq.c: blk_mq_delay_run_hw_queue(hctx, 100/*ms*/); > drivers/nvme/host/fc.c: blk_mq_delay_run_hw_queue(queue->hctx, > NVMEFC_QUEUE_DELAY); > drivers/scsi/scsi_lib.c:blk_mq_delay_run_hw_queue(hctx, > SCSI_QUEUE_DELAY); > drivers/scsi/scsi_lib.c: > blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY); Sure, I'll add a #define. > Those both happen to be set to 3 (ms). As for the value of 100, we're dealing with path faults so retrying them extremely fast could be wasted effort. But there is obviously no once size fits all. As storage gets faster 100 ms could prove to be very bad. But without tests to verify a change, there won't be one. Thanks, Mike
Re: [Drbd-dev] [PATCH 23/27] drbd: make intelligent use of blkdev_issue_zeroout
Hello All, We just noticed that discards to DRBD devices backed by dm-thin devices are fully allocating the thin blocks. This behavior does not exist before ee472d83 block: add a flags argument to (__)blkdev_issue_zeroout The problem exists somewhere between [working] c20cfc27 block: stop using blkdev_issue_write_same for zeroing and [broken] 45c21793 drbd: implement REQ_OP_WRITE_ZEROES Note that c20cfc27 works as expected, but 45c21793 discards blocks being zeroed on the dm-thin backing device. All commits between those two produce the following error: blkdiscard: /dev/drbd/by-res/test: BLKDISCARD ioctl failed: Input/output error Also note that issuing a blkdiscard to the backing device directly discards as you would expect. This is just a problem when sending discards through DRBD. Is there an easy way to solve this in the short term, even if the ultimate fix is more involved? Thank you for your help! -Eric -- Eric Wheeler On Wed, 5 Apr 2017, Christoph Hellwig wrote: > drbd always wants its discard wire operations to zero the blocks, so > use blkdev_issue_zeroout with the BLKDEV_ZERO_UNMAP flag instead of > reinventing it poorly. > > Signed-off-by: Christoph Hellwig> Reviewed-by: Hannes Reinecke > --- > drivers/block/drbd/drbd_debugfs.c | 3 -- > drivers/block/drbd/drbd_int.h | 6 --- > drivers/block/drbd/drbd_receiver.c | 102 > ++--- > drivers/block/drbd/drbd_req.c | 6 +-- > 4 files changed, 7 insertions(+), 110 deletions(-) > > diff --git a/drivers/block/drbd/drbd_debugfs.c > b/drivers/block/drbd/drbd_debugfs.c > index de5c3ee8a790..494837e59f23 100644 > --- a/drivers/block/drbd/drbd_debugfs.c > +++ b/drivers/block/drbd/drbd_debugfs.c > @@ -236,9 +236,6 @@ static void seq_print_peer_request_flags(struct seq_file > *m, struct drbd_peer_re > seq_print_rq_state_bit(m, f & EE_CALL_AL_COMPLETE_IO, , "in-AL"); > seq_print_rq_state_bit(m, f & EE_SEND_WRITE_ACK, , "C"); > seq_print_rq_state_bit(m, f & EE_MAY_SET_IN_SYNC, , "set-in-sync"); > - > - if (f & EE_IS_TRIM) > - __seq_print_rq_state_bit(m, f & EE_IS_TRIM_USE_ZEROOUT, , > "zero-out", "trim"); > seq_print_rq_state_bit(m, f & EE_WRITE_SAME, , "write-same"); > seq_putc(m, '\n'); > } > diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h > index 724d1c50fc52..d5da45bb03a6 100644 > --- a/drivers/block/drbd/drbd_int.h > +++ b/drivers/block/drbd/drbd_int.h > @@ -437,9 +437,6 @@ enum { > > /* is this a TRIM aka REQ_DISCARD? */ > __EE_IS_TRIM, > - /* our lower level cannot handle trim, > - * and we want to fall back to zeroout instead */ > - __EE_IS_TRIM_USE_ZEROOUT, > > /* In case a barrier failed, >* we need to resubmit without the barrier flag. */ > @@ -482,7 +479,6 @@ enum { > #define EE_CALL_AL_COMPLETE_IO (1<<__EE_CALL_AL_COMPLETE_IO) > #define EE_MAY_SET_IN_SYNC (1<<__EE_MAY_SET_IN_SYNC) > #define EE_IS_TRIM (1<<__EE_IS_TRIM) > -#define EE_IS_TRIM_USE_ZEROOUT (1<<__EE_IS_TRIM_USE_ZEROOUT) > #define EE_RESUBMITTED (1<<__EE_RESUBMITTED) > #define EE_WAS_ERROR (1<<__EE_WAS_ERROR) > #define EE_HAS_DIGEST (1<<__EE_HAS_DIGEST) > @@ -1561,8 +1557,6 @@ extern void start_resync_timer_fn(unsigned long data); > extern void drbd_endio_write_sec_final(struct drbd_peer_request *peer_req); > > /* drbd_receiver.c */ > -extern int drbd_issue_discard_or_zero_out(struct drbd_device *device, > - sector_t start, unsigned int nr_sectors, bool discard); > extern int drbd_receiver(struct drbd_thread *thi); > extern int drbd_ack_receiver(struct drbd_thread *thi); > extern void drbd_send_ping_wf(struct work_struct *ws); > diff --git a/drivers/block/drbd/drbd_receiver.c > b/drivers/block/drbd/drbd_receiver.c > index dc9a6dcd431c..bc1d296581f9 100644 > --- a/drivers/block/drbd/drbd_receiver.c > +++ b/drivers/block/drbd/drbd_receiver.c > @@ -1448,108 +1448,14 @@ void drbd_bump_write_ordering(struct drbd_resource > *resource, struct drbd_backin > drbd_info(resource, "Method to ensure write ordering: %s\n", > write_ordering_str[resource->write_ordering]); > } > > -/* > - * We *may* ignore the discard-zeroes-data setting, if so configured. > - * > - * Assumption is that it "discard_zeroes_data=0" is only because the backend > - * may ignore partial unaligned discards. > - * > - * LVM/DM thin as of at least > - * LVM version: 2.02.115(2)-RHEL7 (2015-01-28) > - * Library version: 1.02.93-RHEL7 (2015-01-28) > - * Driver version: 4.29.0 > - * still behaves this way. > - * > - * For unaligned (wrt. alignment and granularity) or too small discards, > - * we zero-out the initial (and/or) trailing unaligned partial chunks, > - * but discard all the aligned full chunks. > - * > - * At least for LVM/DM thin, the result is effectively > "discard_zeroes_data=1". > - */ > -int
Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance
On Fri, Jan 12 2018 at 6:42pm -0500, Bart Van Asschewrote: > On Fri, 2018-01-12 at 18:17 -0500, Mike Snitzer wrote: > > @@ -1570,7 +1570,10 @@ static int multipath_end_io(struct dm_target *ti, > > struct request *clone, > > if (error && blk_path_error(error)) { > > struct multipath *m = ti->private; > > > > - r = DM_ENDIO_REQUEUE; > > + if (r == BLK_STS_RESOURCE) > > + r = DM_ENDIO_DELAY_REQUEUE; > > + else > > + r = DM_ENDIO_REQUEUE; > > Did you perhaps intend "error == BLK_STS_RESOURCE"? Yes, it was a quick patch to get your thoughts. > > > diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h > > index 9ba8453..da83f64 100644 > > --- a/include/linux/device-mapper.h > > +++ b/include/linux/device-mapper.h > > @@ -550,6 +550,7 @@ struct dm_table *dm_swap_table(struct mapped_device *md, > > #define DM_ENDIO_DONE 0 > > #define DM_ENDIO_INCOMPLETE1 > > #define DM_ENDIO_REQUEUE 2 > > +#define DM_ENDIO_DELAY_REQUEUE 3 > > > > /* > > * Definitions of return values from target map function. > > @@ -557,7 +558,7 @@ struct dm_table *dm_swap_table(struct mapped_device *md, > > #define DM_MAPIO_SUBMITTED 0 > > #define DM_MAPIO_REMAPPED 1 > > #define DM_MAPIO_REQUEUE DM_ENDIO_REQUEUE > > -#define DM_MAPIO_DELAY_REQUEUE 3 > > +#define DM_MAPIO_DELAY_REQUEUE DM_ENDIO_DELAY_REQUEUE > > #define DM_MAPIO_KILL 4 > > > > #define dm_sector_div64(x, y)( \ > > Please consider to introduce enumeration types for the DM_ENDIO_* and the > DM_MAPIO_* constants such that the compiler can catch what I reported above. OK, point taken. > Otherwise this patch looks fine to me. Cool, thanks. Mike
Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance
On Fri, 2018-01-12 at 18:17 -0500, Mike Snitzer wrote: > @@ -1570,7 +1570,10 @@ static int multipath_end_io(struct dm_target *ti, > struct request *clone, > if (error && blk_path_error(error)) { > struct multipath *m = ti->private; > > - r = DM_ENDIO_REQUEUE; > + if (r == BLK_STS_RESOURCE) > + r = DM_ENDIO_DELAY_REQUEUE; > + else > + r = DM_ENDIO_REQUEUE; Did you perhaps intend "error == BLK_STS_RESOURCE"? > diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h > index 9ba8453..da83f64 100644 > --- a/include/linux/device-mapper.h > +++ b/include/linux/device-mapper.h > @@ -550,6 +550,7 @@ struct dm_table *dm_swap_table(struct mapped_device *md, > #define DM_ENDIO_DONE0 > #define DM_ENDIO_INCOMPLETE 1 > #define DM_ENDIO_REQUEUE 2 > +#define DM_ENDIO_DELAY_REQUEUE 3 > > /* > * Definitions of return values from target map function. > @@ -557,7 +558,7 @@ struct dm_table *dm_swap_table(struct mapped_device *md, > #define DM_MAPIO_SUBMITTED 0 > #define DM_MAPIO_REMAPPED1 > #define DM_MAPIO_REQUEUE DM_ENDIO_REQUEUE > -#define DM_MAPIO_DELAY_REQUEUE 3 > +#define DM_MAPIO_DELAY_REQUEUE DM_ENDIO_DELAY_REQUEUE > #define DM_MAPIO_KILL4 > > #define dm_sector_div64(x, y)( \ Please consider to introduce enumeration types for the DM_ENDIO_* and the DM_MAPIO_* constants such that the compiler can catch what I reported above. Otherwise this patch looks fine to me. Thanks, Bart.
Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance
On Fri, Jan 12 2018 at 1:54pm -0500, Bart Van Asschewrote: > The intention of commit 6077c2d706097c0 was to address the last mentioned > case. It may be possible to move the delayed queue rerun from the > dm_queue_rq() into dm_requeue_original_request(). But I think it would be > wrong to rerun the queue immediately in case a SCSI target system returns > "BUSY". This is my 3rd reply to this email.. 3rd time's the charm? ;) Here is a patch that will kick the hw queues via blk_mq_requeue_work(), indirectly from dm-rq.c:__dm_mq_kick_requeue_list(), after a delay if BLK_STS_RESOURCE is returned from the target. Your thoughts on this patch as an alternative to commit 6077c2d7060 ? Thanks, Mike diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c index d8c7259..ab2cfdc 100644 --- a/drivers/md/dm-mpath.c +++ b/drivers/md/dm-mpath.c @@ -1570,7 +1570,10 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone, if (error && blk_path_error(error)) { struct multipath *m = ti->private; - r = DM_ENDIO_REQUEUE; + if (r == BLK_STS_RESOURCE) + r = DM_ENDIO_DELAY_REQUEUE; + else + r = DM_ENDIO_REQUEUE; if (pgpath) fail_path(pgpath); diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c index 6b01298..ab0ed2d 100644 --- a/drivers/md/dm-rq.c +++ b/drivers/md/dm-rq.c @@ -315,6 +315,10 @@ static void dm_done(struct request *clone, blk_status_t error, bool mapped) /* The target wants to requeue the I/O */ dm_requeue_original_request(tio, false); break; + case DM_ENDIO_DELAY_REQUEUE: + /* The target wants to requeue the I/O after a delay */ + dm_requeue_original_request(tio, true); + break; default: DMWARN("unimplemented target endio return value: %d", r); BUG(); diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h index 9ba8453..da83f64 100644 --- a/include/linux/device-mapper.h +++ b/include/linux/device-mapper.h @@ -550,6 +550,7 @@ struct dm_table *dm_swap_table(struct mapped_device *md, #define DM_ENDIO_DONE 0 #define DM_ENDIO_INCOMPLETE1 #define DM_ENDIO_REQUEUE 2 +#define DM_ENDIO_DELAY_REQUEUE 3 /* * Definitions of return values from target map function. @@ -557,7 +558,7 @@ struct dm_table *dm_swap_table(struct mapped_device *md, #define DM_MAPIO_SUBMITTED 0 #define DM_MAPIO_REMAPPED 1 #define DM_MAPIO_REQUEUE DM_ENDIO_REQUEUE -#define DM_MAPIO_DELAY_REQUEUE 3 +#define DM_MAPIO_DELAY_REQUEUE DM_ENDIO_DELAY_REQUEUE #define DM_MAPIO_KILL 4 #define dm_sector_div64(x, y)( \
Re: [PATCH v2 3/4] scsi: Avoid that .queuecommand() gets called for a quiesced SCSI device
On Thu, 2018-01-11 at 10:23 +0800, Ming Lei wrote: > > not sufficient to prevent .queuecommand() calls from scsi_send_eh_cmnd(). > > Given it is error handling, do we need to prevent the .queuecommand() call > in scsi_send_eh_cmnd()? Could you share us what the actual issue > observed is from user view? Please have a look at the kernel bug report in the description of patch 4/4 of this series. > > Hence surround the .queuecommand() call from the SCSI error handler with > > blk_start_wait_if_quiesced() / blk_finish_wait_if_quiesced(). > > > > Note: converting the .queuecommand() call in scsi_send_eh_cmnd() into > > code that calls blk_get_request(), e.g. scsi_execute_req(), is not an > > option since scsi_send_eh_cmnd() can be called if all requests are > > allocated and if no requests will make progress without aborting any > > of these requests. > > If we need to prevent the .queuecommand() in scsi_send_eh_cmnd(), what > do you think of the approach by requeuing the EH command via > scsi_mq_requeue_cmd()/scsi_requeue_cmd()? And the EH request will be > dispatched finally when the queue becomes unquiesced or the STOPPED > is cleared. Calling scsi_mq_requeue_cmd() would trigger scsi_mq_uninit_cmd(), blk_mq_put_driver_tag(), blk_mq_get_driver_tag() and scsi_mq_prep_fn(). That could cause scsi_eh_scmd_add() to be called multiple times for the same SCSI command. However, that would break one of the assumptions scsi_eh_scmd_add() is based on, namely that that function gets called only once per SCSI command that is in progress. Bart.
Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance
On Fri, Jan 12 2018 at 1:54pm -0500, Bart Van Asschewrote: > On Fri, 2018-01-12 at 13:06 -0500, Mike Snitzer wrote: > > OK, you have the stage: please give me a pointer to your best > > explaination of the several. > > Since the previous discussion about this topic occurred more than a month > ago it could take more time to look up an explanation than to explain it > again. Anyway, here we go. As you know a block layer request queue needs to > be rerun if one or more requests are waiting and a previous condition that > prevented the request to be executed has been cleared. For the dm-mpath > driver, examples of such conditions are no tags available, a path that is > busy (see also pgpath_busy()), path initialization that is in progress > (pg_init_in_progress) or a request completes with status, e.g. if the > SCSI core calls __blk_mq_end_request(req, error) with error != 0. For some > of these conditions, e.g. path initialization completes, a callback > function in the dm-mpath driver is called and it is possible to explicitly > rerun the queue. I agree that for such scenario's a delayed queue run should > not be triggered. For other scenario's, e.g. if a SCSI initiator submits a > SCSI request over a fabric and the SCSI target replies with "BUSY" then the > SCSI core will end the I/O request with status BLK_STS_RESOURCE after the > maximum number of retries has been reached (see also scsi_io_completion()). > In that last case, if a SCSI target sends a "BUSY" reply over the wire back > to the initiator, there is no other approach for the SCSI initiator to > figure out whether it can queue another request than to resubmit the > request. The worst possible strategy is to resubmit a request immediately > because that will cause a significant fraction of the fabric bandwidth to > be used just for replying "BUSY" to requests that can't be processed > immediately. The thing is, the 2 scenarios you are most concerned about have _nothing_ to do with dm_mq_queue_rq() at all. They occur as an error in the IO path _after_ the request is successfully retrieved with blk_get_request() and then submitted. > The intention of commit 6077c2d706097c0 was to address the last mentioned > case. So it really makes me question why you think commit 6077c2d706097c0 addresses the issue you think it does. And gives me more conviction to remove 6077c2d706097c0. It may help just by virtue of blindly kicking the queue if blk_get_request() fails (regardless of the target is responding with BUSY or not). Very unsatisfying to say the least. I think it'd be much more beneficial for dm-mpath.c:multipath_end_io() to be trained to be respond more intelligently to BLK_STS_RESOURCE. E.g. set BLK_MQ_S_SCHED_RESTART if requests are known to be outstanding on the path. This is one case where Ming said the queue would be re-run, as detailed in this header: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=5b18cff4baedde77e0d69bd62a13ae78f9488d89 And Jens has reinforced to me that BLK_MQ_S_SCHED_RESTART is a means to kicking the queue more efficiently. _BUT_ I'm not seeing any external blk-mq interface that exposes this capability to a blk-mq driver. As is BLK_MQ_S_SCHED_RESTART gets set very much in the bowels of blk-mq (blk_mq_sched_mark_restart_hctx). SO I have to do more homework here... Ming or Jens: might you be able to shed some light on how dm-mpath would/could set BLK_MQ_S_SCHED_RESTART? A new function added that can be called from q->softirq_done_fn in response to BLK_STS_RESOURCE? Or is this level of control (and blk-mq detail) something that a blk-mq driver should need to care about? Anyway, we may need to get to that level of blk-mq driver controlled retry optimization. But Bart: again this is a vastly different feedback loop than the stabbing in the dark your commit 6077c2d706097c0 is doing. Mike
Re: [PATCH] blk-mq-debugfs: Also show requests that have not yet been started
On Fri, 2018-01-12 at 15:05 -0700, Jens Axboe wrote: > On 1/12/18 3:00 PM, Bart Van Assche wrote: > > On Fri, 2018-01-12 at 14:55 -0700, Jens Axboe wrote: > > > On 1/12/18 2:52 PM, Bart Van Assche wrote: > > > > When debugging e.g. the SCSI timeout handler it is important that > > > > requests that have not yet been started or that already have > > > > completed are also reported through debugfs. > > > > > > > > This patch depends on a patch that went upstream recently, namely > > > > commit 14e3062fb185 ("scsi: core: Fix a scsi_show_rq() NULL pointer > > > > dereference"). > > > > > > Why don't we just kill the check, and dump any request that has a > > > matching hctx? We already know the bit was set, so just print > > > all of them. > > > > It is very helpful during debugging that requests owned by a block driver > > and > > requests owned by the block layer core show up in different debugfs files. > > Removing the check completely would cause all requests to show up in the > > same > > debugfs file and would make interpreting the contents of these debugfs files > > much harder. > > Yeah, we'd have to make it just one file at that point. I'm not hugely > against the queuelist check, but probably warrants a comment as it's not > immediately clear (as opposed to the idle check, or the previous START > bit check). How about the below? diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index 19db3f583bf1..da859dac442b 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -394,6 +394,12 @@ struct show_busy_params { }; /* + * Show "busy" requests - these are the requests owned by the block driver. + * The test list_empty(>queuelist) is used to figure out whether or not + * a request is owned by the block driver. That test works because the block + * layer core uses list_del_init() consistently to remove a request from one + * of the request lists. + * * Note: the state of a request may change while this function is in progress, * e.g. due to a concurrent blk_mq_finish_request() call. */ @@ -402,7 +408,7 @@ static void hctx_show_busy_rq(struct request *rq, void *data, bool reserved) const struct show_busy_params *params = data; if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx && - blk_mq_rq_state(rq) != MQ_RQ_IDLE) + list_empty(>queuelist)) __blk_mq_debugfs_rq_show(params->m, list_entry_rq(>queuelist)); } Thanks, Bart.
Re: [PATCH] blk-mq-debugfs: Also show requests that have not yet been started
On 1/12/18 3:00 PM, Bart Van Assche wrote: > On Fri, 2018-01-12 at 14:55 -0700, Jens Axboe wrote: >> On 1/12/18 2:52 PM, Bart Van Assche wrote: >>> When debugging e.g. the SCSI timeout handler it is important that >>> requests that have not yet been started or that already have >>> completed are also reported through debugfs. >>> >>> This patch depends on a patch that went upstream recently, namely >>> commit 14e3062fb185 ("scsi: core: Fix a scsi_show_rq() NULL pointer >>> dereference"). >> >> Why don't we just kill the check, and dump any request that has a >> matching hctx? We already know the bit was set, so just print >> all of them. > > It is very helpful during debugging that requests owned by a block driver and > requests owned by the block layer core show up in different debugfs files. > Removing the check completely would cause all requests to show up in the same > debugfs file and would make interpreting the contents of these debugfs files > much harder. Yeah, we'd have to make it just one file at that point. I'm not hugely against the queuelist check, but probably warrants a comment as it's not immediately clear (as opposed to the idle check, or the previous START bit check). -- Jens Axboe
Re: [PATCHSET v5] blk-mq: reimplement timeout handling
On 1/12/18 2:19 PM, Bart Van Assche wrote: > On Fri, 2018-01-12 at 14:07 -0700, Jens Axboe wrote: >> You're really not making it easy for folks to run this :-) > > My hope is that the ib_srp and ib_srpt patches will be accepted upstream soon. > As long as these are not upstream, anyone who wants to retrieve these patches > is welcome to clone > https://github.com/bvanassche/linux/tree/block-scsi-for-next, > a kernel tree with all my pending patches. > >> Do you have the matching blk-mq debugfs output for the device? > > Sorry but I only retrieved the blk-mq debugfs several minutes after the hang > started so I'm not sure the state information is relevant. Anyway, I have > attached > it to this e-mail. The most remarkable part is the following: > > ./9ddfa913/requeue_list:9646711c {.op=READ, .state=idle, > gen=0x1 > 18, abort_gen=0x0, .cmd_flags=, .rq_flags=SORTED|1|SOFTBARRIER|IO_STAT, > complete > =0, .tag=-1, .internal_tag=217} Two things come to mind here: 1) We forgot to add RQF_STARTED to the debugfs bits, I just rectified that. 2) You are using a scheduler (which one?). The request was inserted, and retrieved by the driver, then requeued. After this requeue, apparently nothing happened. The queue should have been re-run, but that didn't happen. What are the queue/hctx states? -- Jens Axboe
Re: [PATCH] blk-mq-debugfs: Also show requests that have not yet been started
On 1/12/18 2:52 PM, Bart Van Assche wrote: > When debugging e.g. the SCSI timeout handler it is important that > requests that have not yet been started or that already have > completed are also reported through debugfs. > > This patch depends on a patch that went upstream recently, namely > commit 14e3062fb185 ("scsi: core: Fix a scsi_show_rq() NULL pointer > dereference"). Why don't we just kill the check, and dump any request that has a matching hctx? We already know the bit was set, so just print all of them. -- Jens Axboe
[PATCH] blk-mq-debugfs: Also show requests that have not yet been started
When debugging e.g. the SCSI timeout handler it is important that requests that have not yet been started or that already have completed are also reported through debugfs. This patch depends on a patch that went upstream recently, namely commit 14e3062fb185 ("scsi: core: Fix a scsi_show_rq() NULL pointer dereference"). Signed-off-by: Bart Van AsscheCc: Ming Lei Cc: Christoph Hellwig Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Martin K. Petersen --- block/blk-mq-debugfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index 19db3f583bf1..ccd5ef08c3f0 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -402,7 +402,7 @@ static void hctx_show_busy_rq(struct request *rq, void *data, bool reserved) const struct show_busy_params *params = data; if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx && - blk_mq_rq_state(rq) != MQ_RQ_IDLE) + list_empty(>queuelist)) __blk_mq_debugfs_rq_show(params->m, list_entry_rq(>queuelist)); } -- 2.15.1
Re: [PATCHSET v5] blk-mq: reimplement timeout handling
On Fri, 2018-01-12 at 14:07 -0700, Jens Axboe wrote: > You're really not making it easy for folks to run this :-) My hope is that the ib_srp and ib_srpt patches will be accepted upstream soon. As long as these are not upstream, anyone who wants to retrieve these patches is welcome to clone https://github.com/bvanassche/linux/tree/block-scsi-for-next, a kernel tree with all my pending patches. > Do you have the matching blk-mq debugfs output for the device? Sorry but I only retrieved the blk-mq debugfs several minutes after the hang started so I'm not sure the state information is relevant. Anyway, I have attached it to this e-mail. The most remarkable part is the following: ./9ddfa913/requeue_list:9646711c {.op=READ, .state=idle, gen=0x1 18, abort_gen=0x0, .cmd_flags=, .rq_flags=SORTED|1|SOFTBARRIER|IO_STAT, complete =0, .tag=-1, .internal_tag=217} The hexadecimal number at the start is the request_queue pointer (I modified the blk-mq-debugfs code such that queues are registered with there address just after creation and until a name is assigned). This is a dm-mpath queue. Bart../9ddfa913/hctx0/cpu3/completed:29 0 ./9ddfa913/hctx0/cpu3/merged:0 ./9ddfa913/hctx0/cpu3/dispatched:30 0 ./9ddfa913/hctx0/cpu2/completed:0 0 ./9ddfa913/hctx0/cpu2/merged:0 ./9ddfa913/hctx0/cpu2/dispatched:0 0 ./9ddfa913/hctx0/cpu1/completed:0 0 ./9ddfa913/hctx0/cpu1/merged:0 ./9ddfa913/hctx0/cpu1/dispatched:0 0 ./9ddfa913/hctx0/cpu0/completed:0 0 ./9ddfa913/hctx0/cpu0/merged:0 ./9ddfa913/hctx0/cpu0/dispatched:0 0 ./9ddfa913/hctx0/active:0 ./9ddfa913/hctx0/run:92 ./9ddfa913/hctx0/queued:30 ./9ddfa913/hctx0/dispatched: 00 ./9ddfa913/hctx0/dispatched: 197 ./9ddfa913/hctx0/dispatched: 20 ./9ddfa913/hctx0/dispatched: 40 ./9ddfa913/hctx0/dispatched: 80 ./9ddfa913/hctx0/dispatched: 160 ./9ddfa913/hctx0/dispatched: 32+ 0 ./9ddfa913/hctx0/io_poll:considered=0 ./9ddfa913/hctx0/io_poll:invoked=0 ./9ddfa913/hctx0/io_poll:success=0 ./9ddfa913/hctx0/sched_tags_bitmap:: ./9ddfa913/hctx0/sched_tags_bitmap:0010: 0002 ./9ddfa913/hctx0/sched_tags:nr_tags=256 ./9ddfa913/hctx0/sched_tags:nr_reserved_tags=0 ./9ddfa913/hctx0/sched_tags:active_queues=0 ./9ddfa913/hctx0/sched_tags: ./9ddfa913/hctx0/sched_tags:bitmap_tags: ./9ddfa913/hctx0/sched_tags:depth=256 ./9ddfa913/hctx0/sched_tags:busy=1 ./9ddfa913/hctx0/sched_tags:bits_per_word=64 ./9ddfa913/hctx0/sched_tags:map_nr=4 ./9ddfa913/hctx0/sched_tags:alloc_hint={45, 56, 144, 218} ./9ddfa913/hctx0/sched_tags:wake_batch=8 ./9ddfa913/hctx0/sched_tags:wake_index=0 ./9ddfa913/hctx0/sched_tags:ws={ ./9ddfa913/hctx0/sched_tags:{.wait_cnt=8, .wait=inactive}, ./9ddfa913/hctx0/sched_tags:{.wait_cnt=8, .wait=inactive}, ./9ddfa913/hctx0/sched_tags:{.wait_cnt=8, .wait=inactive}, ./9ddfa913/hctx0/sched_tags:{.wait_cnt=8, .wait=inactive}, ./9ddfa913/hctx0/sched_tags:{.wait_cnt=8, .wait=inactive}, ./9ddfa913/hctx0/sched_tags:{.wait_cnt=8, .wait=inactive}, ./9ddfa913/hctx0/sched_tags:{.wait_cnt=8, .wait=inactive}, ./9ddfa913/hctx0/sched_tags:{.wait_cnt=8, .wait=inactive}, ./9ddfa913/hctx0/sched_tags:} ./9ddfa913/hctx0/sched_tags:round_robin=0 ./9ddfa913/hctx0/tags_bitmap:: ./9ddfa913/hctx0/tags_bitmap:0010: ./9ddfa913/hctx0/tags_bitmap:0020: ./9ddfa913/hctx0/tags_bitmap:0030: ./9ddfa913/hctx0/tags_bitmap:0040: ./9ddfa913/hctx0/tags_bitmap:0050: ./9ddfa913/hctx0/tags_bitmap:0060: ./9ddfa913/hctx0/tags_bitmap:0070: ./9ddfa913/hctx0/tags_bitmap:0080: ./9ddfa913/hctx0/tags_bitmap:0090: ./9ddfa913/hctx0/tags_bitmap:00a0: ./9ddfa913/hctx0/tags_bitmap:00b0: ./9ddfa913/hctx0/tags_bitmap:00c0: ./9ddfa913/hctx0/tags_bitmap:00d0: ./9ddfa913/hctx0/tags_bitmap:00e0:
Re: [PATCHSET v5] blk-mq: reimplement timeout handling
On 1/12/18 1:57 PM, Bart Van Assche wrote: > On Tue, 2018-01-09 at 08:29 -0800, Tejun Heo wrote: >> Currently, blk-mq timeout path synchronizes against the usual >> issue/completion path using a complex scheme involving atomic >> bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence >> rules. Unfortunatley, it contains quite a few holes. > > Hello Tejun, > > With this patch series applied I see weird hangs in blk_mq_get_tag() when I > run the srp-test software. If I pull Jens' latest for-next branch and revert > this patch series then the srp-test software runs successfully. Note: if you > don't have InfiniBand hardware available then you will need the RDMA/CM > patches for the SRP initiator and target drivers that have been posted > recently on the linux-rdma mailing list to run the srp-test software. You're really not making it easy for folks to run this :-) > This is how I run the srp-test software in a VM: > > ./run_tests -c -d -r 10 > > Here is an example of what SysRq-w reported when the hang occurred: > > sysrq: SysRq : Show Blocked State > taskPC stack pid father > kworker/u8:0D12864 5 2 0x8000 > Workqueue: events_unbound sd_probe_async [sd_mod] > Call Trace: > ? __schedule+0x2b4/0xbb0 > schedule+0x2d/0x90 > io_schedule+0xd/0x30 > blk_mq_get_tag+0x169/0x290 > ? finish_wait+0x80/0x80 > blk_mq_get_request+0x16a/0x4f0 > blk_mq_alloc_request+0x59/0xc0 > blk_get_request_flags+0x3f/0x260 > scsi_execute+0x33/0x1e0 [scsi_mod] > read_capacity_16.part.35+0x9c/0x460 [sd_mod] > sd_revalidate_disk+0x14bb/0x1cb0 [sd_mod] > sd_probe_async+0xf2/0x1a0 [sd_mod] > process_one_work+0x21c/0x6d0 > worker_thread+0x35/0x380 > ? process_one_work+0x6d0/0x6d0 > kthread+0x117/0x130 > ? kthread_create_worker_on_cpu+0x40/0x40 > ret_from_fork+0x24/0x30 > systemd-udevd D13672 1048285 0x0100 > Call Trace: > ? __schedule+0x2b4/0xbb0 > schedule+0x2d/0x90 > io_schedule+0xd/0x30 > generic_file_read_iter+0x32f/0x970 > ? page_cache_tree_insert+0x100/0x100 > __vfs_read+0xcc/0x120 > vfs_read+0x96/0x140 > SyS_read+0x40/0xa0 > do_syscall_64+0x5f/0x1b0 > entry_SYSCALL64_slow_path+0x25/0x25 Do you have the matching blk-mq debugfs output for the device? -- Jens Axboe
Re: [PATCHSET v5] blk-mq: reimplement timeout handling
On Tue, 2018-01-09 at 08:29 -0800, Tejun Heo wrote: > Currently, blk-mq timeout path synchronizes against the usual > issue/completion path using a complex scheme involving atomic > bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence > rules. Unfortunatley, it contains quite a few holes. Hello Tejun, With this patch series applied I see weird hangs in blk_mq_get_tag() when I run the srp-test software. If I pull Jens' latest for-next branch and revert this patch series then the srp-test software runs successfully. Note: if you don't have InfiniBand hardware available then you will need the RDMA/CM patches for the SRP initiator and target drivers that have been posted recently on the linux-rdma mailing list to run the srp-test software. This is how I run the srp-test software in a VM: ./run_tests -c -d -r 10 Here is an example of what SysRq-w reported when the hang occurred: sysrq: SysRq : Show Blocked State taskPC stack pid father kworker/u8:0D12864 5 2 0x8000 Workqueue: events_unbound sd_probe_async [sd_mod] Call Trace: ? __schedule+0x2b4/0xbb0 schedule+0x2d/0x90 io_schedule+0xd/0x30 blk_mq_get_tag+0x169/0x290 ? finish_wait+0x80/0x80 blk_mq_get_request+0x16a/0x4f0 blk_mq_alloc_request+0x59/0xc0 blk_get_request_flags+0x3f/0x260 scsi_execute+0x33/0x1e0 [scsi_mod] read_capacity_16.part.35+0x9c/0x460 [sd_mod] sd_revalidate_disk+0x14bb/0x1cb0 [sd_mod] sd_probe_async+0xf2/0x1a0 [sd_mod] process_one_work+0x21c/0x6d0 worker_thread+0x35/0x380 ? process_one_work+0x6d0/0x6d0 kthread+0x117/0x130 ? kthread_create_worker_on_cpu+0x40/0x40 ret_from_fork+0x24/0x30 systemd-udevd D13672 1048285 0x0100 Call Trace: ? __schedule+0x2b4/0xbb0 schedule+0x2d/0x90 io_schedule+0xd/0x30 generic_file_read_iter+0x32f/0x970 ? page_cache_tree_insert+0x100/0x100 __vfs_read+0xcc/0x120 vfs_read+0x96/0x140 SyS_read+0x40/0xa0 do_syscall_64+0x5f/0x1b0 entry_SYSCALL64_slow_path+0x25/0x25 RIP: 0033:0x7f8ce6d08d11 RSP: 002b:7fff96dec288 EFLAGS: 0246 ORIG_RAX: RAX: ffda RBX: 5651de7f6e10 RCX: 7f8ce6d08d11 RDX: 0040 RSI: 5651de7f6e38 RDI: 0007 RBP: 5651de7ea500 R08: 7f8ce6cf1c20 R09: 5651de7f6e10 R10: 006f R11: 0246 R12: 01ff R13: 01ff0040 R14: 5651de7ea550 R15: 0040 systemd-udevd D13496 1049285 0x0100 Call Trace: ? __schedule+0x2b4/0xbb0 schedule+0x2d/0x90 io_schedule+0xd/0x30 blk_mq_get_tag+0x169/0x290 ? finish_wait+0x80/0x80 blk_mq_get_request+0x16a/0x4f0 blk_mq_make_request+0x105/0x8e0 ? generic_make_request+0xd6/0x3d0 generic_make_request+0x103/0x3d0 ? submit_bio+0x57/0x110 submit_bio+0x57/0x110 mpage_readpages+0x13b/0x160 ? I_BDEV+0x10/0x10 ? rcu_read_lock_sched_held+0x66/0x70 ? __alloc_pages_nodemask+0x2e8/0x360 __do_page_cache_readahead+0x2a4/0x370 ? force_page_cache_readahead+0xaf/0x110 force_page_cache_readahead+0xaf/0x110 generic_file_read_iter+0x743/0x970 ? find_held_lock+0x2d/0x90 ? _raw_spin_unlock+0x29/0x40 __vfs_read+0xcc/0x120 vfs_read+0x96/0x140 SyS_read+0x40/0xa0 do_syscall_64+0x5f/0x1b0 entry_SYSCALL64_slow_path+0x25/0x25 RIP: 0033:0x7f8ce6d08d11 RSP: 002b:7fff96dec8b8 EFLAGS: 0246 ORIG_RAX: RAX: ffda RBX: 7f8ce7085010 RCX: 7f8ce6d08d11 RDX: 0004 RSI: 7f8ce7085038 RDI: 000f RBP: 5651de7ec840 R08: R09: 7f8ce7085010 R10: 7f8ce7085028 R11: 0246 R12: R13: 0004 R14: 5651de7ec890 R15: 0004 systemd-udevd D13672 1055285 0x0100 Call Trace: ? __schedule+0x2b4/0xbb0 schedule+0x2d/0x90 io_schedule+0xd/0x30 blk_mq_get_tag+0x169/0x290 ? finish_wait+0x80/0x80 blk_mq_get_request+0x16a/0x4f0 blk_mq_make_request+0x105/0x8e0 ? generic_make_request+0xd6/0x3d0 generic_make_request+0x103/0x3d0 ? submit_bio+0x57/0x110 submit_bio+0x57/0x110 mpage_readpages+0x13b/0x160 ? I_BDEV+0x10/0x10 ? rcu_read_lock_sched_held+0x66/0x70 ? __alloc_pages_nodemask+0x2e8/0x360 __do_page_cache_readahead+0x2a4/0x370 ? force_page_cache_readahead+0xaf/0x110 force_page_cache_readahead+0xaf/0x110 generic_file_read_iter+0x743/0x970 __vfs_read+0xcc/0x120 vfs_read+0x96/0x140 SyS_read+0x40/0xa0 do_syscall_64+0x5f/0x1b0 entry_SYSCALL64_slow_path+0x25/0x25 RIP: 0033:0x7f8ce6d08d11 RSP: 002b:7fff96dec848 EFLAGS: 0246 ORIG_RAX: RAX: ffda RBX: 5651de7ec300 RCX: 7f8ce6d08d11 RDX: 0100 RSI: 5651de7ec328 RDI: 000f RBP: 5651de7ea500 R08: R09: 5651de7ec300 R10: 5651de7ec318 R11: 0246 R12: 01ffe000 R13: 01ffe100 R14: 5651de7ea550 R15: 0100 Please let me know if you need more information. Bart.
RE: [PATCH V3 0/5] dm-rq: improve sequential I/O performance
> -Original Message- > From: linux-block-ow...@vger.kernel.org [mailto:linux-block- > ow...@vger.kernel.org] On Behalf Of Bart Van Assche ... > The intention of commit 6077c2d706097c0 was to address the last mentioned > case. It may be possible to move the delayed queue rerun from the > dm_queue_rq() into dm_requeue_original_request(). But I think it would be > wrong to rerun the queue immediately in case a SCSI target system returns > "BUSY". Seeing SCSI BUSY mentioned here... On its own, patch 6077c2d706 seems to be adding an arbitrarily selected magic value of 100 ms without explanation in the patch description or in the added code. But, dm_request_original_request() already seems to have chosen that value for similar purposes: unsigned long delay_ms = delay_requeue ? 100 : 0; So, making them share a #define would indicate there's a reason for that particular value. Any change to the value would be picked up everywhere. All the other callers of blk_mq_delay_run_hw_queue() use macros: drivers/md/dm-rq.c: blk_mq_delay_run_hw_queue(hctx, 100/*ms*/); drivers/nvme/host/fc.c: blk_mq_delay_run_hw_queue(queue->hctx, NVMEFC_QUEUE_DELAY); drivers/scsi/scsi_lib.c:blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY); drivers/scsi/scsi_lib.c:blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY); Those both happen to be set to 3 (ms). --- Robert Elliott, HPE Persistent Memory
Re: [PATCH 1/2] genirq/affinity: assign vectors to all possible CPUs
On Fri, 12 Jan 2018, Ming Lei wrote: > From: Christoph Hellwig> > Currently we assign managed interrupt vectors to all present CPUs. This > works fine for systems were we only online/offline CPUs. But in case of > systems that support physical CPU hotplug (or the virtualized version of > it) this means the additional CPUs covered for in the ACPI tables or on > the command line are not catered for. To fix this we'd either need to > introduce new hotplug CPU states just for this case, or we can start > assining vectors to possible but not present CPUs. > > Reported-by: Christian Borntraeger > Tested-by: Christian Borntraeger > Tested-by: Stefan Haberland > Cc: linux-ker...@vger.kernel.org > Cc: Thomas Gleixner FWIW, Acked-by: Thomas Gleixner
Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance
On Fri, Jan 12 2018 at 1:54pm -0500, Bart Van Asschewrote: > On Fri, 2018-01-12 at 13:06 -0500, Mike Snitzer wrote: > > OK, you have the stage: please give me a pointer to your best > > explaination of the several. > > Since the previous discussion about this topic occurred more than a month > ago it could take more time to look up an explanation than to explain it > again. Anyway, here we go. As you know a block layer request queue needs to > be rerun if one or more requests are waiting and a previous condition that > prevented the request to be executed has been cleared. For the dm-mpath > driver, examples of such conditions are no tags available, a path that is > busy (see also pgpath_busy()), path initialization that is in progress > (pg_init_in_progress) or a request completes with status, e.g. if the > SCSI core calls __blk_mq_end_request(req, error) with error != 0. For some > of these conditions, e.g. path initialization completes, a callback > function in the dm-mpath driver is called and it is possible to explicitly > rerun the queue. I agree that for such scenario's a delayed queue run should > not be triggered. For other scenario's, e.g. if a SCSI initiator submits a > SCSI request over a fabric and the SCSI target replies with "BUSY" then the > SCSI core will end the I/O request with status BLK_STS_RESOURCE after the > maximum number of retries has been reached (see also scsi_io_completion()). > In that last case, if a SCSI target sends a "BUSY" reply over the wire back > to the initiator, there is no other approach for the SCSI initiator to > figure out whether it can queue another request than to resubmit the > request. The worst possible strategy is to resubmit a request immediately > because that will cause a significant fraction of the fabric bandwidth to > be used just for replying "BUSY" to requests that can't be processed > immediately. > > The intention of commit 6077c2d706097c0 was to address the last mentioned > case. It may be possible to move the delayed queue rerun from the > dm_queue_rq() into dm_requeue_original_request(). But I think it would be > wrong to rerun the queue immediately in case a SCSI target system returns > "BUSY". OK, thank you very much for this. Really helps. For starters multipath_clone_and_map() could do a fair amount more with the insight that a SCSI "BUSY" was transmitted back. If both blk-mq being out of tags and SCSI "BUSY" simply return BLK_STS_RESOURCE then dm-mpath doesn't have the ability to behave more intelligently. Anyway, armed with this info I'll have a think about what we might do to tackle this problem head on. Thanks, Mike
Re: [PATCH V3 2/5] dm-mpath: return DM_MAPIO_REQUEUE in case of rq allocation failure
On Thu, 2018-01-11 at 14:01 +0800, Ming Lei wrote: > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c > index 86bf502a8e51..fcddf5a62581 100644 > --- a/drivers/md/dm-mpath.c > +++ b/drivers/md/dm-mpath.c > @@ -533,8 +533,20 @@ static int multipath_clone_and_map(struct dm_target *ti, > struct request *rq, > if (queue_dying) { > atomic_inc(>pg_init_in_progress); > activate_or_offline_path(pgpath); > + return DM_MAPIO_DELAY_REQUEUE; > } > - return DM_MAPIO_DELAY_REQUEUE; > + > + /* > + * blk-mq's SCHED_RESTART can cover this requeue, so > + * we needn't to deal with it by DELAY_REQUEUE. More > + * importantly, we have to return DM_MAPIO_REQUEUE > + * so that blk-mq can get the queue busy feedback, > + * otherwise I/O merge can be hurt. > + */ > + if (q->mq_ops) > + return DM_MAPIO_REQUEUE; > + else > + return DM_MAPIO_DELAY_REQUEUE; > } Sorry but the approach of this patch looks wrong to me. I'm afraid that this approach will cause 100% CPU consumption if the underlying .queue_rq() function returns BLK_STS_RESOURCE for another reason than what can be detected by the .get_budget() call. This can happen if e.g. a SCSI LLD .queuecommand() implementation returns SCSI_MLQUEUE_HOST_BUSY. Many SCSI LLDs can do this: $ git grep 'SCSI_MLQUEUE_[^_]*_BUSY' | wc -l 204 Isn't this a severe regression? Bart.
Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance
On Fri, 2018-01-12 at 13:06 -0500, Mike Snitzer wrote: > OK, you have the stage: please give me a pointer to your best > explaination of the several. Since the previous discussion about this topic occurred more than a month ago it could take more time to look up an explanation than to explain it again. Anyway, here we go. As you know a block layer request queue needs to be rerun if one or more requests are waiting and a previous condition that prevented the request to be executed has been cleared. For the dm-mpath driver, examples of such conditions are no tags available, a path that is busy (see also pgpath_busy()), path initialization that is in progress (pg_init_in_progress) or a request completes with status, e.g. if the SCSI core calls __blk_mq_end_request(req, error) with error != 0. For some of these conditions, e.g. path initialization completes, a callback function in the dm-mpath driver is called and it is possible to explicitly rerun the queue. I agree that for such scenario's a delayed queue run should not be triggered. For other scenario's, e.g. if a SCSI initiator submits a SCSI request over a fabric and the SCSI target replies with "BUSY" then the SCSI core will end the I/O request with status BLK_STS_RESOURCE after the maximum number of retries has been reached (see also scsi_io_completion()). In that last case, if a SCSI target sends a "BUSY" reply over the wire back to the initiator, there is no other approach for the SCSI initiator to figure out whether it can queue another request than to resubmit the request. The worst possible strategy is to resubmit a request immediately because that will cause a significant fraction of the fabric bandwidth to be used just for replying "BUSY" to requests that can't be processed immediately. The intention of commit 6077c2d706097c0 was to address the last mentioned case. It may be possible to move the delayed queue rerun from the dm_queue_rq() into dm_requeue_original_request(). But I think it would be wrong to rerun the queue immediately in case a SCSI target system returns "BUSY". Bart.
Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance
On Fri, Jan 12 2018 at 12:46pm -0500, Bart Van Asschewrote: > On Fri, 2018-01-12 at 12:40 -0500, Mike Snitzer wrote: > > You've not explained it many times. > > That's not correct. I have already several times posted a detailed and easy > to understand explanation OK, you have the stage: please give me a pointer to your best explaination of the several. > > We cannot get a straight answer from you. > > That is a gross and incorrect statement. Please calm down. I'm perfectly calm. I'm just tired of how you sow controversy.
Re: [PATCH 0/2] blk-mq: support physical CPU hotplug
On 1/11/18 7:53 PM, Ming Lei wrote: > Hi, > > This two patches support physical CPU hotplug, so that we can make blk-mq > scale well when new physical CPU is added or removed, and this use case > is normal for VM world. > > Also this patchset fixes the following warning reported by Christian > Borntraeger: > > https://marc.info/?l=linux-block=151092973417143=2 > > Christoph Hellwig (2): > genirq/affinity: assign vectors to all possible CPUs > blk-mq: simplify queue mapping & schedule with each possisble CPU Applied, thanks Ming. -- Jens Axboe
Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance
On Fri, 2018-01-12 at 12:40 -0500, Mike Snitzer wrote: > You've not explained it many times. That's not correct. I have already several times posted a detailed and easy to understand explanation > We cannot get a straight answer from you. That is a gross and incorrect statement. Please calm down. Bart.
Re: [GIT PULL] one nvme fix for Linux 4.15
On 1/12/18 8:09 AM, Christoph Hellwig wrote: > The following changes since commit 254beb84faccbe2f4eda0b51924857bdfb679969: > > nvme-fcloop: avoid possible uninitialized variable warning (2017-12-29 > 10:37:21 +0100) > > are available in the git repository at: > > git://git.infradead.org/nvme.git nvme-4.15 > > for you to fetch changes up to 6b018235b4daabae96d855219fae59c3fb8be417: > > nvme-fabrics: initialize default host->id in nvmf_host_default() > (2018-01-08 10:52:03 +0100) Pulled, thanks. -- Jens Axboe
Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance
On Fri, Jan 12 2018 at 12:26pm -0500, Bart Van Asschewrote: > On Fri, 2018-01-12 at 12:18 -0500, Mike Snitzer wrote: > > This is going upstream for 4.16: > > https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=dm-4.16=5b18cff4baedde77e0d69bd62a13ae78f9488d89 > > That is really gross. I have explained many times in detail on the dm-devel > list why that blk_mq_delay_run_hw_queue() call is necessary. So I think it is > wrong if you remove it. Isn't the responsibility of you as the device mapper > kernel maintainer to avoid regressions instead of introducing regressions? Please stop, seriously. You've not explained it many times. We cannot get a straight answer from you. No analysis that establishes that if an underlying dm-mq multipath path is out of tags (shared or otherwise) that dm-rq core _must_ run the hw queue after a delay. Commit 6077c2d7060 just papers over a real blk-mq problem (that may now be fixed). Your assertions that blk-mq would need to otherwise poll (and waste resources so it can immediately retry) ignores that blk-mq _should_ make progress as requests complete or if/when a path is recovered, etc. So I'm not going to accept your dysfuctional reasoning on this, sorry. _THAT_ is my job as a maintainer.
Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance
On Fri, 2018-01-12 at 12:18 -0500, Mike Snitzer wrote: > This is going upstream for 4.16: > https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=dm-4.16=5b18cff4baedde77e0d69bd62a13ae78f9488d89 That is really gross. I have explained many times in detail on the dm-devel list why that blk_mq_delay_run_hw_queue() call is necessary. So I think it is wrong if you remove it. Isn't the responsibility of you as the device mapper kernel maintainer to avoid regressions instead of introducing regressions? Bart.
Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance
On Thu, Jan 11 2018 at 10:33pm -0500, Ming Leiwrote: > On Thu, Jan 11, 2018 at 08:57:21PM -0500, Mike Snitzer wrote: > > On Thu, Jan 11 2018 at 8:42pm -0500, > > Ming Lei wrote: > > > > > On Thu, Jan 11, 2018 at 10:37:37PM +, Bart Van Assche wrote: > > > > On Thu, 2018-01-11 at 17:07 -0500, Mike Snitzer wrote: > > > > > Bart, if for some reason we regress for some workload you're able to > > > > > more readily test we can deal with it. But I'm too encouraged by > > > > > Ming's > > > > > performance improvements to hold these changes back any further. > > > > > > > > Sorry Mike but I think Ming's measurement results are not sufficient to > > > > motivate upstreaming of these patches. What I remember from previous > > > > versions > > > > of this patch series is that Ming measured the performance impact of > > > > this > > > > patch series only for the Emulex FC driver (lpfc). That driver limits > > > > queue > > > > depth to 3. Instead of modifying the dm code, that driver needs to be > > > > fixed > > > > such that it allows larger queue depths. > > > > > > > > Additionally, some time ago I had explained in detail why I think that > > > > patch > > > > 2/5 in this series is wrong and why it will introduce fairness > > > > regressions > > > > in multi-LUN workloads. I think we need performance measurements for > > > > this > > > > patch series for at least the following two configurations before this > > > > patch > > > > series can be considered for upstream inclusion: > > > > * The performance impact of this patch series for SCSI devices with a > > > > realistic queue depth (e.g. 64 or 128). > > > > > > Please look at the cover letter or patch 5's commit log, it mentions that > > > dm-mpath over virtio-scsi is tested, and the default queue depth of > > > virito-scsi > > > is 128. > > > > > > > * The performance impact for this patch series for a SCSI host with > > > > which > > > > multiple LUNs are associated and for which the target system often > > > > replies > > > > with the SCSI "BUSY" status. > > > > > > I don't think this patch is related with this case, this patch just > > > provides the > > > BUSY feedback from underlying queue to blk-mq via dm-rq. > > > > Hi Ming, > > > > Please see https://www.redhat.com/archives/dm-devel/2017-April/msg00190.html > > > > Specifically: > > "That patch [commit 6077c2d706] can be considered as a first step that > > can be refined further, namely by modifying the dm-rq code further such > > that dm-rq queues are only rerun after the busy condition has been > > cleared. The patch at the start of this thread is easier to review and > > easier to test than any patch that would only rerun dm-rq queues after > > the busy condition has been cleared." > > > > Do you have time to reason through Bart's previous proposal for how to > > better address the specific issue that was documented to be fixed in the > > header for commit 6077c2d706 ? > > Hi Mike, > > Recently we fixed one such issue in blk-mq (eb619fdb2d4cb8b3d34), and I > highly guess that may fix Bart's case. Wow, kind of staggering that there was no mention of this fix prior to now. Seems _very_ relevant (like the real fix). > Let's see this commit 6077c2d706 again, I don't know the idea behind the > fix, which just adds random of 100ms, and this delay degrades performance a > lot, since no hctx can be scheduled any more during the delay. I'd love to try to reproduce the issue documented in commit 6077c2d706 but sadly I cannot get that srp-test to work on my system. Just fails with: # while srp-test/run_tests -d -r 30 -t 02-mq; do :; done Unloaded the ib_srpt kernel module Unloaded the rdma_rxe kernel module Zero-initializing /dev/ram0 ... done Zero-initializing /dev/ram1 ... done Configured SRP target driver Running test srp-test/tests/02-mq ... Test file I/O on top of multipath concurrently with logout and login (0 min; mq) SRP login failed Test srp-test/tests/02-mq failed Think the login is failing due to target not being setup properly? Yeap, now from reading this thread, it is clear that srp-test only works if you have an IB HCA: https://patchwork.kernel.org/patch/10041381/ > So again it is definitely a workaround, no any reasoning, no any theory, just > say it is a fix. Are there anyone who can explain the story? > > IMO, the blk_get_request() in dm-mpath is just like a kmalloc(ATOMIC) which > is used in other blk-mq drivers' IO path too, so don't know why dm-mq is so > strange to require such ugly 100ms delay. The header for commit 6077c2d706 notes that same action that Jens took to unwedge the request stalls (in the patchwork thread I referenced above), with: echo run > /sys/kernel/debug/block/nullb1/state vs what Bart noted in commit 6077c2d706: echo run >/sys/kernel/debug/block/dm-0/state Really does feel like the issue Jens' shared tags fix addressed _is_ the root cause that commit 6077c2d706 worked around.
[for-4.16 PATCH v6 2/4] block: properly protect the 'queue' kobj in blk_unregister_queue
The original commit e9a823fb34a8b (block: fix warning when I/O elevator is changed as request_queue is being removed) is pretty conflated. "conflated" because the resource being protected by q->sysfs_lock isn't the queue_flags (it is the 'queue' kobj). q->sysfs_lock serializes __elevator_change() (via elv_iosched_store) from racing with blk_unregister_queue(): 1) By holding q->sysfs_lock first, __elevator_change() can complete before a racing blk_unregister_queue(). 2) Conversely, __elevator_change() is testing for QUEUE_FLAG_REGISTERED in case elv_iosched_store() loses the race with blk_unregister_queue(), it needs a way to know the 'queue' kobj isn't there. Expand the scope of blk_unregister_queue()'s q->sysfs_lock use so it is held until after the 'queue' kobj is removed. To do so blk_mq_unregister_dev() must not also take q->sysfs_lock. So rename __blk_mq_unregister_dev() to blk_mq_unregister_dev(). Also, blk_unregister_queue() should use q->queue_lock to protect against any concurrent writes to q->queue_flags -- even though chances are the queue is being cleaned up so no concurrent writes are likely. Fixes: e9a823fb34a8b ("block: fix warning when I/O elevator is changed as request_queue is being removed") Signed-off-by: Mike Snitzer--- block/blk-mq-sysfs.c | 9 + block/blk-sysfs.c| 13 ++--- 2 files changed, 11 insertions(+), 11 deletions(-) v6: blk_mq_unregister_dev now requires q->sysfs_lock be held, Ming: I am not seeing any lockdep complaints with this. I've tested bio-based, blk-mq and old .request_fn request-based. diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c index 79969c3c234f..a54b4b070f1c 100644 --- a/block/blk-mq-sysfs.c +++ b/block/blk-mq-sysfs.c @@ -248,7 +248,7 @@ static int blk_mq_register_hctx(struct blk_mq_hw_ctx *hctx) return ret; } -static void __blk_mq_unregister_dev(struct device *dev, struct request_queue *q) +void blk_mq_unregister_dev(struct device *dev, struct request_queue *q) { struct blk_mq_hw_ctx *hctx; int i; @@ -265,13 +265,6 @@ static void __blk_mq_unregister_dev(struct device *dev, struct request_queue *q) q->mq_sysfs_init_done = false; } -void blk_mq_unregister_dev(struct device *dev, struct request_queue *q) -{ - mutex_lock(>sysfs_lock); - __blk_mq_unregister_dev(dev, q); - mutex_unlock(>sysfs_lock); -} - void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx) { kobject_init(>kobj, _mq_hw_ktype); diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 870484eaed1f..9272452ff456 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -929,12 +929,17 @@ void blk_unregister_queue(struct gendisk *disk) if (WARN_ON(!q)) return; + /* +* Protect against the 'queue' kobj being accessed +* while/after it is removed. +*/ mutex_lock(>sysfs_lock); - queue_flag_clear_unlocked(QUEUE_FLAG_REGISTERED, q); - mutex_unlock(>sysfs_lock); - wbt_exit(q); + spin_lock_irq(q->queue_lock); + queue_flag_clear(QUEUE_FLAG_REGISTERED, q); + spin_unlock_irq(q->queue_lock); + wbt_exit(q); if (q->mq_ops) blk_mq_unregister_dev(disk_to_dev(disk), q); @@ -946,4 +951,6 @@ void blk_unregister_queue(struct gendisk *disk) kobject_del(>kobj); blk_trace_remove_sysfs(disk_to_dev(disk)); kobject_put(_to_dev(disk)->kobj); + + mutex_unlock(>sysfs_lock); } -- 2.15.0
Re: [for-4.16 PATCH v5 2/4] block: properly protect the 'queue' kobj in blk_unregister_queue
On Fri, Jan 12 2018 at 10:17am -0500, Ming Leiwrote: > On Fri, Jan 12, 2018 at 10:06:04AM -0500, Mike Snitzer wrote: > > The original commit e9a823fb34a8b (block: fix warning when I/O elevator > > is changed as request_queue is being removed) is pretty conflated. > > "conflated" because the resource being protected by q->sysfs_lock isn't > > the queue_flags (it is the 'queue' kobj). > > > > q->sysfs_lock serializes __elevator_change() (via elv_iosched_store) > > from racing with blk_unregister_queue(): > > 1) By holding q->sysfs_lock first, __elevator_change() can complete > > before a racing blk_unregister_queue(). > > 2) Conversely, __elevator_change() is testing for QUEUE_FLAG_REGISTERED > > in case elv_iosched_store() loses the race with blk_unregister_queue(), > > it needs a way to know the 'queue' kobj isn't there. > > > > Expand the scope of blk_unregister_queue()'s q->sysfs_lock use so it is > > held until after the 'queue' kobj is removed. > > This way will cause deadlock, see blow. Ngh... I thought I tested blk-mq with this patch applied, apparently not. > > > > Also, blk_unregister_queue() should use q->queue_lock to protect against > > any concurrent writes to q->queue_flags -- even though chances are the > > queue is being cleaned up so no concurrent writes are likely. > > > > Fixes: e9a823fb34a8b ("block: fix warning when I/O elevator is changed as > > request_queue is being removed") > > Signed-off-by: Mike Snitzer > > --- > > block/blk-sysfs.c | 13 ++--- > > 1 file changed, 10 insertions(+), 3 deletions(-) > > > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > > index 870484eaed1f..9272452ff456 100644 > > --- a/block/blk-sysfs.c > > +++ b/block/blk-sysfs.c > > @@ -929,12 +929,17 @@ void blk_unregister_queue(struct gendisk *disk) > > if (WARN_ON(!q)) > > return; > > > > + /* > > +* Protect against the 'queue' kobj being accessed > > +* while/after it is removed. > > +*/ > > mutex_lock(>sysfs_lock); > > - queue_flag_clear_unlocked(QUEUE_FLAG_REGISTERED, q); > > - mutex_unlock(>sysfs_lock); > > > > - wbt_exit(q); > > + spin_lock_irq(q->queue_lock); > > + queue_flag_clear(QUEUE_FLAG_REGISTERED, q); > > + spin_unlock_irq(q->queue_lock); > > > > + wbt_exit(q); > > > > if (q->mq_ops) > > blk_mq_unregister_dev(disk_to_dev(disk), q); > > void blk_mq_unregister_dev(struct device *dev, struct request_queue *q) > { > mutex_lock(>sysfs_lock); > __blk_mq_unregister_dev(dev, q); > mutex_unlock(>sysfs_lock); > } > > > @@ -946,4 +951,6 @@ void blk_unregister_queue(struct gendisk *disk) > > kobject_del(>kobj); > > blk_trace_remove_sysfs(disk_to_dev(disk)); > > kobject_put(_to_dev(disk)->kobj); > > + > > + mutex_unlock(>sysfs_lock); > > } > > Except for above, I remember there is also lockdep warning between > sysfs_lock and driver core's lock if the sysfs_lock is extended in this > way(I tried it before, but forget the details now), so please just hold > queue_lock inside the sysfs lock. No good deed goes unpunished. I'll fix this up. Mike
Re: BUG: unable to handle kernel NULL pointer dereference at 0000000000000436
> Il giorno 12 gen 2018, alle ore 09:29, Paolo Valente >ha scritto: > > > >> Il giorno 12 gen 2018, alle ore 05:18, Ming Lei ha >> scritto: >> >> On Thu, Jan 11, 2018 at 08:40:54AM -0700, Jens Axboe wrote: >>> On 1/11/18 2:41 AM, Paolo Valente wrote: > Il giorno 11 gen 2018, alle ore 10:30, Paolo Valente > ha scritto: > > > >> Il giorno 10 gen 2018, alle ore 05:58, Ming Lei ha >> scritto: >> >> Hi Paolo, >> >> Looks this one is introduced in recent merge, and it is triggered >> in test of IO vs. removing device on the latest for-next of block >> tree: >> >> [ 296.151615] BUG: unable to handle kernel NULL pointer dereference at >> 0436 >> [ 296.152302] IP: percpu_counter_add_batch+0x25/0x9d >> [ 296.152698] PGD 0 P4D 0 >> [ 296.152916] Oops: [#1] PREEMPT SMP >> [ 296.153233] Dumping ftrace buffer: >> [ 296.153517](ftrace buffer empty) >> [ 296.153817] Modules linked in: scsi_debug(E) null_blk(E) isofs(E) >> iTCO_wdt(E) iTCO_vendor_support(E) i2c_i801(E) i2c_core(E) lpc_ich(E) >> mfd_core(E) ip_tables(E) sr_mod(E) cdrom(E) sd_mod(E) usb_storage(E) >> ahci(E) libahci(E) libata(E) crc32c_intel(E) virtio_scsi(E) >> qemu_fw_cfg(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) [last >> unloaded: scsi_debug] >> [ 296.156199] CPU: 2 PID: 2001 Comm: scsi_id Tainted: GE >> 4.15.0-rc7790529290ab3_my_v4.15-rc-block-for-next #4 >> [ 296.156961] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS >> 1.9.3-1.fc25 04/01/2014 >> [ 296.157546] RIP: 0010:percpu_counter_add_batch+0x25/0x9d >> [ 296.157920] RSP: 0018:c90001133b28 EFLAGS: 00010002 >> [ 296.158285] RAX: 0002 RBX: ff6e RCX: >> 0001 >> [ 296.158779] RDX: 3fff RSI: 81e95429 RDI: >> 81e5e64c >> [ 296.159276] RBP: 0416 R08: 0001 R09: >> 0001 >> [ 296.159770] R10: c90001133aa0 R11: 88007e96caf0 R12: >> 3fff >> [ 296.160273] R13: 0001 R14: 0001 R15: >> e8d0b180 >> [ 296.160780] FS: 7f635988af80() GS:88007cf0() >> knlGS: >> [ 296.161325] CS: 0010 DS: ES: CR0: 80050033 >> [ 296.161744] CR2: 0436 CR3: 717f1005 CR4: >> 003606e0 >> [ 296.162258] DR0: DR1: DR2: >> >> [ 296.162764] DR3: DR6: fffe0ff0 DR7: >> 0400 >> [ 296.163246] Call Trace: >> [ 296.163445] bfqg_stats_update_io_add+0x35/0xdc >> [ 296.163754] bfq_insert_requests+0xbb9/0xbf2 >> [ 296.164049] ? blk_rq_bio_prep+0x51/0x5d >> [ 296.164353] ? blk_rq_append_bio+0x32/0x78 >> [ 296.164633] ? blk_rq_map_user_iov+0x11b/0x1a0 >> [ 296.164940] blk_mq_sched_insert_request+0xaf/0x12f >> [ 296.165273] blk_execute_rq+0x4b/0x93 >> [ 296.165586] sg_io+0x236/0x38a >> [ 296.165800] scsi_cmd_ioctl+0x1d4/0x381 >> [ 296.166065] ? seccomp_run_filters+0xee/0x12d >> [ 296.166362] ? preempt_count_add+0x6d/0x8c >> [ 296.166643] sd_ioctl+0xbb/0xde [sd_mod] >> [ 296.166915] blkdev_ioctl+0x7f2/0x850 >> [ 296.167167] block_ioctl+0x39/0x3c >> [ 296.167401] vfs_ioctl+0x1b/0x28 >> [ 296.167622] do_vfs_ioctl+0x4bc/0x542 >> [ 296.167877] ? syscall_trace_enter+0x164/0x261 >> [ 296.168180] SyS_ioctl+0x3e/0x5a >> [ 296.168402] do_syscall_64+0x71/0x168 >> [ 296.168654] entry_SYSCALL64_slow_path+0x25/0x25 >> [ 296.168970] RIP: 0033:0x7f63593a1dc7 >> [ 296.169214] RSP: 002b:7fff87210878 EFLAGS: 0246 ORIG_RAX: >> 0010 >> [ 296.169823] RAX: ffda RBX: 7fff872108b0 RCX: >> 7f63593a1dc7 >> [ 296.170303] RDX: 7fff872108b0 RSI: 2285 RDI: >> 0004 >> [ 296.170783] RBP: 7fff87210f00 R08: 2006 R09: >> fe00 >> [ 296.171266] R10: 7fff87210910 R11: 0246 R12: >> >> [ 296.171745] R13: 7fff872108b0 R14: 7fff872108ba R15: >> 7fff872112b0 >> [ 296.172228] Code: 5d 41 5c 41 5d c3 41 55 41 54 49 89 f5 55 53 48 89 >> fd bf 01 00 00 00 41 89 d4 e8 8c 19 d4 ff 48 c7 c7 29 54 e9 81 e8 fd 3d >> ff ff <48> 8b 45 20 49 63 d4 65 8b 18 48 63 db 4c 01 eb 48 39 d3 7d 0b >> [ 296.173594] RIP: percpu_counter_add_batch+0x25/0x9d RSP: >> c90001133b28 >> [ 296.174054] CR2: 0436 >> [ 296.174283] ---[ end trace 87b58f9235daac7e ]--- >> [ 296.174673] Kernel panic - not syncing:
Re: [for-4.16 PATCH v5 2/4] block: properly protect the 'queue' kobj in blk_unregister_queue
On Fri, Jan 12, 2018 at 10:06:04AM -0500, Mike Snitzer wrote: > The original commit e9a823fb34a8b (block: fix warning when I/O elevator > is changed as request_queue is being removed) is pretty conflated. > "conflated" because the resource being protected by q->sysfs_lock isn't > the queue_flags (it is the 'queue' kobj). > > q->sysfs_lock serializes __elevator_change() (via elv_iosched_store) > from racing with blk_unregister_queue(): > 1) By holding q->sysfs_lock first, __elevator_change() can complete > before a racing blk_unregister_queue(). > 2) Conversely, __elevator_change() is testing for QUEUE_FLAG_REGISTERED > in case elv_iosched_store() loses the race with blk_unregister_queue(), > it needs a way to know the 'queue' kobj isn't there. > > Expand the scope of blk_unregister_queue()'s q->sysfs_lock use so it is > held until after the 'queue' kobj is removed. This way will cause deadlock, see blow. > > Also, blk_unregister_queue() should use q->queue_lock to protect against > any concurrent writes to q->queue_flags -- even though chances are the > queue is being cleaned up so no concurrent writes are likely. > > Fixes: e9a823fb34a8b ("block: fix warning when I/O elevator is changed as > request_queue is being removed") > Signed-off-by: Mike Snitzer> --- > block/blk-sysfs.c | 13 ++--- > 1 file changed, 10 insertions(+), 3 deletions(-) > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > index 870484eaed1f..9272452ff456 100644 > --- a/block/blk-sysfs.c > +++ b/block/blk-sysfs.c > @@ -929,12 +929,17 @@ void blk_unregister_queue(struct gendisk *disk) > if (WARN_ON(!q)) > return; > > + /* > + * Protect against the 'queue' kobj being accessed > + * while/after it is removed. > + */ > mutex_lock(>sysfs_lock); > - queue_flag_clear_unlocked(QUEUE_FLAG_REGISTERED, q); > - mutex_unlock(>sysfs_lock); > > - wbt_exit(q); > + spin_lock_irq(q->queue_lock); > + queue_flag_clear(QUEUE_FLAG_REGISTERED, q); > + spin_unlock_irq(q->queue_lock); > > + wbt_exit(q); > > if (q->mq_ops) > blk_mq_unregister_dev(disk_to_dev(disk), q); void blk_mq_unregister_dev(struct device *dev, struct request_queue *q) { mutex_lock(>sysfs_lock); __blk_mq_unregister_dev(dev, q); mutex_unlock(>sysfs_lock); } > @@ -946,4 +951,6 @@ void blk_unregister_queue(struct gendisk *disk) > kobject_del(>kobj); > blk_trace_remove_sysfs(disk_to_dev(disk)); > kobject_put(_to_dev(disk)->kobj); > + > + mutex_unlock(>sysfs_lock); > } Except for above, I remember there is also lockdep warning between sysfs_lock and driver core's lock if the sysfs_lock is extended in this way(I tried it before, but forget the details now), so please just hold queue_lock inside the sysfs lock. Thanks, Ming
Re: [PATCH IMPROVEMENT] block, bfq: limit sectors served with interactive weight raising
On 1/12/18 3:20 AM, Paolo Valente wrote: > > >> Il giorno 12 gen 2018, alle ore 11:15, Holger Hoffstätte >>ha scritto: >> >> On 01/12/18 06:58, Paolo Valente wrote: >>> >>> Il giorno 28 dic 2017, alle ore 15:00, Holger Hoffstätte ha scritto: On 12/28/17 12:19, Paolo Valente wrote: (snip half a tech report ;) So either this or the previous patch ("limit tags for writes and async I/O" can lead to a hard, unrecoverable hang with heavy writes. Since I couldn't log into the affected system anymore I couldn't get any stack traces, blk-mq debug output etc. but there was nothing in dmesg/on the console, so it wasn't a BUG/OOPS. -h >>> >>> Hi Holger, >>> if, as I guess, this problem hasn't gone away for you, I have two >>> requests: >>> 1) could you share your exact test >>> 2) if nothing happens in my systems with your test, would you be >>> willing to retry with the dev version of bfq? It should be able to >>> tell us what takes to your hang. If you are willing to do this test, >>> I'll prepare a branch with everything already configured for you. >> >> Hi, >> >> thanks for following up but there's no need for any of that; it turned out >> to be something else since I got the same hang without those patches at >> least once (during a btrfs balance, even though it didn't look like btrfs' >> fault directly; more like block/mm/helpers. >> >> So on January 7 I posted to linux-block et.al. where I said >> "So this turned out to be something else, sorry for the false alarm." >> but apparently that didn't make it through since it's not in the >> archives either. Sorry. >> >> Long story short, the good news is that I've been running with both patches >> since then without any issue. :) >> > > Wow, what a relief! :) > > So, Jens, being the only issue reported gone, can you please consider > queueing this patch and the other pending one [1]? They are both > critical for bfq performance. Please just resend them. -- Jens Axboe
[for-4.16 PATCH v5 1/4] block: only bdi_unregister() in del_gendisk() if !GENHD_FL_HIDDEN
device_add_disk() will only call bdi_register_owner() if !GENHD_FL_HIDDEN, so it follows that del_gendisk() should only call bdi_unregister() if !GENHD_FL_HIDDEN. Found with code inspection. bdi_unregister() won't do any harm if bdi_register_owner() wasn't used but best to avoid the unnecessary call to bdi_unregister(). Fixes: 8ddcd65325 ("block: introduce GENHD_FL_HIDDEN") Signed-off-by: Mike SnitzerReviewed-by: Ming Lei Reviewed-by: Hannes Reinecke --- block/genhd.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/block/genhd.c b/block/genhd.c index 96a66f671720..00620e01e043 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -725,7 +725,8 @@ void del_gendisk(struct gendisk *disk) * Unregister bdi before releasing device numbers (as they can * get reused and we'd get clashes in sysfs). */ - bdi_unregister(disk->queue->backing_dev_info); + if (!(disk->flags & GENHD_FL_HIDDEN)) + bdi_unregister(disk->queue->backing_dev_info); blk_unregister_queue(disk); } else { WARN_ON(1); -- 2.15.0
[for-4.16 PATCH v5 2/4] block: properly protect the 'queue' kobj in blk_unregister_queue
The original commit e9a823fb34a8b (block: fix warning when I/O elevator is changed as request_queue is being removed) is pretty conflated. "conflated" because the resource being protected by q->sysfs_lock isn't the queue_flags (it is the 'queue' kobj). q->sysfs_lock serializes __elevator_change() (via elv_iosched_store) from racing with blk_unregister_queue(): 1) By holding q->sysfs_lock first, __elevator_change() can complete before a racing blk_unregister_queue(). 2) Conversely, __elevator_change() is testing for QUEUE_FLAG_REGISTERED in case elv_iosched_store() loses the race with blk_unregister_queue(), it needs a way to know the 'queue' kobj isn't there. Expand the scope of blk_unregister_queue()'s q->sysfs_lock use so it is held until after the 'queue' kobj is removed. Also, blk_unregister_queue() should use q->queue_lock to protect against any concurrent writes to q->queue_flags -- even though chances are the queue is being cleaned up so no concurrent writes are likely. Fixes: e9a823fb34a8b ("block: fix warning when I/O elevator is changed as request_queue is being removed") Signed-off-by: Mike Snitzer--- block/blk-sysfs.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 870484eaed1f..9272452ff456 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -929,12 +929,17 @@ void blk_unregister_queue(struct gendisk *disk) if (WARN_ON(!q)) return; + /* +* Protect against the 'queue' kobj being accessed +* while/after it is removed. +*/ mutex_lock(>sysfs_lock); - queue_flag_clear_unlocked(QUEUE_FLAG_REGISTERED, q); - mutex_unlock(>sysfs_lock); - wbt_exit(q); + spin_lock_irq(q->queue_lock); + queue_flag_clear(QUEUE_FLAG_REGISTERED, q); + spin_unlock_irq(q->queue_lock); + wbt_exit(q); if (q->mq_ops) blk_mq_unregister_dev(disk_to_dev(disk), q); @@ -946,4 +951,6 @@ void blk_unregister_queue(struct gendisk *disk) kobject_del(>kobj); blk_trace_remove_sysfs(disk_to_dev(disk)); kobject_put(_to_dev(disk)->kobj); + + mutex_unlock(>sysfs_lock); } -- 2.15.0
[for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready
Hi Jens, I'm submitting this v5 with more feeling now ;) I've distilled the changes down to be quite minimal. Hopefully this will help you and others review. I've also done a dry-run of applying 4.16 block changes and then merging dm-4.16; no merge conflicts: https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=block-4.16 https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=dm-4.16 https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=block-4.16_dm-4.16 And all tests very well for me. Please consider for 4.16, thanks! Mike Mike Snitzer (4): block: only bdi_unregister() in del_gendisk() if !GENHD_FL_HIDDEN block: properly protect the 'queue' kobj in blk_unregister_queue block: allow gendisk's request_queue registration to be deferred dm: fix incomplete request_queue initialization block/blk-sysfs.c | 18 +++--- block/genhd.c | 23 +++ drivers/md/dm-rq.c| 9 - drivers/md/dm.c | 11 ++- include/linux/genhd.h | 5 + 5 files changed, 49 insertions(+), 17 deletions(-) -- 2.15.0
[for-4.16 PATCH v5 3/4] block: allow gendisk's request_queue registration to be deferred
Since I can remember DM has forced the block layer to allow the allocation and initialization of the request_queue to be distinct operations. Reason for this is block/genhd.c:add_disk() has requires that the request_queue (and associated bdi) be tied to the gendisk before add_disk() is called -- because add_disk() also deals with exposing the request_queue via blk_register_queue(). DM's dynamic creation of arbitrary device types (and associated request_queue types) requires the DM device's gendisk be available so that DM table loads can establish a master/slave relationship with subordinate devices that are referenced by loaded DM tables -- using bd_link_disk_holder(). But until these DM tables, and their associated subordinate devices, are known DM cannot know what type of request_queue it needs -- nor what its queue_limits should be. This chicken and egg scenario has created all manner of problems for DM and, at times, the block layer. Summary of changes: - Add device_add_disk_no_queue_reg() and add_disk_no_queue_reg() variant that drivers may use to add a disk without also calling blk_register_queue(). Driver must call blk_register_queue() once its request_queue is fully initialized. - Return early from blk_unregister_queue() if QUEUE_FLAG_REGISTERED is not set. It won't be set if driver used add_disk_no_queue_reg() but driver encounters an error and must del_gendisk() before calling blk_register_queue(). - Export blk_register_queue(). These changes allow DM to use add_disk_no_queue_reg() to anchor its gendisk as the "master" for master/slave relationships DM must establish with subordinate devices referenced in DM tables that get loaded. Once all "slave" devices for a DM device are known its request_queue can be properly initialized and then advertised via sysfs -- important improvement being that no request_queue resource initialization performed by blk_register_queue() is missed for DM devices anymore. Signed-off-by: Mike SnitzerReviewed-by: Ming Lei --- block/blk-sysfs.c | 5 + block/genhd.c | 20 +--- include/linux/genhd.h | 5 + 3 files changed, 27 insertions(+), 3 deletions(-) diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 9272452ff456..4a6a40ffd78e 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -921,6 +921,7 @@ int blk_register_queue(struct gendisk *disk) mutex_unlock(>sysfs_lock); return ret; } +EXPORT_SYMBOL_GPL(blk_register_queue); void blk_unregister_queue(struct gendisk *disk) { @@ -929,6 +930,10 @@ void blk_unregister_queue(struct gendisk *disk) if (WARN_ON(!q)) return; + /* Return early if disk->queue was never registered. */ + if (!test_bit(QUEUE_FLAG_REGISTERED, >queue_flags)) + return; + /* * Protect against the 'queue' kobj being accessed * while/after it is removed. diff --git a/block/genhd.c b/block/genhd.c index 00620e01e043..d4aaf0cae9ad 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -629,7 +629,7 @@ static void register_disk(struct device *parent, struct gendisk *disk) } /** - * device_add_disk - add partitioning information to kernel list + * device_add_disk_no_queue_reg - add partitioning information to kernel list * @parent: parent device for the disk * @disk: per-device partitioning information * @@ -638,7 +638,7 @@ static void register_disk(struct device *parent, struct gendisk *disk) * * FIXME: error handling */ -void device_add_disk(struct device *parent, struct gendisk *disk) +void device_add_disk_no_queue_reg(struct device *parent, struct gendisk *disk) { dev_t devt; int retval; @@ -682,7 +682,6 @@ void device_add_disk(struct device *parent, struct gendisk *disk) exact_match, exact_lock, disk); } register_disk(parent, disk); - blk_register_queue(disk); /* * Take an extra ref on queue which will be put on disk_release() @@ -693,6 +692,21 @@ void device_add_disk(struct device *parent, struct gendisk *disk) disk_add_events(disk); blk_integrity_add(disk); } +EXPORT_SYMBOL(device_add_disk_no_queue_reg); + +/** + * device_add_disk - add disk information to kernel list + * @parent: parent device for the disk + * @disk: per-device disk information + * + * Adds partitioning information and also registers the + * associated request_queue to @disk. + */ +void device_add_disk(struct device *parent, struct gendisk *disk) +{ + device_add_disk_no_queue_reg(parent, disk); + blk_register_queue(disk); +} EXPORT_SYMBOL(device_add_disk); void del_gendisk(struct gendisk *disk) diff --git a/include/linux/genhd.h b/include/linux/genhd.h index 5144ebe046c9..5e3531027b51 100644 --- a/include/linux/genhd.h +++ b/include/linux/genhd.h @@ -395,6 +395,11 @@ static inline void add_disk(struct gendisk *disk) {
[for-4.16 PATCH v5 4/4] dm: fix incomplete request_queue initialization
DM is no longer prone to having its request_queue be improperly initialized. Summary of changes: - defer DM's blk_register_queue() from add_disk()-time until dm_setup_md_queue() by using add_disk_no_queue_reg() in alloc_dev(). - dm_setup_md_queue() is updated to fully initialize DM's request_queue (_after_ all table loads have occurred and the request_queue's type, features and limits are known). A very welcome side-effect of these changes is DM no longer needs to: 1) backfill the "mq" sysfs entry (because historically DM didn't initialize the request_queue to use blk-mq until _after_ blk_register_queue() was called via add_disk()). 2) call elv_register_queue() to get .request_fn request-based DM device's "iosched" exposed in syfs. In addition, blk-mq debugfs support is now made available because request-based DM's blk-mq request_queue is now properly initialized before dm_setup_md_queue() calls blk_register_queue(). These changes also stave off the need to introduce new DM-specific workarounds in block core, e.g. this proposal: https://patchwork.kernel.org/patch/10067961/ In the end DM devices should be less unicorn in nature (relative to initialization and availability of block core infrastructure provided by the request_queue). Signed-off-by: Mike SnitzerTested-by: Ming Lei --- drivers/md/dm-rq.c | 9 - drivers/md/dm.c| 11 ++- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c index 9d32f25489c2..c28357f5cb0e 100644 --- a/drivers/md/dm-rq.c +++ b/drivers/md/dm-rq.c @@ -713,8 +713,6 @@ int dm_old_init_request_queue(struct mapped_device *md, struct dm_table *t) return error; } - elv_register_queue(md->queue); - return 0; } @@ -812,15 +810,8 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t) } dm_init_md_queue(md); - /* backfill 'mq' sysfs registration normally done in blk_register_queue */ - err = blk_mq_register_dev(disk_to_dev(md->disk), q); - if (err) - goto out_cleanup_queue; - return 0; -out_cleanup_queue: - blk_cleanup_queue(q); out_tag_set: blk_mq_free_tag_set(md->tag_set); out_kfree_tag_set: diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 7475739fee49..8c26bfc35335 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1761,7 +1761,7 @@ static struct mapped_device *alloc_dev(int minor) goto bad; md->dax_dev = dax_dev; - add_disk(md->disk); + add_disk_no_queue_reg(md->disk); format_dev_t(md->name, MKDEV(_major, minor)); md->wq = alloc_workqueue("kdmflush", WQ_MEM_RECLAIM, 0); @@ -2021,6 +2021,7 @@ EXPORT_SYMBOL_GPL(dm_get_queue_limits); int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t) { int r; + struct queue_limits limits; enum dm_queue_mode type = dm_get_md_type(md); switch (type) { @@ -2057,6 +2058,14 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t) break; } + r = dm_calculate_queue_limits(t, ); + if (r) { + DMERR("Cannot calculate initial queue limits"); + return r; + } + dm_table_set_restrictions(t, md->queue, ); + blk_register_queue(md->disk); + return 0; } -- 2.15.0
Re: [for-4.16 PATCH v4 2/4] block: use queue_lock when clearing QUEUE_FLAG_REGISTERED in blk_unregister_queue
On Fri, Jan 12 2018 at 9:14am -0500, Ming Leiwrote: > On Fri, Jan 12, 2018 at 07:53:40AM -0500, Mike Snitzer wrote: > > On Fri, Jan 12 2018 at 2:09am -0500, > > Ming Lei wrote: > > > > > On Thu, Jan 11, 2018 at 03:14:15PM -0500, Mike Snitzer wrote: > > > > blk_unregister_queue() must protect against any modifications of > > > > q->queue_flags (not just those performed in blk-sysfs.c). Therefore > > > > q->queue_lock needs to be used rather than q->sysfs_lock. > > > > > > > > Fixes: e9a823fb34a8b ("block: fix warning when I/O elevator is changed > > > > as request_queue is being removed") > > > > Cc: sta...@vger.kernel.org # 4.14+ > > > > Reported-by: Bart Van Assche > > > > Signed-off-by: Mike Snitzer > > > > --- > > > > block/blk-sysfs.c | 4 ++-- > > > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > > > > index 870484eaed1f..52f57539f1c7 100644 > > > > --- a/block/blk-sysfs.c > > > > +++ b/block/blk-sysfs.c > > > > @@ -929,9 +929,9 @@ void blk_unregister_queue(struct gendisk *disk) > > > > if (WARN_ON(!q)) > > > > return; > > > > > > > > - mutex_lock(>sysfs_lock); > > > > + spin_lock_irq(q->queue_lock); > > > > queue_flag_clear_unlocked(QUEUE_FLAG_REGISTERED, q); > > > > - mutex_unlock(>sysfs_lock); > > > > + spin_unlock_irq(q->queue_lock); > > > > > > > > wbt_exit(q); > > > > > > Hi Mike, > > > > > > This change may not be correct, since at least e9a823fb34a8b depends > > > on q->sysfs_lock to sync between testing the flag in __elevator_change() > > > and clearing it here. > > > > The header for commit e9a823fb34a8b says: > > To fix this warning, we can check the QUEUE_FLAG_REGISTERED flag when > > changing the elevator and use the request_queue's sysfs_lock to > > serialize between clearing the flag and the elevator testing the flag. > > > > The reality is sysfs_lock isn't needed to serialize between > > blk_unregister_queue() clearing the flag and __elevator_change() testing > > the flag. > > Without holding sysfs_lock, __elevator_change() may just be called after > q->kobj is deleted from blk_unregister_queue(), then the warning of > 'WARNING: CPU: 3 PID: 14075 at lib/kobject.c:244 > kobject_add_internal+0x103/0x2d0' > can be triggered again. > > BTW, __elevator_change() is always run with holding sysfs_lock. Yes, I'm well aware. Please see v5's PATCH 02/04 which is inbound now. Thanks, Mike
Re: [for-4.16 PATCH v4 2/4] block: use queue_lock when clearing QUEUE_FLAG_REGISTERED in blk_unregister_queue
On Fri, Jan 12, 2018 at 07:53:40AM -0500, Mike Snitzer wrote: > On Fri, Jan 12 2018 at 2:09am -0500, > Ming Leiwrote: > > > On Thu, Jan 11, 2018 at 03:14:15PM -0500, Mike Snitzer wrote: > > > blk_unregister_queue() must protect against any modifications of > > > q->queue_flags (not just those performed in blk-sysfs.c). Therefore > > > q->queue_lock needs to be used rather than q->sysfs_lock. > > > > > > Fixes: e9a823fb34a8b ("block: fix warning when I/O elevator is changed as > > > request_queue is being removed") > > > Cc: sta...@vger.kernel.org # 4.14+ > > > Reported-by: Bart Van Assche > > > Signed-off-by: Mike Snitzer > > > --- > > > block/blk-sysfs.c | 4 ++-- > > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > > > index 870484eaed1f..52f57539f1c7 100644 > > > --- a/block/blk-sysfs.c > > > +++ b/block/blk-sysfs.c > > > @@ -929,9 +929,9 @@ void blk_unregister_queue(struct gendisk *disk) > > > if (WARN_ON(!q)) > > > return; > > > > > > - mutex_lock(>sysfs_lock); > > > + spin_lock_irq(q->queue_lock); > > > queue_flag_clear_unlocked(QUEUE_FLAG_REGISTERED, q); > > > - mutex_unlock(>sysfs_lock); > > > + spin_unlock_irq(q->queue_lock); > > > > > > wbt_exit(q); > > > > Hi Mike, > > > > This change may not be correct, since at least e9a823fb34a8b depends > > on q->sysfs_lock to sync between testing the flag in __elevator_change() > > and clearing it here. > > The header for commit e9a823fb34a8b says: > To fix this warning, we can check the QUEUE_FLAG_REGISTERED flag when > changing the elevator and use the request_queue's sysfs_lock to > serialize between clearing the flag and the elevator testing the flag. > > The reality is sysfs_lock isn't needed to serialize between > blk_unregister_queue() clearing the flag and __elevator_change() testing > the flag. Without holding sysfs_lock, __elevator_change() may just be called after q->kobj is deleted from blk_unregister_queue(), then the warning of 'WARNING: CPU: 3 PID: 14075 at lib/kobject.c:244 kobject_add_internal+0x103/0x2d0' can be triggered again. BTW, __elevator_change() is always run with holding sysfs_lock. Thanks, Ming
Re: [for-4.16 PATCH v4 2/4] block: use queue_lock when clearing QUEUE_FLAG_REGISTERED in blk_unregister_queue
On Fri, Jan 12 2018 at 2:09am -0500, Ming Leiwrote: > On Thu, Jan 11, 2018 at 03:14:15PM -0500, Mike Snitzer wrote: > > blk_unregister_queue() must protect against any modifications of > > q->queue_flags (not just those performed in blk-sysfs.c). Therefore > > q->queue_lock needs to be used rather than q->sysfs_lock. > > > > Fixes: e9a823fb34a8b ("block: fix warning when I/O elevator is changed as > > request_queue is being removed") > > Cc: sta...@vger.kernel.org # 4.14+ > > Reported-by: Bart Van Assche > > Signed-off-by: Mike Snitzer > > --- > > block/blk-sysfs.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > > index 870484eaed1f..52f57539f1c7 100644 > > --- a/block/blk-sysfs.c > > +++ b/block/blk-sysfs.c > > @@ -929,9 +929,9 @@ void blk_unregister_queue(struct gendisk *disk) > > if (WARN_ON(!q)) > > return; > > > > - mutex_lock(>sysfs_lock); > > + spin_lock_irq(q->queue_lock); > > queue_flag_clear_unlocked(QUEUE_FLAG_REGISTERED, q); > > - mutex_unlock(>sysfs_lock); > > + spin_unlock_irq(q->queue_lock); > > > > wbt_exit(q); > > Hi Mike, > > This change may not be correct, since at least e9a823fb34a8b depends > on q->sysfs_lock to sync between testing the flag in __elevator_change() > and clearing it here. The header for commit e9a823fb34a8b says: To fix this warning, we can check the QUEUE_FLAG_REGISTERED flag when changing the elevator and use the request_queue's sysfs_lock to serialize between clearing the flag and the elevator testing the flag. The reality is sysfs_lock isn't needed to serialize between blk_unregister_queue() clearing the flag and __elevator_change() testing the flag. The original commit e9a823fb34a8b is pretty conflated. "conflated" because the resource being protected isn't the queue_flags (it is the 'queue' kobj). I'll respin v5 of this patchset to fix this up first, and then apply the changes I _really_ need to land (DM queue initialization fix). And then I'm going to slowly step away from block core and _not_ allow myself to be tripped up, or baited, by historic block core issues for a while... ;) Thanks, Mike
Re: [PATCH 0/2] blk-mq: support physical CPU hotplug
On Fri, Jan 12, 2018 at 09:12:24AM +0100, Christian Borntraeger wrote: > I think we also need cc stable for this. The bug was introduced with > > commit 4b855ad37194f7bdbb200ce7a1c7051fecb56a08 upstream. > ("blk-mq: Create hctx for each present CPU") > > and that was even backported into 4.12 stable. Yes, and Ming (or Jens?), can you please add a Fixes tag as well? It helps with identifying patches which fix backports. Thanks, Johannes -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH IMPROVEMENT] block, bfq: limit sectors served with interactive weight raising
> Il giorno 12 gen 2018, alle ore 11:15, Holger Hoffstätte >ha scritto: > > On 01/12/18 06:58, Paolo Valente wrote: >> >> >>> Il giorno 28 dic 2017, alle ore 15:00, Holger Hoffstätte >>> ha scritto: >>> >>> >>> On 12/28/17 12:19, Paolo Valente wrote: >>> (snip half a tech report ;) >>> >>> So either this or the previous patch ("limit tags for writes and async I/O" >>> can lead to a hard, unrecoverable hang with heavy writes. Since I couldn't >>> log into the affected system anymore I couldn't get any stack traces, blk-mq >>> debug output etc. but there was nothing in dmesg/on the console, so it >>> wasn't a BUG/OOPS. >>> >>> -h >> >> Hi Holger, >> if, as I guess, this problem hasn't gone away for you, I have two >> requests: >> 1) could you share your exact test >> 2) if nothing happens in my systems with your test, would you be >> willing to retry with the dev version of bfq? It should be able to >> tell us what takes to your hang. If you are willing to do this test, >> I'll prepare a branch with everything already configured for you. > > Hi, > > thanks for following up but there's no need for any of that; it turned out > to be something else since I got the same hang without those patches at > least once (during a btrfs balance, even though it didn't look like btrfs' > fault directly; more like block/mm/helpers. > > So on January 7 I posted to linux-block et.al. where I said > "So this turned out to be something else, sorry for the false alarm." > but apparently that didn't make it through since it's not in the > archives either. Sorry. > > Long story short, the good news is that I've been running with both patches > since then without any issue. :) > Wow, what a relief! :) So, Jens, being the only issue reported gone, can you please consider queueing this patch and the other pending one [1]? They are both critical for bfq performance. Thanks, Paolo [1] https://www.spinics.net/lists/kernel/msg2684463.html > cheers > Holger
Re: [PATCH IMPROVEMENT] block, bfq: limit sectors served with interactive weight raising
On 01/12/18 06:58, Paolo Valente wrote: > > >> Il giorno 28 dic 2017, alle ore 15:00, Holger Hoffstätte >>ha scritto: >> >> >> On 12/28/17 12:19, Paolo Valente wrote: >> (snip half a tech report ;) >> >> So either this or the previous patch ("limit tags for writes and async I/O" >> can lead to a hard, unrecoverable hang with heavy writes. Since I couldn't >> log into the affected system anymore I couldn't get any stack traces, blk-mq >> debug output etc. but there was nothing in dmesg/on the console, so it >> wasn't a BUG/OOPS. >> >> -h > > Hi Holger, > if, as I guess, this problem hasn't gone away for you, I have two > requests: > 1) could you share your exact test > 2) if nothing happens in my systems with your test, would you be > willing to retry with the dev version of bfq? It should be able to > tell us what takes to your hang. If you are willing to do this test, > I'll prepare a branch with everything already configured for you. Hi, thanks for following up but there's no need for any of that; it turned out to be something else since I got the same hang without those patches at least once (during a btrfs balance, even though it didn't look like btrfs' fault directly; more like block/mm/helpers. So on January 7 I posted to linux-block et.al. where I said "So this turned out to be something else, sorry for the false alarm." but apparently that didn't make it through since it's not in the archives either. Sorry. Long story short, the good news is that I've been running with both patches since then without any issue. :) cheers Holger
Re: BUG: unable to handle kernel NULL pointer dereference at 0000000000000436
> Il giorno 12 gen 2018, alle ore 05:18, Ming Leiha > scritto: > > On Thu, Jan 11, 2018 at 08:40:54AM -0700, Jens Axboe wrote: >> On 1/11/18 2:41 AM, Paolo Valente wrote: >>> >>> Il giorno 11 gen 2018, alle ore 10:30, Paolo Valente ha scritto: > Il giorno 10 gen 2018, alle ore 05:58, Ming Lei ha > scritto: > > Hi Paolo, > > Looks this one is introduced in recent merge, and it is triggered > in test of IO vs. removing device on the latest for-next of block > tree: > > [ 296.151615] BUG: unable to handle kernel NULL pointer dereference at > 0436 > [ 296.152302] IP: percpu_counter_add_batch+0x25/0x9d > [ 296.152698] PGD 0 P4D 0 > [ 296.152916] Oops: [#1] PREEMPT SMP > [ 296.153233] Dumping ftrace buffer: > [ 296.153517](ftrace buffer empty) > [ 296.153817] Modules linked in: scsi_debug(E) null_blk(E) isofs(E) > iTCO_wdt(E) iTCO_vendor_support(E) i2c_i801(E) i2c_core(E) lpc_ich(E) > mfd_core(E) ip_tables(E) sr_mod(E) cdrom(E) sd_mod(E) usb_storage(E) > ahci(E) libahci(E) libata(E) crc32c_intel(E) virtio_scsi(E) > qemu_fw_cfg(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) [last > unloaded: scsi_debug] > [ 296.156199] CPU: 2 PID: 2001 Comm: scsi_id Tainted: GE > 4.15.0-rc7790529290ab3_my_v4.15-rc-block-for-next #4 > [ 296.156961] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > 1.9.3-1.fc25 04/01/2014 > [ 296.157546] RIP: 0010:percpu_counter_add_batch+0x25/0x9d > [ 296.157920] RSP: 0018:c90001133b28 EFLAGS: 00010002 > [ 296.158285] RAX: 0002 RBX: ff6e RCX: > 0001 > [ 296.158779] RDX: 3fff RSI: 81e95429 RDI: > 81e5e64c > [ 296.159276] RBP: 0416 R08: 0001 R09: > 0001 > [ 296.159770] R10: c90001133aa0 R11: 88007e96caf0 R12: > 3fff > [ 296.160273] R13: 0001 R14: 0001 R15: > e8d0b180 > [ 296.160780] FS: 7f635988af80() GS:88007cf0() > knlGS: > [ 296.161325] CS: 0010 DS: ES: CR0: 80050033 > [ 296.161744] CR2: 0436 CR3: 717f1005 CR4: > 003606e0 > [ 296.162258] DR0: DR1: DR2: > > [ 296.162764] DR3: DR6: fffe0ff0 DR7: > 0400 > [ 296.163246] Call Trace: > [ 296.163445] bfqg_stats_update_io_add+0x35/0xdc > [ 296.163754] bfq_insert_requests+0xbb9/0xbf2 > [ 296.164049] ? blk_rq_bio_prep+0x51/0x5d > [ 296.164353] ? blk_rq_append_bio+0x32/0x78 > [ 296.164633] ? blk_rq_map_user_iov+0x11b/0x1a0 > [ 296.164940] blk_mq_sched_insert_request+0xaf/0x12f > [ 296.165273] blk_execute_rq+0x4b/0x93 > [ 296.165586] sg_io+0x236/0x38a > [ 296.165800] scsi_cmd_ioctl+0x1d4/0x381 > [ 296.166065] ? seccomp_run_filters+0xee/0x12d > [ 296.166362] ? preempt_count_add+0x6d/0x8c > [ 296.166643] sd_ioctl+0xbb/0xde [sd_mod] > [ 296.166915] blkdev_ioctl+0x7f2/0x850 > [ 296.167167] block_ioctl+0x39/0x3c > [ 296.167401] vfs_ioctl+0x1b/0x28 > [ 296.167622] do_vfs_ioctl+0x4bc/0x542 > [ 296.167877] ? syscall_trace_enter+0x164/0x261 > [ 296.168180] SyS_ioctl+0x3e/0x5a > [ 296.168402] do_syscall_64+0x71/0x168 > [ 296.168654] entry_SYSCALL64_slow_path+0x25/0x25 > [ 296.168970] RIP: 0033:0x7f63593a1dc7 > [ 296.169214] RSP: 002b:7fff87210878 EFLAGS: 0246 ORIG_RAX: > 0010 > [ 296.169823] RAX: ffda RBX: 7fff872108b0 RCX: > 7f63593a1dc7 > [ 296.170303] RDX: 7fff872108b0 RSI: 2285 RDI: > 0004 > [ 296.170783] RBP: 7fff87210f00 R08: 2006 R09: > fe00 > [ 296.171266] R10: 7fff87210910 R11: 0246 R12: > > [ 296.171745] R13: 7fff872108b0 R14: 7fff872108ba R15: > 7fff872112b0 > [ 296.172228] Code: 5d 41 5c 41 5d c3 41 55 41 54 49 89 f5 55 53 48 89 > fd bf 01 00 00 00 41 89 d4 e8 8c 19 d4 ff 48 c7 c7 29 54 e9 81 e8 fd 3d > ff ff <48> 8b 45 20 49 63 d4 65 8b 18 48 63 db 4c 01 eb 48 39 d3 7d 0b > [ 296.173594] RIP: percpu_counter_add_batch+0x25/0x9d RSP: > c90001133b28 > [ 296.174054] CR2: 0436 > [ 296.174283] ---[ end trace 87b58f9235daac7e ]--- > [ 296.174673] Kernel panic - not syncing: Fatal exception > [ 297.270432] Shutting down cpus with NMI > [ 297.271020] Dumping ftrace buffer: > [ 297.271262](ftrace buffer empty) > [ 297.271513] Kernel Offset: disabled > [
Re: [PATCH 0/2] blk-mq: support physical CPU hotplug
I think we also need cc stable for this. The bug was introduced with commit 4b855ad37194f7bdbb200ce7a1c7051fecb56a08 upstream. ("blk-mq: Create hctx for each present CPU") and that was even backported into 4.12 stable. On 01/12/2018 03:53 AM, Ming Lei wrote: > Hi, > > This two patches support physical CPU hotplug, so that we can make blk-mq > scale well when new physical CPU is added or removed, and this use case > is normal for VM world. > > Also this patchset fixes the following warning reported by Christian > Borntraeger: > > https://marc.info/?l=linux-block=151092973417143=2 > > Christoph Hellwig (2): > genirq/affinity: assign vectors to all possible CPUs > blk-mq: simplify queue mapping & schedule with each possisble CPU > > block/blk-mq.c| 19 --- > kernel/irq/affinity.c | 30 +++--- > 2 files changed, 23 insertions(+), 26 deletions(-) >
Re: BUG: unable to handle kernel NULL pointer dereference at 0000000000000436
> Il giorno 11 gen 2018, alle ore 16:40, Jens Axboeha > scritto: > > On 1/11/18 2:41 AM, Paolo Valente wrote: >> >> >>> Il giorno 11 gen 2018, alle ore 10:30, Paolo Valente >>> ha scritto: >>> >>> >>> Il giorno 10 gen 2018, alle ore 05:58, Ming Lei ha scritto: Hi Paolo, Looks this one is introduced in recent merge, and it is triggered in test of IO vs. removing device on the latest for-next of block tree: [ 296.151615] BUG: unable to handle kernel NULL pointer dereference at 0436 [ 296.152302] IP: percpu_counter_add_batch+0x25/0x9d [ 296.152698] PGD 0 P4D 0 [ 296.152916] Oops: [#1] PREEMPT SMP [ 296.153233] Dumping ftrace buffer: [ 296.153517](ftrace buffer empty) [ 296.153817] Modules linked in: scsi_debug(E) null_blk(E) isofs(E) iTCO_wdt(E) iTCO_vendor_support(E) i2c_i801(E) i2c_core(E) lpc_ich(E) mfd_core(E) ip_tables(E) sr_mod(E) cdrom(E) sd_mod(E) usb_storage(E) ahci(E) libahci(E) libata(E) crc32c_intel(E) virtio_scsi(E) qemu_fw_cfg(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) [last unloaded: scsi_debug] [ 296.156199] CPU: 2 PID: 2001 Comm: scsi_id Tainted: GE 4.15.0-rc7790529290ab3_my_v4.15-rc-block-for-next #4 [ 296.156961] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.9.3-1.fc25 04/01/2014 [ 296.157546] RIP: 0010:percpu_counter_add_batch+0x25/0x9d [ 296.157920] RSP: 0018:c90001133b28 EFLAGS: 00010002 [ 296.158285] RAX: 0002 RBX: ff6e RCX: 0001 [ 296.158779] RDX: 3fff RSI: 81e95429 RDI: 81e5e64c [ 296.159276] RBP: 0416 R08: 0001 R09: 0001 [ 296.159770] R10: c90001133aa0 R11: 88007e96caf0 R12: 3fff [ 296.160273] R13: 0001 R14: 0001 R15: e8d0b180 [ 296.160780] FS: 7f635988af80() GS:88007cf0() knlGS: [ 296.161325] CS: 0010 DS: ES: CR0: 80050033 [ 296.161744] CR2: 0436 CR3: 717f1005 CR4: 003606e0 [ 296.162258] DR0: DR1: DR2: [ 296.162764] DR3: DR6: fffe0ff0 DR7: 0400 [ 296.163246] Call Trace: [ 296.163445] bfqg_stats_update_io_add+0x35/0xdc [ 296.163754] bfq_insert_requests+0xbb9/0xbf2 [ 296.164049] ? blk_rq_bio_prep+0x51/0x5d [ 296.164353] ? blk_rq_append_bio+0x32/0x78 [ 296.164633] ? blk_rq_map_user_iov+0x11b/0x1a0 [ 296.164940] blk_mq_sched_insert_request+0xaf/0x12f [ 296.165273] blk_execute_rq+0x4b/0x93 [ 296.165586] sg_io+0x236/0x38a [ 296.165800] scsi_cmd_ioctl+0x1d4/0x381 [ 296.166065] ? seccomp_run_filters+0xee/0x12d [ 296.166362] ? preempt_count_add+0x6d/0x8c [ 296.166643] sd_ioctl+0xbb/0xde [sd_mod] [ 296.166915] blkdev_ioctl+0x7f2/0x850 [ 296.167167] block_ioctl+0x39/0x3c [ 296.167401] vfs_ioctl+0x1b/0x28 [ 296.167622] do_vfs_ioctl+0x4bc/0x542 [ 296.167877] ? syscall_trace_enter+0x164/0x261 [ 296.168180] SyS_ioctl+0x3e/0x5a [ 296.168402] do_syscall_64+0x71/0x168 [ 296.168654] entry_SYSCALL64_slow_path+0x25/0x25 [ 296.168970] RIP: 0033:0x7f63593a1dc7 [ 296.169214] RSP: 002b:7fff87210878 EFLAGS: 0246 ORIG_RAX: 0010 [ 296.169823] RAX: ffda RBX: 7fff872108b0 RCX: 7f63593a1dc7 [ 296.170303] RDX: 7fff872108b0 RSI: 2285 RDI: 0004 [ 296.170783] RBP: 7fff87210f00 R08: 2006 R09: fe00 [ 296.171266] R10: 7fff87210910 R11: 0246 R12: [ 296.171745] R13: 7fff872108b0 R14: 7fff872108ba R15: 7fff872112b0 [ 296.172228] Code: 5d 41 5c 41 5d c3 41 55 41 54 49 89 f5 55 53 48 89 fd bf 01 00 00 00 41 89 d4 e8 8c 19 d4 ff 48 c7 c7 29 54 e9 81 e8 fd 3d ff ff <48> 8b 45 20 49 63 d4 65 8b 18 48 63 db 4c 01 eb 48 39 d3 7d 0b [ 296.173594] RIP: percpu_counter_add_batch+0x25/0x9d RSP: c90001133b28 [ 296.174054] CR2: 0436 [ 296.174283] ---[ end trace 87b58f9235daac7e ]--- [ 296.174673] Kernel panic - not syncing: Fatal exception [ 297.270432] Shutting down cpus with NMI [ 297.271020] Dumping ftrace buffer: [ 297.271262](ftrace buffer empty) [ 297.271513] Kernel Offset: disabled [ 297.271760] ---[ end Kernel panic - not syncing: Fatal exception >>> >>> Well, then don't do it! :) >>> >>> Jokes aside, I've been trying to reproduce this