Re: [PATCH 3/3] blk-mq: Fix the queue freezing mechanism
Hello, Bart. On Thu, Sep 24, 2015 at 03:54:18PM -0700, Bart Van Assche wrote: > Sorry that I had not yet made this clear but I agreed with the analysis in > your two most recent e-mails. I think I have found the cause of the loop: > for one or another reason the scsi_dh_alua driver was not loaded > automatically. I think that caused the SCSI core to return a retryable error > code for reads and writes sent over paths in the SCSI ALUA state "standby" > instead of a non-retryable error code and that that caused the dm-mpath > driver to enter an infinite loop. Loading the scsi_dh_alua driver resolved > the infinite loop. Anyway, thank you for the feedback. Great! I think we still probably wanna fix the kill/reinit race tho. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] blk-mq: Fix the queue freezing mechanism
On 09/24/2015 11:14 AM, Tejun Heo wrote: On Thu, Sep 24, 2015 at 11:09:33AM -0700, Bart Van Assche wrote: On 09/24/2015 10:49 AM, Tejun Heo wrote: Again, that doesn't happen. In case anyone would be interested, the backtraces for the lockup I had observed are as follows: If this is happening and it's not caused by a hung in-flight request, it's either percpu_ref being buggy or the forementioned kill/reinit race screwing it up. percpu_ref_kill() is expected to disable tryget_live() in a finite amount of time regardless of concurrent tryget tries. Hello Tejun, Sorry that I had not yet made this clear but I agreed with the analysis in your two most recent e-mails. I think I have found the cause of the loop: for one or another reason the scsi_dh_alua driver was not loaded automatically. I think that caused the SCSI core to return a retryable error code for reads and writes sent over paths in the SCSI ALUA state "standby" instead of a non-retryable error code and that that caused the dm-mpath driver to enter an infinite loop. Loading the scsi_dh_alua driver resolved the infinite loop. Anyway, thank you for the feedback. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] blk-mq: Fix the queue freezing mechanism
On 09/23/2015 08:23 PM, Ming Lei wrote: > One solution I thought of is the following patch, which depends on > Akinobu's patch (blk-mq: fix freeze queue race > http://marc.info/?l=linux-kernel&m=143723697010781&w=2). Has that patch been tested against a debug kernel ? The following call trace is triggered by that patch: WARNING: CPU: 3 PID: 200 at kernel/sched/core.c:7485 __might_sleep+0x77/0x80() do not call blocking ops when !TASK_RUNNING; state=1 set at [] prepare_to_wait_event+0x63/0xf0 Modules linked in: ib_srp(O) scsi_transport_srp(O) netconsole configfs af_packet msr sg coretemp x86_pkg_temp_thermal crct10dif_pclmul crc32c_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ipmi_devintf ablk_helper cryptd tg3 lpc_ich microcode pcspkr libphy mfd_core ipmi_si ipmi_msghandler ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad dm_multipath dm_mod rdma_cm ib_cm acpi_power_meter hwmon iw_cm processor button wmi ext4 crc16 mbcache jbd2 mlx4_ib mlx4_en ib_sa ptp pps_core ib_mad ib_core ib_addr sd_mod sr_mod cdrom hid_generic usbhid hid ahci libahci libata ehci_pci ehci_hcd mlx4_core usbcore usb_common scsi_mod autofs4 [last unloaded: brd] CPU: 3 PID: 200 Comm: kworker/u16:9 Tainted: G O4.3.0-rc2-debug+ #1 Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.0.2 11/17/2014 Workqueue: events_unbound async_run_entry_fn Call Trace: [] dump_stack+0x4b/0x68 [] warn_slowpath_common+0x86/0xc0 [] warn_slowpath_fmt+0x4c/0x50 [] ? prepare_to_wait_event+0x63/0xf0 [] ? prepare_to_wait_event+0x63/0xf0 [] __might_sleep+0x77/0x80 [] mutex_lock_nested+0x33/0x3b0 [] ? trace_hardirqs_on+0xd/0x10 [] ? prepare_to_wait_event+0x95/0xf0 [] blk_mq_queue_enter+0x17a/0x2d0 [] ? blk_mq_queue_enter+0x33/0x2d0 [] ? prepare_to_wait_event+0xf0/0xf0 [] blk_mq_map_request+0x2e/0x330 [] ? __lock_is_held+0x49/0x70 [] blk_sq_make_request+0x91/0x430 [ ... ] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] blk-mq: Fix the queue freezing mechanism
Hello, On Thu, Sep 24, 2015 at 11:09:33AM -0700, Bart Van Assche wrote: > On 09/24/2015 10:49 AM, Tejun Heo wrote: > > Again, that doesn't happen. > > Hello Tejun, > > In case anyone would be interested, the backtraces for the lockup I had > observed are as follows: If this is happening and it's not caused by a hung in-flight request, it's either percpu_ref being buggy or the forementioned kill/reinit race screwing it up. percpu_ref_kill() is expected to disable tryget_live() in a finite amount of time regardless of concurrent tryget tries. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] blk-mq: Fix the queue freezing mechanism
On 09/24/2015 10:49 AM, Tejun Heo wrote: > Again, that doesn't happen. Hello Tejun, In case anyone would be interested, the backtraces for the lockup I had observed are as follows: sysrq: SysRq : Show Blocked State taskPC stack pid father kworker/4:0 D 88045c5d5a00 029 2 0x Workqueue: srp_remove srp_remove_work [ib_srp] 88045c767c08 0086 810ba11d 88047fd15ad8 88045c5d5a00 88045c768000 88045c768000 88041c737ab8 880036f7bcf8 8804158b55e8 0100 88045c767c20 Call Trace: [] ? trace_hardirqs_on+0xd/0x10 [] schedule+0x3a/0x90 [] blk_mq_freeze_queue_wait+0x56/0xb0 [] ? prepare_to_wait_event+0xf0/0xf0 [] blk_mq_update_tag_set_depth+0x44/0xb0 [] blk_mq_free_queue+0x50/0x110 [] blk_cleanup_queue+0x148/0x240 [] __scsi_remove_device+0x65/0xd0 [scsi_mod] [] scsi_forget_host+0x69/0x70 [scsi_mod] [] scsi_remove_host+0x82/0x130 [scsi_mod] [] srp_remove_work+0x90/0x1f0 [ib_srp] [] process_one_work+0x1d8/0x610 [] ? process_one_work+0x14b/0x610 [] worker_thread+0x114/0x460 [] ? process_one_work+0x610/0x610 [] kthread+0xf8/0x110 [] ? kthread_create_on_node+0x200/0x200 [] ret_from_fork+0x3f/0x70 [] ? kthread_create_on_node+0x200/0x200 kworker/1:2 D 88045c648000 0 264 2 0x Workqueue: srp_remove srp_remove_work [ib_srp] 880444d47c08 0086 810ba11d 88047fc55ad8 88045c648000 880450a9da00 880444d48000 88042826c8d8 880428969790 880413b27b50 0040 880444d47c20 Call Trace: [] ? trace_hardirqs_on+0xd/0x10 [] schedule+0x3a/0x90 [] blk_mq_freeze_queue_wait+0x56/0xb0 [] ? prepare_to_wait_event+0xf0/0xf0 [] blk_mq_update_tag_set_depth+0x44/0xb0 [] blk_mq_free_queue+0x50/0x110 [] blk_cleanup_queue+0x148/0x240 [] __scsi_remove_device+0x65/0xd0 [scsi_mod] [] scsi_forget_host+0x69/0x70 [scsi_mod] [] scsi_remove_host+0x82/0x130 [scsi_mod] [] srp_remove_work+0x90/0x1f0 [ib_srp] [] process_one_work+0x1d8/0x610 [] ? process_one_work+0x14b/0x610 [] worker_thread+0x114/0x460 [] ? process_one_work+0x610/0x610 [] kthread+0xf8/0x110 [] ? kthread_create_on_node+0x200/0x200 [] ret_from_fork+0x3f/0x70 [] ? kthread_create_on_node+0x200/0x200 kworker/7:0 D 88045c675a00 0 27179 2 0x Workqueue: srp_remove srp_remove_work [ib_srp] 8803f7333c08 0086 810ba11d 88047fdd5ad8 88045c675a00 8803f7139680 8803f7334000 880403d76e40 88040c412408 8803fe493cf8 01c0 8803f7333c20 Call Trace: [] ? trace_hardirqs_on+0xd/0x10 [] schedule+0x3a/0x90 [] blk_mq_freeze_queue_wait+0x56/0xb0 [] ? prepare_to_wait_event+0xf0/0xf0 [] blk_mq_update_tag_set_depth+0x44/0xb0 [] blk_mq_free_queue+0x50/0x110 [] blk_cleanup_queue+0x148/0x240 [] __scsi_remove_device+0x65/0xd0 [scsi_mod] [] scsi_forget_host+0x69/0x70 [scsi_mod] [] scsi_remove_host+0x82/0x130 [scsi_mod] [] srp_remove_work+0x90/0x1f0 [ib_srp] [] process_one_work+0x1d8/0x610 [] ? process_one_work+0x14b/0x610 [] worker_thread+0x114/0x460 [] ? process_one_work+0x610/0x610 [] kthread+0xf8/0x110 [] ? kthread_create_on_node+0x200/0x200 [] ret_from_fork+0x3f/0x70 [] ? kthread_create_on_node+0x200/0x200 kworker/3:0 D 88045c649680 0 21529 2 0x Workqueue: srp_remove srp_remove_work [ib_srp] 880392f2fc08 0086 810ba11d 88047fcd5ad8 88045c649680 88039a72c380 880392f3 880420f2fab8 880425896ed8 880416043cf8 00c0 880392f2fc20 Call Trace: [] ? trace_hardirqs_on+0xd/0x10 [] schedule+0x3a/0x90 [] blk_mq_freeze_queue_wait+0x56/0xb0 [] ? prepare_to_wait_event+0xf0/0xf0 [] blk_mq_update_tag_set_depth+0x44/0xb0 [] blk_mq_free_queue+0x50/0x110 [] blk_cleanup_queue+0x148/0x240 [] __scsi_remove_device+0x65/0xd0 [scsi_mod] [] scsi_forget_host+0x69/0x70 [scsi_mod] [] scsi_remove_host+0x82/0x130 [scsi_mod] [] srp_remove_work+0x90/0x1f0 [ib_srp] [] process_one_work+0x1d8/0x610 [] ? process_one_work+0x14b/0x610 [] worker_thread+0x114/0x460 [] ? process_one_work+0x610/0x610 [] kthread+0xf8/0x110 [] ? kthread_create_on_node+0x200/0x200 [] ret_from_fork+0x3f/0x70 [] ? kthread_create_on_node+0x200/0x200 kworker/5:0 D 88045c64ad00 0 10950 2 0x Workqueue: srp_remove srp_remove_work [ib_srp] 88037111bc08 0086 810ba11d 88047fd55ad8 88045c64ad00 880055991680 88037111c000 88041ff348d8 880424463080 8804145155e8 0140 88037111bc20 Call Trace: [] ? trace_hardirqs_on+0xd/0x10 [] schedule+0x3a/0x90 [] blk_mq_freeze_queue_wait+0x56/0xb0 [] ? prepare_to_wait_event+0xf0/0xf0 [] blk_mq_update_tag_set_depth+0x44/0xb0 [] blk_mq_free_queue+0x50/0x110 [] blk_cleanup_queue+0x148/0x240 [] __scsi_remove_device+0x65/0xd0 [scsi_mod] [] scsi_forget_host+0x69/0x70 [scsi_mod] [] scsi_remove_host+0x82/0x
Re: [PATCH 3/3] blk-mq: Fix the queue freezing mechanism
Hello, Bart. On Thu, Sep 24, 2015 at 10:35:41AM -0700, Bart Van Assche wrote: > My interpretation of the percpu_ref_tryget_live() implementation in > is that the tryget operation will only fail if the > refcount is in atomic mode and additionally the __PERCPU_REF_DEAD flag has > been set. Yeah and percpu_ref_kill() does both. > >Also, what does the barriers do in your patch? > > My intention was to guarantee that on architectures that do not provide the > same ordering guarantees as x86 (e.g. PPC or ARM) that the store and load > operations on mq_freeze_depth and mq_usage_counter would not be reordered. > However, it is probably safe to leave out the barrier I proposed to > introduce in blk_mq_queue_enter() since it is acceptable that there is some > delay in communicating mq_freeze_depth updates from the CPU that modified > that counter to the CPU that reads that counter. Hmmm... please don't use barriers this way. Use it only when there's a clear requirement for interlocking writer and reader pair. There isn't one here. All it does is confusing people trying to read the code. > >The only race condition that I can see there is if unfreeze and freeze > >race each other and freeze tries to kill the ref which hasn't finished > >reinit yet. We prolly want to put mutexes around freeze/unfreeze so > >that they're serialized if something like that can happen (it isn't a > >hot path to begin with). > > My concern is that the following could happen if mq_freeze_depth is not > checked in the hot path of blk_mq_queue_enter(): > * mq_usage_counter >= 1 before blk_mq_freeze_queue() is called. > * blk_mq_freeze_queue() keeps waiting forever if new requests are queued > faster than that these requests complete. Again, that doesn't happen. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] blk-mq: Fix the queue freezing mechanism
On 09/24/2015 09:54 AM, Tejun Heo wrote: On Thu, Sep 24, 2015 at 09:43:48AM -0700, Bart Van Assche wrote: On 09/23/2015 08:23 PM, Ming Lei wrote: IMO, mq_freeze_depth should only be accessed in slow path, and looks the race just happens during the small window between increasing 'mq_freeze_depth' and killing the percpu counter. Hello Ming, My concern is that *not* checking mq_freeze_depth in the hot path can cause a livelock. If there is a software layer, e.g. multipathd, that periodically submits new commands and if these commands take time to process e.g. because the transport layer is unavailable, how to guarantee that freezing ever succeeds without checking mq_freeze_depth in the hot path ? I couldn't tell what the patch was trying to do from the patch description, so including the above prolly is a good idea. Isn't the above guaranteed by percpu_ref_kill() preventing new tryget_live()'s? My interpretation of the percpu_ref_tryget_live() implementation in is that the tryget operation will only fail if the refcount is in atomic mode and additionally the __PERCPU_REF_DEAD flag has been set. Also, what does the barriers do in your patch? My intention was to guarantee that on architectures that do not provide the same ordering guarantees as x86 (e.g. PPC or ARM) that the store and load operations on mq_freeze_depth and mq_usage_counter would not be reordered. However, it is probably safe to leave out the barrier I proposed to introduce in blk_mq_queue_enter() since it is acceptable that there is some delay in communicating mq_freeze_depth updates from the CPU that modified that counter to the CPU that reads that counter. The only race condition that I can see there is if unfreeze and freeze race each other and freeze tries to kill the ref which hasn't finished reinit yet. We prolly want to put mutexes around freeze/unfreeze so that they're serialized if something like that can happen (it isn't a hot path to begin with). My concern is that the following could happen if mq_freeze_depth is not checked in the hot path of blk_mq_queue_enter(): * mq_usage_counter >= 1 before blk_mq_freeze_queue() is called. * blk_mq_freeze_queue() keeps waiting forever if new requests are queued faster than that these requests complete. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] blk-mq: Fix the queue freezing mechanism
On Thu, Sep 24, 2015 at 09:43:48AM -0700, Bart Van Assche wrote: > On 09/23/2015 08:23 PM, Ming Lei wrote: > >IMO, mq_freeze_depth should only be accessed in slow path, and looks > >the race just happens during the small window between increasing > >'mq_freeze_depth' and killing the percpu counter. > > Hello Ming, > > My concern is that *not* checking mq_freeze_depth in the hot path can cause > a livelock. If there is a software layer, e.g. multipathd, that periodically > submits new commands and if these commands take time to process e.g. because > the transport layer is unavailable, how to guarantee that freezing ever > succeeds without checking mq_freeze_depth in the hot path ? I couldn't tell what the patch was trying to do from the patch description, so including the above prolly is a good idea. Isn't the above guaranteed by percpu_ref_kill() preventing new tryget_live()'s? Also, what does the barriers do in your patch? The only race condition that I can see there is if unfreeze and freeze race each other and freeze tries to kill the ref which hasn't finished reinit yet. We prolly want to put mutexes around freeze/unfreeze so that they're serialized if something like that can happen (it isn't a hot path to begin with). Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] blk-mq: Fix the queue freezing mechanism
On 09/23/2015 08:23 PM, Ming Lei wrote: IMO, mq_freeze_depth should only be accessed in slow path, and looks the race just happens during the small window between increasing 'mq_freeze_depth' and killing the percpu counter. Hello Ming, My concern is that *not* checking mq_freeze_depth in the hot path can cause a livelock. If there is a software layer, e.g. multipathd, that periodically submits new commands and if these commands take time to process e.g. because the transport layer is unavailable, how to guarantee that freezing ever succeeds without checking mq_freeze_depth in the hot path ? Thanks, Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] blk-mq: Fix the queue freezing mechanism
On Wed, 23 Sep 2015 15:14:10 -0700 Bart Van Assche wrote: > Ensure that blk_mq_queue_enter() waits if mq_freeze_depth is not > zero. Ensure that the update of mq_freeze_depth by blk_mq_freeze_queue() > is visible by all CPU cores before that function waits on > mq_usage_counter. > > It is unfortunate that this patch introduces an smp_mb() in the > hot path (blk_mq_queue_enter()) but I have not yet found a way to > avoid this. > > I came across this code while analyzing a lockup triggered by > deleting a SCSI host created by the SRP initiator immediately > followed by a relogin. > > Signed-off-by: Bart Van Assche > Cc: Christoph Hellwig > Cc: Tejun Heo > Cc: > --- > block/blk-mq.c | 14 -- > 1 file changed, 12 insertions(+), 2 deletions(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 2077f0d..e3ad411 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -83,8 +83,13 @@ static int blk_mq_queue_enter(struct request_queue *q, > gfp_t gfp) > while (true) { > int ret; > > - if (percpu_ref_tryget_live(&q->mq_usage_counter)) > - return 0; > + if (percpu_ref_tryget_live(&q->mq_usage_counter)) { > + /* Order mq_use_counter and mq_freeze_depth accesses */ > + smp_mb(); > + if (!atomic_read(&q->mq_freeze_depth)) > + return 0; > + percpu_ref_put(&q->mq_usage_counter); > + } IMO, mq_freeze_depth should only be accessed in slow path, and looks the race just happens during the small window between increasing 'mq_freeze_depth' and killing the percpu counter. One solution I thought of is the following patch, which depends on Akinobu's patch (blk-mq: fix freeze queue race http://marc.info/?l=linux-kernel&m=143723697010781&w=2). --- diff --git a/block/blk-mq.c b/block/blk-mq.c index f774f67..1c71c04 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -77,6 +77,17 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx, clear_bit(CTX_TO_BIT(hctx, ctx), &bm->word); } +static inline int blk_mq_read_freeze_depth(struct request_queue *q) +{ + int depth; + + mutex_lock(&q->mq_freeze_lock); + depth = q->mq_freeze_depth; + mutex_unlock(&q->mq_freeze_lock); + + return depth; +} + static int blk_mq_queue_enter(struct request_queue *q, gfp_t gfp) { while (true) { @@ -89,7 +100,7 @@ static int blk_mq_queue_enter(struct request_queue *q, gfp_t gfp) return -EBUSY; ret = wait_event_interruptible(q->mq_freeze_wq, - !atomic_read(&q->mq_freeze_depth) || + !blk_mq_read_freeze_depth(q) || blk_queue_dying(q)); if (blk_queue_dying(q)) return -ENODEV; @@ -113,12 +124,9 @@ static void blk_mq_usage_counter_release(struct percpu_ref *ref) void blk_mq_freeze_queue_start(struct request_queue *q) { - int freeze_depth; - mutex_lock(&q->mq_freeze_lock); - freeze_depth = atomic_inc_return(&q->mq_freeze_depth); - if (freeze_depth == 1) { + if (!q->mq_freeze_depth++) { percpu_ref_kill(&q->mq_usage_counter); blk_mq_run_hw_queues(q, false); } @@ -149,7 +157,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q) mutex_lock(&q->mq_freeze_lock); - freeze_depth = atomic_dec_return(&q->mq_freeze_depth); + freeze_depth = --q->mq_freeze_depth; WARN_ON_ONCE(freeze_depth < 0); if (!freeze_depth) { percpu_ref_reinit(&q->mq_usage_counter); @@ -2084,7 +2092,7 @@ void blk_mq_free_queue(struct request_queue *q) /* Basically redo blk_mq_init_queue with queue frozen */ static void blk_mq_queue_reinit(struct request_queue *q) { - WARN_ON_ONCE(!atomic_read(&q->mq_freeze_depth)); + WARN_ON_ONCE(!ACCESS_ONCE(q->mq_freeze_depth)); blk_mq_sysfs_unregister(q); diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 6cdf2b7..86fedcc 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -436,7 +436,7 @@ struct request_queue { struct mutexsysfs_lock; int bypass_depth; - atomic_tmq_freeze_depth; + int mq_freeze_depth; #if defined(CONFIG_BLK_DEV_BSG) bsg_job_fn *bsg_job_fn; > > if (!(gfp & __GFP_WAIT)) > return -EBUSY; > @@ -136,6 +141,11 @@ static void blk_mq_freeze_queue_wait(struct > request_queue *q) > void blk_mq_freeze_queue(struct request_queue *q) > { > blk_mq_freeze_queue_start(q); > + /* > + * Ensure that the mq_freeze_depth update is visiable before > + * mq_use_counter is read. > + */ > + smp_mb(); > blk_mq_freeze_queue_wait(q); > }