Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
I've observed a very nasty NULL pointer dereference with NVME drives on hot-unplug [1]. While I haven't gone in the details of this change, it does fix the NULL dereference I've been seeing. Alex [1] http://lists.infradead.org/pipermail/linux-nvme/2018-April/016623.html
Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
Hello, Bart. On Mon, Apr 09, 2018 at 09:30:27PM +, Bart Van Assche wrote: > On Mon, 2018-04-09 at 11:56 -0700, t...@kernel.org wrote: > > On Mon, Apr 09, 2018 at 05:03:05PM +, Bart Van Assche wrote: > > > exist today in the blk-mq timeout handling code cannot be fixed completely > > > using RCU only. > > > > I really don't think that is that complicated. Let's first confirm > > the race fix and get to narrowing / closing that window. > > Two months ago it was reported for the first time that commit 1d9bd5161ba3 > ("blk-mq: replace timeout synchronization with a RCU and generation based > scheme") introduces a regression. Since that report nobody has posted a > patch that fixes all races related to blk-mq timeout handling and that only The two patches using RCU were posted a long time ago. It was just that the repro that only you had at the time didn't work anymore so we couldn't confirm the fix. If we now have a different repro, awesome. Let's see whether the fix works. > uses RCU. If you want to continue working on this that's fine with me. But > since my opinion is that it is impossible to fix these races using RCU only > I will continue working on an alternative approach. See also "[PATCH] > blk-mq: Fix a race between resetting the timer and completion handling" > (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18089.html). ISTR discussing that patch earlier. Didn't the RCU based fix get posted after that discussion? Thanks. -- tejun
Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
On Mon, 2018-04-09 at 11:56 -0700, t...@kernel.org wrote: > On Mon, Apr 09, 2018 at 05:03:05PM +, Bart Van Assche wrote: > > exist today in the blk-mq timeout handling code cannot be fixed completely > > using RCU only. > > I really don't think that is that complicated. Let's first confirm > the race fix and get to narrowing / closing that window. Hello Tejun, Two months ago it was reported for the first time that commit 1d9bd5161ba3 ("blk-mq: replace timeout synchronization with a RCU and generation based scheme") introduces a regression. Since that report nobody has posted a patch that fixes all races related to blk-mq timeout handling and that only uses RCU. If you want to continue working on this that's fine with me. But since my opinion is that it is impossible to fix these races using RCU only I will continue working on an alternative approach. See also "[PATCH] blk-mq: Fix a race between resetting the timer and completion handling" (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18089.html). Thanks, Bart.
Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
Hello, On Mon, Apr 09, 2018 at 05:03:05PM +, Bart Van Assche wrote: > My opinion is not only that the two patches that you posted recently do not > fix all the races that are fixed by this patch but also that the races that The race was with the path where the ownership of a timed out request is passed back to normal completion path and those two patches fix that, right? > exist today in the blk-mq timeout handling code cannot be fixed completely > using RCU only. I really don't think that is that complicated. Let's first confirm the race fix and get to narrowing / closing that window. > That race window did not exist in the legacy block layer. I don't think IIRC, it did. It was there when I first started working on libata EH years ago. Thanks. -- tejun
Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
On Mon, 2018-04-09 at 09:47 -0700, Tejun Heo wrote: > On Sun, Apr 08, 2018 at 10:20:38PM -0700, Bart Van Assche wrote: > > If a completion occurs after blk_mq_rq_timed_out() has reset > > rq->aborted_gstate and the request is again in flight when the timeout > > expires then a request will be completed twice: a first time by the > > timeout handler and a second time when the regular completion occurs. > > Are we still talking about the same BLK_EH_RESET_TIMER case? This can > be solved by the two patches which rcu-synchronizes the hand-over to > normal completion path, right? Hello Tejun, My opinion is not only that the two patches that you posted recently do not fix all the races that are fixed by this patch but also that the races that exist today in the blk-mq timeout handling code cannot be fixed completely using RCU only. > > Additionally, the blk-mq timeout handling code ignores completions that > > occur after blk_mq_check_expired() has been called and before > > blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver > > timeout handler always returns BLK_EH_RESET_TIMER then the result will > > be that the request never terminates. > > And this is the same race window which was always there, right? I > really don't think reducing or closing this window requires full > synchronization. That race window did not exist in the legacy block layer. I don't think we should tolerate such a race window in the blk-mq code either. If a request does not get completed that leads to unkillable processes or hanging kernel code. Such issues are hard to identify when reported by users. I think we should fix these races before these cause more trouble to Linux users. Thanks, Bart.
Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
Hello, Sagi. On Mon, Apr 09, 2018 at 11:37:15AM +0300, Sagi Grimberg wrote: > > >If a completion occurs after blk_mq_rq_timed_out() has reset > >rq->aborted_gstate and the request is again in flight when the timeout > >expires then a request will be completed twice: a first time by the > >timeout handler and a second time when the regular completion occurs. > > > >Additionally, the blk-mq timeout handling code ignores completions that > >occur after blk_mq_check_expired() has been called and before > >blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver > >timeout handler always returns BLK_EH_RESET_TIMER then the result will > >be that the request never terminates. > > OK, now I understand how we can complete twice. Israel, can you verify > this patch solves your double completion problem? > > Given that it is, the change log of your patches should be modified to > the original bug report it solves. > > Thread starts here: > http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848.html Can you please see whether the following two patches fix the problem you've been seeing? http://lkml.kernel.org/r/20180402190053.gc388...@devbig577.frc2.facebook.com http://lkml.kernel.org/r/20180402190120.gd388...@devbig577.frc2.facebook.com Thanks. -- tejun
Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
Hey, Bart. On Sun, Apr 08, 2018 at 10:20:38PM -0700, Bart Van Assche wrote: > If a completion occurs after blk_mq_rq_timed_out() has reset > rq->aborted_gstate and the request is again in flight when the timeout > expires then a request will be completed twice: a first time by the > timeout handler and a second time when the regular completion occurs. Are we still talking about the same BLK_EH_RESET_TIMER case? This can be solved by the two patches which rcu-synchronizes the hand-over to normal completion path, right? > Additionally, the blk-mq timeout handling code ignores completions that > occur after blk_mq_check_expired() has been called and before > blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver > timeout handler always returns BLK_EH_RESET_TIMER then the result will > be that the request never terminates. And this is the same race window which was always there, right? I really don't think reducing or closing this window requires full synchronization. Thanks. -- tejun
Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
On 4/9/2018 11:37 AM, Sagi Grimberg wrote: If a completion occurs after blk_mq_rq_timed_out() has reset rq->aborted_gstate and the request is again in flight when the timeout expires then a request will be completed twice: a first time by the timeout handler and a second time when the regular completion occurs. Additionally, the blk-mq timeout handling code ignores completions that occur after blk_mq_check_expired() has been called and before blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver timeout handler always returns BLK_EH_RESET_TIMER then the result will be that the request never terminates. OK, now I understand how we can complete twice. Israel, can you verify this patch solves your double completion problem? I just verified that this commit fix the double completion problem. We still need my patches to fix the original bug report. Given that it is, the change log of your patches should be modified to the original bug report it solves. Thread starts here: http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848.html
Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
On 4/9/18 8:58 AM, Bart Van Assche wrote: > On Mon, 2018-04-09 at 11:37 +0200, Christoph Hellwig wrote: >> This looks sensible, but I'm worried about taking a whole spinlock >> for every request completion, including irq disabling. However it seems >> like your new updated pattern would fit use of cmpxchg() very nicely. > > Hello Christoph, > > Thanks for the review. I had a look at the spin lock implementation on > x86 and apparently on x86 spin locks are implemented as qspinlocks > (include/asm-generic/qspinlock.h). queued_spin_lock() already uses > atomic_cmpxchg_acquire(). Are you sure that replacing the spin lock > by cmpxchg() will yield a performance improvement? It's definitely worth a shot, especially since this looks like a clear cut case for cmpxchg(). Additionally, it also kills the local irq disable. Conversion should be trivial and we can do some perf testing around that. Neither solution really makes me happy though, the prospect of either kind of synchronization cost isn't appealing at all. I'll have to ponder this a lot more, but it would be ideal if we could shift that cost to the extremely unlikely case of a timeout triggering rather than add the cost upfront to EVERY request. -- Jens Axboe
Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
On Mon, 2018-04-09 at 11:37 +0200, Christoph Hellwig wrote: > This looks sensible, but I'm worried about taking a whole spinlock > for every request completion, including irq disabling. However it seems > like your new updated pattern would fit use of cmpxchg() very nicely. Hello Christoph, Thanks for the review. I had a look at the spin lock implementation on x86 and apparently on x86 spin locks are implemented as qspinlocks (include/asm-generic/qspinlock.h). queued_spin_lock() already uses atomic_cmpxchg_acquire(). Are you sure that replacing the spin lock by cmpxchg() will yield a performance improvement? Thanks, Bart.
Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
On Mon, 2018-04-09 at 11:37 +0300, Sagi Grimberg wrote: > > If a completion occurs after blk_mq_rq_timed_out() has reset > > rq->aborted_gstate and the request is again in flight when the timeout > > expires then a request will be completed twice: a first time by the > > timeout handler and a second time when the regular completion occurs. > > > > Additionally, the blk-mq timeout handling code ignores completions that > > occur after blk_mq_check_expired() has been called and before > > blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver > > timeout handler always returns BLK_EH_RESET_TIMER then the result will > > be that the request never terminates. > > OK, now I understand how we can complete twice. Israel, can you verify > this patch solves your double completion problem? > > Given that it is, the change log of your patches should be modified to > the original bug report it solves. > > Thread starts here: > http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848.html Hello Sagi, I will wait until it has been verified whether or not this patch fixes the "nvme-rdma corrupts memory upon timeout" issue before adding a reference to that issue in the description of this patch. Thanks, Bart.
Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
This looks sensible, but I'm worried about taking a whole spinlock for every request completion, including irq disabling. However it seems like your new updated pattern would fit use of cmpxchg() very nicely.
Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code
If a completion occurs after blk_mq_rq_timed_out() has reset rq->aborted_gstate and the request is again in flight when the timeout expires then a request will be completed twice: a first time by the timeout handler and a second time when the regular completion occurs. Additionally, the blk-mq timeout handling code ignores completions that occur after blk_mq_check_expired() has been called and before blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver timeout handler always returns BLK_EH_RESET_TIMER then the result will be that the request never terminates. OK, now I understand how we can complete twice. Israel, can you verify this patch solves your double completion problem? Given that it is, the change log of your patches should be modified to the original bug report it solves. Thread starts here: http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848.html
[PATCH] blk-mq: Fix recently introduced races in the timeout handling code
If a completion occurs after blk_mq_rq_timed_out() has reset rq->aborted_gstate and the request is again in flight when the timeout expires then a request will be completed twice: a first time by the timeout handler and a second time when the regular completion occurs. Additionally, the blk-mq timeout handling code ignores completions that occur after blk_mq_check_expired() has been called and before blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver timeout handler always returns BLK_EH_RESET_TIMER then the result will be that the request never terminates. Since the request state can be updated from two different contexts, namely regular completion and request timeout, this race cannot be fixed with RCU synchronization only. Fix this race as follows: - Introduce a spinlock to protect the request state and deadline changes. - Use the deadline instead of the request generation to detect whether or not a request timer fired after reinitialization of a request. - Store the request state in the lowest two bits of the deadline instead of the lowest two bits of 'gstate'. - Remove all request member variables that became superfluous due to this change: gstate, aborted_gstate, gstate_seq and aborted_gstate_sync. - Remove the request state information that became superfluous due to this patch, namely RQF_MQ_TIMEOUT_EXPIRED. - Remove the code that became superfluous due to this change, namely the RCU lock and unlock statements in blk_mq_complete_request() and also the synchronize_rcu() call in the timeout handler. This patch fixes the following kernel crash: BUG: unable to handle kernel NULL pointer dereference at (null) Oops: [#1] PREEMPT SMP CPU: 2 PID: 151 Comm: kworker/2:1H Tainted: GW4.15.0-dbg+ #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 Workqueue: kblockd blk_mq_timeout_work RIP: 0010:scsi_times_out+0x17/0x2c0 [scsi_mod] Call Trace: blk_mq_terminate_expired+0x42/0x80 bt_iter+0x3d/0x50 blk_mq_queue_tag_busy_iter+0xe9/0x200 blk_mq_timeout_work+0x181/0x2e0 process_one_work+0x21c/0x6d0 worker_thread+0x35/0x380 kthread+0x117/0x130 ret_from_fork+0x24/0x30 Fixes: 1d9bd5161ba3 ("blk-mq: replace timeout synchronization with a RCU and generation based scheme") Signed-off-by: Bart Van Assche Cc: Tejun Heo Cc: Christoph Hellwig Cc: Sagi Grimberg Cc: Israel Rukshin , Cc: Max Gurtovoy Cc: # v4.16 --- block/blk-core.c | 3 +- block/blk-mq-debugfs.c | 1 - block/blk-mq.c | 178 +++-- block/blk-mq.h | 25 ++- block/blk-timeout.c| 1 - block/blk.h| 4 +- include/linux/blkdev.h | 28 ++-- 7 files changed, 53 insertions(+), 187 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 2623e609db4a..83c7a58e4fb3 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -200,8 +200,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq) rq->start_time = jiffies; set_start_time_ns(rq); rq->part = NULL; - seqcount_init(&rq->gstate_seq); - u64_stats_init(&rq->aborted_gstate_sync); + spin_lock_init(&rq->state_lock); } EXPORT_SYMBOL(blk_rq_init); diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index 6f72413b6cab..80c7c585769f 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -345,7 +345,6 @@ static const char *const rqf_name[] = { RQF_NAME(STATS), RQF_NAME(SPECIAL_PAYLOAD), RQF_NAME(ZONE_WRITE_LOCKED), - RQF_NAME(MQ_TIMEOUT_EXPIRED), RQF_NAME(MQ_POLL_SLEPT), }; #undef RQF_NAME diff --git a/block/blk-mq.c b/block/blk-mq.c index 7816d28b7219..1da16d5e5cf1 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -305,7 +305,6 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, rq->special = NULL; /* tag was already set */ rq->extra_len = 0; - rq->__deadline = 0; INIT_LIST_HEAD(&rq->timeout_list); rq->timeout = 0; @@ -527,8 +526,7 @@ static void __blk_mq_complete_request(struct request *rq) bool shared = false; int cpu; - WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT); - blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE); + WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE); if (rq->internal_tag != -1) blk_mq_sched_completed_request(rq); @@ -577,34 +575,26 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx) *srcu_idx = srcu_read_lock(hctx->srcu); } -static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate) +/** + * blk_mq_change_rq_state - atomically test and set request state + * @rq: Request pointer. + * @old: Old request state. + * @new: New request state. + */ +static bool blk_mq_change_rq_state(struct request *rq, enum mq_rq_state old, + enum mq_rq_stat