On Wed, 2018-02-07 at 12:09 -0800, t...@kernel.org wrote:
> Hello,
> 
> On Wed, Feb 07, 2018 at 07:03:56PM +0000, Bart Van Assche wrote:
> > I tried the above patch but already during the first iteration of the test I
> > noticed that the test hung, probably due to the following request that got 
> > stuck:
> > 
> > $ (cd /sys/kernel/debug/block && grep -aH . */*/*/rq_list)
> > 00000000a98cff60 {.op=SCSI_IN, .cmd_flags=, 
> > .rq_flags=MQ_INFLIGHT|PREEMPT|QUIET|IO_STAT|PM,
> >  .state=idle, .tag=22, .internal_tag=-1, .cmd=Synchronize Cache(10) 35 00 
> > 00 00 00 00, .retries=0,
> >  .result = 0x0, .flags=TAGGED, .timeout=60.000, allocated 872.690 s ago}
> 
> I'm wonder how this happened, so we can lose a completion when it
> races against BLK_EH_RESET_TIMER; however, the command should timeout
> later cuz the timer is running again now.  Maybe we actually had the
> memory barrier race that you pointed out in the other message?

Hello Tejun,

The patch that I used in my test had an smp_wmb() call (see also below). Anyway,
I will see whether I can extract more state information through debugfs.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ef4f6df0f1df..8eb2105d82b7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -827,13 +827,9 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
reserved)
                __blk_mq_complete_request(req);
                break;
        case BLK_EH_RESET_TIMER:
-               /*
-                * As nothing prevents from completion happening while
-                * ->aborted_gstate is set, this may lead to ignored
-                * completions and further spurious timeouts.
-                */
-               blk_mq_rq_update_aborted_gstate(req, 0);
                blk_add_timer(req);
+               smp_wmb();
+               blk_mq_rq_update_aborted_gstate(req, 0);
                break;
        case BLK_EH_NOT_HANDLED:
                break;



Reply via email to