On Wed, Aug 09, 2017 at 05:10:01PM +0000, Bart Van Assche wrote:
> On Wed, 2017-08-09 at 12:43 -0400, Laurence Oberman wrote:
> > Your latest patch on stock upstream without Ming's latest patches is 
> > behaving for me.
> > 
> > As already mentioned, the requeue -11 and clone failure messages are 
> > gone and I am not actually seeing any soft lockups or hard lockups.
> > 
> > When Ming gets back I will work with him on his patch set and the lockups.
> > 
> > Running 10 parallel writes which easily trips into soft lockups on 
> > Ming's kernel (even with your patch) has been stable here on 4.13-RC3 
> > with your patch.
> > 
> > I will leave it running for a while now but the patch is good.
> > 
> > If it survives 4 hours I will add a Tested-by to your latest patch.
> 
> Hello Laurence,
> 
> I'm working on an additional patch that should reduce unnecessary requeuing
> even further. I will let you know when it's ready.
> 
> Additionally, please trim e-mails when replying such that e-mails do not get
> too long.

soft lockup still can be observed easily with patch d4acf3650c7c(
block: Make blk_mq_delay_kick_requeue_list() rerun the queue at a quiet time),
but no hard lockup.

With the patchset of 'blk-mq-sched: improve SCSI-MQ performance', hard
lockup can be observed following some failure log:

        [  269.277653] device-mapper: multipath: blk_get_request() returned -11 
- requeuing
        [  269.321244] device-mapper: multipath: blk_get_request() returned -11 
- requeuing
        ...
        [  273.421688] scsi host2: SRP abort called
        [  273.444577] scsi host2: Sending SRP abort for tag 0x6007e
        [  273.673871] scsi host2: Null scmnd for RSP w/tag 0x0000000006007e 
received on ch 6 / QP 0x30
        ...
        [  274.372110] device-mapper: multipath: blk_get_request() returned -11 
- requeuing
        [  278.658671] scsi host2: SRP abort called
        [  278.690630] scsi host2: SRP abort called
        [  278.717634] scsi host2: SRP abort called
        [  278.745629] scsi host2: SRP abort called
        [  279.083227] multipath_clone_and_map: 1092 callbacks suppressed
        ....
        [  296.210503] scsi host2: SRP reset_device called
        ....
        [  303.784287] NMI watchdog: Watchdog detected hard LOCKUP on cpu 10

The trick thing is that both hard lockup and soft lockup share
one same stack trace.

Another question, I don't understand why request is allocated with
GFP_ATOMIC in multipath_clone_and_map(), looks it shouldn't be
necessary.


--
Ming

Reply via email to