Re: [PATCH] SCSI: run queue if SCSI device queue isn't ready and queue is idle

2017-12-06 Thread Holger Hoffstätte
On 12/05/17 08:52, Ming Lei wrote:
> Before commit 0df21c86bdbf ("scsi: implement .get_budget and .put_budget
> for blk-mq"), we run queue after 3ms if queue is idle and SCSI device
> queue isn't ready, which is done in handling BLK_STS_RESOURCE. After
> commit 0df21c86bdbf is introduced, queue won't be run any more under
> this situation.
> 
> IO hang is observed when timeout happened, and this patch fixes the IO
> hang issue by running queue after delay in scsi_dev_queue_ready, just like
> non-mq. This issue can be triggered by the following script[1].
> 
> There is another issue which can be covered by running idle queue:
> when .get_budget() is called on request coming from hctx->dispatch_list,
> if one request just completes during .get_budget(), we can't depend on
> SCSI's restart to make progress any more. This patch fixes the race too.
> 
> With this patch, we basically recover to previous behaviour(before commit
> 0df21c86bdbf) of handling idle queue when running out of resource.
> 
> [1] script for test/verify SCSI timeout
> rmmod scsi_debug
> modprobe scsi_debug max_queue=1
> 
> DEVICE=`ls -d 
> /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/* | head -1 
> | xargs basename`
> DISK_DIR=`ls -d /sys/block/$DEVICE/device/scsi_disk/*`
> 
> echo "using scsi device $DEVICE"
> echo "-1" >/sys/bus/pseudo/drivers/scsi_debug/every_nth
> echo "temporary write through" >$DISK_DIR/cache_type
> echo "128" >/sys/bus/pseudo/drivers/scsi_debug/opts
> echo none > /sys/block/$DEVICE/queue/scheduler
> dd if=/dev/$DEVICE of=/dev/null bs=1M iflag=direct count=1 &
> sleep 5
> echo "0" >/sys/bus/pseudo/drivers/scsi_debug/opts
> wait
> echo "SUCCESS"
> 
> Fixes: 0df21c86bdbf ("scsi: implement .get_budget and .put_budget for blk-mq")
> Signed-off-by: Ming Lei <ming@redhat.com>
> ---
>  drivers/scsi/scsi_lib.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index db9556662e27..1816dd8259b3 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1967,6 +1967,8 @@ static bool scsi_mq_get_budget(struct blk_mq_hw_ctx 
> *hctx)
>  out_put_device:
>   put_device(>sdev_gendev);
>  out:
> + if (atomic_read(>device_busy) == 0 && !scsi_device_blocked(sdev))
> + blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
>   return false;
>  }

So just to follow up on this: with this patch I haven't encountered any
new hangs with blk-mq, regardless of medium (SSD/rotating disk) or scheduler.
I cannot speak for other hangs that may be reproducible by other means,
but for now here's my:

Tested-by: Holger Hoffstätte <hol...@applied-asynchrony.com>

cheers,
Holger


Re: [PATCH] SCSI: delay run queue if device is blocked in scsi_dev_queue_ready()

2017-12-05 Thread Holger Hoffstätte
On 12/05/17 06:16, Ming Lei wrote:
> On Mon, Dec 04, 2017 at 11:48:07PM +0000, Holger Hoffstätte wrote:
>> On Tue, 05 Dec 2017 06:45:08 +0800, Ming Lei wrote:
>>
>>> On Mon, Dec 04, 2017 at 03:09:20PM +, Bart Van Assche wrote:
>>>> On Sun, 2017-12-03 at 00:31 +0800, Ming Lei wrote:
>>>>> Fixes: 0df21c86bdbf ("scsi: implement .get_budget and .put_budget for 
>>>>> blk-mq")
>>>>
>>>> It might be safer to revert commit 0df21c86bdbf instead of trying to fix 
>>>> all
>>>> issues introduced by that commit for kernel version v4.15 ...
>>>
>>> What are all issues in v4.15-rc? Up to now, it is the only issue reported,
>>> and can be fixed by this simple patch, which one can be thought as cleanup
>>> too.
>>
>> Even with this patch I've encountered at least one hang that
>> seemed related. I'm using most of block/scsi-4.15 on top of 4.14 and
>> the hang in question was on a rotating disk. It could be solved by activating
>> a different scheduler on the hanging device; all hanging sync/df processes 
>> got
>> unstuck and all was fine again, which leads me to believe that there is at 
>> least
>> one more rare condition where delaying requests (as done in the budget patch)
>> leads to a hang.
>>
>> This happened with mq-deadline which I was testing specifically to avoid
>> any BFQ-related side effects.
> 
> OK, this looks a new report.
> 
> Without any log, we can't make any progress, and even we can't guess
> what the issue is related with.

Considering that you just had an idea about a corner case and posted v2
of the patch, it's safe to say that we actually can...which is why I
described the situation exactly the way I did. :)

I did try to get stack traces but all the hanging processes were
simply stuck on the device mutex (deep inside btrfs), so nothing too
helpful.

> Could you post your dmesg log(include the hang process stack trace)? And
> dump the debugfs log by the following script when this hang happens?
> 
>   http://people.redhat.com/minlei/tests/tools/dump-blk-info
> 
> BTW, you just need to pass the disk name to the script, such as: /dev/sda.

Thanks for the script. I'm now running with the new patch and will see what
happens.

cheers,
Holger


Re: [PATCH] SCSI: delay run queue if device is blocked in scsi_dev_queue_ready()

2017-12-05 Thread Holger Hoffstätte
On Tue, 05 Dec 2017 06:45:08 +0800, Ming Lei wrote:

> On Mon, Dec 04, 2017 at 03:09:20PM +, Bart Van Assche wrote:
>> On Sun, 2017-12-03 at 00:31 +0800, Ming Lei wrote:
>> > Fixes: 0df21c86bdbf ("scsi: implement .get_budget and .put_budget for 
>> > blk-mq")
>> 
>> It might be safer to revert commit 0df21c86bdbf instead of trying to fix all
>> issues introduced by that commit for kernel version v4.15 ...
> 
> What are all issues in v4.15-rc? Up to now, it is the only issue reported,
> and can be fixed by this simple patch, which one can be thought as cleanup
> too.

Even with this patch I've encountered at least one hang that
seemed related. I'm using most of block/scsi-4.15 on top of 4.14 and
the hang in question was on a rotating disk. It could be solved by activating
a different scheduler on the hanging device; all hanging sync/df processes got
unstuck and all was fine again, which leads me to believe that there is at least
one more rare condition where delaying requests (as done in the budget patch)
leads to a hang.

This happened with mq-deadline which I was testing specifically to avoid
any BFQ-related side effects.

I didn't do anything specific to trigger the hang and have not been able
to reproduce it during regular usage.

-h