Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-08 Thread Ming Lei
On Wed, Nov 08, 2017 at 10:57:23AM -0700, Jens Axboe wrote:
> On 11/08/2017 09:41 AM, Bart Van Assche wrote:
> > On Tue, 2017-11-07 at 20:06 -0700, Jens Axboe wrote:
> >> At this point, I have no idea what Bart's setup looks like. Bart, it
> >> would be REALLY helpful if you could tell us how you are reproducing
> >> your hang. I don't know why this has to be dragged out.
> > 
> > Hello Jens,
> > 
> > It is a disappointment to me that you have allowed Ming to evaluate other
> > approaches than reverting "blk-mq: don't handle TAG_SHARED in restart". That
> > patch namely replaces an algorithm that is trusted by the community with an
> > algorithm of which even Ming acknowledged that it is racy. A quote from [1]:
> > "IO hang may be caused if all requests are completed just before the current
> > SCSI device is added to shost->starved_list". I don't know of any way to fix
> > that race other than serializing request submission and completion by adding
> > locking around these actions, which is something we don't want. Hence my
> > request to revert that patch.
> 
> I was reluctant to revert it, in case we could work out a better way of
> doing it. As I mentioned in the other replies, it's not exactly the
> prettiest or most efficient. However, since we currently don't have
> a good solution for the issue, I'm fine with reverting that patch.
> 
> > Regarding the test I run, here is a summary of what I mentioned in previous
> > e-mails:
> > * I modified the SRP initiator such that the SCSI target queue depth is
> >   reduced to one by setting starget->can_queue to 1 from inside
> >   scsi_host_template.target_alloc.
> > * With that modified SRP initiator I run the srp-test software as follows
> >   until something breaks:
> >   while ./run_tests -f xfs -d -e deadline -r 60; do :; done
> 
> What kernel options are needed? Where do I download everything I need?
> 
> In other words, would it be possible to do a fuller guide for getting
> this setup and running?
> 
> I'll run my simple test case as well, since it's currently breaking
> basically everywhere.
> 
> > Today a system with at least one InfiniBand HCA is required to run that 
> > test.
> > When I have the time I will post the SRP initiator and target patches on the
> > linux-rdma mailing list that make it possible to run that test against the
> > SoftRoCE driver (drivers/infiniband/sw/rxe). The only hardware required to
> > use that driver is an Ethernet adapter.
> 
> OK, I guess I can't run it then... I'll have to rely on your testing.

Even we don't need to run it, just post out the log from 'tags' or 'sched_tags',
which should tell us more, when this IO hang happens.

Unfortunately still not see such logs up to now, :-(

-- 
Ming


Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-08 Thread Ming Lei
On Wed, Nov 08, 2017 at 04:41:35PM +, Bart Van Assche wrote:
> On Tue, 2017-11-07 at 20:06 -0700, Jens Axboe wrote:
> > At this point, I have no idea what Bart's setup looks like. Bart, it
> > would be REALLY helpful if you could tell us how you are reproducing
> > your hang. I don't know why this has to be dragged out.
> 
> Hello Jens,
> 
> It is a disappointment to me that you have allowed Ming to evaluate other
> approaches than reverting "blk-mq: don't handle TAG_SHARED in restart". That

I have mentioned in another email to Jens, that I agree to revert that
patch because of TAG_WAITING's issue in Jens's test case.

> patch namely replaces an algorithm that is trusted by the community with an
> algorithm of which even Ming acknowledged that it is racy. A quote from [1]:
> "IO hang may be caused if all requests are completed just before the current
> SCSI device is added to shost->starved_list". I don't know of any way to fix
> that race other than serializing request submission and completion by adding
> locking around these actions, which is something we don't want. Hence my
> request to revert that patch.

That can't be the reason for this revert.

This issue[1] is fixed by '[PATCH] SCSI: don't get target/host busy_count in
scsi_mq_get_budget()', follows the idea:

- we add sdev into shost->starved_list in scsi_target_queue_ready(),
  and the return value of BLK_STS_RESOURCE is set

- atomic_read(>device_busy) is checked to see if there is
  pending request, queue will be run if it is zero, otherwise we
  depend on scsi_end_request() from pending request to restart queue.

- you may mention sdev->device_busy may become 0 just after the
  check, then the completion still see the sdev in
  shost->starved_list and do the restart, and no IO hang

If you think something above is wrong, please comment on it directly.
Without this patch, no need any out-of-tree patch, IO hang can be
triggered in test 01 of srp-test. After this patch is applied on
V4.14-rc4, no IO hang can be observed any more.

> 
> Regarding the test I run, here is a summary of what I mentioned in previous
> e-mails:
> * I modified the SRP initiator such that the SCSI target queue depth is
>   reduced to one by setting starget->can_queue to 1 from inside
>   scsi_host_template.target_alloc.
> * With that modified SRP initiator I run the srp-test software as follows
>   until something breaks:
>   while ./run_tests -f xfs -d -e deadline -r 60; do :; done
> 
> Today a system with at least one InfiniBand HCA is required to run that test.
> When I have the time I will post the SRP initiator and target patches on the
> linux-rdma mailing list that make it possible to run that test against the
> SoftRoCE driver (drivers/infiniband/sw/rxe). The only hardware required to
> use that driver is an Ethernet adapter.

The thing is that we still don't know the root cause for your issue, and
keeping the restart for TAG_SHARED can be thought as a workaround. Maybe
it is same with Jens, maybe others, we don't know, and even without any
log provided, such as sched_tags or tags.

It is easy to see > 20% IOPS drops with restart for TAG_SHARED in 8
luns scsi debug test.

-- 
Ming


Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-08 Thread Jens Axboe
On 11/08/2017 11:22 AM, Laurence Oberman wrote:
> On Wed, 2017-11-08 at 10:57 -0700, Jens Axboe wrote:
>> On 11/08/2017 09:41 AM, Bart Van Assche wrote:
>>> On Tue, 2017-11-07 at 20:06 -0700, Jens Axboe wrote:
 At this point, I have no idea what Bart's setup looks like. Bart,
 it
 would be REALLY helpful if you could tell us how you are
 reproducing
 your hang. I don't know why this has to be dragged out.
>>>
>>> Hello Jens,
>>>
>>> It is a disappointment to me that you have allowed Ming to evaluate
>>> other
>>> approaches than reverting "blk-mq: don't handle TAG_SHARED in
>>> restart". That
>>> patch namely replaces an algorithm that is trusted by the community
>>> with an
>>> algorithm of which even Ming acknowledged that it is racy. A quote
>>> from [1]:
>>> "IO hang may be caused if all requests are completed just before
>>> the current
>>> SCSI device is added to shost->starved_list". I don't know of any
>>> way to fix
>>> that race other than serializing request submission and completion
>>> by adding
>>> locking around these actions, which is something we don't want.
>>> Hence my
>>> request to revert that patch.
>>
>> I was reluctant to revert it, in case we could work out a better way
>> of
>> doing it. As I mentioned in the other replies, it's not exactly the
>> prettiest or most efficient. However, since we currently don't have
>> a good solution for the issue, I'm fine with reverting that patch.
>>
>>> Regarding the test I run, here is a summary of what I mentioned in
>>> previous
>>> e-mails:
>>> * I modified the SRP initiator such that the SCSI target queue
>>> depth is
>>>   reduced to one by setting starget->can_queue to 1 from inside
>>>   scsi_host_template.target_alloc.
>>> * With that modified SRP initiator I run the srp-test software as
>>> follows
>>>   until something breaks:
>>>   while ./run_tests -f xfs -d -e deadline -r 60; do :; done
>>
>> What kernel options are needed? Where do I download everything I
>> need?
>>
>> In other words, would it be possible to do a fuller guide for getting
>> this setup and running?
>>
>> I'll run my simple test case as well, since it's currently breaking
>> basically everywhere.
>>
>>> Today a system with at least one InfiniBand HCA is required to run
>>> that test.
>>> When I have the time I will post the SRP initiator and target
>>> patches on the
>>> linux-rdma mailing list that make it possible to run that test
>>> against the
>>> SoftRoCE driver (drivers/infiniband/sw/rxe). The only hardware
>>> required to
>>> use that driver is an Ethernet adapter.
>>
>> OK, I guess I can't run it then... I'll have to rely on your testing.
> 
> Hello 
> 
> I agree with Bart in this case, we should revert this.
> My test-bed is tied up and I have not been able to give it back to Ming
> so he could follow up on Bart's last update.
> 
> Right now its safer to revert.

I had already reverted it when sending out that email, so we should be
all set (hopefully).

-- 
Jens Axboe



Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-08 Thread Laurence Oberman
On Wed, 2017-11-08 at 10:57 -0700, Jens Axboe wrote:
> On 11/08/2017 09:41 AM, Bart Van Assche wrote:
> > On Tue, 2017-11-07 at 20:06 -0700, Jens Axboe wrote:
> > > At this point, I have no idea what Bart's setup looks like. Bart,
> > > it
> > > would be REALLY helpful if you could tell us how you are
> > > reproducing
> > > your hang. I don't know why this has to be dragged out.
> > 
> > Hello Jens,
> > 
> > It is a disappointment to me that you have allowed Ming to evaluate
> > other
> > approaches than reverting "blk-mq: don't handle TAG_SHARED in
> > restart". That
> > patch namely replaces an algorithm that is trusted by the community
> > with an
> > algorithm of which even Ming acknowledged that it is racy. A quote
> > from [1]:
> > "IO hang may be caused if all requests are completed just before
> > the current
> > SCSI device is added to shost->starved_list". I don't know of any
> > way to fix
> > that race other than serializing request submission and completion
> > by adding
> > locking around these actions, which is something we don't want.
> > Hence my
> > request to revert that patch.
> 
> I was reluctant to revert it, in case we could work out a better way
> of
> doing it. As I mentioned in the other replies, it's not exactly the
> prettiest or most efficient. However, since we currently don't have
> a good solution for the issue, I'm fine with reverting that patch.
> 
> > Regarding the test I run, here is a summary of what I mentioned in
> > previous
> > e-mails:
> > * I modified the SRP initiator such that the SCSI target queue
> > depth is
> >   reduced to one by setting starget->can_queue to 1 from inside
> >   scsi_host_template.target_alloc.
> > * With that modified SRP initiator I run the srp-test software as
> > follows
> >   until something breaks:
> >   while ./run_tests -f xfs -d -e deadline -r 60; do :; done
> 
> What kernel options are needed? Where do I download everything I
> need?
> 
> In other words, would it be possible to do a fuller guide for getting
> this setup and running?
> 
> I'll run my simple test case as well, since it's currently breaking
> basically everywhere.
> 
> > Today a system with at least one InfiniBand HCA is required to run
> > that test.
> > When I have the time I will post the SRP initiator and target
> > patches on the
> > linux-rdma mailing list that make it possible to run that test
> > against the
> > SoftRoCE driver (drivers/infiniband/sw/rxe). The only hardware
> > required to
> > use that driver is an Ethernet adapter.
> 
> OK, I guess I can't run it then... I'll have to rely on your testing.

Hello 

I agree with Bart in this case, we should revert this.
My test-bed is tied up and I have not been able to give it back to Ming
so he could follow up on Bart's last update.

Right now its safer to revert.

Thanks
Laurence
> 


Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-08 Thread Jens Axboe
On 11/08/2017 08:59 AM, Ming Lei wrote:
> On Wed, Nov 08, 2017 at 02:20:41PM +0800, Ming Lei wrote:
>> On Tue, Nov 07, 2017 at 08:17:59PM -0700, Jens Axboe wrote:
>>> On 11/07/2017 08:12 PM, Ming Lei wrote:
 On Tue, Nov 07, 2017 at 08:01:48PM -0700, Jens Axboe wrote:
> On 11/07/2017 06:03 PM, Ming Lei wrote:
>> On Tue, Nov 07, 2017 at 03:06:31PM -0700, Jens Axboe wrote:
>>> On 11/07/2017 10:36 AM, Jens Axboe wrote:
 On 11/07/2017 10:10 AM, Jens Axboe wrote:
> On 11/07/2017 09:29 AM, Jens Axboe wrote:
>> On 11/07/2017 09:20 AM, Bart Van Assche wrote:
>>> On Tue, 2017-11-07 at 10:11 +0800, Ming Lei wrote:
 If you can reproduce, please provide me at least the following log
 first:

find /sys/kernel/debug/block -name tags | xargs cat | grep busy

 If any pending requests arn't completed, please provide the related
 info in dbgfs about where is the request.
>>>
>>> Every time I ran the above or a similar command its output was 
>>> empty. I
>>> assume that's because the hang usually occurs in a phase where 
>>> these debugfs
>>> attributes either have not yet been created or have already 
>>> disappeared.
>>
>> Bart, do you still see a hang with the patch that fixes the tag leak 
>> when
>> we fail to get a dispatch budget?
>>
>> https://marc.info/?l=linux-block=151004881411480=2
>>
>> I've run a lot of stability testing here, and I haven't run into any
>> issues. This is with shared tags as well. So if you still see the 
>> failure
>> case with the current tree AND the above patch, then I'll try and get
>> a test case setup that hits it too so we can get to the bottom of 
>> this.
>
> Ming, I managed to reproduce the hang using null_blk. Note this is
> WITHOUT the patch mentioned above, running with that now.
>
> # modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 
> submit_queues=1 hw_queue_depth=1
>
> and using this fio file:
>
> [global]
> bs=4k
> rw=randread
> norandommap
> direct=1
> ioengine=libaio
> iodepth=4
>
> [nullb0]
> filename=/dev/nullb0
> [nullb1]
> filename=/dev/nullb1
> [nullb2]
> filename=/dev/nullb2
> [nullb3]
> filename=/dev/nullb3
>
> it seemed to keep running, but it hung when exiting. The troublesome
> device was nullb1:
>
> [  492.513374] INFO: task fio:3263 blocked for more than 120 seconds.
> [  492.520782]   Not tainted 4.14.0-rc7+ #499
> [  492.526247] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> [  492.535904] fio D13208  3263   3211 0x
> [  492.542535] Call Trace:
> [  492.545764]  __schedule+0x279/0x720
> [  492.550151]  schedule+0x33/0x90
> [  492.554145]  io_schedule+0x16/0x40
> [  492.558435]  blk_mq_get_tag+0x148/0x250
> [  492.563211]  ? finish_wait+0x90/0x90
> [  492.567693]  blk_mq_get_request+0xf0/0x3e0
> [  492.572760]  blk_mq_make_request+0xe2/0x690
> [  492.577913]  generic_make_request+0xfc/0x2f0
> [  492.583177]  submit_bio+0x64/0x120
> [  492.587475]  ? set_page_dirty_lock+0x4b/0x60
> [  492.592736]  ? submit_bio+0x64/0x120
> [  492.597309]  ? bio_set_pages_dirty+0x55/0x60
> [  492.602570]  blkdev_direct_IO+0x388/0x3c0
> [  492.607546]  ? free_ioctx_users+0xe0/0xe0
> [  492.612511]  ? blk_mq_dispatch_rq_list+0x238/0x3a0
> [  492.618353]  ? _raw_spin_unlock+0xe/0x20
> [  492.623227]  generic_file_read_iter+0xb3/0xa00
> [  492.628682]  ? generic_file_read_iter+0xb3/0xa00
> [  492.634334]  ? security_file_permission+0x9b/0xc0
> [  492.640114]  blkdev_read_iter+0x35/0x40
> [  492.644877]  aio_read+0xc5/0x120
> [  492.648973]  ? aio_read_events+0x24c/0x340
> [  492.654124]  ? __might_sleep+0x4a/0x80
> [  492.658800]  do_io_submit+0x47c/0x5e0
> [  492.663373]  ? do_io_submit+0x47c/0x5e0
> [  492.668234]  SyS_io_submit+0x10/0x20
> [  492.672715]  ? SyS_io_submit+0x10/0x20
> [  492.677394]  entry_SYSCALL_64_fastpath+0x13/0x94
> [  492.683039] RIP: 0033:0x7f83d1871717
> [  492.687521] RSP: 002b:7ffd38fe5a88 EFLAGS: 0202 ORIG_RAX: 
> 00d1
> [  492.696969] RAX: ffda RBX: 7f83b6972b50 RCX: 
> 7f83d1871717
> [  492.705423] RDX: 01f41418 RSI: 0001 RDI: 
> 7f83e4d36000
> [  492.713889] RBP: 

Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-08 Thread Jens Axboe
On 11/08/2017 09:41 AM, Bart Van Assche wrote:
> On Tue, 2017-11-07 at 20:06 -0700, Jens Axboe wrote:
>> At this point, I have no idea what Bart's setup looks like. Bart, it
>> would be REALLY helpful if you could tell us how you are reproducing
>> your hang. I don't know why this has to be dragged out.
> 
> Hello Jens,
> 
> It is a disappointment to me that you have allowed Ming to evaluate other
> approaches than reverting "blk-mq: don't handle TAG_SHARED in restart". That
> patch namely replaces an algorithm that is trusted by the community with an
> algorithm of which even Ming acknowledged that it is racy. A quote from [1]:
> "IO hang may be caused if all requests are completed just before the current
> SCSI device is added to shost->starved_list". I don't know of any way to fix
> that race other than serializing request submission and completion by adding
> locking around these actions, which is something we don't want. Hence my
> request to revert that patch.

I was reluctant to revert it, in case we could work out a better way of
doing it. As I mentioned in the other replies, it's not exactly the
prettiest or most efficient. However, since we currently don't have
a good solution for the issue, I'm fine with reverting that patch.

> Regarding the test I run, here is a summary of what I mentioned in previous
> e-mails:
> * I modified the SRP initiator such that the SCSI target queue depth is
>   reduced to one by setting starget->can_queue to 1 from inside
>   scsi_host_template.target_alloc.
> * With that modified SRP initiator I run the srp-test software as follows
>   until something breaks:
>   while ./run_tests -f xfs -d -e deadline -r 60; do :; done

What kernel options are needed? Where do I download everything I need?

In other words, would it be possible to do a fuller guide for getting
this setup and running?

I'll run my simple test case as well, since it's currently breaking
basically everywhere.

> Today a system with at least one InfiniBand HCA is required to run that test.
> When I have the time I will post the SRP initiator and target patches on the
> linux-rdma mailing list that make it possible to run that test against the
> SoftRoCE driver (drivers/infiniband/sw/rxe). The only hardware required to
> use that driver is an Ethernet adapter.

OK, I guess I can't run it then... I'll have to rely on your testing.

-- 
Jens Axboe



Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-08 Thread Bart Van Assche
On Tue, 2017-11-07 at 20:06 -0700, Jens Axboe wrote:
> At this point, I have no idea what Bart's setup looks like. Bart, it
> would be REALLY helpful if you could tell us how you are reproducing
> your hang. I don't know why this has to be dragged out.

Hello Jens,

It is a disappointment to me that you have allowed Ming to evaluate other
approaches than reverting "blk-mq: don't handle TAG_SHARED in restart". That
patch namely replaces an algorithm that is trusted by the community with an
algorithm of which even Ming acknowledged that it is racy. A quote from [1]:
"IO hang may be caused if all requests are completed just before the current
SCSI device is added to shost->starved_list". I don't know of any way to fix
that race other than serializing request submission and completion by adding
locking around these actions, which is something we don't want. Hence my
request to revert that patch.

Regarding the test I run, here is a summary of what I mentioned in previous
e-mails:
* I modified the SRP initiator such that the SCSI target queue depth is
  reduced to one by setting starget->can_queue to 1 from inside
  scsi_host_template.target_alloc.
* With that modified SRP initiator I run the srp-test software as follows
  until something breaks:
  while ./run_tests -f xfs -d -e deadline -r 60; do :; done

Today a system with at least one InfiniBand HCA is required to run that test.
When I have the time I will post the SRP initiator and target patches on the
linux-rdma mailing list that make it possible to run that test against the
SoftRoCE driver (drivers/infiniband/sw/rxe). The only hardware required to
use that driver is an Ethernet adapter.

Bart.

[1] [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg15263.html).

Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-08 Thread Ming Lei
On Wed, Nov 08, 2017 at 02:20:41PM +0800, Ming Lei wrote:
> On Tue, Nov 07, 2017 at 08:17:59PM -0700, Jens Axboe wrote:
> > On 11/07/2017 08:12 PM, Ming Lei wrote:
> > > On Tue, Nov 07, 2017 at 08:01:48PM -0700, Jens Axboe wrote:
> > >> On 11/07/2017 06:03 PM, Ming Lei wrote:
> > >>> On Tue, Nov 07, 2017 at 03:06:31PM -0700, Jens Axboe wrote:
> >  On 11/07/2017 10:36 AM, Jens Axboe wrote:
> > > On 11/07/2017 10:10 AM, Jens Axboe wrote:
> > >> On 11/07/2017 09:29 AM, Jens Axboe wrote:
> > >>> On 11/07/2017 09:20 AM, Bart Van Assche wrote:
> >  On Tue, 2017-11-07 at 10:11 +0800, Ming Lei wrote:
> > > If you can reproduce, please provide me at least the following log
> > > first:
> > >
> > >   find /sys/kernel/debug/block -name tags | xargs cat | grep busy
> > >
> > > If any pending requests arn't completed, please provide the 
> > > related
> > > info in dbgfs about where is the request.
> > 
> >  Every time I ran the above or a similar command its output was 
> >  empty. I
> >  assume that's because the hang usually occurs in a phase where 
> >  these debugfs
> >  attributes either have not yet been created or have already 
> >  disappeared.
> > >>>
> > >>> Bart, do you still see a hang with the patch that fixes the tag 
> > >>> leak when
> > >>> we fail to get a dispatch budget?
> > >>>
> > >>> https://marc.info/?l=linux-block=151004881411480=2
> > >>>
> > >>> I've run a lot of stability testing here, and I haven't run into any
> > >>> issues. This is with shared tags as well. So if you still see the 
> > >>> failure
> > >>> case with the current tree AND the above patch, then I'll try and 
> > >>> get
> > >>> a test case setup that hits it too so we can get to the bottom of 
> > >>> this.
> > >>
> > >> Ming, I managed to reproduce the hang using null_blk. Note this is
> > >> WITHOUT the patch mentioned above, running with that now.
> > >>
> > >> # modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 
> > >> submit_queues=1 hw_queue_depth=1
> > >>
> > >> and using this fio file:
> > >>
> > >> [global]
> > >> bs=4k
> > >> rw=randread
> > >> norandommap
> > >> direct=1
> > >> ioengine=libaio
> > >> iodepth=4
> > >>
> > >> [nullb0]
> > >> filename=/dev/nullb0
> > >> [nullb1]
> > >> filename=/dev/nullb1
> > >> [nullb2]
> > >> filename=/dev/nullb2
> > >> [nullb3]
> > >> filename=/dev/nullb3
> > >>
> > >> it seemed to keep running, but it hung when exiting. The troublesome
> > >> device was nullb1:
> > >>
> > >> [  492.513374] INFO: task fio:3263 blocked for more than 120 seconds.
> > >> [  492.520782]   Not tainted 4.14.0-rc7+ #499
> > >> [  492.526247] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> > >> disables this message.
> > >> [  492.535904] fio D13208  3263   3211 0x
> > >> [  492.542535] Call Trace:
> > >> [  492.545764]  __schedule+0x279/0x720
> > >> [  492.550151]  schedule+0x33/0x90
> > >> [  492.554145]  io_schedule+0x16/0x40
> > >> [  492.558435]  blk_mq_get_tag+0x148/0x250
> > >> [  492.563211]  ? finish_wait+0x90/0x90
> > >> [  492.567693]  blk_mq_get_request+0xf0/0x3e0
> > >> [  492.572760]  blk_mq_make_request+0xe2/0x690
> > >> [  492.577913]  generic_make_request+0xfc/0x2f0
> > >> [  492.583177]  submit_bio+0x64/0x120
> > >> [  492.587475]  ? set_page_dirty_lock+0x4b/0x60
> > >> [  492.592736]  ? submit_bio+0x64/0x120
> > >> [  492.597309]  ? bio_set_pages_dirty+0x55/0x60
> > >> [  492.602570]  blkdev_direct_IO+0x388/0x3c0
> > >> [  492.607546]  ? free_ioctx_users+0xe0/0xe0
> > >> [  492.612511]  ? blk_mq_dispatch_rq_list+0x238/0x3a0
> > >> [  492.618353]  ? _raw_spin_unlock+0xe/0x20
> > >> [  492.623227]  generic_file_read_iter+0xb3/0xa00
> > >> [  492.628682]  ? generic_file_read_iter+0xb3/0xa00
> > >> [  492.634334]  ? security_file_permission+0x9b/0xc0
> > >> [  492.640114]  blkdev_read_iter+0x35/0x40
> > >> [  492.644877]  aio_read+0xc5/0x120
> > >> [  492.648973]  ? aio_read_events+0x24c/0x340
> > >> [  492.654124]  ? __might_sleep+0x4a/0x80
> > >> [  492.658800]  do_io_submit+0x47c/0x5e0
> > >> [  492.663373]  ? do_io_submit+0x47c/0x5e0
> > >> [  492.668234]  SyS_io_submit+0x10/0x20
> > >> [  492.672715]  ? SyS_io_submit+0x10/0x20
> > >> [  492.677394]  entry_SYSCALL_64_fastpath+0x13/0x94
> > >> [  492.683039] RIP: 0033:0x7f83d1871717
> > >> [  492.687521] RSP: 002b:7ffd38fe5a88 EFLAGS: 0202 ORIG_RAX: 
> > >> 00d1
> > >> [  492.696969] RAX: ffda RBX: 7f83b6972b50 RCX: 
> > >> 7f83d1871717
> > >> [  492.705423] RDX: 01f41418 RSI: 

Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-07 Thread Ming Lei
On Tue, Nov 07, 2017 at 08:17:59PM -0700, Jens Axboe wrote:
> On 11/07/2017 08:12 PM, Ming Lei wrote:
> > On Tue, Nov 07, 2017 at 08:01:48PM -0700, Jens Axboe wrote:
> >> On 11/07/2017 06:03 PM, Ming Lei wrote:
> >>> On Tue, Nov 07, 2017 at 03:06:31PM -0700, Jens Axboe wrote:
>  On 11/07/2017 10:36 AM, Jens Axboe wrote:
> > On 11/07/2017 10:10 AM, Jens Axboe wrote:
> >> On 11/07/2017 09:29 AM, Jens Axboe wrote:
> >>> On 11/07/2017 09:20 AM, Bart Van Assche wrote:
>  On Tue, 2017-11-07 at 10:11 +0800, Ming Lei wrote:
> > If you can reproduce, please provide me at least the following log
> > first:
> >
> > find /sys/kernel/debug/block -name tags | xargs cat | grep busy
> >
> > If any pending requests arn't completed, please provide the related
> > info in dbgfs about where is the request.
> 
>  Every time I ran the above or a similar command its output was 
>  empty. I
>  assume that's because the hang usually occurs in a phase where these 
>  debugfs
>  attributes either have not yet been created or have already 
>  disappeared.
> >>>
> >>> Bart, do you still see a hang with the patch that fixes the tag leak 
> >>> when
> >>> we fail to get a dispatch budget?
> >>>
> >>> https://marc.info/?l=linux-block=151004881411480=2
> >>>
> >>> I've run a lot of stability testing here, and I haven't run into any
> >>> issues. This is with shared tags as well. So if you still see the 
> >>> failure
> >>> case with the current tree AND the above patch, then I'll try and get
> >>> a test case setup that hits it too so we can get to the bottom of 
> >>> this.
> >>
> >> Ming, I managed to reproduce the hang using null_blk. Note this is
> >> WITHOUT the patch mentioned above, running with that now.
> >>
> >> # modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 
> >> submit_queues=1 hw_queue_depth=1
> >>
> >> and using this fio file:
> >>
> >> [global]
> >> bs=4k
> >> rw=randread
> >> norandommap
> >> direct=1
> >> ioengine=libaio
> >> iodepth=4
> >>
> >> [nullb0]
> >> filename=/dev/nullb0
> >> [nullb1]
> >> filename=/dev/nullb1
> >> [nullb2]
> >> filename=/dev/nullb2
> >> [nullb3]
> >> filename=/dev/nullb3
> >>
> >> it seemed to keep running, but it hung when exiting. The troublesome
> >> device was nullb1:
> >>
> >> [  492.513374] INFO: task fio:3263 blocked for more than 120 seconds.
> >> [  492.520782]   Not tainted 4.14.0-rc7+ #499
> >> [  492.526247] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> >> disables this message.
> >> [  492.535904] fio D13208  3263   3211 0x
> >> [  492.542535] Call Trace:
> >> [  492.545764]  __schedule+0x279/0x720
> >> [  492.550151]  schedule+0x33/0x90
> >> [  492.554145]  io_schedule+0x16/0x40
> >> [  492.558435]  blk_mq_get_tag+0x148/0x250
> >> [  492.563211]  ? finish_wait+0x90/0x90
> >> [  492.567693]  blk_mq_get_request+0xf0/0x3e0
> >> [  492.572760]  blk_mq_make_request+0xe2/0x690
> >> [  492.577913]  generic_make_request+0xfc/0x2f0
> >> [  492.583177]  submit_bio+0x64/0x120
> >> [  492.587475]  ? set_page_dirty_lock+0x4b/0x60
> >> [  492.592736]  ? submit_bio+0x64/0x120
> >> [  492.597309]  ? bio_set_pages_dirty+0x55/0x60
> >> [  492.602570]  blkdev_direct_IO+0x388/0x3c0
> >> [  492.607546]  ? free_ioctx_users+0xe0/0xe0
> >> [  492.612511]  ? blk_mq_dispatch_rq_list+0x238/0x3a0
> >> [  492.618353]  ? _raw_spin_unlock+0xe/0x20
> >> [  492.623227]  generic_file_read_iter+0xb3/0xa00
> >> [  492.628682]  ? generic_file_read_iter+0xb3/0xa00
> >> [  492.634334]  ? security_file_permission+0x9b/0xc0
> >> [  492.640114]  blkdev_read_iter+0x35/0x40
> >> [  492.644877]  aio_read+0xc5/0x120
> >> [  492.648973]  ? aio_read_events+0x24c/0x340
> >> [  492.654124]  ? __might_sleep+0x4a/0x80
> >> [  492.658800]  do_io_submit+0x47c/0x5e0
> >> [  492.663373]  ? do_io_submit+0x47c/0x5e0
> >> [  492.668234]  SyS_io_submit+0x10/0x20
> >> [  492.672715]  ? SyS_io_submit+0x10/0x20
> >> [  492.677394]  entry_SYSCALL_64_fastpath+0x13/0x94
> >> [  492.683039] RIP: 0033:0x7f83d1871717
> >> [  492.687521] RSP: 002b:7ffd38fe5a88 EFLAGS: 0202 ORIG_RAX: 
> >> 00d1
> >> [  492.696969] RAX: ffda RBX: 7f83b6972b50 RCX: 
> >> 7f83d1871717
> >> [  492.705423] RDX: 01f41418 RSI: 0001 RDI: 
> >> 7f83e4d36000
> >> [  492.713889] RBP: 0001 R08: 0001 R09: 
> >> 01f3f2e0
> >> [  492.722352] R10: 1000 R11: 0202 R12: 
> >> 7ffd38fe5be0
> >> [  492.730827] R13: 7f83b6972b01 R14: 

Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-07 Thread Jens Axboe
On 11/07/2017 08:12 PM, Ming Lei wrote:
> On Tue, Nov 07, 2017 at 08:01:48PM -0700, Jens Axboe wrote:
>> On 11/07/2017 06:03 PM, Ming Lei wrote:
>>> On Tue, Nov 07, 2017 at 03:06:31PM -0700, Jens Axboe wrote:
 On 11/07/2017 10:36 AM, Jens Axboe wrote:
> On 11/07/2017 10:10 AM, Jens Axboe wrote:
>> On 11/07/2017 09:29 AM, Jens Axboe wrote:
>>> On 11/07/2017 09:20 AM, Bart Van Assche wrote:
 On Tue, 2017-11-07 at 10:11 +0800, Ming Lei wrote:
> If you can reproduce, please provide me at least the following log
> first:
>
>   find /sys/kernel/debug/block -name tags | xargs cat | grep busy
>
> If any pending requests arn't completed, please provide the related
> info in dbgfs about where is the request.

 Every time I ran the above or a similar command its output was empty. I
 assume that's because the hang usually occurs in a phase where these 
 debugfs
 attributes either have not yet been created or have already 
 disappeared.
>>>
>>> Bart, do you still see a hang with the patch that fixes the tag leak 
>>> when
>>> we fail to get a dispatch budget?
>>>
>>> https://marc.info/?l=linux-block=151004881411480=2
>>>
>>> I've run a lot of stability testing here, and I haven't run into any
>>> issues. This is with shared tags as well. So if you still see the 
>>> failure
>>> case with the current tree AND the above patch, then I'll try and get
>>> a test case setup that hits it too so we can get to the bottom of this.
>>
>> Ming, I managed to reproduce the hang using null_blk. Note this is
>> WITHOUT the patch mentioned above, running with that now.
>>
>> # modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 
>> submit_queues=1 hw_queue_depth=1
>>
>> and using this fio file:
>>
>> [global]
>> bs=4k
>> rw=randread
>> norandommap
>> direct=1
>> ioengine=libaio
>> iodepth=4
>>
>> [nullb0]
>> filename=/dev/nullb0
>> [nullb1]
>> filename=/dev/nullb1
>> [nullb2]
>> filename=/dev/nullb2
>> [nullb3]
>> filename=/dev/nullb3
>>
>> it seemed to keep running, but it hung when exiting. The troublesome
>> device was nullb1:
>>
>> [  492.513374] INFO: task fio:3263 blocked for more than 120 seconds.
>> [  492.520782]   Not tainted 4.14.0-rc7+ #499
>> [  492.526247] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>> disables this message.
>> [  492.535904] fio D13208  3263   3211 0x
>> [  492.542535] Call Trace:
>> [  492.545764]  __schedule+0x279/0x720
>> [  492.550151]  schedule+0x33/0x90
>> [  492.554145]  io_schedule+0x16/0x40
>> [  492.558435]  blk_mq_get_tag+0x148/0x250
>> [  492.563211]  ? finish_wait+0x90/0x90
>> [  492.567693]  blk_mq_get_request+0xf0/0x3e0
>> [  492.572760]  blk_mq_make_request+0xe2/0x690
>> [  492.577913]  generic_make_request+0xfc/0x2f0
>> [  492.583177]  submit_bio+0x64/0x120
>> [  492.587475]  ? set_page_dirty_lock+0x4b/0x60
>> [  492.592736]  ? submit_bio+0x64/0x120
>> [  492.597309]  ? bio_set_pages_dirty+0x55/0x60
>> [  492.602570]  blkdev_direct_IO+0x388/0x3c0
>> [  492.607546]  ? free_ioctx_users+0xe0/0xe0
>> [  492.612511]  ? blk_mq_dispatch_rq_list+0x238/0x3a0
>> [  492.618353]  ? _raw_spin_unlock+0xe/0x20
>> [  492.623227]  generic_file_read_iter+0xb3/0xa00
>> [  492.628682]  ? generic_file_read_iter+0xb3/0xa00
>> [  492.634334]  ? security_file_permission+0x9b/0xc0
>> [  492.640114]  blkdev_read_iter+0x35/0x40
>> [  492.644877]  aio_read+0xc5/0x120
>> [  492.648973]  ? aio_read_events+0x24c/0x340
>> [  492.654124]  ? __might_sleep+0x4a/0x80
>> [  492.658800]  do_io_submit+0x47c/0x5e0
>> [  492.663373]  ? do_io_submit+0x47c/0x5e0
>> [  492.668234]  SyS_io_submit+0x10/0x20
>> [  492.672715]  ? SyS_io_submit+0x10/0x20
>> [  492.677394]  entry_SYSCALL_64_fastpath+0x13/0x94
>> [  492.683039] RIP: 0033:0x7f83d1871717
>> [  492.687521] RSP: 002b:7ffd38fe5a88 EFLAGS: 0202 ORIG_RAX: 
>> 00d1
>> [  492.696969] RAX: ffda RBX: 7f83b6972b50 RCX: 
>> 7f83d1871717
>> [  492.705423] RDX: 01f41418 RSI: 0001 RDI: 
>> 7f83e4d36000
>> [  492.713889] RBP: 0001 R08: 0001 R09: 
>> 01f3f2e0
>> [  492.722352] R10: 1000 R11: 0202 R12: 
>> 7ffd38fe5be0
>> [  492.730827] R13: 7f83b6972b01 R14: 7f83b69824b8 R15: 
>> 7f83b6982368
>>
>> and if we look at the debug entries, it's waiting on a scheduler tag:
>>
>> sched_tags:nr_tags=2
>> sched_tags:nr_reserved_tags=0
>> sched_tags:active_queues=0
>> sched_tags:bitmap_tags:
>> sched_tags:depth=2
>> 

Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-07 Thread Jens Axboe
On 11/07/2017 09:17 AM, Bart Van Assche wrote:
> On Tue, 2017-11-07 at 18:15 +0800, Ming Lei wrote:
>> Last time, you didn't mention the target patch for setting its
>> can_queue as 1, so I think you can't reproduce the issue on upstream
>> kernel without out-of-tree patch. Then looks it is another issue,
>> and we are making progress actually.
> 
> If I don't trust a patch I introduce additional tests. The fact that I
> modified the SRP initiator before this hang occurred does not mean that the
> approach of your patch is fine. What this means is that all your patch does
> is to reduce the race window and that there is still a race window.

I agree, reducing the depth to test that specific case is perfectly
valid, it doesn't make the test invalid. It's perfectly possible that
other use cases out there have exactly that configuration. My null_blk
test case is basically the same. But it would be nice if all of this was
out in the open, so everybody is clear on exactly what is being
run/tested.

However, where my trace is hung off getting a scheduler tag, yours seems
to be waiting on IO completion. So not the same fingerprint, which is
worrisome.

-- 
Jens Axboe



Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-07 Thread Ming Lei
On Tue, Nov 07, 2017 at 08:01:48PM -0700, Jens Axboe wrote:
> On 11/07/2017 06:03 PM, Ming Lei wrote:
> > On Tue, Nov 07, 2017 at 03:06:31PM -0700, Jens Axboe wrote:
> >> On 11/07/2017 10:36 AM, Jens Axboe wrote:
> >>> On 11/07/2017 10:10 AM, Jens Axboe wrote:
>  On 11/07/2017 09:29 AM, Jens Axboe wrote:
> > On 11/07/2017 09:20 AM, Bart Van Assche wrote:
> >> On Tue, 2017-11-07 at 10:11 +0800, Ming Lei wrote:
> >>> If you can reproduce, please provide me at least the following log
> >>> first:
> >>>
> >>>   find /sys/kernel/debug/block -name tags | xargs cat | grep busy
> >>>
> >>> If any pending requests arn't completed, please provide the related
> >>> info in dbgfs about where is the request.
> >>
> >> Every time I ran the above or a similar command its output was empty. I
> >> assume that's because the hang usually occurs in a phase where these 
> >> debugfs
> >> attributes either have not yet been created or have already 
> >> disappeared.
> >
> > Bart, do you still see a hang with the patch that fixes the tag leak 
> > when
> > we fail to get a dispatch budget?
> >
> > https://marc.info/?l=linux-block=151004881411480=2
> >
> > I've run a lot of stability testing here, and I haven't run into any
> > issues. This is with shared tags as well. So if you still see the 
> > failure
> > case with the current tree AND the above patch, then I'll try and get
> > a test case setup that hits it too so we can get to the bottom of this.
> 
>  Ming, I managed to reproduce the hang using null_blk. Note this is
>  WITHOUT the patch mentioned above, running with that now.
> 
>  # modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 
>  submit_queues=1 hw_queue_depth=1
> 
>  and using this fio file:
> 
>  [global]
>  bs=4k
>  rw=randread
>  norandommap
>  direct=1
>  ioengine=libaio
>  iodepth=4
> 
>  [nullb0]
>  filename=/dev/nullb0
>  [nullb1]
>  filename=/dev/nullb1
>  [nullb2]
>  filename=/dev/nullb2
>  [nullb3]
>  filename=/dev/nullb3
> 
>  it seemed to keep running, but it hung when exiting. The troublesome
>  device was nullb1:
> 
>  [  492.513374] INFO: task fio:3263 blocked for more than 120 seconds.
>  [  492.520782]   Not tainted 4.14.0-rc7+ #499
>  [  492.526247] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>  disables this message.
>  [  492.535904] fio D13208  3263   3211 0x
>  [  492.542535] Call Trace:
>  [  492.545764]  __schedule+0x279/0x720
>  [  492.550151]  schedule+0x33/0x90
>  [  492.554145]  io_schedule+0x16/0x40
>  [  492.558435]  blk_mq_get_tag+0x148/0x250
>  [  492.563211]  ? finish_wait+0x90/0x90
>  [  492.567693]  blk_mq_get_request+0xf0/0x3e0
>  [  492.572760]  blk_mq_make_request+0xe2/0x690
>  [  492.577913]  generic_make_request+0xfc/0x2f0
>  [  492.583177]  submit_bio+0x64/0x120
>  [  492.587475]  ? set_page_dirty_lock+0x4b/0x60
>  [  492.592736]  ? submit_bio+0x64/0x120
>  [  492.597309]  ? bio_set_pages_dirty+0x55/0x60
>  [  492.602570]  blkdev_direct_IO+0x388/0x3c0
>  [  492.607546]  ? free_ioctx_users+0xe0/0xe0
>  [  492.612511]  ? blk_mq_dispatch_rq_list+0x238/0x3a0
>  [  492.618353]  ? _raw_spin_unlock+0xe/0x20
>  [  492.623227]  generic_file_read_iter+0xb3/0xa00
>  [  492.628682]  ? generic_file_read_iter+0xb3/0xa00
>  [  492.634334]  ? security_file_permission+0x9b/0xc0
>  [  492.640114]  blkdev_read_iter+0x35/0x40
>  [  492.644877]  aio_read+0xc5/0x120
>  [  492.648973]  ? aio_read_events+0x24c/0x340
>  [  492.654124]  ? __might_sleep+0x4a/0x80
>  [  492.658800]  do_io_submit+0x47c/0x5e0
>  [  492.663373]  ? do_io_submit+0x47c/0x5e0
>  [  492.668234]  SyS_io_submit+0x10/0x20
>  [  492.672715]  ? SyS_io_submit+0x10/0x20
>  [  492.677394]  entry_SYSCALL_64_fastpath+0x13/0x94
>  [  492.683039] RIP: 0033:0x7f83d1871717
>  [  492.687521] RSP: 002b:7ffd38fe5a88 EFLAGS: 0202 ORIG_RAX: 
>  00d1
>  [  492.696969] RAX: ffda RBX: 7f83b6972b50 RCX: 
>  7f83d1871717
>  [  492.705423] RDX: 01f41418 RSI: 0001 RDI: 
>  7f83e4d36000
>  [  492.713889] RBP: 0001 R08: 0001 R09: 
>  01f3f2e0
>  [  492.722352] R10: 1000 R11: 0202 R12: 
>  7ffd38fe5be0
>  [  492.730827] R13: 7f83b6972b01 R14: 7f83b69824b8 R15: 
>  7f83b6982368
> 
>  and if we look at the debug entries, it's waiting on a scheduler tag:
> 
>  sched_tags:nr_tags=2
>  sched_tags:nr_reserved_tags=0
>  sched_tags:active_queues=0
>  sched_tags:bitmap_tags:
>  sched_tags:depth=2
>  sched_tags:busy=2
>  

Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-07 Thread Jens Axboe
On 11/07/2017 07:58 PM, Ming Lei wrote:
> On Tue, Nov 07, 2017 at 07:55:32PM -0700, Jens Axboe wrote:
>> On 11/07/2017 05:39 PM, Ming Lei wrote:
>>> On Tue, Nov 07, 2017 at 04:20:08PM +, Bart Van Assche wrote:
 On Tue, 2017-11-07 at 10:11 +0800, Ming Lei wrote:
> If you can reproduce, please provide me at least the following log
> first:
>
>   find /sys/kernel/debug/block -name tags | xargs cat | grep busy
>
> If any pending requests arn't completed, please provide the related
> info in dbgfs about where is the request.

 Every time I ran the above or a similar command its output was empty. I
 assume that's because the hang usually occurs in a phase where these 
 debugfs
 attributes either have not yet been created or have already disappeared.
>>>
>>> Could you dump all tags? Then you can see if this attribute is disappeared.
>>>
>>> If that output is empty, it often means there isn't pending request not
>>> completed. So if that is true, your hang is _not_ related with RESTART.
>>
>> You need to check sched_tags as well. It could still be a restart race
>> or problem, if tags is empty but sched_tags has busy bits.
> 
> Yeah, I didn't mention because SRP is MQ hardware, and the default
> scheduler is none, but if Bart changes that, the sched_tags need to
> checked first.

At this point, I have no idea what Bart's setup looks like. Bart, it
would be REALLY helpful if you could tell us how you are reproducing
your hang. I don't know why this has to be dragged out.

Ming/Bart - there seems to be an increasing amount of tension between
you two, for reasons that are unknown to me. I suggest you put that
aside in the pursuit of fixing the current issue, and then we can
discuss how to best resolve these going forward. But right now the top
priority is getting to the bottom of this. There's a chance that the
issue I can reproduce is the same that Bart is seeing, in which case we
might be fixing both in one fell swoop. But if that isn't the case, then
we have some work to do this week.

-- 
Jens Axboe



Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-07 Thread Jens Axboe
On 11/07/2017 06:03 PM, Ming Lei wrote:
> On Tue, Nov 07, 2017 at 03:06:31PM -0700, Jens Axboe wrote:
>> On 11/07/2017 10:36 AM, Jens Axboe wrote:
>>> On 11/07/2017 10:10 AM, Jens Axboe wrote:
 On 11/07/2017 09:29 AM, Jens Axboe wrote:
> On 11/07/2017 09:20 AM, Bart Van Assche wrote:
>> On Tue, 2017-11-07 at 10:11 +0800, Ming Lei wrote:
>>> If you can reproduce, please provide me at least the following log
>>> first:
>>>
>>> find /sys/kernel/debug/block -name tags | xargs cat | grep busy
>>>
>>> If any pending requests arn't completed, please provide the related
>>> info in dbgfs about where is the request.
>>
>> Every time I ran the above or a similar command its output was empty. I
>> assume that's because the hang usually occurs in a phase where these 
>> debugfs
>> attributes either have not yet been created or have already disappeared.
>
> Bart, do you still see a hang with the patch that fixes the tag leak when
> we fail to get a dispatch budget?
>
> https://marc.info/?l=linux-block=151004881411480=2
>
> I've run a lot of stability testing here, and I haven't run into any
> issues. This is with shared tags as well. So if you still see the failure
> case with the current tree AND the above patch, then I'll try and get
> a test case setup that hits it too so we can get to the bottom of this.

 Ming, I managed to reproduce the hang using null_blk. Note this is
 WITHOUT the patch mentioned above, running with that now.

 # modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 
 submit_queues=1 hw_queue_depth=1

 and using this fio file:

 [global]
 bs=4k
 rw=randread
 norandommap
 direct=1
 ioengine=libaio
 iodepth=4

 [nullb0]
 filename=/dev/nullb0
 [nullb1]
 filename=/dev/nullb1
 [nullb2]
 filename=/dev/nullb2
 [nullb3]
 filename=/dev/nullb3

 it seemed to keep running, but it hung when exiting. The troublesome
 device was nullb1:

 [  492.513374] INFO: task fio:3263 blocked for more than 120 seconds.
 [  492.520782]   Not tainted 4.14.0-rc7+ #499
 [  492.526247] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
 this message.
 [  492.535904] fio D13208  3263   3211 0x
 [  492.542535] Call Trace:
 [  492.545764]  __schedule+0x279/0x720
 [  492.550151]  schedule+0x33/0x90
 [  492.554145]  io_schedule+0x16/0x40
 [  492.558435]  blk_mq_get_tag+0x148/0x250
 [  492.563211]  ? finish_wait+0x90/0x90
 [  492.567693]  blk_mq_get_request+0xf0/0x3e0
 [  492.572760]  blk_mq_make_request+0xe2/0x690
 [  492.577913]  generic_make_request+0xfc/0x2f0
 [  492.583177]  submit_bio+0x64/0x120
 [  492.587475]  ? set_page_dirty_lock+0x4b/0x60
 [  492.592736]  ? submit_bio+0x64/0x120
 [  492.597309]  ? bio_set_pages_dirty+0x55/0x60
 [  492.602570]  blkdev_direct_IO+0x388/0x3c0
 [  492.607546]  ? free_ioctx_users+0xe0/0xe0
 [  492.612511]  ? blk_mq_dispatch_rq_list+0x238/0x3a0
 [  492.618353]  ? _raw_spin_unlock+0xe/0x20
 [  492.623227]  generic_file_read_iter+0xb3/0xa00
 [  492.628682]  ? generic_file_read_iter+0xb3/0xa00
 [  492.634334]  ? security_file_permission+0x9b/0xc0
 [  492.640114]  blkdev_read_iter+0x35/0x40
 [  492.644877]  aio_read+0xc5/0x120
 [  492.648973]  ? aio_read_events+0x24c/0x340
 [  492.654124]  ? __might_sleep+0x4a/0x80
 [  492.658800]  do_io_submit+0x47c/0x5e0
 [  492.663373]  ? do_io_submit+0x47c/0x5e0
 [  492.668234]  SyS_io_submit+0x10/0x20
 [  492.672715]  ? SyS_io_submit+0x10/0x20
 [  492.677394]  entry_SYSCALL_64_fastpath+0x13/0x94
 [  492.683039] RIP: 0033:0x7f83d1871717
 [  492.687521] RSP: 002b:7ffd38fe5a88 EFLAGS: 0202 ORIG_RAX: 
 00d1
 [  492.696969] RAX: ffda RBX: 7f83b6972b50 RCX: 
 7f83d1871717
 [  492.705423] RDX: 01f41418 RSI: 0001 RDI: 
 7f83e4d36000
 [  492.713889] RBP: 0001 R08: 0001 R09: 
 01f3f2e0
 [  492.722352] R10: 1000 R11: 0202 R12: 
 7ffd38fe5be0
 [  492.730827] R13: 7f83b6972b01 R14: 7f83b69824b8 R15: 
 7f83b6982368

 and if we look at the debug entries, it's waiting on a scheduler tag:

 sched_tags:nr_tags=2
 sched_tags:nr_reserved_tags=0
 sched_tags:active_queues=0
 sched_tags:bitmap_tags:
 sched_tags:depth=2
 sched_tags:busy=2
 sched_tags:bits_per_word=64
 sched_tags:map_nr=1
 sched_tags:alloc_hint={0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 
 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 
 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0}
 sched_tags:wake_batch=1
 

Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-07 Thread Ming Lei
On Tue, Nov 07, 2017 at 07:55:32PM -0700, Jens Axboe wrote:
> On 11/07/2017 05:39 PM, Ming Lei wrote:
> > On Tue, Nov 07, 2017 at 04:20:08PM +, Bart Van Assche wrote:
> >> On Tue, 2017-11-07 at 10:11 +0800, Ming Lei wrote:
> >>> If you can reproduce, please provide me at least the following log
> >>> first:
> >>>
> >>>   find /sys/kernel/debug/block -name tags | xargs cat | grep busy
> >>>
> >>> If any pending requests arn't completed, please provide the related
> >>> info in dbgfs about where is the request.
> >>
> >> Every time I ran the above or a similar command its output was empty. I
> >> assume that's because the hang usually occurs in a phase where these 
> >> debugfs
> >> attributes either have not yet been created or have already disappeared.
> > 
> > Could you dump all tags? Then you can see if this attribute is disappeared.
> > 
> > If that output is empty, it often means there isn't pending request not
> > completed. So if that is true, your hang is _not_ related with RESTART.
> 
> You need to check sched_tags as well. It could still be a restart race
> or problem, if tags is empty but sched_tags has busy bits.

Yeah, I didn't mention because SRP is MQ hardware, and the default
scheduler is none, but if Bart changes that, the sched_tags need to
checked first.

-- 
Ming


Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-07 Thread Jens Axboe
On 11/07/2017 05:39 PM, Ming Lei wrote:
> On Tue, Nov 07, 2017 at 04:20:08PM +, Bart Van Assche wrote:
>> On Tue, 2017-11-07 at 10:11 +0800, Ming Lei wrote:
>>> If you can reproduce, please provide me at least the following log
>>> first:
>>>
>>> find /sys/kernel/debug/block -name tags | xargs cat | grep busy
>>>
>>> If any pending requests arn't completed, please provide the related
>>> info in dbgfs about where is the request.
>>
>> Every time I ran the above or a similar command its output was empty. I
>> assume that's because the hang usually occurs in a phase where these debugfs
>> attributes either have not yet been created or have already disappeared.
> 
> Could you dump all tags? Then you can see if this attribute is disappeared.
> 
> If that output is empty, it often means there isn't pending request not
> completed. So if that is true, your hang is _not_ related with RESTART.

You need to check sched_tags as well. It could still be a restart race
or problem, if tags is empty but sched_tags has busy bits.

-- 
Jens Axboe



Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-07 Thread Ming Lei
On Wed, Nov 08, 2017 at 08:53:38AM +0800, Ming Lei wrote:
> On Tue, Nov 07, 2017 at 05:34:38PM +, Bart Van Assche wrote:
> > On Tue, 2017-11-07 at 09:29 -0700, Jens Axboe wrote:
> > > On 11/07/2017 09:20 AM, Bart Van Assche wrote:
> > > > On Tue, 2017-11-07 at 10:11 +0800, Ming Lei wrote:
> > > > > If you can reproduce, please provide me at least the following log
> > > > > first:
> > > > > 
> > > > >   find /sys/kernel/debug/block -name tags | xargs cat | grep busy
> > > > > 
> > > > > If any pending requests arn't completed, please provide the related
> > > > > info in dbgfs about where is the request.
> > > > 
> > > > Every time I ran the above or a similar command its output was empty. I
> > > > assume that's because the hang usually occurs in a phase where these 
> > > > debugfs
> > > > attributes either have not yet been created or have already disappeared.
> > > 
> > > Bart, do you still see a hang with the patch that fixes the tag leak when
> > > we fail to get a dispatch budget?
> > > 
> > > https://marc.info/?l=linux-block=151004881411480=2
> > > 
> > > I've run a lot of stability testing here, and I haven't run into any
> > > issues. This is with shared tags as well. So if you still see the failure
> > > case with the current tree AND the above patch, then I'll try and get
> > > a test case setup that hits it too so we can get to the bottom of this.
> > 
> > It took a little longer than expected but I just ran into the following
> > lockup with your for-next branch of this morning (commit e8fa44bb8af9) and
> > Ming's patch "blk-mq: put driver tag if dispatch budget can't be got"
> > applied on top of it:
> > 
> > [ 2575.324678] sysrq: SysRq : Show Blocked State
> > [ 2575.332336]   taskPC stack   pid father
> > [ 2575.345239] systemd-udevd   D0 47577518 0x0106
> > [ 2575.353821] Call Trace:
> > [ 2575.358805]  __schedule+0x28b/0x890
> > [ 2575.364906]  schedule+0x36/0x80
> > [ 2575.370436]  io_schedule+0x16/0x40
> > [ 2575.376287]  __lock_page+0xfc/0x140
> > [ 2575.382061]  ? page_cache_tree_insert+0xc0/0xc0
> > [ 2575.388943]  truncate_inode_pages_range+0x5e8/0x830
> > [ 2575.396083]  truncate_inode_pages+0x15/0x20
> > [ 2575.402398]  kill_bdev+0x2f/0x40
> > [ 2575.407538]  __blkdev_put+0x74/0x1f0
> > [ 2575.413010]  ? kmem_cache_free+0x197/0x1c0
> > [ 2575.418986]  blkdev_put+0x4c/0xd0
> > [ 2575.424040]  blkdev_close+0x34/0x70
> > [ 2575.429216]  __fput+0xe7/0x220
> > [ 2575.433863]  fput+0xe/0x10
> > [ 2575.438490]  task_work_run+0x76/0x90
> > [ 2575.443619]  do_exit+0x2e0/0xaf0
> > [ 2575.448311]  do_group_exit+0x43/0xb0
> > [ 2575.453386]  get_signal+0x299/0x5e0
> > [ 2575.458303]  do_signal+0x37/0x740
> > [ 2575.462976]  ? blkdev_read_iter+0x35/0x40
> > [ 2575.468425]  ? new_sync_read+0xde/0x130
> > [ 2575.473620]  ? vfs_read+0x115/0x130
> > [ 2575.478388]  exit_to_usermode_loop+0x80/0xd0
> > [ 2575.484002]  do_syscall_64+0xb3/0xc0
> > [ 2575.488813]  entry_SYSCALL64_slow_path+0x25/0x25
> > [ 2575.494759] RIP: 0033:0x7efd829cbd11
> > [ 2575.499506] RSP: 002b:7984f978 EFLAGS: 0246 ORIG_RAX: 
> > 
> > [ 2575.508741] RAX: 00022000 RBX: 55f19f902ca0 RCX: 
> > 7efd829cbd11
> > [ 2575.517455] RDX: 0004 RSI: 55f19f902cc8 RDI: 
> > 0007
> > [ 2575.526163] RBP: 55f19f7fb9d0 R08:  R09: 
> > 55f19f902ca0
> > [ 2575.534860] R10: 55f19f902cb8 R11: 0246 R12: 
> > 
> > [ 2575.544250] R13: 0004 R14: 55f19f7fba20 R15: 
> > 0004
> 
> Again please show us the output of 'tags' to see if there is pending
> requests not completed.
> 
> Please run this test on linus tree(V4.14-rc7) to see if the same issue
> can be reproduced.
> 
> Also if possible, please provide us the way for reproducing.

BTW, please apply the following patch before your further test:

https://marc.info/?l=linux-block=150988386406050=2

Since you don't see busy tag in 'tags' and queue may have been frozen.
And the in-progress dispatch after queue DEAD might corrupt memory.

-- 
Ming


Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-07 Thread Ming Lei
On Tue, Nov 07, 2017 at 03:06:31PM -0700, Jens Axboe wrote:
> On 11/07/2017 10:36 AM, Jens Axboe wrote:
> > On 11/07/2017 10:10 AM, Jens Axboe wrote:
> >> On 11/07/2017 09:29 AM, Jens Axboe wrote:
> >>> On 11/07/2017 09:20 AM, Bart Van Assche wrote:
>  On Tue, 2017-11-07 at 10:11 +0800, Ming Lei wrote:
> > If you can reproduce, please provide me at least the following log
> > first:
> >
> > find /sys/kernel/debug/block -name tags | xargs cat | grep busy
> >
> > If any pending requests arn't completed, please provide the related
> > info in dbgfs about where is the request.
> 
>  Every time I ran the above or a similar command its output was empty. I
>  assume that's because the hang usually occurs in a phase where these 
>  debugfs
>  attributes either have not yet been created or have already disappeared.
> >>>
> >>> Bart, do you still see a hang with the patch that fixes the tag leak when
> >>> we fail to get a dispatch budget?
> >>>
> >>> https://marc.info/?l=linux-block=151004881411480=2
> >>>
> >>> I've run a lot of stability testing here, and I haven't run into any
> >>> issues. This is with shared tags as well. So if you still see the failure
> >>> case with the current tree AND the above patch, then I'll try and get
> >>> a test case setup that hits it too so we can get to the bottom of this.
> >>
> >> Ming, I managed to reproduce the hang using null_blk. Note this is
> >> WITHOUT the patch mentioned above, running with that now.
> >>
> >> # modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 
> >> submit_queues=1 hw_queue_depth=1
> >>
> >> and using this fio file:
> >>
> >> [global]
> >> bs=4k
> >> rw=randread
> >> norandommap
> >> direct=1
> >> ioengine=libaio
> >> iodepth=4
> >>
> >> [nullb0]
> >> filename=/dev/nullb0
> >> [nullb1]
> >> filename=/dev/nullb1
> >> [nullb2]
> >> filename=/dev/nullb2
> >> [nullb3]
> >> filename=/dev/nullb3
> >>
> >> it seemed to keep running, but it hung when exiting. The troublesome
> >> device was nullb1:
> >>
> >> [  492.513374] INFO: task fio:3263 blocked for more than 120 seconds.
> >> [  492.520782]   Not tainted 4.14.0-rc7+ #499
> >> [  492.526247] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> >> this message.
> >> [  492.535904] fio D13208  3263   3211 0x
> >> [  492.542535] Call Trace:
> >> [  492.545764]  __schedule+0x279/0x720
> >> [  492.550151]  schedule+0x33/0x90
> >> [  492.554145]  io_schedule+0x16/0x40
> >> [  492.558435]  blk_mq_get_tag+0x148/0x250
> >> [  492.563211]  ? finish_wait+0x90/0x90
> >> [  492.567693]  blk_mq_get_request+0xf0/0x3e0
> >> [  492.572760]  blk_mq_make_request+0xe2/0x690
> >> [  492.577913]  generic_make_request+0xfc/0x2f0
> >> [  492.583177]  submit_bio+0x64/0x120
> >> [  492.587475]  ? set_page_dirty_lock+0x4b/0x60
> >> [  492.592736]  ? submit_bio+0x64/0x120
> >> [  492.597309]  ? bio_set_pages_dirty+0x55/0x60
> >> [  492.602570]  blkdev_direct_IO+0x388/0x3c0
> >> [  492.607546]  ? free_ioctx_users+0xe0/0xe0
> >> [  492.612511]  ? blk_mq_dispatch_rq_list+0x238/0x3a0
> >> [  492.618353]  ? _raw_spin_unlock+0xe/0x20
> >> [  492.623227]  generic_file_read_iter+0xb3/0xa00
> >> [  492.628682]  ? generic_file_read_iter+0xb3/0xa00
> >> [  492.634334]  ? security_file_permission+0x9b/0xc0
> >> [  492.640114]  blkdev_read_iter+0x35/0x40
> >> [  492.644877]  aio_read+0xc5/0x120
> >> [  492.648973]  ? aio_read_events+0x24c/0x340
> >> [  492.654124]  ? __might_sleep+0x4a/0x80
> >> [  492.658800]  do_io_submit+0x47c/0x5e0
> >> [  492.663373]  ? do_io_submit+0x47c/0x5e0
> >> [  492.668234]  SyS_io_submit+0x10/0x20
> >> [  492.672715]  ? SyS_io_submit+0x10/0x20
> >> [  492.677394]  entry_SYSCALL_64_fastpath+0x13/0x94
> >> [  492.683039] RIP: 0033:0x7f83d1871717
> >> [  492.687521] RSP: 002b:7ffd38fe5a88 EFLAGS: 0202 ORIG_RAX: 
> >> 00d1
> >> [  492.696969] RAX: ffda RBX: 7f83b6972b50 RCX: 
> >> 7f83d1871717
> >> [  492.705423] RDX: 01f41418 RSI: 0001 RDI: 
> >> 7f83e4d36000
> >> [  492.713889] RBP: 0001 R08: 0001 R09: 
> >> 01f3f2e0
> >> [  492.722352] R10: 1000 R11: 0202 R12: 
> >> 7ffd38fe5be0
> >> [  492.730827] R13: 7f83b6972b01 R14: 7f83b69824b8 R15: 
> >> 7f83b6982368
> >>
> >> and if we look at the debug entries, it's waiting on a scheduler tag:
> >>
> >> sched_tags:nr_tags=2
> >> sched_tags:nr_reserved_tags=0
> >> sched_tags:active_queues=0
> >> sched_tags:bitmap_tags:
> >> sched_tags:depth=2
> >> sched_tags:busy=2
> >> sched_tags:bits_per_word=64
> >> sched_tags:map_nr=1
> >> sched_tags:alloc_hint={0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 
> >> 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 
> >> 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0}
> >> sched_tags:wake_batch=1
> >> sched_tags:wake_index=4
> >> sched_tags:ws={
> >> 

Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-07 Thread Ming Lei
On Tue, Nov 07, 2017 at 05:34:38PM +, Bart Van Assche wrote:
> On Tue, 2017-11-07 at 09:29 -0700, Jens Axboe wrote:
> > On 11/07/2017 09:20 AM, Bart Van Assche wrote:
> > > On Tue, 2017-11-07 at 10:11 +0800, Ming Lei wrote:
> > > > If you can reproduce, please provide me at least the following log
> > > > first:
> > > > 
> > > > find /sys/kernel/debug/block -name tags | xargs cat | grep busy
> > > > 
> > > > If any pending requests arn't completed, please provide the related
> > > > info in dbgfs about where is the request.
> > > 
> > > Every time I ran the above or a similar command its output was empty. I
> > > assume that's because the hang usually occurs in a phase where these 
> > > debugfs
> > > attributes either have not yet been created or have already disappeared.
> > 
> > Bart, do you still see a hang with the patch that fixes the tag leak when
> > we fail to get a dispatch budget?
> > 
> > https://marc.info/?l=linux-block=151004881411480=2
> > 
> > I've run a lot of stability testing here, and I haven't run into any
> > issues. This is with shared tags as well. So if you still see the failure
> > case with the current tree AND the above patch, then I'll try and get
> > a test case setup that hits it too so we can get to the bottom of this.
> 
> It took a little longer than expected but I just ran into the following
> lockup with your for-next branch of this morning (commit e8fa44bb8af9) and
> Ming's patch "blk-mq: put driver tag if dispatch budget can't be got"
> applied on top of it:
> 
> [ 2575.324678] sysrq: SysRq : Show Blocked State
> [ 2575.332336]   taskPC stack   pid father
> [ 2575.345239] systemd-udevd   D0 47577518 0x0106
> [ 2575.353821] Call Trace:
> [ 2575.358805]  __schedule+0x28b/0x890
> [ 2575.364906]  schedule+0x36/0x80
> [ 2575.370436]  io_schedule+0x16/0x40
> [ 2575.376287]  __lock_page+0xfc/0x140
> [ 2575.382061]  ? page_cache_tree_insert+0xc0/0xc0
> [ 2575.388943]  truncate_inode_pages_range+0x5e8/0x830
> [ 2575.396083]  truncate_inode_pages+0x15/0x20
> [ 2575.402398]  kill_bdev+0x2f/0x40
> [ 2575.407538]  __blkdev_put+0x74/0x1f0
> [ 2575.413010]  ? kmem_cache_free+0x197/0x1c0
> [ 2575.418986]  blkdev_put+0x4c/0xd0
> [ 2575.424040]  blkdev_close+0x34/0x70
> [ 2575.429216]  __fput+0xe7/0x220
> [ 2575.433863]  fput+0xe/0x10
> [ 2575.438490]  task_work_run+0x76/0x90
> [ 2575.443619]  do_exit+0x2e0/0xaf0
> [ 2575.448311]  do_group_exit+0x43/0xb0
> [ 2575.453386]  get_signal+0x299/0x5e0
> [ 2575.458303]  do_signal+0x37/0x740
> [ 2575.462976]  ? blkdev_read_iter+0x35/0x40
> [ 2575.468425]  ? new_sync_read+0xde/0x130
> [ 2575.473620]  ? vfs_read+0x115/0x130
> [ 2575.478388]  exit_to_usermode_loop+0x80/0xd0
> [ 2575.484002]  do_syscall_64+0xb3/0xc0
> [ 2575.488813]  entry_SYSCALL64_slow_path+0x25/0x25
> [ 2575.494759] RIP: 0033:0x7efd829cbd11
> [ 2575.499506] RSP: 002b:7984f978 EFLAGS: 0246 ORIG_RAX: 
> 
> [ 2575.508741] RAX: 00022000 RBX: 55f19f902ca0 RCX: 
> 7efd829cbd11
> [ 2575.517455] RDX: 0004 RSI: 55f19f902cc8 RDI: 
> 0007
> [ 2575.526163] RBP: 55f19f7fb9d0 R08:  R09: 
> 55f19f902ca0
> [ 2575.534860] R10: 55f19f902cb8 R11: 0246 R12: 
> 
> [ 2575.544250] R13: 0004 R14: 55f19f7fba20 R15: 
> 0004

Again please show us the output of 'tags' to see if there is pending
requests not completed.

Please run this test on linus tree(V4.14-rc7) to see if the same issue
can be reproduced.

Also if possible, please provide us the way for reproducing.

-- 
Ming


Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-07 Thread Ming Lei
On Tue, Nov 07, 2017 at 10:34:35PM +, Bart Van Assche wrote:
> On Tue, 2017-11-07 at 15:06 -0700, Jens Axboe wrote:
> > Just to keep everyone in the loop, this bug is not new to
> > for-4.15/block, nor is it new to the current 4.41-rc or 4.13. So it's
> > probably different to what Bart is hitting, but it's a bug none the
> > less...
> 
> Hello Jens,
> 
> There are several reasons why I think that patch "blk-mq: don't handle
> TAG_SHARED in restart" really should be reverted:
> * That patch is based on the assumption that only the SCSI driver uses shared
>   tags. That assumption is not correct. null_blk and nvme also use shared 
> tags.

No, both null_blk and nvme should be handled by BLK_MQ_S_TAG_WAITING, not need
to waste CPU to check all shared tags.

> * As my tests have shown, the algorithm for restarting queues based on the

Your test doesn't show it is related with RESTART since there isn't
pending request in output of 'tags'.

>   SCSI starved list is flawed. So using that mechanism instead of the blk-mq
>   shared queue restarting algorithm is wrong.

The algorithm based on starved list has been used for dozens of years
for SCSI, I don't think it is flawed enough.

> * We are close to the merge window. It is too late for trying to fix the
>   "blk-mq: don't handle TAG_SHARED in restart" patch.

If you can provide us the reproduction approach, the time is enough to fix it
before V4.15 release.

> 
> My proposal is to make sure that what will be sent to Linus during the v4.15
> merge window works reliably. That means using the v4.13/v4.14 algorithm for
> queue restarting which is an algorithm that is trusted by the community. If
> Roman Penyaev's patch could get applied that would be even better.

Frankly speaking, the algorithm for blk-mq's restarting won't be used by SCSI at
all because scsi_end_request() restarts the queue before the restart for 
TAG_SHARED.

For NVMe and null_blk, it is basically same since we cover that via 
BLK_MQ_S_TAG_WAITING.

So Nak your proposal.

-- 
Ming


Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-07 Thread Ming Lei
On Tue, Nov 07, 2017 at 04:20:08PM +, Bart Van Assche wrote:
> On Tue, 2017-11-07 at 10:11 +0800, Ming Lei wrote:
> > If you can reproduce, please provide me at least the following log
> > first:
> > 
> > find /sys/kernel/debug/block -name tags | xargs cat | grep busy
> > 
> > If any pending requests arn't completed, please provide the related
> > info in dbgfs about where is the request.
> 
> Every time I ran the above or a similar command its output was empty. I
> assume that's because the hang usually occurs in a phase where these debugfs
> attributes either have not yet been created or have already disappeared.

Could you dump all tags? Then you can see if this attribute is disappeared.

If that output is empty, it often means there isn't pending request not
completed. So if that is true, your hang is _not_ related with RESTART.

Thakns,
Ming


Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-07 Thread Jens Axboe
On 11/07/2017 03:34 PM, Bart Van Assche wrote:
> On Tue, 2017-11-07 at 15:06 -0700, Jens Axboe wrote:
>> Just to keep everyone in the loop, this bug is not new to
>> for-4.15/block, nor is it new to the current 4.14-rc or 4.13. So it's
>> probably different to what Bart is hitting, but it's a bug none the
>> less...
> 
> Hello Jens,
> 
> There are several reasons why I think that patch "blk-mq: don't handle
> TAG_SHARED in restart" really should be reverted:
> * That patch is based on the assumption that only the SCSI driver uses shared
>   tags. That assumption is not correct. null_blk and nvme also use shared 
> tags.
> * As my tests have shown, the algorithm for restarting queues based on the
>   SCSI starved list is flawed. So using that mechanism instead of the blk-mq
>   shared queue restarting algorithm is wrong.
> * We are close to the merge window. It is too late for trying to fix the
>   "blk-mq: don't handle TAG_SHARED in restart" patch.
> 
> My proposal is to make sure that what will be sent to Linus during the v4.15
> merge window works reliably. That means using the v4.13/v4.14 algorithm for
> queue restarting which is an algorithm that is trusted by the community. If
> Roman Penyaev's patch could get applied that would be even better.

I'm fine with reverting a single patch, that's still a far cry from the
giant list. I'd rather get a fix in though, if at all possible. The code
it removed wasn't exactly the fastest or prettiest solution. But the
most important part is that we have something that works. If you have a
test case that is runnable AND reproduces the problem, I'd love to have
it. I've seen comments to that effect spread over several messages,
would be nice to have it summarized and readily available for all that
want to work on it.

The issue above is NOT a new bug, I ran into it by accident trying to
reproduce your case. Debugging that one right now, hopefully we'll have
some closure on that tomorrow and we can move forward on the shared tag
restart/loop. It smells like a TAG_WAITING race, or a restart race.

-- 
Jens Axboe



Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-07 Thread Bart Van Assche
On Tue, 2017-11-07 at 15:06 -0700, Jens Axboe wrote:
> Just to keep everyone in the loop, this bug is not new to
> for-4.15/block, nor is it new to the current 4.41-rc or 4.13. So it's
> probably different to what Bart is hitting, but it's a bug none the
> less...

Hello Jens,

There are several reasons why I think that patch "blk-mq: don't handle
TAG_SHARED in restart" really should be reverted:
* That patch is based on the assumption that only the SCSI driver uses shared
  tags. That assumption is not correct. null_blk and nvme also use shared tags.
* As my tests have shown, the algorithm for restarting queues based on the
  SCSI starved list is flawed. So using that mechanism instead of the blk-mq
  shared queue restarting algorithm is wrong.
* We are close to the merge window. It is too late for trying to fix the
  "blk-mq: don't handle TAG_SHARED in restart" patch.

My proposal is to make sure that what will be sent to Linus during the v4.15
merge window works reliably. That means using the v4.13/v4.14 algorithm for
queue restarting which is an algorithm that is trusted by the community. If
Roman Penyaev's patch could get applied that would be even better.

Bart.

Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-07 Thread Jens Axboe
On 11/07/2017 10:36 AM, Jens Axboe wrote:
> On 11/07/2017 10:10 AM, Jens Axboe wrote:
>> On 11/07/2017 09:29 AM, Jens Axboe wrote:
>>> On 11/07/2017 09:20 AM, Bart Van Assche wrote:
 On Tue, 2017-11-07 at 10:11 +0800, Ming Lei wrote:
> If you can reproduce, please provide me at least the following log
> first:
>
>   find /sys/kernel/debug/block -name tags | xargs cat | grep busy
>
> If any pending requests arn't completed, please provide the related
> info in dbgfs about where is the request.

 Every time I ran the above or a similar command its output was empty. I
 assume that's because the hang usually occurs in a phase where these 
 debugfs
 attributes either have not yet been created or have already disappeared.
>>>
>>> Bart, do you still see a hang with the patch that fixes the tag leak when
>>> we fail to get a dispatch budget?
>>>
>>> https://marc.info/?l=linux-block=151004881411480=2
>>>
>>> I've run a lot of stability testing here, and I haven't run into any
>>> issues. This is with shared tags as well. So if you still see the failure
>>> case with the current tree AND the above patch, then I'll try and get
>>> a test case setup that hits it too so we can get to the bottom of this.
>>
>> Ming, I managed to reproduce the hang using null_blk. Note this is
>> WITHOUT the patch mentioned above, running with that now.
>>
>> # modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 submit_queues=1 
>> hw_queue_depth=1
>>
>> and using this fio file:
>>
>> [global]
>> bs=4k
>> rw=randread
>> norandommap
>> direct=1
>> ioengine=libaio
>> iodepth=4
>>
>> [nullb0]
>> filename=/dev/nullb0
>> [nullb1]
>> filename=/dev/nullb1
>> [nullb2]
>> filename=/dev/nullb2
>> [nullb3]
>> filename=/dev/nullb3
>>
>> it seemed to keep running, but it hung when exiting. The troublesome
>> device was nullb1:
>>
>> [  492.513374] INFO: task fio:3263 blocked for more than 120 seconds.
>> [  492.520782]   Not tainted 4.14.0-rc7+ #499
>> [  492.526247] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
>> this message.
>> [  492.535904] fio D13208  3263   3211 0x
>> [  492.542535] Call Trace:
>> [  492.545764]  __schedule+0x279/0x720
>> [  492.550151]  schedule+0x33/0x90
>> [  492.554145]  io_schedule+0x16/0x40
>> [  492.558435]  blk_mq_get_tag+0x148/0x250
>> [  492.563211]  ? finish_wait+0x90/0x90
>> [  492.567693]  blk_mq_get_request+0xf0/0x3e0
>> [  492.572760]  blk_mq_make_request+0xe2/0x690
>> [  492.577913]  generic_make_request+0xfc/0x2f0
>> [  492.583177]  submit_bio+0x64/0x120
>> [  492.587475]  ? set_page_dirty_lock+0x4b/0x60
>> [  492.592736]  ? submit_bio+0x64/0x120
>> [  492.597309]  ? bio_set_pages_dirty+0x55/0x60
>> [  492.602570]  blkdev_direct_IO+0x388/0x3c0
>> [  492.607546]  ? free_ioctx_users+0xe0/0xe0
>> [  492.612511]  ? blk_mq_dispatch_rq_list+0x238/0x3a0
>> [  492.618353]  ? _raw_spin_unlock+0xe/0x20
>> [  492.623227]  generic_file_read_iter+0xb3/0xa00
>> [  492.628682]  ? generic_file_read_iter+0xb3/0xa00
>> [  492.634334]  ? security_file_permission+0x9b/0xc0
>> [  492.640114]  blkdev_read_iter+0x35/0x40
>> [  492.644877]  aio_read+0xc5/0x120
>> [  492.648973]  ? aio_read_events+0x24c/0x340
>> [  492.654124]  ? __might_sleep+0x4a/0x80
>> [  492.658800]  do_io_submit+0x47c/0x5e0
>> [  492.663373]  ? do_io_submit+0x47c/0x5e0
>> [  492.668234]  SyS_io_submit+0x10/0x20
>> [  492.672715]  ? SyS_io_submit+0x10/0x20
>> [  492.677394]  entry_SYSCALL_64_fastpath+0x13/0x94
>> [  492.683039] RIP: 0033:0x7f83d1871717
>> [  492.687521] RSP: 002b:7ffd38fe5a88 EFLAGS: 0202 ORIG_RAX: 
>> 00d1
>> [  492.696969] RAX: ffda RBX: 7f83b6972b50 RCX: 
>> 7f83d1871717
>> [  492.705423] RDX: 01f41418 RSI: 0001 RDI: 
>> 7f83e4d36000
>> [  492.713889] RBP: 0001 R08: 0001 R09: 
>> 01f3f2e0
>> [  492.722352] R10: 1000 R11: 0202 R12: 
>> 7ffd38fe5be0
>> [  492.730827] R13: 7f83b6972b01 R14: 7f83b69824b8 R15: 
>> 7f83b6982368
>>
>> and if we look at the debug entries, it's waiting on a scheduler tag:
>>
>> sched_tags:nr_tags=2
>> sched_tags:nr_reserved_tags=0
>> sched_tags:active_queues=0
>> sched_tags:bitmap_tags:
>> sched_tags:depth=2
>> sched_tags:busy=2
>> sched_tags:bits_per_word=64
>> sched_tags:map_nr=1
>> sched_tags:alloc_hint={0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 
>> 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 
>> 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0}
>> sched_tags:wake_batch=1
>> sched_tags:wake_index=4
>> sched_tags:ws={
>> sched_tags:  {.wait_cnt=-102, .wait=inactive},
>> sched_tags:  {.wait_cnt=-126, .wait=inactive},
>> sched_tags:  {.wait_cnt=-72, .wait=inactive},
>> sched_tags:  {.wait_cnt=-96, .wait=inactive},
>> sched_tags:  {.wait_cnt=-134, .wait=inactive},
>> sched_tags:  {.wait_cnt=-116, .wait=inactive},
>> sched_tags:  

Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-07 Thread Jens Axboe
On 11/07/2017 10:10 AM, Jens Axboe wrote:
> On 11/07/2017 09:29 AM, Jens Axboe wrote:
>> On 11/07/2017 09:20 AM, Bart Van Assche wrote:
>>> On Tue, 2017-11-07 at 10:11 +0800, Ming Lei wrote:
 If you can reproduce, please provide me at least the following log
 first:

find /sys/kernel/debug/block -name tags | xargs cat | grep busy

 If any pending requests arn't completed, please provide the related
 info in dbgfs about where is the request.
>>>
>>> Every time I ran the above or a similar command its output was empty. I
>>> assume that's because the hang usually occurs in a phase where these debugfs
>>> attributes either have not yet been created or have already disappeared.
>>
>> Bart, do you still see a hang with the patch that fixes the tag leak when
>> we fail to get a dispatch budget?
>>
>> https://marc.info/?l=linux-block=151004881411480=2
>>
>> I've run a lot of stability testing here, and I haven't run into any
>> issues. This is with shared tags as well. So if you still see the failure
>> case with the current tree AND the above patch, then I'll try and get
>> a test case setup that hits it too so we can get to the bottom of this.
> 
> Ming, I managed to reproduce the hang using null_blk. Note this is
> WITHOUT the patch mentioned above, running with that now.
> 
> # modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 submit_queues=1 
> hw_queue_depth=1
> 
> and using this fio file:
> 
> [global]
> bs=4k
> rw=randread
> norandommap
> direct=1
> ioengine=libaio
> iodepth=4
> 
> [nullb0]
> filename=/dev/nullb0
> [nullb1]
> filename=/dev/nullb1
> [nullb2]
> filename=/dev/nullb2
> [nullb3]
> filename=/dev/nullb3
> 
> it seemed to keep running, but it hung when exiting. The troublesome
> device was nullb1:
> 
> [  492.513374] INFO: task fio:3263 blocked for more than 120 seconds.
> [  492.520782]   Not tainted 4.14.0-rc7+ #499
> [  492.526247] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [  492.535904] fio D13208  3263   3211 0x
> [  492.542535] Call Trace:
> [  492.545764]  __schedule+0x279/0x720
> [  492.550151]  schedule+0x33/0x90
> [  492.554145]  io_schedule+0x16/0x40
> [  492.558435]  blk_mq_get_tag+0x148/0x250
> [  492.563211]  ? finish_wait+0x90/0x90
> [  492.567693]  blk_mq_get_request+0xf0/0x3e0
> [  492.572760]  blk_mq_make_request+0xe2/0x690
> [  492.577913]  generic_make_request+0xfc/0x2f0
> [  492.583177]  submit_bio+0x64/0x120
> [  492.587475]  ? set_page_dirty_lock+0x4b/0x60
> [  492.592736]  ? submit_bio+0x64/0x120
> [  492.597309]  ? bio_set_pages_dirty+0x55/0x60
> [  492.602570]  blkdev_direct_IO+0x388/0x3c0
> [  492.607546]  ? free_ioctx_users+0xe0/0xe0
> [  492.612511]  ? blk_mq_dispatch_rq_list+0x238/0x3a0
> [  492.618353]  ? _raw_spin_unlock+0xe/0x20
> [  492.623227]  generic_file_read_iter+0xb3/0xa00
> [  492.628682]  ? generic_file_read_iter+0xb3/0xa00
> [  492.634334]  ? security_file_permission+0x9b/0xc0
> [  492.640114]  blkdev_read_iter+0x35/0x40
> [  492.644877]  aio_read+0xc5/0x120
> [  492.648973]  ? aio_read_events+0x24c/0x340
> [  492.654124]  ? __might_sleep+0x4a/0x80
> [  492.658800]  do_io_submit+0x47c/0x5e0
> [  492.663373]  ? do_io_submit+0x47c/0x5e0
> [  492.668234]  SyS_io_submit+0x10/0x20
> [  492.672715]  ? SyS_io_submit+0x10/0x20
> [  492.677394]  entry_SYSCALL_64_fastpath+0x13/0x94
> [  492.683039] RIP: 0033:0x7f83d1871717
> [  492.687521] RSP: 002b:7ffd38fe5a88 EFLAGS: 0202 ORIG_RAX: 
> 00d1
> [  492.696969] RAX: ffda RBX: 7f83b6972b50 RCX: 
> 7f83d1871717
> [  492.705423] RDX: 01f41418 RSI: 0001 RDI: 
> 7f83e4d36000
> [  492.713889] RBP: 0001 R08: 0001 R09: 
> 01f3f2e0
> [  492.722352] R10: 1000 R11: 0202 R12: 
> 7ffd38fe5be0
> [  492.730827] R13: 7f83b6972b01 R14: 7f83b69824b8 R15: 
> 7f83b6982368
> 
> and if we look at the debug entries, it's waiting on a scheduler tag:
> 
> sched_tags:nr_tags=2
> sched_tags:nr_reserved_tags=0
> sched_tags:active_queues=0
> sched_tags:bitmap_tags:
> sched_tags:depth=2
> sched_tags:busy=2
> sched_tags:bits_per_word=64
> sched_tags:map_nr=1
> sched_tags:alloc_hint={0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 
> 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 
> 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0}
> sched_tags:wake_batch=1
> sched_tags:wake_index=4
> sched_tags:ws={
> sched_tags:   {.wait_cnt=-102, .wait=inactive},
> sched_tags:   {.wait_cnt=-126, .wait=inactive},
> sched_tags:   {.wait_cnt=-72, .wait=inactive},
> sched_tags:   {.wait_cnt=-96, .wait=inactive},
> sched_tags:   {.wait_cnt=-134, .wait=inactive},
> sched_tags:   {.wait_cnt=-116, .wait=inactive},
> sched_tags:   {.wait_cnt=-90, .wait=inactive},
> sched_tags:   {.wait_cnt=-115, .wait=active},
> sched_tags:}
> sched_tags:round_robin=0
> sched_tags_bitmap:: 03
> 
> 

Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-07 Thread Bart Van Assche
On Tue, 2017-11-07 at 09:29 -0700, Jens Axboe wrote:
> On 11/07/2017 09:20 AM, Bart Van Assche wrote:
> > On Tue, 2017-11-07 at 10:11 +0800, Ming Lei wrote:
> > > If you can reproduce, please provide me at least the following log
> > > first:
> > > 
> > >   find /sys/kernel/debug/block -name tags | xargs cat | grep busy
> > > 
> > > If any pending requests arn't completed, please provide the related
> > > info in dbgfs about where is the request.
> > 
> > Every time I ran the above or a similar command its output was empty. I
> > assume that's because the hang usually occurs in a phase where these debugfs
> > attributes either have not yet been created or have already disappeared.
> 
> Bart, do you still see a hang with the patch that fixes the tag leak when
> we fail to get a dispatch budget?
> 
> https://marc.info/?l=linux-block=151004881411480=2
> 
> I've run a lot of stability testing here, and I haven't run into any
> issues. This is with shared tags as well. So if you still see the failure
> case with the current tree AND the above patch, then I'll try and get
> a test case setup that hits it too so we can get to the bottom of this.

It took a little longer than expected but I just ran into the following
lockup with your for-next branch of this morning (commit e8fa44bb8af9) and
Ming's patch "blk-mq: put driver tag if dispatch budget can't be got"
applied on top of it:

[ 2575.324678] sysrq: SysRq : Show Blocked State
[ 2575.332336]   taskPC stack   pid father
[ 2575.345239] systemd-udevd   D0 47577518 0x0106
[ 2575.353821] Call Trace:
[ 2575.358805]  __schedule+0x28b/0x890
[ 2575.364906]  schedule+0x36/0x80
[ 2575.370436]  io_schedule+0x16/0x40
[ 2575.376287]  __lock_page+0xfc/0x140
[ 2575.382061]  ? page_cache_tree_insert+0xc0/0xc0
[ 2575.388943]  truncate_inode_pages_range+0x5e8/0x830
[ 2575.396083]  truncate_inode_pages+0x15/0x20
[ 2575.402398]  kill_bdev+0x2f/0x40
[ 2575.407538]  __blkdev_put+0x74/0x1f0
[ 2575.413010]  ? kmem_cache_free+0x197/0x1c0
[ 2575.418986]  blkdev_put+0x4c/0xd0
[ 2575.424040]  blkdev_close+0x34/0x70
[ 2575.429216]  __fput+0xe7/0x220
[ 2575.433863]  fput+0xe/0x10
[ 2575.438490]  task_work_run+0x76/0x90
[ 2575.443619]  do_exit+0x2e0/0xaf0
[ 2575.448311]  do_group_exit+0x43/0xb0
[ 2575.453386]  get_signal+0x299/0x5e0
[ 2575.458303]  do_signal+0x37/0x740
[ 2575.462976]  ? blkdev_read_iter+0x35/0x40
[ 2575.468425]  ? new_sync_read+0xde/0x130
[ 2575.473620]  ? vfs_read+0x115/0x130
[ 2575.478388]  exit_to_usermode_loop+0x80/0xd0
[ 2575.484002]  do_syscall_64+0xb3/0xc0
[ 2575.488813]  entry_SYSCALL64_slow_path+0x25/0x25
[ 2575.494759] RIP: 0033:0x7efd829cbd11
[ 2575.499506] RSP: 002b:7984f978 EFLAGS: 0246 ORIG_RAX: 

[ 2575.508741] RAX: 00022000 RBX: 55f19f902ca0 RCX: 7efd829cbd11
[ 2575.517455] RDX: 0004 RSI: 55f19f902cc8 RDI: 0007
[ 2575.526163] RBP: 55f19f7fb9d0 R08:  R09: 55f19f902ca0
[ 2575.534860] R10: 55f19f902cb8 R11: 0246 R12: 
[ 2575.544250] R13: 0004 R14: 55f19f7fba20 R15: 0004

Bart.

Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-07 Thread Jens Axboe
On 11/07/2017 09:29 AM, Jens Axboe wrote:
> On 11/07/2017 09:20 AM, Bart Van Assche wrote:
>> On Tue, 2017-11-07 at 10:11 +0800, Ming Lei wrote:
>>> If you can reproduce, please provide me at least the following log
>>> first:
>>>
>>> find /sys/kernel/debug/block -name tags | xargs cat | grep busy
>>>
>>> If any pending requests arn't completed, please provide the related
>>> info in dbgfs about where is the request.
>>
>> Every time I ran the above or a similar command its output was empty. I
>> assume that's because the hang usually occurs in a phase where these debugfs
>> attributes either have not yet been created or have already disappeared.
> 
> Bart, do you still see a hang with the patch that fixes the tag leak when
> we fail to get a dispatch budget?
> 
> https://marc.info/?l=linux-block=151004881411480=2
> 
> I've run a lot of stability testing here, and I haven't run into any
> issues. This is with shared tags as well. So if you still see the failure
> case with the current tree AND the above patch, then I'll try and get
> a test case setup that hits it too so we can get to the bottom of this.

Ming, I managed to reproduce the hang using null_blk. Note this is
WITHOUT the patch mentioned above, running with that now.

# modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 submit_queues=1 
hw_queue_depth=1

and using this fio file:

[global]
bs=4k
rw=randread
norandommap
direct=1
ioengine=libaio
iodepth=4

[nullb0]
filename=/dev/nullb0
[nullb1]
filename=/dev/nullb1
[nullb2]
filename=/dev/nullb2
[nullb3]
filename=/dev/nullb3

it seemed to keep running, but it hung when exiting. The troublesome
device was nullb1:

[  492.513374] INFO: task fio:3263 blocked for more than 120 seconds.
[  492.520782]   Not tainted 4.14.0-rc7+ #499
[  492.526247] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  492.535904] fio D13208  3263   3211 0x
[  492.542535] Call Trace:
[  492.545764]  __schedule+0x279/0x720
[  492.550151]  schedule+0x33/0x90
[  492.554145]  io_schedule+0x16/0x40
[  492.558435]  blk_mq_get_tag+0x148/0x250
[  492.563211]  ? finish_wait+0x90/0x90
[  492.567693]  blk_mq_get_request+0xf0/0x3e0
[  492.572760]  blk_mq_make_request+0xe2/0x690
[  492.577913]  generic_make_request+0xfc/0x2f0
[  492.583177]  submit_bio+0x64/0x120
[  492.587475]  ? set_page_dirty_lock+0x4b/0x60
[  492.592736]  ? submit_bio+0x64/0x120
[  492.597309]  ? bio_set_pages_dirty+0x55/0x60
[  492.602570]  blkdev_direct_IO+0x388/0x3c0
[  492.607546]  ? free_ioctx_users+0xe0/0xe0
[  492.612511]  ? blk_mq_dispatch_rq_list+0x238/0x3a0
[  492.618353]  ? _raw_spin_unlock+0xe/0x20
[  492.623227]  generic_file_read_iter+0xb3/0xa00
[  492.628682]  ? generic_file_read_iter+0xb3/0xa00
[  492.634334]  ? security_file_permission+0x9b/0xc0
[  492.640114]  blkdev_read_iter+0x35/0x40
[  492.644877]  aio_read+0xc5/0x120
[  492.648973]  ? aio_read_events+0x24c/0x340
[  492.654124]  ? __might_sleep+0x4a/0x80
[  492.658800]  do_io_submit+0x47c/0x5e0
[  492.663373]  ? do_io_submit+0x47c/0x5e0
[  492.668234]  SyS_io_submit+0x10/0x20
[  492.672715]  ? SyS_io_submit+0x10/0x20
[  492.677394]  entry_SYSCALL_64_fastpath+0x13/0x94
[  492.683039] RIP: 0033:0x7f83d1871717
[  492.687521] RSP: 002b:7ffd38fe5a88 EFLAGS: 0202 ORIG_RAX: 
00d1
[  492.696969] RAX: ffda RBX: 7f83b6972b50 RCX: 7f83d1871717
[  492.705423] RDX: 01f41418 RSI: 0001 RDI: 7f83e4d36000
[  492.713889] RBP: 0001 R08: 0001 R09: 01f3f2e0
[  492.722352] R10: 1000 R11: 0202 R12: 7ffd38fe5be0
[  492.730827] R13: 7f83b6972b01 R14: 7f83b69824b8 R15: 7f83b6982368

and if we look at the debug entries, it's waiting on a scheduler tag:

sched_tags:nr_tags=2
sched_tags:nr_reserved_tags=0
sched_tags:active_queues=0
sched_tags:bitmap_tags:
sched_tags:depth=2
sched_tags:busy=2
sched_tags:bits_per_word=64
sched_tags:map_nr=1
sched_tags:alloc_hint={0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 
1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 
0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0}
sched_tags:wake_batch=1
sched_tags:wake_index=4
sched_tags:ws={
sched_tags: {.wait_cnt=-102, .wait=inactive},
sched_tags: {.wait_cnt=-126, .wait=inactive},
sched_tags: {.wait_cnt=-72, .wait=inactive},
sched_tags: {.wait_cnt=-96, .wait=inactive},
sched_tags: {.wait_cnt=-134, .wait=inactive},
sched_tags: {.wait_cnt=-116, .wait=inactive},
sched_tags: {.wait_cnt=-90, .wait=inactive},
sched_tags: {.wait_cnt=-115, .wait=active},
sched_tags:}
sched_tags:round_robin=0
sched_tags_bitmap:: 03

with SCHED_RESTART being set:

state:SCHED_RESTART

and with the driver tags being idle:

tags:nr_tags=1
tags:nr_reserved_tags=0
tags:active_queues=0
tags:bitmap_tags:
tags:depth=1
tags:busy=0
tags:bits_per_word=64
tags:map_nr=1
tags:alloc_hint={0, 0, 0, 0, 0, 0, 

Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-07 Thread Jens Axboe
On 11/07/2017 09:20 AM, Bart Van Assche wrote:
> On Tue, 2017-11-07 at 10:11 +0800, Ming Lei wrote:
>> If you can reproduce, please provide me at least the following log
>> first:
>>
>>  find /sys/kernel/debug/block -name tags | xargs cat | grep busy
>>
>> If any pending requests arn't completed, please provide the related
>> info in dbgfs about where is the request.
> 
> Every time I ran the above or a similar command its output was empty. I
> assume that's because the hang usually occurs in a phase where these debugfs
> attributes either have not yet been created or have already disappeared.

Bart, do you still see a hang with the patch that fixes the tag leak when
we fail to get a dispatch budget?

https://marc.info/?l=linux-block=151004881411480=2

I've run a lot of stability testing here, and I haven't run into any
issues. This is with shared tags as well. So if you still see the failure
case with the current tree AND the above patch, then I'll try and get
a test case setup that hits it too so we can get to the bottom of this.

-- 
Jens Axboe



Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-07 Thread Bart Van Assche
On Tue, 2017-11-07 at 10:11 +0800, Ming Lei wrote:
> If you can reproduce, please provide me at least the following log
> first:
> 
>   find /sys/kernel/debug/block -name tags | xargs cat | grep busy
> 
> If any pending requests arn't completed, please provide the related
> info in dbgfs about where is the request.

Every time I ran the above or a similar command its output was empty. I
assume that's because the hang usually occurs in a phase where these debugfs
attributes either have not yet been created or have already disappeared.

Bart.

Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-07 Thread Bart Van Assche
On Tue, 2017-11-07 at 18:15 +0800, Ming Lei wrote:
> Last time, you didn't mention the target patch for setting its
> can_queue as 1, so I think you can't reproduce the issue on upstream
> kernel without out-of-tree patch. Then looks it is another issue,
> and we are making progress actually.

If I don't trust a patch I introduce additional tests. The fact that I
modified the SRP initiator before this hang occurred does not mean that the
approach of your patch is fine. What this means is that all your patch does
is to reduce the race window and that there is still a race window.

Bart.

Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-07 Thread Ming Lei
On Mon, Nov 06, 2017 at 07:45:23PM +, Bart Van Assche wrote:
> On Sat, 2017-11-04 at 08:19 -0600, Jens Axboe wrote:
> > On 11/03/2017 07:55 PM, Ming Lei wrote:
> > > It is very expensive to atomic_inc/atomic_dec the host wide counter of
> > > host->busy_count, and it should have been avoided via blk-mq's mechanism
> > > of getting driver tag, which uses the more efficient way of sbitmap queue.
> > > 
> > > Also we don't check atomic_read(>device_busy) in 
> > > scsi_mq_get_budget()
> > > and don't run queue if the counter becomes zero, so IO hang may be caused
> > > if all requests are completed just before the current SCSI device
> > > is added to shost->starved_list.
> > 
> > This looks like an improvement. I have added it for 4.15.
> > 
> > Bart, does this fix your hang?
> 
> No, it doesn't. After I had reduced starget->can_queue in the SRP initiator I
> ran into the following hang while running the srp-test software:
> 
> sysrq: SysRq : Show Blocked State
>   taskPC stack   pid father
> systemd-udevd   D0 19882467 0x8106
> Call Trace:
>  __schedule+0x2fa/0xbb0
>  schedule+0x36/0x90
>  io_schedule+0x16/0x40
>  __lock_page+0x10a/0x140
>  truncate_inode_pages_range+0x4ff/0x800
>  truncate_inode_pages+0x15/0x20
>  kill_bdev+0x35/0x40
>  __blkdev_put+0x6d/0x1f0
>  blkdev_put+0x4e/0x130
>  blkdev_close+0x25/0x30
>  __fput+0xed/0x1f0
>  fput+0xe/0x10
>  task_work_run+0x8b/0xc0
>  do_exit+0x38d/0xc70
>  do_group_exit+0x50/0xd0
>  get_signal+0x2ad/0x8c0
>  do_signal+0x28/0x680
>  exit_to_usermode_loop+0x5a/0xa0
>  do_syscall_64+0x12e/0x170
>  entry_SYSCALL64_slow_path+0x25/0x25
> 
> The SRP initiator driver was modified as follows for this test:
> 
> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
> b/drivers/infiniband/ulp/srp/ib_srp.c
> index a6664467651e..9d24a871cc2e 100644
> --- a/drivers/infiniband/ulp/srp/ib_srp.c
> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
> 
> @@ -2835,6 +2839,13 @@ static int srp_reset_host(struct scsi_cmnd *scmnd)
>   return srp_reconnect_rport(target->rport) == 0 ? SUCCESS : FAILED;
>  }
>  
> +static int srp_target_alloc(struct scsi_target *starget)
> +{
> + starget->can_queue = 1;
> + return 0;
> +}
> +
>  static int srp_slave_alloc(struct scsi_device *sdev)
>  {
>   struct Scsi_Host *shost = sdev->host;
> @@ -3039,6 +3050,7 @@ static struct scsi_host_template srp_template = {
>   .module = THIS_MODULE,
>   .name   = "InfiniBand SRP initiator",
>   .proc_name  = DRV_NAME,
> + .target_alloc   = srp_target_alloc,
>   .slave_alloc= srp_slave_alloc,
>   .slave_configure= srp_slave_configure,
>   .info   = srp_target_info,

Last time, you didn't mention the target patch for setting its
can_queue as 1, so I think you can't reproduce the issue on upstream
kernel without out-of-tree patch. Then looks it is another issue,
and we are making progress actually.

I just posted a one-line patch, which should address the small queue
depth issue, please let us know if it fixes your issue:

https://marc.info/?l=linux-block=151004881411480=2

-- 
Ming


Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-06 Thread Martin K. Petersen

Ming,

>> Since this is a SCSI patch the SCSI maintainer, Martin Petersen,
>> should have been Cc-ed for this patch. Additionally, I think that
>> this patch should not have been queued by Jens before Martin had
>> approved this patch.
>
> This patch has been CCed to SCSI list.

I don't need a personal CC, linux-scsi is sufficient. However, as a
general rule, changes to any file in drivers/scsi needs go through the
SCSI tree.

We sometimes make exceptions to facilitate the merge process. In those
cases an Acked-by: will be provided to indicate that it is OK that the
patch goes through a different tree.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-06 Thread Ming Lei
On Mon, Nov 06, 2017 at 07:45:23PM +, Bart Van Assche wrote:
> On Sat, 2017-11-04 at 08:19 -0600, Jens Axboe wrote:
> > On 11/03/2017 07:55 PM, Ming Lei wrote:
> > > It is very expensive to atomic_inc/atomic_dec the host wide counter of
> > > host->busy_count, and it should have been avoided via blk-mq's mechanism
> > > of getting driver tag, which uses the more efficient way of sbitmap queue.
> > > 
> > > Also we don't check atomic_read(>device_busy) in 
> > > scsi_mq_get_budget()
> > > and don't run queue if the counter becomes zero, so IO hang may be caused
> > > if all requests are completed just before the current SCSI device
> > > is added to shost->starved_list.
> > 
> > This looks like an improvement. I have added it for 4.15.
> > 
> > Bart, does this fix your hang?
> 
> No, it doesn't. After I had reduced starget->can_queue in the SRP initiator I
> ran into the following hang while running the srp-test software:
> 
> sysrq: SysRq : Show Blocked State
>   taskPC stack   pid father
> systemd-udevd   D0 19882467 0x8106
> Call Trace:
>  __schedule+0x2fa/0xbb0
>  schedule+0x36/0x90
>  io_schedule+0x16/0x40
>  __lock_page+0x10a/0x140
>  truncate_inode_pages_range+0x4ff/0x800
>  truncate_inode_pages+0x15/0x20
>  kill_bdev+0x35/0x40
>  __blkdev_put+0x6d/0x1f0
>  blkdev_put+0x4e/0x130
>  blkdev_close+0x25/0x30
>  __fput+0xed/0x1f0
>  fput+0xe/0x10
>  task_work_run+0x8b/0xc0
>  do_exit+0x38d/0xc70
>  do_group_exit+0x50/0xd0
>  get_signal+0x2ad/0x8c0
>  do_signal+0x28/0x680
>  exit_to_usermode_loop+0x5a/0xa0
>  do_syscall_64+0x12e/0x170
>  entry_SYSCALL64_slow_path+0x25/0x25

I can't reproduce your issue on IB/SRP any more against V4.14-RC4 with
the following patches, and without any hang after running your 6
srp-test:

88022d7201e9 blk-mq: don't handle failure in .get_budget
826a70a08b12 SCSI: don't get target/host busy_count in scsi_mq_get_budget()
1f460b63d4b3 blk-mq: don't restart queue when .get_budget returns 
BLK_STS_RESOURCE
358a3a6bccb7 blk-mq: don't handle TAG_SHARED in restart
0df21c86bdbf scsi: implement .get_budget and .put_budget for blk-mq
aeec77629a4a scsi: allow passing in null rq to scsi_prep_state_check()
b347689ffbca blk-mq-sched: improve dispatching from sw queue
de1482974080 blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
63ba8e31c3ac block: kyber: check if there are requests in ctx in 
kyber_has_work()
7930d0a00ff5 sbitmap: introduce __sbitmap_for_each_set()
caf8eb0d604a blk-mq-sched: move actual dispatching into one helper
5e3d02bbafad blk-mq-sched: dispatch from scheduler IFF progress is made in 
->dispatch


If you can reproduce, please provide me at least the following log
first:

find /sys/kernel/debug/block -name tags | xargs cat | grep busy

If any pending requests arn't completed, please provide the related
info in dbgfs about where is the request.

-- 
Ming


Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-06 Thread Bart Van Assche
On Sat, 2017-11-04 at 08:19 -0600, Jens Axboe wrote:
> On 11/03/2017 07:55 PM, Ming Lei wrote:
> > It is very expensive to atomic_inc/atomic_dec the host wide counter of
> > host->busy_count, and it should have been avoided via blk-mq's mechanism
> > of getting driver tag, which uses the more efficient way of sbitmap queue.
> > 
> > Also we don't check atomic_read(>device_busy) in scsi_mq_get_budget()
> > and don't run queue if the counter becomes zero, so IO hang may be caused
> > if all requests are completed just before the current SCSI device
> > is added to shost->starved_list.
> 
> This looks like an improvement. I have added it for 4.15.
> 
> Bart, does this fix your hang?

No, it doesn't. After I had reduced starget->can_queue in the SRP initiator I
ran into the following hang while running the srp-test software:

sysrq: SysRq : Show Blocked State
  taskPC stack   pid father
systemd-udevd   D0 19882467 0x8106
Call Trace:
 __schedule+0x2fa/0xbb0
 schedule+0x36/0x90
 io_schedule+0x16/0x40
 __lock_page+0x10a/0x140
 truncate_inode_pages_range+0x4ff/0x800
 truncate_inode_pages+0x15/0x20
 kill_bdev+0x35/0x40
 __blkdev_put+0x6d/0x1f0
 blkdev_put+0x4e/0x130
 blkdev_close+0x25/0x30
 __fput+0xed/0x1f0
 fput+0xe/0x10
 task_work_run+0x8b/0xc0
 do_exit+0x38d/0xc70
 do_group_exit+0x50/0xd0
 get_signal+0x2ad/0x8c0
 do_signal+0x28/0x680
 exit_to_usermode_loop+0x5a/0xa0
 do_syscall_64+0x12e/0x170
 entry_SYSCALL64_slow_path+0x25/0x25

The SRP initiator driver was modified as follows for this test:

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
b/drivers/infiniband/ulp/srp/ib_srp.c
index a6664467651e..9d24a871cc2e 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c

@@ -2835,6 +2839,13 @@ static int srp_reset_host(struct scsi_cmnd *scmnd)
return srp_reconnect_rport(target->rport) == 0 ? SUCCESS : FAILED;
 }
 
+static int srp_target_alloc(struct scsi_target *starget)
+{
+   starget->can_queue = 1;
+   return 0;
+}
+
 static int srp_slave_alloc(struct scsi_device *sdev)
 {
struct Scsi_Host *shost = sdev->host;
@@ -3039,6 +3050,7 @@ static struct scsi_host_template srp_template = {
.module = THIS_MODULE,
.name   = "InfiniBand SRP initiator",
.proc_name  = DRV_NAME,
+   .target_alloc   = srp_target_alloc,
.slave_alloc= srp_slave_alloc,
.slave_configure= srp_slave_configure,
.info   = srp_target_info,

Bart.


Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-06 Thread Bart Van Assche
On Sat, 2017-11-04 at 09:55 +0800, Ming Lei wrote:
> It is very expensive to atomic_inc/atomic_dec the host wide counter of
> host->busy_count, and it should have been avoided via blk-mq's mechanism
> of getting driver tag, which uses the more efficient way of sbitmap queue.

Did you perhaps mean the host->host_busy counter? Unrelated to this patch:
I think that commit 64d513ac31bd ("scsi: use host wide tags by default") made
that counter superfluous.

> Also we don't check atomic_read(>device_busy) in scsi_mq_get_budget()
> and don't run queue if the counter becomes zero, so IO hang may be caused
> if all requests are completed just before the current SCSI device
> is added to shost->starved_list.

What is the relationship between the above description and the code changes
below? The above description does not explain whether the 
scsi_mq_get/put_budget()
changes below prevent that all outstanding SCSI requests complete before
another queue is added to the starved list.

Is this perhaps an attempt to move the code that can add a request queue to
"starved_list" from inside scsi_mq_get_budget() into scsi_queue_rq()? Does
this patch more than reducing the probability that the race is encountered
that a queue is added to "starved_list" after all requests have finished?

> Fixes: 0df21c86bdbf(scsi: implement .get_budget and .put_budget for blk-mq)
> Reported-by: Bart Van Assche 
> Signed-off-by: Ming Lei 

Since this is a SCSI patch the SCSI maintainer, Martin Petersen, should have
been Cc-ed for this patch. Additionally, I think that this patch should not
have been queued by Jens before Martin had approved this patch.

> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index a098af3070a1..7f218ef61900 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1944,11 +1944,7 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx 
> *hctx)
>  {
>   struct request_queue *q = hctx->queue;
>   struct scsi_device *sdev = q->queuedata;
> - struct Scsi_Host *shost = sdev->host;
>  
> - atomic_dec(>host_busy);
> - if (scsi_target(sdev)->can_queue > 0)
> - atomic_dec(_target(sdev)->target_busy);
>   atomic_dec(>device_busy);
>   put_device(>sdev_gendev);
>  }

scsi_mq_get/put_budget() were introduced to improve sequential I/O
performance. Does removing the SCSI target busy check have a negative
performance on sequential I/O for e.g. the iSER initiator driver? The iSCSI
over TCP initiator driver is not appropriate for testing performance
regressions because it limits the iSCSI parameter MaxOutstandingR2T to one.

Bart.

Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-04 Thread Jens Axboe
On 11/03/2017 07:55 PM, Ming Lei wrote:
> It is very expensive to atomic_inc/atomic_dec the host wide counter of
> host->busy_count, and it should have been avoided via blk-mq's mechanism
> of getting driver tag, which uses the more efficient way of sbitmap queue.
> 
> Also we don't check atomic_read(>device_busy) in scsi_mq_get_budget()
> and don't run queue if the counter becomes zero, so IO hang may be caused
> if all requests are completed just before the current SCSI device
> is added to shost->starved_list.

This looks like an improvement. I have added it for 4.15.

Bart, does this fix your hang?

-- 
Jens Axboe