On Tue, 2017-11-07 at 20:06 -0700, Jens Axboe wrote:
> At this point, I have no idea what Bart's setup looks like. Bart, it
> would be REALLY helpful if you could tell us how you are reproducing
> your hang. I don't know why this has to be dragged out.

Hello Jens,

It is a disappointment to me that you have allowed Ming to evaluate other
approaches than reverting "blk-mq: don't handle TAG_SHARED in restart". That
patch namely replaces an algorithm that is trusted by the community with an
algorithm of which even Ming acknowledged that it is racy. A quote from [1]:
"IO hang may be caused if all requests are completed just before the current
SCSI device is added to shost->starved_list". I don't know of any way to fix
that race other than serializing request submission and completion by adding
locking around these actions, which is something we don't want. Hence my
request to revert that patch.

Regarding the test I run, here is a summary of what I mentioned in previous
e-mails:
* I modified the SRP initiator such that the SCSI target queue depth is
  reduced to one by setting starget->can_queue to 1 from inside
  scsi_host_template.target_alloc.
* With that modified SRP initiator I run the srp-test software as follows
  until something breaks:
  while ./run_tests -f xfs -d -e deadline -r 60; do :; done

Today a system with at least one InfiniBand HCA is required to run that test.
When I have the time I will post the SRP initiator and target patches on the
linux-rdma mailing list that make it possible to run that test against the
SoftRoCE driver (drivers/infiniband/sw/rxe). The only hardware required to
use that driver is an Ethernet adapter.

Bart.

[1] [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg15263.html).

Reply via email to