On Tue, 2017-11-07 at 20:06 -0700, Jens Axboe wrote: > At this point, I have no idea what Bart's setup looks like. Bart, it > would be REALLY helpful if you could tell us how you are reproducing > your hang. I don't know why this has to be dragged out.
Hello Jens, It is a disappointment to me that you have allowed Ming to evaluate other approaches than reverting "blk-mq: don't handle TAG_SHARED in restart". That patch namely replaces an algorithm that is trusted by the community with an algorithm of which even Ming acknowledged that it is racy. A quote from [1]: "IO hang may be caused if all requests are completed just before the current SCSI device is added to shost->starved_list". I don't know of any way to fix that race other than serializing request submission and completion by adding locking around these actions, which is something we don't want. Hence my request to revert that patch. Regarding the test I run, here is a summary of what I mentioned in previous e-mails: * I modified the SRP initiator such that the SCSI target queue depth is reduced to one by setting starget->can_queue to 1 from inside scsi_host_template.target_alloc. * With that modified SRP initiator I run the srp-test software as follows until something breaks: while ./run_tests -f xfs -d -e deadline -r 60; do :; done Today a system with at least one InfiniBand HCA is required to run that test. When I have the time I will post the SRP initiator and target patches on the linux-rdma mailing list that make it possible to run that test against the SoftRoCE driver (drivers/infiniband/sw/rxe). The only hardware required to use that driver is an Ethernet adapter. Bart. [1] [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget() (https://www.mail-archive.com/linux-block@vger.kernel.org/msg15263.html).