In NVMe's error handler, follows the typical steps for tearing down hardware: 1) stop blk_mq hw queues 2) stop the real hw queues 3) cancel in-flight requests via blk_mq_tagset_busy_iter(tags, cancel_request, ...) cancel_request(): mark the request as abort blk_mq_complete_request(req); 4) destroy real hw queues However, there may be race between #3 and #4, because blk_mq_complete_request() actually completes the request asynchronously. This patch introduces blk_mq_complete_request_sync() for fixing the above race.Other block drivers wait until outstanding requests have completed by calling blk_cleanup_queue() before hardware queues are destroyed. Why can't the NVMe driver follow that approach?The tearing down of controller can be done in error handler, in which the request queues may not be cleaned up, almost all kinds of NVMe controller's error handling follows the above steps, such as: nvme_rdma_error_recovery_work() ->nvme_rdma_teardown_io_queues()
Clarification, this happens in its dedicated context, not the timeout or error handler. But I still don't understand the issue here, what is the exact race you are referring to? that we abort/cancel a request and then we complete it again when we destroy the HW queue? If so, that is not the case (at least for rdma/tcp) because nvme_rdma_teardown_io_queues() first flushes the hw queue and then aborts inflight I/O.
