Calling napi_disable() on an already disabled napi can cause the
deadlock. In commit 4bc12818b363 ("virtio-net: disable delayed refill
when pausing rx"), to avoid the deadlock, when pausing the RX in
virtnet_rx_pause[_all](), we disable and cancel the delayed refill work.
However, in the virtnet_rx_resume_all(), we enable the delayed refill
work too early before enabling all the receive queue napis.The deadlock can be reproduced by running selftests/drivers/net/hw/xsk_reconfig.py with multiqueue virtio-net device and inserting a cond_resched() inside the for loop in virtnet_rx_resume_all() to increase the success rate. Because the worker processing the delayed refilled work runs on the same CPU as virtnet_rx_resume_all(), a reschedule is needed to cause the deadlock. In real scenario, the contention on netdev_lock can cause the reschedule. In this series, we make the refill work a per receive queue work instead so that we can manage them separately and avoid further mistakes. - Patch 1 makes the refill work a per receive queue work. It fixes the deadlock in reproducer because now we only need to ensure refill work is scheduled after NAPI of its receive queue is enabled not all NAPIs of all queues. After this patch, enable_delayed_refill is stilled called before napi_enable in virtnet_rx_resume[_all] but I don't how the work can be scheduled in that window. - Patch 2 moves the enable_delayed_refill after napi_enable and fixes the deadlock variant in virtnet_open. - Patch 3 fixes the issue arises when enable_delayed_refill is moved after napi_enable. The issue is that a refill work might need to be scheduled in virtnet_receive but cannot because refill work is disabled. This can lead to receive side stuck.So we need to set a pending bit, later when refill work is enabled, the work is scheduled. All 3 patches need to be applied to fix the issue so does it mean I need to add Fixes and Cc stable for all 3? Link to the previous approach and discussion: https://lore.kernel.org/netdev/[email protected]/ Reported-by: Paolo Abeni <[email protected]> Closes: https://netdev-ctrl.bots.linux.dev/logs/vmksft/drv-hw-dbg/results/400961/3-xdp-py/stderr Thanks, Quang Minh. Bui Quang Minh (3): virtio-net: make refill work a per receive queue work virtio-net: ensure rx NAPI is enabled before enabling refill work virtio-net: schedule the pending refill work after being enabled drivers/net/virtio_net.c | 173 ++++++++++++++++++++------------------- 1 file changed, 91 insertions(+), 82 deletions(-) -- 2.43.0

