Hi Niklas, > From: Niklas Schnelle <schne...@linux.ibm.com> > Sent: Monday, June 15, 2020 3:32 PM > > Hello Saeed, > > On 6/13/20 12:01 AM, Saeed Mahameed wrote: > > On Fri, 2020-06-12 at 15:09 +0200, Niklas Schnelle wrote: > >> Hello Parav, Hello Saeed, > >> > ... snip ... > >> > >> So without really knowing anything about these functions I would > >> guess that with the device still registered the drained queue does > >> not remain empty as new entries are added. > >> Does that sound plausible to you? > >> > > > > I don't think it is related, maybe this is similar to some issues > > addressed lately by Shay's patches: > > > > > https://patchwork.ozlabs.org/project/netdev/patch/20200611224708.23501 > > 4-2-sae...@mellanox.com/ > > > https://patchwork.ozlabs.org/project/netdev/patch/20200611224708.23501 > > 4-3-sae...@mellanox.com/ > > > > net/mlx5: drain health workqueue in case of driver load error > > net/mlx5: Fix fatal error handling during device load > > I agree with your similarity assessment especially for the first commit. > These do not fix the issue though, with mainline v5.8-rc1 which has both I'm > still getting a hang over 50% of the time with the following detach sequence > on z/VM: > > vmcp detach pcif <mlx_fid>; echo 1 > /proc/cio_settle > > Since now the commit 41798df9bfca ("net/mlx5: Drain wq first during PCI > device removal") no longer reverts cleanly I used the following diff to move > the mlx5_drain_health_wq(dev) after the mlx5_unregister_devices(dev). > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c > b/drivers/net/ethernet/mellanox/mlx5/core/main.c > index 8b658908f044..63a196fd8e68 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c > @@ -1382,8 +1382,8 @@ static void remove_one(struct pci_dev *pdev) > > devlink_reload_disable(devlink); > mlx5_crdump_disable(dev); > - mlx5_drain_health_wq(dev); > mlx5_unload_one(dev, true); > + mlx5_drain_health_wq(dev); > mlx5_pci_close(dev); > mlx5_mdev_uninit(dev); > mlx5_devlink_free(devlink); > > > Note that this changed order also matches the call order in > mlx5_pci_err_detected(). > With that change I've now done over two dozen detachments with varying > time between attach and detach to have the driver at different stages of > initialization. > With the change all worked without a hitch. > > Best regards, > Niklas Schnelle > >
Sorry for my late response. Yes, this looks good and I also found same in my analysis. With latest code mlx5_pci_close() already does drain_health_wq(), so the additional call in remove_one() is redundant. It should be just removed. If you can verify below hunk in your setup, it will be really helpful. You still need patch 42ea9f1b5c6 in your tree. diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c index 8b658908f044..ebec2318dbc4 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c @@ -1382,7 +1382,6 @@ static void remove_one(struct pci_dev *pdev) devlink_reload_disable(devlink); mlx5_crdump_disable(dev); - mlx5_drain_health_wq(dev); mlx5_unload_one(dev, true); mlx5_pci_close(dev); mlx5_mdev_uninit(dev);