** Tags removed: verification-needed-bionic
** Tags added: verification-done-bionic

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1764982

Title:
  [bionic] machine stuck and bonding not working well when nvmet_rdma
  module is loaded

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Bionic:
  Fix Committed

Bug description:

  == SRU Justification ==
  This bug causes the machine to get stuck and bonding to not work when
  the nvmet_rdma module is loaded.

  Both of these commits are in mainline as of v4.17-rc1.

  == Fixes ==
  a3dd7d0022c3 ("nvmet-rdma: Don't flush system_wq by default during 
remove_one")
  9bad0404ecd7 ("nvme-rdma: Don't flush delete_wq by default during remove_one")

  == Regression Potential ==
  Low.  Limited to nvme driver and tested by Mellanox.

  == Test Case ==
  A test kernel was built with these patches and tested by the original bug 
reporter.
  The bug reporter states the test kernel resolved the bug.


  
  == Original Bug Description ==

  Hi
  Machine stuck after unregistering bonding interface when the nvmet_rdma 
module is loading.

  scenario:

   # modprobe nvmet_rdma
   # modprobe -r bonding
   # modprobe bonding  -v mode=1 miimon=100 fail_over_mac=0
   # ifdown eth4
   # ifdown eth5
   # ip addr add 15.209.12.173/8 dev bond0
   # ip link set bond0 up
   # echo +eth5 > /sys/class/net/bond0/bonding/slaves
   # echo +eth4 > /sys/class/net/bond0/bonding/slaves
   # echo -eth4 > /sys/class/net/bond0/bonding/slaves
   # echo -eth5 > /sys/class/net/bond0/bonding/slaves
   # echo -bond0 > /sys/class/net/bonding_masters

  dmesg:

  kernel: [78348.225556] bond0 (unregistering): Released all slaves
  kernel: [78358.339631] unregister_netdevice: waiting for bond0 to become 
free. Usage count = 2
  kernel: [78368.419621] unregister_netdevice: waiting for bond0 to become 
free. Usage count = 2
  kernel: [78378.499615] unregister_netdevice: waiting for bond0 to become 
free. Usage count = 2
  kernel: [78388.579625] unregister_netdevice: waiting for bond0 to become 
free. Usage count = 2
  kernel: [78398.659613] unregister_netdevice: waiting for bond0 to become 
free. Usage count = 2
  kernel: [78408.739655] unregister_netdevice: waiting for bond0 to become 
free. Usage count = 2
  kernel: [78418.819634] unregister_netdevice: waiting for bond0 to become 
free. Usage count = 2
  kernel: [78428.899642] unregister_netdevice: waiting for bond0 to become 
free. Usage count = 2
  kernel: [78438.979614] unregister_netdevice: waiting for bond0 to become 
free. Usage count = 2
  kernel: [78449.059619] unregister_netdevice: waiting for bond0 to become 
free. Usage count = 2
  kernel: [78459.139626] unregister_netdevice: waiting for bond0 to become 
free. Usage count = 2
  kernel: [78469.219623] unregister_netdevice: waiting for bond0 to become 
free. Usage count = 2
  kernel: [78479.299619] unregister_netdevice: waiting for bond0 to become 
free. Usage count = 2
  kernel: [78489.379620] unregister_netdevice: waiting for bond0 to become 
free. Usage count = 2
  kernel: [78499.459623] unregister_netdevice: waiting for bond0 to become 
free. Usage count = 2
  kernel: [78509.539631] unregister_netdevice: waiting for bond0 to become 
free. Usage count = 2
  kernel: [78519.619629] unregister_netdevice: waiting for bond0 to become 
free. Usage count = 2

  The following upstream commits that fix this issue

  commit a3dd7d0022c347207ae931c753a6dc3e6e8fcbc1
  Author: Max Gurtovoy <m...@mellanox.com>
  Date:   Wed Feb 28 13:12:38 2018 +0200

      nvmet-rdma: Don't flush system_wq by default during remove_one

      The .remove_one function is called for any ib_device removal.
      In case the removed device has no reference in our driver, there
      is no need to flush the system work queue.

      Reviewed-by: Israel Rukshin <isra...@mellanox.com>
      Signed-off-by: Max Gurtovoy <m...@mellanox.com>
      Reviewed-by: Sagi Grimberg <s...@grimberg.me>
      Signed-off-by: Keith Busch <keith.bu...@intel.com>
      Signed-off-by: Jens Axboe <ax...@kernel.dk>

  diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
  index aa8068f..a59263d 100644
  --- a/drivers/nvme/target/rdma.c
  +++ b/drivers/nvme/target/rdma.c
  @@ -1469,8 +1469,25 @@ static struct nvmet_fabrics_ops nvmet_rdma_ops = {
   static void nvmet_rdma_remove_one(struct ib_device *ib_device, void 
*client_data)
   {
          struct nvmet_rdma_queue *queue, *tmp;
  + struct nvmet_rdma_device *ndev;
  + bool found = false;
  +
  + mutex_lock(&device_list_mutex);
  + list_for_each_entry(ndev, &device_list, entry) {
  +         if (ndev->device == ib_device) {
  +                 found = true;
  +                 break;
  +         }
  + }
  + mutex_unlock(&device_list_mutex);
  +
  + if (!found)
  +         return;

  -   /* Device is being removed, delete all queues using this device */
  + /*
  +  * IB Device that is used by nvmet controllers is being removed,
  +  * delete all queues using this device.
  +  */
          mutex_lock(&nvmet_rdma_queue_mutex);
          list_for_each_entry_safe(queue, tmp, &nvmet_rdma_queue_list,
                                   queue_list) {

  commit 9bad0404ecd7594265cef04e176adeaa4ffbca4a
  Author: Max Gurtovoy <m...@mellanox.com>
  Date:   Wed Feb 28 13:12:39 2018 +0200

      nvme-rdma: Don't flush delete_wq by default during remove_one

      The .remove_one function is called for any ib_device removal.
      In case the removed device has no reference in our driver, there
      is no need to flush the work queue.

      Reviewed-by: Israel Rukshin <isra...@mellanox.com>
      Signed-off-by: Max Gurtovoy <m...@mellanox.com>
      Reviewed-by: Sagi Grimberg <s...@grimberg.me>
      Signed-off-by: Keith Busch <keith.bu...@intel.com>
      Signed-off-by: Jens Axboe <ax...@kernel.dk>

  diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
  index f5f460b..250b277 100644
  --- a/drivers/nvme/host/rdma.c
  +++ b/drivers/nvme/host/rdma.c
  @@ -2024,6 +2024,20 @@ static struct nvmf_transport_ops nvme_rdma_transport = 
{
   static void nvme_rdma_remove_one(struct ib_device *ib_device, void 
*client_data)
   {
          struct nvme_rdma_ctrl *ctrl;
  + struct nvme_rdma_device *ndev;
  + bool found = false;
  +
  + mutex_lock(&device_list_mutex);
  + list_for_each_entry(ndev, &device_list, entry) {
  +         if (ndev->dev == ib_device) {
  +                 found = true;
  +                 break;
  +         }
  + }
  + mutex_unlock(&device_list_mutex);
  +
  + if (!found)
  +         return;

          /* Delete all controllers using this device */
          mutex_lock(&nvme_rdma_ctrl_mutex);

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1764982/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to