Public bug reported:

[Impact]

During our AWS testing we were experiencing deadlocks on hibernate across all 
Xen instance types.
The trace was showing that the system was stuck in xennet_remove():

[  358.109087] Freezing of tasks failed after 20.006 seconds (1 tasks refusing 
to freeze, wq_busy=0):
[  358.115102] modprobe        D    0  4892   4833 0x00004004
[  358.115104] Call Trace:
[  358.115112]  __schedule+0x2a8/0x670
[  358.115115]  schedule+0x33/0xa0
[  358.115118]  xennet_remove+0x1f0/0x230 [xen_netfront]
[  358.115121]  ? wait_woken+0x80/0x80
[  358.115124]  xenbus_dev_remove+0x51/0xa0
[  358.115126]  device_release_driver_internal+0xe0/0x1b0
[  358.115127]  driver_detach+0x49/0x90
[  358.115129]  bus_remove_driver+0x59/0xd0
[  358.115131]  driver_unregister+0x2c/0x40
[  358.115132]  xenbus_unregister_driver+0x12/0x20
[  358.115134]  netif_exit+0x10/0x7aa [xen_netfront]
[  358.115137]  __x64_sys_delete_module+0x146/0x290
[  358.115140]  do_syscall_64+0x5a/0x130
[  358.115142]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

This prevented hibernation to complete.

The reason of this problem is a race condition in xennet_remove(): the
system is reading the current state of the bus, it's requesting to
change the state to "Closing", and it's waiting for the state to be
changed to "Closing". However, if the state becomes "Closed" between
reading the state and requesting the state change, we are stuck forever,
because the state will never change from "Closed" back to "Closing".

[Test case]

Create any Xen-based instance in AWS, hibernate/resume multiple times.
Some times the system gets stuck (hung task timeout).

[Fix]

Prevent the deadlock by changing the wait condition to check also for
state == Closed.

[Regression potential]

Minimal, this change affects only Xen, more exactly only the xen-
netfront driver.

** Affects: linux-aws (Ubuntu)
     Importance: Undecided
         Status: New

** Affects: linux-aws (Ubuntu Eoan)
     Importance: Undecided
         Status: New

** Affects: linux-aws (Ubuntu Focal)
     Importance: Undecided
         Status: New

** Also affects: linux-aws (Ubuntu Eoan)
   Importance: Undecided
       Status: New

** Also affects: linux-aws (Ubuntu Focal)
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-aws in Ubuntu.
https://bugs.launchpad.net/bugs/1888510

Title:
  xen-netfront: potential deadlock in xennet_remove()

Status in linux-aws package in Ubuntu:
  New
Status in linux-aws source package in Eoan:
  New
Status in linux-aws source package in Focal:
  New

Bug description:
  [Impact]

  During our AWS testing we were experiencing deadlocks on hibernate across all 
Xen instance types.
  The trace was showing that the system was stuck in xennet_remove():

  [  358.109087] Freezing of tasks failed after 20.006 seconds (1 tasks 
refusing to freeze, wq_busy=0):
  [  358.115102] modprobe        D    0  4892   4833 0x00004004
  [  358.115104] Call Trace:
  [  358.115112]  __schedule+0x2a8/0x670
  [  358.115115]  schedule+0x33/0xa0
  [  358.115118]  xennet_remove+0x1f0/0x230 [xen_netfront]
  [  358.115121]  ? wait_woken+0x80/0x80
  [  358.115124]  xenbus_dev_remove+0x51/0xa0
  [  358.115126]  device_release_driver_internal+0xe0/0x1b0
  [  358.115127]  driver_detach+0x49/0x90
  [  358.115129]  bus_remove_driver+0x59/0xd0
  [  358.115131]  driver_unregister+0x2c/0x40
  [  358.115132]  xenbus_unregister_driver+0x12/0x20
  [  358.115134]  netif_exit+0x10/0x7aa [xen_netfront]
  [  358.115137]  __x64_sys_delete_module+0x146/0x290
  [  358.115140]  do_syscall_64+0x5a/0x130
  [  358.115142]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

  This prevented hibernation to complete.

  The reason of this problem is a race condition in xennet_remove(): the
  system is reading the current state of the bus, it's requesting to
  change the state to "Closing", and it's waiting for the state to be
  changed to "Closing". However, if the state becomes "Closed" between
  reading the state and requesting the state change, we are stuck
  forever, because the state will never change from "Closed" back to
  "Closing".

  [Test case]

  Create any Xen-based instance in AWS, hibernate/resume multiple times.
  Some times the system gets stuck (hung task timeout).

  [Fix]

  Prevent the deadlock by changing the wait condition to check also for
  state == Closed.

  [Regression potential]

  Minimal, this change affects only Xen, more exactly only the xen-
  netfront driver.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1888510/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to