[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

Colin Ian King Thu, 07 Dec 2017 09:36:33 -0800

** Description changed:

- This is impacting us for ubuntu autopkgtests. Eventually the whole
- region ends up dying because each worker is hit by this bug in turn and
- backs off until the next reset (6 hourly).
+ == SRU Justification ==
+ 
+ The fix to bug 1672819 can cause an lockup because it can spin
+ indefinitely waiting for a child to exit.
+ 
+ [FIX]
+ Add a sauce patch to the original fix to insert a reasonable small delay and 
an upper bounds to the number of retries that are made before bailing out with 
an error. This avoids the lockup and also is less aggressive in the retry loop.
+ 
+ [TEST]
+ Without the fix the machine hangs. With the fix, the lockup no longer occurs.
+ 
+ [REGRESSION POTENTIAL]
+ There may be an issue with the interruptible sleep having some unforeseen 
impact with userspace racy code that expects the system call to return quickly 
when the race condition occurs and instead it gets delayed by a few 
milliseconds while the retry loop spins.  However, code that relies on timing 
on fork/exec inside pthreads where this particular code path could bite is 
generally non-POSIX conforming racy code anyhow.
+ 
+ -----------------------------------
+ 
+ 
+ This is impacting us for ubuntu autopkgtests. Eventually the whole region 
ends up dying because each worker is hit by this bug in turn and backs off 
until the next reset (6 hourly).
  
  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console
  
    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0
  
  One host that exhibits this behaviour was:
  
    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux
  
  guest running:
  
    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)
  
  The affected cloud region is running the xenial/Ocata cloud archive, so
  the version of qemu-kvm in there may also be relevant.
  
  Here's how I reproduced it in lcy01:
  
    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $ <ssh to each instance> sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done
  
  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.
  
  We've seen good and bad instances on multiple compute hosts - it doesn't
  feel to me like a host problem but rather a race condition somewhere
  that's somehow either triggered or triggered much more often by what
  lcy01 is running. I always saw this on the first reboot - never on first
  boot, and never on n>1th boot. (But if it's a race then that might not
  mean much.)
  
  I'll attach a bad and a good console-log for reference.
  
  If you're at Canonical then see internal rt #107135 for some other
  details.


-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  Fix Committed
Status in qemu-kvm package in Ubuntu:
  Confirmed
Status in linux source package in Zesty:
  Incomplete
Status in qemu-kvm source package in Zesty:
  New
Status in linux source package in Artful:
  In Progress
Status in qemu-kvm source package in Artful:
  Confirmed
Status in linux source package in Bionic:
  Fix Committed
Status in qemu-kvm source package in Bionic:
  Confirmed

Bug description:
  == SRU Justification ==

  The fix to bug 1672819 can cause an lockup because it can spin
  indefinitely waiting for a child to exit.

  [FIX]
  Add a sauce patch to the original fix to insert a reasonable small delay and 
an upper bounds to the number of retries that are made before bailing out with 
an error. This avoids the lockup and also is less aggressive in the retry loop.

  [TEST]
  Without the fix the machine hangs. With the fix, the lockup no longer occurs.

  [REGRESSION POTENTIAL]
  There may be an issue with the interruptible sleep having some unforeseen 
impact with userspace racy code that expects the system call to return quickly 
when the race condition occurs and instead it gets delayed by a few 
milliseconds while the retry loop spins.  However, code that relies on timing 
on fork/exec inside pthreads where this particular code path could bite is 
generally non-POSIX conforming racy code anyhow.

  -----------------------------------

  
  This is impacting us for ubuntu autopkgtests. Eventually the whole region 
ends up dying because each worker is hit by this bug in turn and backs off 
until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $ <ssh to each instance> sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

Reply via email to