subject:"\[Kernel\-packages\] \[Bug 1730717\] Re\: Some VMs fail to reboot with \"watchdog\: BUG\: soft lockup \- CPU#0 stuck for 22s\! \[systemd\:1\]\""

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2018-04-03 Thread Seth Forshee

Bionic should have been fixed for a while now, updating the status.

** Changed in: linux (Ubuntu Bionic)
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  Fix Released
Status in qemu-kvm package in Ubuntu:
  Confirmed
Status in linux source package in Zesty:
  Won't Fix
Status in qemu-kvm source package in Zesty:
  Won't Fix
Status in linux source package in Artful:
  Fix Released
Status in qemu-kvm source package in Artful:
  Won't Fix
Status in linux source package in Bionic:
  Fix Released
Status in qemu-kvm source package in Bionic:
  Won't Fix

Bug description:
  == SRU Justification ==

  The fix to bug 1672819 can cause an lockup because it can spin
  indefinitely waiting for a child to exit.

  [FIX]
  Add a sauce patch to the original fix to insert a reasonable small delay and 
an upper bounds to the number of retries that are made before bailing out with 
an error. This avoids the lockup and also is less aggressive in the retry loop.

  [TEST]
  Without the fix the machine hangs. With the fix, the lockup no longer occurs.

  [REGRESSION POTENTIAL]
  There may be an issue with the interruptible sleep having some unforeseen 
impact with userspace racy code that expects the system call to return quickly 
when the race condition occurs and instead it gets delayed by a few 
milliseconds while the retry loop spins.  However, code that relies on timing 
on fork/exec inside pthreads where this particular code path could bite is 
generally non-POSIX conforming racy code anyhow.

  ---

  
  This is impacting us for ubuntu autopkgtests. Eventually the whole region 
ends up dying because each worker is hit by this bug in turn and backs off 
until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2018-04-03 Thread ChristianEhrhardt

Also since the change was identified to be in the kernel, set qemu to
Won't Fix

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  Fix Committed
Status in qemu-kvm package in Ubuntu:
  Confirmed
Status in linux source package in Zesty:
  Won't Fix
Status in qemu-kvm source package in Zesty:
  Won't Fix
Status in linux source package in Artful:
  Fix Released
Status in qemu-kvm source package in Artful:
  Won't Fix
Status in linux source package in Bionic:
  Fix Committed
Status in qemu-kvm source package in Bionic:
  Won't Fix

Bug description:
  == SRU Justification ==

  The fix to bug 1672819 can cause an lockup because it can spin
  indefinitely waiting for a child to exit.

  [FIX]
  Add a sauce patch to the original fix to insert a reasonable small delay and 
an upper bounds to the number of retries that are made before bailing out with 
an error. This avoids the lockup and also is less aggressive in the retry loop.

  [TEST]
  Without the fix the machine hangs. With the fix, the lockup no longer occurs.

  [REGRESSION POTENTIAL]
  There may be an issue with the interruptible sleep having some unforeseen 
impact with userspace racy code that expects the system call to return quickly 
when the race condition occurs and instead it gets delayed by a few 
milliseconds while the retry loop spins.  However, code that relies on timing 
on fork/exec inside pthreads where this particular code path could bite is 
generally non-POSIX conforming racy code anyhow.

  ---

  
  This is impacting us for ubuntu autopkgtests. Eventually the whole region 
ends up dying because each worker is hit by this bug in turn and backs off 
until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2018-04-03 Thread Launchpad Bug Tracker

This bug was fixed in the package linux - 4.13.0-38.43

---
linux (4.13.0-38.43) artful; urgency=medium

  * linux: 4.13.0-38.43 -proposed tracker (LP: #1755762)

  * Servers going OOM after updating kernel from 4.10 to 4.13 (LP: #1748408)
- i40e: Fix memory leak related filter programming status
- i40e: Add programming descriptors to cleaned_count

  * [SRU] Lenovo E41 Mic mute hotkey is not responding (LP: #1753347)
- platform/x86: ideapad-laptop: Increase timeout to wait for EC answer

  * fails to dump with latest kpti fixes (LP: #1750021)
- kdump: write correct address of mem_section into vmcoreinfo

  * headset mic can't be detected on two Dell machines (LP: #1748807)
- ALSA: hda/realtek - Support headset mode for ALC215/ALC285/ALC289
- ALSA: hda - Fix headset mic detection problem for two Dell machines
- ALSA: hda - Fix a wrong FIXUP for alc289 on Dell machines

  * CIFS SMB2/SMB3 does not work for domain based DFS (LP: #1747572)
- CIFS: make IPC a regular tcon
- CIFS: use tcon_ipc instead of use_ipc parameter of SMB2_ioctl
- CIFS: dump IPC tcon in debug proc file

  * i2c-thunderx: erroneous error message "unhandled state: 0" (LP: #1754076)
- i2c: octeon: Prevent error message on bus error

  * hisi_sas: Add disk LED support (LP: #1752695)
- scsi: hisi_sas: directly attached disk LED feature for v2 hw

  * EDAC, sb_edac: Backport 1 patch to Ubuntu 17.10 (Fix missing DIMM sysfs
entries with KNL SNC2/SNC4 mode) (LP: #1743856)
- EDAC, sb_edac: Fix missing DIMM sysfs entries with KNL SNC2/SNC4 mode

  * [regression] Colour banding and artefacts appear system-wide on an Asus
Zenbook UX303LA with Intel HD 4400 graphics (LP: #1749420)
- drm/edid: Add 6 bpc quirk for CPT panel in Asus UX303LA

  * DVB Card with SAA7146 chipset not working (LP: #1742316)
- vmalloc: fix __GFP_HIGHMEM usage for vmalloc_32 on 32b systems

  * [Asus UX360UA] battery status in unity-panel is not changing when battery is
being charged (LP: #1661876) // AC adapter status not detected on Asus
ZenBook UX410UAK (LP: #1745032)
- ACPI / battery: Add quirk for Asus UX360UA and UX410UAK

  * ASUS UX305LA - Battery state not detected correctly (LP: #1482390)
- ACPI / battery: Add quirk for Asus GL502VSK and UX305LA

  * support thunderx2 vendor pmu events (LP: #1747523)
- perf pmu: Extract function to get JSON alias map
- perf pmu: Pass pmu as a parameter to get_cpuid_str()
- perf tools arm64: Add support for get_cpuid_str function.
- perf pmu: Add helper function is_pmu_core to detect PMU CORE devices
- perf vendor events arm64: Add ThunderX2 implementation defined pmu core
  events
- perf pmu: Add check for valid cpuid in perf_pmu__find_map()

  * lpfc.ko module doesn't work (LP: #1746970)
- scsi: lpfc: Fix loop mode target discovery

  * Ubuntu 17.10 crashes on vmalloc.c (LP: #1739498)
- powerpc/mm/book3s64: Make KERN_IO_START a variable
- powerpc/mm/slb: Move comment next to the code it's referring to
- powerpc/mm/hash64: Make vmalloc 56T on hash

  * ethtool -p fails to light NIC LED on HiSilicon D05 systems (LP: #1748567)
- net: hns: add ACPI mode support for ethtool -p

  * CVE-2017-17807
- KEYS: add missing permission check for request_key() destination

  * [Artful SRU] Fix capsule update regression (LP: #1746019)
- efi/capsule-loader: Reinstate virtual capsule mapping

  * [Artful/Bionic] [Config] enable EDAC_GHES for ARM64 (LP: #1747746)
- Ubuntu: [Config] enable EDAC_GHES for ARM64

  * linux-tools: perf incorrectly linking libbfd (LP: #1748922)
- SAUCE: tools -- add ability to disable libbfd
- [Packaging] correct disablement of libbfd

  * Cherry pick c96f5471ce7d for delayacct fix (LP: #1747769)
- delayacct: Account blkio completion on the correct task

  * Error in CPU frequency reporting when nominal and min pstates are same
(cpufreq) (LP: #1746174)
- cpufreq: powernv: Dont assume distinct pstate values for nominal and pmin

  * retpoline abi files are empty on i386 (LP: #1751021)
- [Packaging] retpoline-extract -- instantiate retpoline files for i386
- [Packaging] final-checks -- sanity checking ABI contents
- [Packaging] final-checks -- check for empty retpoline files

  * [P9,Power NV][WSP][Ubuntu 1804] : "Kernel access of bad area " when grouping
different pmu events using perf fuzzer . (perf:) (LP: #1746225)
- powerpc/perf: Fix oops when grouping different pmu events

  * bnx2x_attn_int_deasserted3:4323 MC assert! (LP: #1715519) //
CVE-2018-126
- net: create skb_gso_validate_mac_len()
- bnx2x: disable GSO where gso_size is too big for hardware

  * Ubuntu16.04.03: ISAv3 initialize MMU registers before setting partition
table (LP: #1736145)
- powerpc/64s: Initialize ISAv3 MMU registers before setting partition table

  * powerpc/powernv: Flush console before platform error reboot (LP: #1735159)
- po

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2018-04-03 Thread ChristianEhrhardt

Thanks that this gets into releases now.
I wonder about Bionic's status - any update on that?

** Changed in: qemu-kvm (Ubuntu Zesty)
   Status: New => Won't Fix

** Changed in: qemu-kvm (Ubuntu Bionic)
   Status: Confirmed => Won't Fix

** Changed in: qemu-kvm (Ubuntu Artful)
   Status: Confirmed => Won't Fix

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  Fix Committed
Status in qemu-kvm package in Ubuntu:
  Confirmed
Status in linux source package in Zesty:
  Won't Fix
Status in qemu-kvm source package in Zesty:
  Won't Fix
Status in linux source package in Artful:
  Fix Released
Status in qemu-kvm source package in Artful:
  Won't Fix
Status in linux source package in Bionic:
  Fix Committed
Status in qemu-kvm source package in Bionic:
  Won't Fix

Bug description:
  == SRU Justification ==

  The fix to bug 1672819 can cause an lockup because it can spin
  indefinitely waiting for a child to exit.

  [FIX]
  Add a sauce patch to the original fix to insert a reasonable small delay and 
an upper bounds to the number of retries that are made before bailing out with 
an error. This avoids the lockup and also is less aggressive in the retry loop.

  [TEST]
  Without the fix the machine hangs. With the fix, the lockup no longer occurs.

  [REGRESSION POTENTIAL]
  There may be an issue with the interruptible sleep having some unforeseen 
impact with userspace racy code that expects the system call to return quickly 
when the race condition occurs and instead it gets delayed by a few 
milliseconds while the retry loop spins.  However, code that relies on timing 
on fork/exec inside pthreads where this particular code path could bite is 
generally non-POSIX conforming racy code anyhow.

  ---

  
  This is impacting us for ubuntu autopkgtests. Eventually the whole region 
ends up dying because each worker is hit by this bug in turn and backs off 
until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2018-03-19 Thread Stefan Bader

This bug is awaiting verification that the kernel in -proposed solves
the problem. Please test the kernel and update this bug with the
results. If the problem is solved, change the tag 'verification-needed-
artful' to 'verification-done-artful'. If the problem still exists,
change the tag 'verification-needed-artful' to 'verification-failed-
artful'.

If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: verification-needed-artful

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  Fix Committed
Status in qemu-kvm package in Ubuntu:
  Confirmed
Status in linux source package in Zesty:
  Won't Fix
Status in qemu-kvm source package in Zesty:
  New
Status in linux source package in Artful:
  Fix Committed
Status in qemu-kvm source package in Artful:
  Confirmed
Status in linux source package in Bionic:
  Fix Committed
Status in qemu-kvm source package in Bionic:
  Confirmed

Bug description:
  == SRU Justification ==

  The fix to bug 1672819 can cause an lockup because it can spin
  indefinitely waiting for a child to exit.

  [FIX]
  Add a sauce patch to the original fix to insert a reasonable small delay and 
an upper bounds to the number of retries that are made before bailing out with 
an error. This avoids the lockup and also is less aggressive in the retry loop.

  [TEST]
  Without the fix the machine hangs. With the fix, the lockup no longer occurs.

  [REGRESSION POTENTIAL]
  There may be an issue with the interruptible sleep having some unforeseen 
impact with userspace racy code that expects the system call to return quickly 
when the race condition occurs and instead it gets delayed by a few 
milliseconds while the retry loop spins.  However, code that relies on timing 
on fork/exec inside pthreads where this particular code path could bite is 
generally non-POSIX conforming racy code anyhow.

  ---

  
  This is impacting us for ubuntu autopkgtests. Eventually the whole region 
ends up dying because each worker is hit by this bug in turn and backs off 
until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packa

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2018-02-16 Thread Glyn M Burton

Hello Everyone,

Google led me here with a search for a "soft lockup CPU#" error I am
experiencing.

I run Ubuntu Server on Citrix XenServer and have never had this issue
with Ubuntu Server 12.04, 14.04 or 16.04.  I am here because 18.04 has
this issue upon using 'reboot', although 'poweroff' does as intended.

If I be of any assistance to yourselves in retrieving log files, testing
or what-ever to remedy this bug for the good of everyone then please let
me know.  Keen to improve Ubuntu for all.

Thank you and best regards,

Modiford.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  Fix Committed
Status in qemu-kvm package in Ubuntu:
  Confirmed
Status in linux source package in Zesty:
  Won't Fix
Status in qemu-kvm source package in Zesty:
  New
Status in linux source package in Artful:
  Fix Committed
Status in qemu-kvm source package in Artful:
  Confirmed
Status in linux source package in Bionic:
  Fix Committed
Status in qemu-kvm source package in Bionic:
  Confirmed

Bug description:
  == SRU Justification ==

  The fix to bug 1672819 can cause an lockup because it can spin
  indefinitely waiting for a child to exit.

  [FIX]
  Add a sauce patch to the original fix to insert a reasonable small delay and 
an upper bounds to the number of retries that are made before bailing out with 
an error. This avoids the lockup and also is less aggressive in the retry loop.

  [TEST]
  Without the fix the machine hangs. With the fix, the lockup no longer occurs.

  [REGRESSION POTENTIAL]
  There may be an issue with the interruptible sleep having some unforeseen 
impact with userspace racy code that expects the system call to return quickly 
when the race condition occurs and instead it gets delayed by a few 
milliseconds while the retry loop spins.  However, code that relies on timing 
on fork/exec inside pthreads where this particular code path could bite is 
generally non-POSIX conforming racy code anyhow.

  ---

  
  This is impacting us for ubuntu autopkgtests. Eventually the whole region 
ends up dying because each worker is hit by this bug in turn and backs off 
until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.l

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2018-02-01 Thread Khaled El Mously

** Changed in: linux (Ubuntu Artful)
   Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  Fix Committed
Status in qemu-kvm package in Ubuntu:
  Confirmed
Status in linux source package in Zesty:
  Won't Fix
Status in qemu-kvm source package in Zesty:
  New
Status in linux source package in Artful:
  Fix Committed
Status in qemu-kvm source package in Artful:
  Confirmed
Status in linux source package in Bionic:
  Fix Committed
Status in qemu-kvm source package in Bionic:
  Confirmed

Bug description:
  == SRU Justification ==

  The fix to bug 1672819 can cause an lockup because it can spin
  indefinitely waiting for a child to exit.

  [FIX]
  Add a sauce patch to the original fix to insert a reasonable small delay and 
an upper bounds to the number of retries that are made before bailing out with 
an error. This avoids the lockup and also is less aggressive in the retry loop.

  [TEST]
  Without the fix the machine hangs. With the fix, the lockup no longer occurs.

  [REGRESSION POTENTIAL]
  There may be an issue with the interruptible sleep having some unforeseen 
impact with userspace racy code that expects the system call to return quickly 
when the race condition occurs and instead it gets delayed by a few 
milliseconds while the retry loop spins.  However, code that relies on timing 
on fork/exec inside pthreads where this particular code path could bite is 
generally non-POSIX conforming racy code anyhow.

  ---

  
  This is impacting us for ubuntu autopkgtests. Eventually the whole region 
ends up dying because each worker is hit by this bug in turn and backs off 
until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2018-01-23 Thread Stefan Bader

** Changed in: linux (Ubuntu Zesty)
   Status: Incomplete => Won't Fix

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  Fix Committed
Status in qemu-kvm package in Ubuntu:
  Confirmed
Status in linux source package in Zesty:
  Won't Fix
Status in qemu-kvm source package in Zesty:
  New
Status in linux source package in Artful:
  In Progress
Status in qemu-kvm source package in Artful:
  Confirmed
Status in linux source package in Bionic:
  Fix Committed
Status in qemu-kvm source package in Bionic:
  Confirmed

Bug description:
  == SRU Justification ==

  The fix to bug 1672819 can cause an lockup because it can spin
  indefinitely waiting for a child to exit.

  [FIX]
  Add a sauce patch to the original fix to insert a reasonable small delay and 
an upper bounds to the number of retries that are made before bailing out with 
an error. This avoids the lockup and also is less aggressive in the retry loop.

  [TEST]
  Without the fix the machine hangs. With the fix, the lockup no longer occurs.

  [REGRESSION POTENTIAL]
  There may be an issue with the interruptible sleep having some unforeseen 
impact with userspace racy code that expects the system call to return quickly 
when the race condition occurs and instead it gets delayed by a few 
milliseconds while the retry loop spins.  However, code that relies on timing 
on fork/exec inside pthreads where this particular code path could bite is 
generally non-POSIX conforming racy code anyhow.

  ---

  
  This is impacting us for ubuntu autopkgtests. Eventually the whole region 
ends up dying because each worker is hit by this bug in turn and backs off 
until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2017-12-07 Thread Colin Ian King

** Description changed:

- This is impacting us for ubuntu autopkgtests. Eventually the whole
- region ends up dying because each worker is hit by this bug in turn and
- backs off until the next reset (6 hourly).
+ == SRU Justification ==
+ 
+ The fix to bug 1672819 can cause an lockup because it can spin
+ indefinitely waiting for a child to exit.
+ 
+ [FIX]
+ Add a sauce patch to the original fix to insert a reasonable small delay and 
an upper bounds to the number of retries that are made before bailing out with 
an error. This avoids the lockup and also is less aggressive in the retry loop.
+ 
+ [TEST]
+ Without the fix the machine hangs. With the fix, the lockup no longer occurs.
+ 
+ [REGRESSION POTENTIAL]
+ There may be an issue with the interruptible sleep having some unforeseen 
impact with userspace racy code that expects the system call to return quickly 
when the race condition occurs and instead it gets delayed by a few 
milliseconds while the retry loop spins.  However, code that relies on timing 
on fork/exec inside pthreads where this particular code path could bite is 
generally non-POSIX conforming racy code anyhow.
+ 
+ ---
+ 
+ 
+ This is impacting us for ubuntu autopkgtests. Eventually the whole region 
ends up dying because each worker is hit by this bug in turn and backs off 
until the next reset (6 hourly).
  
  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console
  
    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0
  
  One host that exhibits this behaviour was:
  
    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux
  
  guest running:
  
    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)
  
  The affected cloud region is running the xenial/Ocata cloud archive, so
  the version of qemu-kvm in there may also be relevant.
  
  Here's how I reproduced it in lcy01:
  
    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done
  
  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.
  
  We've seen good and bad instances on multiple compute hosts - it doesn't
  feel to me like a host problem but rather a race condition somewhere
  that's somehow either triggered or triggered much more often by what
  lcy01 is running. I always saw this on the first reboot - never on first
  boot, and never on n>1th boot. (But if it's a race then that might not
  mean much.)
  
  I'll attach a bad and a good console-log for reference.
  
  If you're at Canonical then see internal rt #107135 for some other
  details.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  Fix Committed
Status in qemu-kvm package in Ubuntu:
  Confirmed
Status in linux source package in Zesty:
  Incomplete
Status in qemu-kvm source package in Zesty:
  New
Status in linux source package in Artful:
  In Progress
Status in qemu-kvm source package in Artful:
  Confirmed
Status in linux source package in Bionic:
  Fix Committed
Status in qemu-kvm source package in Bionic:
  Confirmed

Bug description:
  == SRU Justification ==

  The fix to bug 1672819 can cause an lockup because it can spin
  indefinitely waiting for a child to exit.

  [FIX]
  Add a sauce patch to the original fix to insert a reasonable small delay and 
an upper bounds to the number of retries that are made before bailing out with 
an error. This avoids the lockup and also is less aggressive in the retry loop.

  [TEST]
  Without the fix the machine hangs. With the fix, the lockup no longer occurs.

  [REGRESSION POTENTIAL]
  There may be an issue with the interruptible sleep having some unforeseen 
impact with userspace rac

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2017-12-07 Thread Seth Forshee

We've traced the problem to "UBUNTU: SAUCE: exec: ensure file system
accounting in check_unsafe_exec is correct." cking has a fix which will
be used for zesty and artful. I've reverted the patch in bionic since
there's a fix available for golang and we do not want Ubuntu userspace
to become reliant on the non-standard kernel behavior in the patch.

** Changed in: linux (Ubuntu Bionic)
   Status: In Progress => Fix Committed

** Changed in: linux (Ubuntu Bionic)
 Assignee: Joseph Salisbury (jsalisbury) => Seth Forshee (sforshee)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  Fix Committed
Status in qemu-kvm package in Ubuntu:
  Confirmed
Status in linux source package in Zesty:
  Incomplete
Status in qemu-kvm source package in Zesty:
  New
Status in linux source package in Artful:
  In Progress
Status in qemu-kvm source package in Artful:
  Confirmed
Status in linux source package in Bionic:
  Fix Committed
Status in qemu-kvm source package in Bionic:
  Confirmed

Bug description:
  == SRU Justification ==

  The fix to bug 1672819 can cause an lockup because it can spin
  indefinitely waiting for a child to exit.

  [FIX]
  Add a sauce patch to the original fix to insert a reasonable small delay and 
an upper bounds to the number of retries that are made before bailing out with 
an error. This avoids the lockup and also is less aggressive in the retry loop.

  [TEST]
  Without the fix the machine hangs. With the fix, the lockup no longer occurs.

  [REGRESSION POTENTIAL]
  There may be an issue with the interruptible sleep having some unforeseen 
impact with userspace racy code that expects the system call to return quickly 
when the race condition occurs and instead it gets delayed by a few 
milliseconds while the retry loop spins.  However, code that relies on timing 
on fork/exec inside pthreads where this particular code path could bite is 
generally non-POSIX conforming racy code anyhow.

  ---

  
  This is impacting us for ubuntu autopkgtests. Eventually the whole region 
ends up dying because each worker is hit by this bug in turn and backs off 
until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.l

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2017-12-04 Thread Joseph Salisbury

** Tags removed: kernel-key
** Tags added: kernel-da-key

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  In Progress
Status in qemu-kvm package in Ubuntu:
  Confirmed
Status in linux source package in Zesty:
  Incomplete
Status in qemu-kvm source package in Zesty:
  New
Status in linux source package in Artful:
  In Progress
Status in qemu-kvm source package in Artful:
  Confirmed
Status in linux source package in Bionic:
  In Progress
Status in qemu-kvm source package in Bionic:
  Confirmed

Bug description:
  This is impacting us for ubuntu autopkgtests. Eventually the whole
  region ends up dying because each worker is hit by this bug in turn
  and backs off until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2017-12-04 Thread Andy Whitcroft

I would note that the kernel watchdog timeouts here are always at 20 odd
seconds.  They are not increasing so whatever is occuring is progressing
at least as far as the kernel is concerned.  If we assume the systemd
log is still working (and it was shortly before the event when it
reported reaching shutdown state) then we would expect it to be in the
process of attempting to deconstruct the system before calling reboot.
Most of the deconstructors it calls are reported before calling.  There
is one, cg_trim(), which is not announced.  Looking at the
implementation of that it is doing a hierachical remove of the
/sys/fs/cgroup hierachy.  On my system this is some 15000 files in 1300
directories.  If there was a performance issue in there we could easily
spend hours in this call with nothing logged.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  In Progress
Status in qemu-kvm package in Ubuntu:
  Confirmed
Status in linux source package in Zesty:
  Incomplete
Status in qemu-kvm source package in Zesty:
  New
Status in linux source package in Artful:
  In Progress
Status in qemu-kvm source package in Artful:
  Confirmed
Status in linux source package in Bionic:
  In Progress
Status in qemu-kvm source package in Bionic:
  Confirmed

Bug description:
  This is impacting us for ubuntu autopkgtests. Eventually the whole
  region ends up dying because each worker is hit by this bug in turn
  and backs off until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2017-11-23 Thread Iain Lane

Just a heads up - this apparently became much harder for me to reproduce
at will. We're still seeing it in actual workloads but I'm having
trouble recreating manually.

My current strategy is to start stress-ng on a number of machines and
then constantly reboot them, with the idea that this will stress the
cloud and make the bug more likely, which relies on the assumption that
this bug is somehow to do with busyness on the underlying machines.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  In Progress
Status in qemu-kvm package in Ubuntu:
  Confirmed
Status in linux source package in Zesty:
  Incomplete
Status in qemu-kvm source package in Zesty:
  New
Status in linux source package in Artful:
  In Progress
Status in qemu-kvm source package in Artful:
  Confirmed
Status in linux source package in Bionic:
  In Progress
Status in qemu-kvm source package in Bionic:
  Confirmed

Bug description:
  This is impacting us for ubuntu autopkgtests. Eventually the whole
  region ends up dying because each worker is hit by this bug in turn
  and backs off until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2017-11-18 Thread Joseph Salisbury

The fact that 4.13.8 doesn't reproduce the bug might be another
indicator that the bug was introduced by a SAUCE patch.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  In Progress
Status in qemu-kvm package in Ubuntu:
  Confirmed
Status in linux source package in Zesty:
  Incomplete
Status in qemu-kvm source package in Zesty:
  New
Status in linux source package in Artful:
  In Progress
Status in qemu-kvm source package in Artful:
  Confirmed
Status in linux source package in Bionic:
  In Progress
Status in qemu-kvm source package in Bionic:
  Confirmed

Bug description:
  This is impacting us for ubuntu autopkgtests. Eventually the whole
  region ends up dying because each worker is hit by this bug in turn
  and backs off until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2017-11-17 Thread Iain Lane

OK a few days ago apw pointed me at http://kernel.ubuntu.com/~kernel-
ppa/mainline/v4.13.8/ which is the mainline kernel that the artful-
proposed one I identified as bad is based on.

I ran 35 instances and rebooted them 30 times - all successful. So I
think that says this kernel is good.

Will try another one later on.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  In Progress
Status in qemu-kvm package in Ubuntu:
  Confirmed
Status in linux source package in Zesty:
  Incomplete
Status in qemu-kvm source package in Zesty:
  New
Status in linux source package in Artful:
  In Progress
Status in qemu-kvm source package in Artful:
  Confirmed
Status in linux source package in Bionic:
  In Progress
Status in qemu-kvm source package in Bionic:
  Confirmed

Bug description:
  This is impacting us for ubuntu autopkgtests. Eventually the whole
  region ends up dying because each worker is hit by this bug in turn
  and backs off until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2017-11-16 Thread Joseph Salisbury

The bug reporter in bug 1713751 has been unable to reproduce the bug
with the 4.13.0-16-generic kernel.  He's re-testing with the original
kernel that exhibited the bug to ensure he can reproduce it
consistently.  If he finds that 4.13.0-16-generic is really good he
really might be hitting a different bug.  I'll update this bug with that
answer.

laney, we could start a bisect in this bug at the same time if you want.
The first thing we would have to do is identify the last "Good" Ubuntu
kernel and first "Bad" Ubuntu kernel.  That would mean trying some of
the earlier 4.13 kernels, based on that, some 4.12 kernels, etc.  Bug
1713751 suggests that he is unable to reproduce the bug with 4.10.0-33,
so maybe you could try that one next?  It's available from:

https://launchpad.net/~canonical-kernel-
team/+archive/ubuntu/ppa/+build/13234886

Based on what you see with that, we can decide jump to the middle of the
versions and try a 4.11 or 4.12 based kernel?


** Changed in: linux (Ubuntu Artful)
 Assignee: (unassigned) => Joseph Salisbury (jsalisbury)

** Changed in: linux (Ubuntu Bionic)
 Assignee: (unassigned) => Joseph Salisbury (jsalisbury)

** Changed in: linux (Ubuntu Artful)
   Status: Triaged => In Progress

** Changed in: linux (Ubuntu Bionic)
   Status: Triaged => In Progress

** Also affects: linux (Ubuntu Zesty)
   Importance: Undecided
   Status: New

** Also affects: qemu-kvm (Ubuntu Zesty)
   Importance: Undecided
   Status: New

** Changed in: linux (Ubuntu Zesty)
   Status: New => Incomplete

** Changed in: linux (Ubuntu Zesty)
   Importance: Undecided => High

** Changed in: linux (Ubuntu Zesty)
 Assignee: (unassigned) => Joseph Salisbury (jsalisbury)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  In Progress
Status in qemu-kvm package in Ubuntu:
  Confirmed
Status in linux source package in Zesty:
  Incomplete
Status in qemu-kvm source package in Zesty:
  New
Status in linux source package in Artful:
  In Progress
Status in qemu-kvm source package in Artful:
  Confirmed
Status in linux source package in Bionic:
  In Progress
Status in qemu-kvm source package in Bionic:
  Confirmed

Bug description:
  This is impacting us for ubuntu autopkgtests. Eventually the whole
  region ends up dying because each worker is hit by this bug in turn
  and backs off until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2017-11-16 Thread Launchpad Bug Tracker

Status changed to 'Confirmed' because the bug affects multiple users.

** Changed in: qemu-kvm (Ubuntu Artful)
   Status: New => Confirmed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  Triaged
Status in qemu-kvm package in Ubuntu:
  Confirmed
Status in linux source package in Artful:
  Triaged
Status in qemu-kvm source package in Artful:
  Confirmed
Status in linux source package in Bionic:
  Triaged
Status in qemu-kvm source package in Bionic:
  Confirmed

Bug description:
  This is impacting us for ubuntu autopkgtests. Eventually the whole
  region ends up dying because each worker is hit by this bug in turn
  and backs off until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2017-11-16 Thread Launchpad Bug Tracker

Status changed to 'Confirmed' because the bug affects multiple users.

** Changed in: qemu-kvm (Ubuntu)
   Status: New => Confirmed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  Triaged
Status in qemu-kvm package in Ubuntu:
  Confirmed
Status in linux source package in Artful:
  Triaged
Status in qemu-kvm source package in Artful:
  Confirmed
Status in linux source package in Bionic:
  Triaged
Status in qemu-kvm source package in Bionic:
  Confirmed

Bug description:
  This is impacting us for ubuntu autopkgtests. Eventually the whole
  region ends up dying because each worker is hit by this bug in turn
  and backs off until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2017-11-09 Thread Adam Conrad

Note that if it *is* the same bug as #1713751, that reporter already
mentioned that using mainline kernels (and he was hitting this on 4.10)
fixed it for him, so it seems more plausible not that 4.14 has a fix,
but that Ubuntu's sauce has the breakage.  Of course, they may well not
be the same bug, cause the symptom itself is pretty generic.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  Triaged
Status in qemu-kvm package in Ubuntu:
  New
Status in linux source package in Artful:
  Triaged
Status in qemu-kvm source package in Artful:
  New
Status in linux source package in Bionic:
  Triaged
Status in qemu-kvm source package in Bionic:
  New

Bug description:
  This is impacting us for ubuntu autopkgtests. Eventually the whole
  region ends up dying because each worker is hit by this bug in turn
  and backs off until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Re: [Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2017-11-09 Thread Iain Lane

On Wed, Nov 08, 2017 at 06:57:46PM -, Joseph Salisbury wrote:
> Maybe give 4.13.0-16 a try:
> 
> https://launchpad.net/~canonical-kernel-security-
> team/+archive/ubuntu/ppa2/+build/13567624
> 
> It could also be the bug is being triggered by a Ubuntu specific SAUCE
> patch, so it won't happen with upstream kernels.

It looked to me like that's the same kernel that's in artful release, so
instead I tried with the kernel in artful-proposed (4.13.0-17.20) and
managed to reproduce the bug on 1/6 instances after a few reboot cycles.
So I think my method is okay to check candidate kernels. Feel free to
throw some more at me if you want to bisect.

(I think IS took some of the slower machines out of rotation so the
problem might become slightly harder to reproduce - definitely is still
happening though.)

-- 
Iain Lane  [ i...@orangesquash.org.uk ]
Debian Developer   [ la...@debian.org ]
Ubuntu Developer   [ la...@ubuntu.com ]

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  Triaged
Status in qemu-kvm package in Ubuntu:
  New
Status in linux source package in Artful:
  Triaged
Status in qemu-kvm source package in Artful:
  New
Status in linux source package in Bionic:
  Triaged
Status in qemu-kvm source package in Bionic:
  New

Bug description:
  This is impacting us for ubuntu autopkgtests. Eventually the whole
  region ends up dying because each worker is hit by this bug in turn
  and backs off until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2017-11-08 Thread Joseph Salisbury

Maybe give 4.13.0-16 a try:

https://launchpad.net/~canonical-kernel-security-
team/+archive/ubuntu/ppa2/+build/13567624

It could also be the bug is being triggered by a Ubuntu specific SAUCE
patch, so it won't happen with upstream kernels.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  Triaged
Status in qemu-kvm package in Ubuntu:
  New
Status in linux source package in Artful:
  Triaged
Status in qemu-kvm source package in Artful:
  New
Status in linux source package in Bionic:
  Triaged
Status in qemu-kvm source package in Bionic:
  New

Bug description:
  This is impacting us for ubuntu autopkgtests. Eventually the whole
  region ends up dying because each worker is hit by this bug in turn
  and backs off until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Re: [Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2017-11-08 Thread Iain Lane

On Wed, Nov 08, 2017 at 03:16:29PM -, Joseph Salisbury wrote:
> @Laney, thanks for testing the mainline kernel.  It's promising that a
> fix might be in that kernel.  The time consuming part will be
> identifying what commit in that kernel is the actual fix.  We could
> perform a "Reverse" kernel bisect, which would required testing 12 or so
> test kernels.  However, it sounds like to set up a reproducer is time
> consuming as well.
> 
> The easiest thing to try next would be to test the latest upstream 4.13
> stable kernels.  It's possible the fix that is in 4.14-rc8 was also cc'd
> to upstream stable and made it's way into 4.13 thorough the normal
> stable update process.
> 
> If possible to test, the latest 4.13 upstream kernel is 4.13.12 and is 
> available here:
> http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13.12/

20 instances, all good.

Maybe I should try one that we think is 'bad' to confirm that the way
I'm doing this is actually capable of reproducing the issue? If you
think that's sensible, could you recommend me a kernel to try?

-- 
Iain Lane  [ i...@orangesquash.org.uk ]
Debian Developer   [ la...@debian.org ]
Ubuntu Developer   [ la...@ubuntu.com ]

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  Triaged
Status in qemu-kvm package in Ubuntu:
  New
Status in linux source package in Artful:
  Triaged
Status in qemu-kvm source package in Artful:
  New
Status in linux source package in Bionic:
  Triaged
Status in qemu-kvm source package in Bionic:
  New

Bug description:
  This is impacting us for ubuntu autopkgtests. Eventually the whole
  region ends up dying because each worker is hit by this bug in turn
  and backs off until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2017-11-08 Thread Joseph Salisbury

@Laney, thanks for testing the mainline kernel.  It's promising that a
fix might be in that kernel.  The time consuming part will be
identifying what commit in that kernel is the actual fix.  We could
perform a "Reverse" kernel bisect, which would required testing 12 or so
test kernels.  However, it sounds like to set up a reproducer is time
consuming as well.

The easiest thing to try next would be to test the latest upstream 4.13
stable kernels.  It's possible the fix that is in 4.14-rc8 was also cc'd
to upstream stable and made it's way into 4.13 thorough the normal
stable update process.

If possible to test, the latest 4.13 upstream kernel is 4.13.12 and is 
available here:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13.12/

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  Triaged
Status in qemu-kvm package in Ubuntu:
  New
Status in linux source package in Artful:
  Triaged
Status in qemu-kvm source package in Artful:
  New
Status in linux source package in Bionic:
  Triaged
Status in qemu-kvm source package in Bionic:
  New

Bug description:
  This is impacting us for ubuntu autopkgtests. Eventually the whole
  region ends up dying because each worker is hit by this bug in turn
  and backs off until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Re: [Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2017-11-08 Thread Iain Lane

On Wed, Nov 08, 2017 at 11:56:02AM -, ChristianEhrhardt wrote:
> Torkoal (our Jenkins node) was idle atm and Ryan reported he had seen the 
> issues there before, so trying there as well.
> This is LTS + HWE - Kernel 4.10.0-38-generic, qemu: 1:2.5+dfsg-5ubuntu10
> 
> I thought about your case since you seem just to start a lot of them and 
> reboot,
> this shouldn't be so much different to:
> $ uvt-simplestreams-libvirt --verbose sync --source 
> http://cloud-images.ubuntu.com/daily arch=amd64 label=daily release=artful
> $ for i in {1..30}; do uvt-kvm create --log-console-output --password=ubuntu 
> artful-${i}-bug1730717 release=artful arch=amd64 label=daily; done
> $ for i in {1..30}; do uvt-kvm wait --insecure artful-${i}-bug1730717; done
> $ for i in {1..30}; do uvt-kvm ssh --insecure artful-${i}-bug1730717 "sudo 
> reboot"; done
> $ sudo grep "soft lockup" /var/log/libvirt/qemu/artful-*-bug1730717.log

Sounds like it's similar, but maybe you have to put the system under
load - you might need more instances, or maybe start a whole bunch first
and get them to run something memory intensive before running that same
test again. In the cloud there will be buildds and tests running on the
compute nodes too, as well as these 'empty' instances that I use to
reproduce the problem.

> Waiting for your feedback if you can trigger the same issue on a non-
> busy openstack system (could after all be some openstack magic at work
> that makes it behave differently).

I don't have access to a non busy cloud I'm afraid.

ANYWAY! My results are in. I created an image by booting the stock
artful cloud image and installing the mainline kernel v4.14-rc8
(39dae59d66acd86d1de24294bd2f343fd5e7a625) packages, on lcy01 (the busy
cloud that exhibits this problem).

I started 34 (17 × 2 in two runs - that's all I could squeeze in before
I hit my quota) instances, and they were all good. This isn't definitive
proof, but it looks like that kernel might be good.

Cheers,

-- 
Iain Lane  [ i...@orangesquash.org.uk ]
Debian Developer   [ la...@debian.org ]
Ubuntu Developer   [ la...@ubuntu.com ]

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  Triaged
Status in qemu-kvm package in Ubuntu:
  New
Status in linux source package in Artful:
  Triaged
Status in qemu-kvm source package in Artful:
  New
Status in linux source package in Bionic:
  Triaged
Status in qemu-kvm source package in Bionic:
  New

Bug description:
  This is impacting us for ubuntu autopkgtests. Eventually the whole
  region ends up dying because each worker is hit by this bug in turn
  and backs off until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2017-11-08 Thread ChristianEhrhardt

Torkoal (our Jenkins node) was idle atm and Ryan reported he had seen the 
issues there before, so trying there as well.
This is LTS + HWE - Kernel 4.10.0-38-generic, qemu: 1:2.5+dfsg-5ubuntu10

I thought about your case since you seem just to start a lot of them and reboot,
this shouldn't be so much different to:
$ uvt-simplestreams-libvirt --verbose sync --source 
http://cloud-images.ubuntu.com/daily arch=amd64 label=daily release=artful
$ for i in {1..30}; do uvt-kvm create --log-console-output --password=ubuntu 
artful-${i}-bug1730717 release=artful arch=amd64 label=daily; done
$ for i in {1..30}; do uvt-kvm wait --insecure artful-${i}-bug1730717; done
$ for i in {1..30}; do uvt-kvm ssh --insecure artful-${i}-bug1730717 "sudo 
reboot"; done
$ sudo grep "soft lockup" /var/log/libvirt/qemu/artful-*-bug1730717.log

But this works for me :-/

Waiting for your feedback if you can trigger the same issue on a non-
busy openstack system (could after all be some openstack magic at work
that makes it behave differently).

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  Triaged
Status in qemu-kvm package in Ubuntu:
  New
Status in linux source package in Artful:
  Triaged
Status in qemu-kvm source package in Artful:
  New
Status in linux source package in Bionic:
  Triaged
Status in qemu-kvm source package in Bionic:
  New

Bug description:
  This is impacting us for ubuntu autopkgtests. Eventually the whole
  region ends up dying because each worker is hit by this bug in turn
  and backs off until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Re: [Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2017-11-08 Thread Iain Lane

On Wed, Nov 08, 2017 at 11:08:05AM -, ChristianEhrhardt wrote:
> @Laney I found it interesting that you essentially only needed to 
> start+reboot.
> I assume on the host you had other workload goes on in the background (since 
> it is lcy01)?

I don't have visibility into what else the hosts are doing - I'm just a
client here. But I do know that *I* had other workloads running at the
same time. I assume that there were a lot of buildd jobs too. These
compute nodes are probably loaded most of the time.

> If you'd have any sort of non-busy but otherwise comparable system - could 
> you check to confirm the assumption we have so far that there all is fine?
> If yes - then the memory pressure theory gets more likely, if not we can 
> focus on simpler reproducers - so we can only win by that check.

Afraid not, sorry. I did try on my artful+artful host but didn't
reproduce the problem - it probably wasn't under enough stress anyway.

> Crossing fingers for jsalisbury's hope that 4.14 might already have a
> fix.

Right now I'm trying to build a new cloud image with this kernel that I
can try.

-- 
Iain Lane  [ i...@orangesquash.org.uk ]
Debian Developer   [ la...@debian.org ]
Ubuntu Developer   [ la...@ubuntu.com ]

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  Triaged
Status in qemu-kvm package in Ubuntu:
  New
Status in linux source package in Artful:
  Triaged
Status in qemu-kvm source package in Artful:
  New
Status in linux source package in Bionic:
  Triaged
Status in qemu-kvm source package in Bionic:
  New

Bug description:
  This is impacting us for ubuntu autopkgtests. Eventually the whole
  region ends up dying because each worker is hit by this bug in turn
  and backs off until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2017-11-08 Thread ChristianEhrhardt

Out of the IRC discussions documenting potentially related issues:
- this bug: KVM: Host-Kernel: Xenial-GA, Qemu: Xenial-Ocata, Guest: Bionic
- bug 1722311 KVM: Host-Kernel: Xenial-GA, Qemu: Xenial, Guest: Artful - some 
relation to cache pressure
- bug 1713751 AWS: triggered by Xenial kernel update, supposed fixed but shown 
up again and again
- bug 1655842 Host-Kernel: Xenial-GA, Qemu: Xenial, Guest: Artful - some 
relation to cache pressure
These might after all just run into the same soft lockup symptom, but I thought 
it was worth to mention for thos enot reading the IRC log.

These cases seem to somewhat agree on:
- Recent guest kernel
- Xenial Host kernel
- some memory pressure

To get further I thought some sort of local reproducer for the kernel Team to 
work on easier than needing a full cloud.
But so far I failed at setting such a local case up 
(http://paste.ubuntu.com/25916781/).

Thanks Laney for the openstack based repro description.
@Laney I found it interesting that you essentially only needed to start+reboot.
I assume on the host you had other workload goes on in the background (since it 
is lcy01)?
If you'd have any sort of non-busy but otherwise comparable system - could you 
check to confirm the assumption we have so far that there all is fine?
If yes - then the memory pressure theory gets more likely, if not we can focus 
on simpler reproducers - so we can only win by that check.

Crossing fingers for jsalisbury's hope that 4.14 might already have a
fix.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  Triaged
Status in qemu-kvm package in Ubuntu:
  New
Status in linux source package in Artful:
  Triaged
Status in qemu-kvm source package in Artful:
  New
Status in linux source package in Bionic:
  Triaged
Status in qemu-kvm source package in Bionic:
  New

Bug description:
  This is impacting us for ubuntu autopkgtests. Eventually the whole
  region ends up dying because each worker is hit by this bug in turn
  and backs off until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2017-11-07 Thread Joseph Salisbury

Do you know if this bug is also happening with Zesty, or just Artful and
Bionic(>4.13)?

I'm going to working on bisecting bug 1713751 in case it's related.

Also, it would be good to know if this bug is already fixed in the
latest mainline kernel.  Do you have a way to reproduce this bug?  If
so, could you give 4.14-rc8 a try?  It can be downloaded from:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14-rc8


** Changed in: linux (Ubuntu)
   Status: Incomplete => Triaged

** Also affects: linux (Ubuntu Bionic)
   Importance: High
   Status: Triaged

** Also affects: qemu-kvm (Ubuntu Bionic)
   Importance: Undecided
   Status: New

** Also affects: linux (Ubuntu Artful)
   Importance: Undecided
   Status: New

** Also affects: qemu-kvm (Ubuntu Artful)
   Importance: Undecided
   Status: New

** Changed in: linux (Ubuntu Artful)
   Status: New => Triaged

** Changed in: linux (Ubuntu Artful)
   Importance: Undecided => High

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  Triaged
Status in qemu-kvm package in Ubuntu:
  New
Status in linux source package in Artful:
  Triaged
Status in qemu-kvm source package in Artful:
  New
Status in linux source package in Bionic:
  Triaged
Status in qemu-kvm source package in Bionic:
  New

Bug description:
  This is impacting us for ubuntu autopkgtests. Eventually the whole
  region ends up dying because each worker is hit by this bug in turn
  and backs off until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2017-11-07 Thread Joseph Salisbury

** Changed in: linux (Ubuntu)
   Importance: Undecided => High

** Tags added: kernel-key

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  Triaged
Status in qemu-kvm package in Ubuntu:
  New
Status in linux source package in Artful:
  Triaged
Status in qemu-kvm source package in Artful:
  New
Status in linux source package in Bionic:
  Triaged
Status in qemu-kvm source package in Bionic:
  New

Bug description:
  This is impacting us for ubuntu autopkgtests. Eventually the whole
  region ends up dying because each worker is hit by this bug in turn
  and backs off until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2017-11-07 Thread Iain Lane

I tried 28 (then my quota ran out) xenial guests BTW and none of those
failed.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  New
Status in qemu-kvm package in Ubuntu:
  New

Bug description:
  This is impacting us for ubuntu autopkgtests. Eventually the whole
  region ends up dying because each worker is hit by this bug in turn
  and backs off until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2017-11-07 Thread Iain Lane

** Attachment added: "bad run console-log"
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+attachment/5005464/+files/laney-test25.log

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  New
Status in qemu-kvm package in Ubuntu:
  New

Bug description:
  This is impacting us for ubuntu autopkgtests. Eventually the whole
  region ends up dying because each worker is hit by this bug in turn
  and backs off until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2017-11-07 Thread Iain Lane

Oh also see https://bugs.launchpad.net/ubuntu/+source/linux-
hwe/+bug/1713751 which has some superficially similar symptoms (cpu
stuck on shutdown).

** Description changed:

  This is impacting us for ubuntu autopkgtests. Eventually the whole
  region ends up dying because each worker is hit by this bug in turn and
  backs off until the next reset (6 hourly).
  
- Guests are sometimes failing to reboot. When this happens, you see the
- following in the console
+ 17.10 (and bionic) guests are sometimes failing to reboot. When this
+ happens, you see the following in the console
  
-   [[0;32m  OK  [0m] Reached target Shutdown.
-   [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
-   [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
-   [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
-   [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
-   [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
-   [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0
+   [[0;32m  OK  [0m] Reached target Shutdown.
+   [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
+   [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
+   [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
+   [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
+   [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
+   [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0
  
  One host that exhibits this behaviour was:
  
-   Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
+   Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux
  
  guest running:
  
-   Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
+   Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)
  
  The affected cloud region is running the xenial/Ocata cloud archive, so
  the version of qemu-kvm in there may also be relevant.
  
  Here's how I reproduced it in lcy01:
  
-   $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
-   $  sudo reboot
-   # wait a minute or so for the instances to all reboot
-   $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done
+   $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
+   $  sudo reboot
+   # wait a minute or so for the instances to all reboot
+   $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done
  
  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.
  
  We've seen good and bad instances on multiple compute hosts - it doesn't
  feel to me like a host problem but rather a race condition somewhere
  that's somehow either triggered or triggered much more often by what
  lcy01 is running. I always saw this on the first reboot - never on first
  boot, and never on n>1th boot. (But if it's a race then that might not
  mean much.)
  
  I'll attach a bad and a good console-log for reference.
  
  If you're at Canonical then see internal rt #107135 for some other
  details.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  New
Status in qemu-kvm package in Ubuntu:
  New

Bug description:
  This is impacting us for ubuntu autopkgtests. Eventually the whole
  region ends up dying because each worker is hit by this bug in turn
  and backs off until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

2017-11-07 Thread Iain Lane

** Attachment added: "good run console-log"
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+attachment/5005465/+files/laney-test14.log

** Also affects: qemu-kvm (Ubuntu)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1730717

Title:
  Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck
  for 22s! [systemd:1]"

Status in linux package in Ubuntu:
  New
Status in qemu-kvm package in Ubuntu:
  New

Bug description:
  This is impacting us for ubuntu autopkgtests. Eventually the whole
  region ends up dying because each worker is hit by this bug in turn
  and backs off until the next reset (6 hourly).

  17.10 (and bionic) guests are sometimes failing to reboot. When this
  happens, you see the following in the console

    [[0;32m  OK  [0m] Reached target Shutdown.
    [  191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
    [  226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
    [  226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
    [  226.706093] All QSes seen, last rcu_sched kthread activity 15002 
(4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
    [  226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

  One host that exhibits this behaviour was:

    Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC
  2017 x86_64 x86_64 x86_64 GNU/Linux

  guest running:

    Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0
  (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017
  (Ubuntu 4.13.0-16.19-generic 4.13.4)

  The affected cloud region is running the xenial/Ocata cloud archive,
  so the version of qemu-kvm in there may also be relevant.

  Here's how I reproduced it in lcy01:

    $ for n in {1..30}; do nova boot --flavor m1.small --image 
ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name 
testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; 
done
    $  sudo reboot
    # wait a minute or so for the instances to all reboot
    $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} 
| tail; done

  On bad instances you'll see the "soft lockup" message - on good it'll
  reboot as normal.

  We've seen good and bad instances on multiple compute hosts - it
  doesn't feel to me like a host problem but rather a race condition
  somewhere that's somehow either triggered or triggered much more often
  by what lcy01 is running. I always saw this on the first reboot -
  never on first boot, and never on n>1th boot. (But if it's a race then
  that might not mean much.)

  I'll attach a bad and a good console-log for reference.

  If you're at Canonical then see internal rt #107135 for some other
  details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

Re: [Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

Re: [Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

Re: [Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

Re: [Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

[Kernel-packages] [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

33 matches

Site Navigation

Mail list logo

Footer information