[Kernel-packages] [Bug 1672521] Re: ThunderX: soft lockup on 4.8+ kernels
Just one addition, the log before contains dmesg output too. The task that hanged was systemd, it might be related with some VMs from the previous boot record being restarted automatically, but it still doesn't explain the crash. Rebooting the node again with 4.4 did not result in kernel crash. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1672521 Title: ThunderX: soft lockup on 4.8+ kernels Status in linux package in Ubuntu: Triaged Status in linux source package in Yakkety: Triaged Status in linux source package in Zesty: Triaged Bug description: I have been trying to easily reproduce this for days. We initially observed it in OPNFV Armband, when we tried to upgrade our Ubuntu Xenial installation kernel to linux-image-generic-hwe-16.04 (4.8). In our environment, this was easily triggered on compute nodes, when launching multiple VMs (we suspected OVS, QEMU etc.). However, in order to rule out our specifics, we looked for a simple way to reproduce it on all ThunderX nodes we have access to, and we finally found it: $ apt-get install stress-ng $ stress-ng --hdd 1024 We tested different FW versions, provided by both chip/board manufacturers, and with all of them the result is 100% reproductible, leading to a kernel Oops [1]: [ 726.070531] INFO: task kworker/0:1:312 blocked for more than 120 seconds. [ 726.077908] Tainted: GW I 4.8.0-41-generic #44~16.04.1-Ubuntu [ 726.085850] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 726.094383] kworker/0:1 D 080861bc 0 312 2 0x [ 726.094401] Workqueue: events vmstat_shepherd [ 726.094404] Call trace: [ 726.094411] [] __switch_to+0x94/0xa8 [ 726.094418] [] __schedule+0x224/0x718 [ 726.094421] [] schedule+0x38/0x98 [ 726.094425] [] schedule_preempt_disabled+0x14/0x20 [ 726.094428] [] __mutex_lock_slowpath+0xd4/0x168 [ 726.094431] [] mutex_lock+0x58/0x70 [ 726.094437] [] get_online_cpus+0x44/0x70 [ 726.094440] [] vmstat_shepherd+0x3c/0xe8 [ 726.094446] [] process_one_work+0x150/0x478 [ 726.094449] [] worker_thread+0x50/0x4b8 [ 726.094453] [] kthread+0xec/0x100 [ 726.094456] [] ret_from_fork+0x10/0x40 Over the last few days, I tested all 4.8-* and 4.10 (zesty backport), the soft lockup happens with each and every one of them. On the other hand, 4.4.0-45-generic seems to work perfectly fine (probably newer 4.4.0-* too, but due to a regression in the ethernet drivers after 4.4.0-45, we can't test those with ease) under normal conditions, yet running stress-ng leads to the same oops. [1] http://paste.ubuntu.com/24172516/ --- AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Mar 13 19:27 seq crw-rw 1 root audio 116, 33 Mar 13 19:27 timer AplayDevices: Error: [Errno 2] No such file or directory ApportVersion: 2.20.1-0ubuntu2.5 Architecture: arm64 ArecordDevices: Error: [Errno 2] No such file or directory AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: DistroRelease: Ubuntu 16.04 IwConfig: Error: [Errno 2] No such file or directory MachineType: GIGABYTE R120-T30 Package: linux (not installed) PciMultimedia: ProcEnviron: TERM=vt220 PATH=(custom, no user) XDG_RUNTIME_DIR= LANG=en_US.UTF-8 SHELL=/bin/bash ProcFB: 0 astdrmfb ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.8.0-41-generic root=/dev/mapper/os-root ro console=tty0 console=ttyS0,115200 console=ttyAMA0,115200 net.ifnames=1 biosdevname=0 rootdelay=90 nomodeset quiet splash vt.handoff=7 ProcVersionSignature: Ubuntu 4.8.0-41.44~16.04.1-generic 4.8.17 RelatedPackageVersions: linux-restricted-modules-4.8.0-41-generic N/A linux-backports-modules-4.8.0-41-generic N/A linux-firmware1.157.8 RfKill: Error: [Errno 2] No such file or directory Tags: xenial Uname: Linux 4.8.0-41-generic aarch64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: _MarkForUpload: True dmi.bios.date: 11/22/2016 dmi.bios.vendor: GIGABYTE dmi.bios.version: T22 dmi.board.asset.tag: 01234567890123456789AB dmi.board.name: MT30-GS0 dmi.board.vendor: GIGABYTE dmi.board.version: 01234567 dmi.chassis.asset.tag: 01234567890123456789AB dmi.chassis.type: 17 dmi.chassis.vendor: GIGABYTE dmi.chassis.version: 01234567 dmi.modalias: dmi:bvnGIGABYTE:bvrT22:bd11/22/2016:svnGIGABYTE:pnR120-T30:pvr0100:rvnGIGABYTE:rnMT30-GS0:rvr01234567:cvnGIGABYTE:ct17:cvr01234567: dmi.product.name: R120-T30 dmi.product.version: 0100 dmi.sys.vendor: GIGABYTE To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1672521/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@l
[Kernel-packages] [Bug 1672521] Re: ThunderX: soft lockup on 4.8+ kernels
Hi, The same bug happened again on a similar board with T27 firmware, but this time running kernel 4.4.0-45-generic. I'm attaching log with serial console (with debug info from the FW). I can't attach more because the kernel hanged. So far 4.4.0-45-generic was stable on our lab, this happened with no obvious reason. /ciprian ** Attachment added: "dmesg.log" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1672521/+attachment/4837761/+files/dmesg.log -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1672521 Title: ThunderX: soft lockup on 4.8+ kernels Status in linux package in Ubuntu: Triaged Status in linux source package in Yakkety: Triaged Status in linux source package in Zesty: Triaged Bug description: I have been trying to easily reproduce this for days. We initially observed it in OPNFV Armband, when we tried to upgrade our Ubuntu Xenial installation kernel to linux-image-generic-hwe-16.04 (4.8). In our environment, this was easily triggered on compute nodes, when launching multiple VMs (we suspected OVS, QEMU etc.). However, in order to rule out our specifics, we looked for a simple way to reproduce it on all ThunderX nodes we have access to, and we finally found it: $ apt-get install stress-ng $ stress-ng --hdd 1024 We tested different FW versions, provided by both chip/board manufacturers, and with all of them the result is 100% reproductible, leading to a kernel Oops [1]: [ 726.070531] INFO: task kworker/0:1:312 blocked for more than 120 seconds. [ 726.077908] Tainted: GW I 4.8.0-41-generic #44~16.04.1-Ubuntu [ 726.085850] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 726.094383] kworker/0:1 D 080861bc 0 312 2 0x [ 726.094401] Workqueue: events vmstat_shepherd [ 726.094404] Call trace: [ 726.094411] [] __switch_to+0x94/0xa8 [ 726.094418] [] __schedule+0x224/0x718 [ 726.094421] [] schedule+0x38/0x98 [ 726.094425] [] schedule_preempt_disabled+0x14/0x20 [ 726.094428] [] __mutex_lock_slowpath+0xd4/0x168 [ 726.094431] [] mutex_lock+0x58/0x70 [ 726.094437] [] get_online_cpus+0x44/0x70 [ 726.094440] [] vmstat_shepherd+0x3c/0xe8 [ 726.094446] [] process_one_work+0x150/0x478 [ 726.094449] [] worker_thread+0x50/0x4b8 [ 726.094453] [] kthread+0xec/0x100 [ 726.094456] [] ret_from_fork+0x10/0x40 Over the last few days, I tested all 4.8-* and 4.10 (zesty backport), the soft lockup happens with each and every one of them. On the other hand, 4.4.0-45-generic seems to work perfectly fine (probably newer 4.4.0-* too, but due to a regression in the ethernet drivers after 4.4.0-45, we can't test those with ease) under normal conditions, yet running stress-ng leads to the same oops. [1] http://paste.ubuntu.com/24172516/ --- AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Mar 13 19:27 seq crw-rw 1 root audio 116, 33 Mar 13 19:27 timer AplayDevices: Error: [Errno 2] No such file or directory ApportVersion: 2.20.1-0ubuntu2.5 Architecture: arm64 ArecordDevices: Error: [Errno 2] No such file or directory AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: DistroRelease: Ubuntu 16.04 IwConfig: Error: [Errno 2] No such file or directory MachineType: GIGABYTE R120-T30 Package: linux (not installed) PciMultimedia: ProcEnviron: TERM=vt220 PATH=(custom, no user) XDG_RUNTIME_DIR= LANG=en_US.UTF-8 SHELL=/bin/bash ProcFB: 0 astdrmfb ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.8.0-41-generic root=/dev/mapper/os-root ro console=tty0 console=ttyS0,115200 console=ttyAMA0,115200 net.ifnames=1 biosdevname=0 rootdelay=90 nomodeset quiet splash vt.handoff=7 ProcVersionSignature: Ubuntu 4.8.0-41.44~16.04.1-generic 4.8.17 RelatedPackageVersions: linux-restricted-modules-4.8.0-41-generic N/A linux-backports-modules-4.8.0-41-generic N/A linux-firmware1.157.8 RfKill: Error: [Errno 2] No such file or directory Tags: xenial Uname: Linux 4.8.0-41-generic aarch64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: _MarkForUpload: True dmi.bios.date: 11/22/2016 dmi.bios.vendor: GIGABYTE dmi.bios.version: T22 dmi.board.asset.tag: 01234567890123456789AB dmi.board.name: MT30-GS0 dmi.board.vendor: GIGABYTE dmi.board.version: 01234567 dmi.chassis.asset.tag: 01234567890123456789AB dmi.chassis.type: 17 dmi.chassis.vendor: GIGABYTE dmi.chassis.version: 01234567 dmi.modalias: dmi:bvnGIGABYTE:bvrT22:bd11/22/2016:svnGIGABYTE:pnR120-T30:pvr0100:rvnGIGABYTE:rnMT30-GS0:rvr01234567:cvnGIGABYTE:ct17:cvr01234567: dmi.product.name: R120-T30 dmi.product.version: 0100 dmi.sys.vendor: GIGABYTE To manage notifications about this bug go