------- Comment From lakun...@in.ibm.com 2017-08-24 14:38 EDT------- As not much info was there in the bug, I tested the kernels (mentioned in comment 57) on ubuntu 16.04 (LTS- 4.4.0-92-generic). When tried IO (Austin,hydepar EN, Shiner, Glacierpar) and CPU dlpar on both the kernel I didnt see any issue. But while removing memory (after adding the to mem to lpar), lpar dropped at xmon.
console logs =========== root@roselp2:~# [15615.010274] kernel BUG at mm/memory_hotplug.c:1908! cpu 0x2a: Vector: 700 (Program Check) at [c00000024524b780] pc: c00000000035d720: remove_memory+0x100/0x120 lr: c00000000035d6ac: remove_memory+0x8c/0x120 sp: c00000024524ba00 msr: 8000000000029033 current = 0xc0000001b2f05c00 paca = 0xc00000000fadb900 softe: 0 irq_happened: 0x01 pid = 48226, comm = drmgr kernel BUG at mm/memory_hotplug.c:1908! Linux version 4.13.0-rc5-next-20170817 (bauermann@u1604le) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4)) #2 SMP Mon Aug 21 19:07:20 BRT 2017 enter ? for help [c00000024524ba40] c0000000000c3554 pseries_remove_memblock+0xe4/0x140 [c00000024524ba90] c0000000000c37dc pseries_memory_notifier+0x22c/0x280 [c00000024524bad0] c00000000012d1f0 notifier_call_chain+0xa0/0x110 [c00000024524bb20] c00000000012d7f8 __blocking_notifier_call_chain+0x78/0xb0 [c00000024524bb70] c000000000a304f8 of_property_notify+0x58/0xc0 [c00000024524bbb0] c000000000a2b320 of_update_property+0x110/0x150 [c00000024524bc10] c0000000000b7340 ofdt_write+0x1d0/0x6d0 [c00000024524bcd0] c00000000043db7c proc_reg_write+0x8c/0xd0 [c00000024524bd00] c000000000391938 __vfs_write+0x48/0x200 [c00000024524bd90] c0000000003936f4 vfs_write+0xd4/0x270 [c00000024524bde0] c00000000039558c SyS_write+0x6c/0x110 [c00000024524be30] c00000000000b184 system_call+0x58/0x6c --- Exception: c01 (System Call) at 00002000001ce09c SP (7fffe1582790) is in userspace 2a:mon> c cpus stopped: 0x0-0x9f 2a:mon> t [c00000024524ba40] c0000000000c3554 pseries_remove_memblock+0xe4/0x140 [c00000024524ba90] c0000000000c37dc pseries_memory_notifier+0x22c/0x280 [c00000024524bad0] c00000000012d1f0 notifier_call_chain+0xa0/0x110 [c00000024524bb20] c00000000012d7f8 __blocking_notifier_call_chain+0x78/0xb0 [c00000024524bb70] c000000000a304f8 of_property_notify+0x58/0xc0 [c00000024524bbb0] c000000000a2b320 of_update_property+0x110/0x150 [c00000024524bc10] c0000000000b7340 ofdt_write+0x1d0/0x6d0 [c00000024524bcd0] c00000000043db7c proc_reg_write+0x8c/0xd0 [c00000024524bd00] c000000000391938 __vfs_write+0x48/0x200 [c00000024524bd90] c0000000003936f4 vfs_write+0xd4/0x270 [c00000024524bde0] c00000000039558c SyS_write+0x6c/0x110 [c00000024524be30] c00000000000b184 system_call+0x58/0x6c --- Exception: c01 (System Call) at 00002000001ce09c SP (7fffe1582790) is in userspace 2a:mon> r R00 = c00000000035d6ac R16 = 0000000000000000 R01 = c00000024524ba00 R17 = 0000000000000000 R02 = c0000000015bfb00 R18 = 0000000000000000 R03 = 0000000000000001 R19 = 0000000000000000 R04 = c00000027668ade8 R20 = 0000000000000000 R05 = c0000002766a1fe8 R21 = 0000000000000000 R06 = 0000000000015f60 R22 = 0000000000000000 R07 = 0000000000000001 R23 = 00000100111ecca8 R08 = 0000000000000007 R24 = 00000100112106b0 R09 = 0000000000000002 R25 = c00000027e089810 R10 = c00000023aa97c80 R26 = 0000000000000001 R11 = 303078302d303030 R27 = 0000000010000000 R12 = 0000000000002200 R28 = 0000000000000000 R13 = c00000000fadb900 R29 = c000000001494548 R14 = 0000000000000000 R30 = 0000000001000000 R15 = 0000000000000000 R31 = 0000000280000000 pc = c00000000035d720 remove_memory+0x100/0x120 cfar= c00000000035d6b0 remove_memory+0x90/0x120 lr = c00000000035d6ac remove_memory+0x8c/0x120 msr = 8000000000029033 cr = 22002424 ctr = 0000000000000000 xer = 0000000020000000 trap = 700 2a:mon> d 0000000000000000 **************** **************** | | 2a:mon> -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1661684 Title: ISST-LTE:pVM:roselp4:ubuntu 16.04.2: drop in xmon when running dlpar tests under stress Status in The Ubuntu-power-systems project: Opinion Status in linux package in Ubuntu: Incomplete Bug description: == Comment: #0 - Ping Tian Han <pt...@cn.ibm.com> - 2016-12-26 21:59:52 == ---Problem Description--- When testing DLPAR, include slot/cpu/mem, under stress on roselp4, system dropped into xmon: roselp4 login: [ 95.511790] sysrq: SysRq : Changing Loglevel [ 95.511816] sysrq: Loglevel set to 9 [ 289.363833] mlx4_en 0292:60:00.0: removed PHC [ 293.123896] iommu: Removing device 0292:60:00.0 from group 3 [ 303.173744] pci_bus 0292:60: busn_res: [bus 60-ff] is released [ 303.173865] rpadlpar_io: slot PHB 658 removed [ 335.853779] iommu: Removing device 0021:01:00.0 from group 0 [ 345.893764] pci_bus 0021:01: busn_res: [bus 01-ff] is released [ 345.893869] rpadlpar_io: slot PHB 33 removed [ 382.204003] min_free_kbytes is not updated to 16885 because user defined value 551564 is preferred [ 446.143648] cpu 152 (hwid 152) Ready to die... [ 446.464057] cpu 153 (hwid 153) Ready to die... [ 446.473525] cpu 154 (hwid 154) Ready to die... [ 446.474077] cpu 155 (hwid 155) Ready to die... [ 446.483529] cpu 156 (hwid 156) Ready to die... [ 446.493532] cpu 157 (hwid 157) Ready to die... [ 446.494078] cpu 158 (hwid 158) Ready to die... [ 446.503527] cpu 159 (hwid 159) Ready to die... [ 446.664534] cpu 144 (hwid 144) Ready to die... [ 446.964113] cpu 145 (hwid 145) Ready to die... [ 446.973525] cpu 146 (hwid 146) Ready to die... [ 446.974094] cpu 147 (hwid 147) Ready to die... [ 446.983944] cpu 148 (hwid 148) Ready to die... [ 446.984062] cpu 149 (hwid 149) Ready to die... [ 446.993518] cpu 150 (hwid 150) Ready to die... [ 446.993543] Querying DEAD? cpu 150 (150) shows 2 [ 446.994098] cpu 151 (hwid 151) Ready to die... [ 447.133726] cpu 136 (hwid 136) Ready to die... [ 447.403532] cpu 137 (hwid 137) Ready to die... [ 447.403772] cpu 138 (hwid 138) Ready to die... [ 447.403839] cpu 139 (hwid 139) Ready to die... [ 447.403887] cpu 140 (hwid 140) Ready to die... [ 447.403937] cpu 141 (hwid 141) Ready to die... [ 447.403979] cpu 142 (hwid 142) Ready to die... [ 447.404038] cpu 143 (hwid 143) Ready to die... [ 447.513546] cpu 128 (hwid 128) Ready to die... [ 447.693533] cpu 129 (hwid 129) Ready to die... [ 447.693999] cpu 130 (hwid 130) Ready to die... [ 447.703530] cpu 131 (hwid 131) Ready to die... [ 447.704087] Querying DEAD? cpu 132 (132) shows 2 [ 447.704102] cpu 132 (hwid 132) Ready to die... [ 447.713534] cpu 133 (hwid 133) Ready to die... [ 447.714064] Querying DEAD? cpu 134 (134) shows 2 cpu 0x86: Vector: 300 (Data Access) at [c000000007b0fd40] pc: 000000001ec3072c lr: 000000001ec2fee0 sp: 1faf6bd0 msr: 8000000102801000 dar: 212d6c1a2a20c dsisr: 42000000 current = 0xc000000474c6d600 paca = 0xc000000007b6b600 softe: 0 irq_happened: 0x01 pid = 0, comm = swapper/134 Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11) WARNING: exception is not recoverable, can't continue enter ? for help SP (1faf6bd0) is in userspace 86:mon> 86:mon> t SP (1faf6bd0) is in userspace 86:mon> r R00 = 000212d6c1a2a20f R16 = c000000000ff1c38 R01 = 000000001faf6bd0 R17 = c000000474c9c080 R02 = 000000001ed1be80 R18 = c000000474c9c000 R03 = 000000001faf6c80 R19 = c0000000013fdf08 R04 = 0000000000000018 R20 = c000000474c9c080 R05 = 00000000000000e0 R21 = c0000000013e8ad0 R06 = 0000000000009e04 R22 = c000000474c9c000 R07 = 000000001faf6d30 R23 = c00000047a9a1c40 R08 = 000000001faf6d28 R24 = 0000000000000002 R09 = 000212d6c1a2a20c R25 = c000000000fd4e6c R10 = 000000001ec1b118 R26 = c000000000fd4e6c R11 = 000000001ee7e040 R27 = c0000000014daae0 R12 = 000000000163c1d8 R28 = 0000000000000000 R13 = c000000007b6b600 R29 = 0000000000000086 R14 = c0000000014defb0 R30 = c000000000fd4e68 R15 = 0000000000000001 R31 = 000000001faf6bd0 pc = 000000001ec3072c cfar= 000000001ec2fedc lr = 000000001ec2fee0 msr = 8000000102801000 cr = 42000000 ctr = 000000001ec48788 xer = 0000000000000020 trap = 300 dar = 000212d6c1a2a20c dsisr = 42000000 86:mon> Contact Information = Ping Tian Han/pt...@cn.ibm.com ---uname output--- Linux roselp4 4.8.0-34-generic #36~16.04.1-Ubuntu SMP Wed Dec 21 18:53:20 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux Machine Type = lpar ---Debugger Data--- cpu 0x86: Vector: 300 (Data Access) at [c000000007b0fd40] pc: 000000001ec3072c lr: 000000001ec2fee0 sp: 1faf6bd0 msr: 8000000102801000 dar: 212d6c1a2a20c dsisr: 42000000 current = 0xc000000474c6d600 paca = 0xc000000007b6b600 softe: 0 irq_happened: 0x01 pid = 0, comm = swapper/134 Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11) WARNING: exception is not recoverable, can't continue enter ? for help SP (1faf6bd0) is in userspace 86:mon> e cpu 0x86: Vector: 300 (Data Access) at [c000000007b0fd40] pc: 000000001ec3072c lr: 000000001ec2fee0 sp: 1faf6bd0 msr: 8000000102801000 dar: 212d6c1a2a20c dsisr: 42000000 current = 0xc000000474c6d600 paca = 0xc000000007b6b600 softe: 0 irq_happened: 0x01 pid = 0, comm = swapper/134 Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11) 86:mon> t SP (1faf6bd0) is in userspace 86:mon> r R00 = 000212d6c1a2a20f R16 = c000000000ff1c38 R01 = 000000001faf6bd0 R17 = c000000474c9c080 R02 = 000000001ed1be80 R18 = c000000474c9c000 R03 = 000000001faf6c80 R19 = c0000000013fdf08 R04 = 0000000000000018 R20 = c000000474c9c080 R05 = 00000000000000e0 R21 = c0000000013e8ad0 R06 = 0000000000009e04 R22 = c000000474c9c000 R07 = 000000001faf6d30 R23 = c00000047a9a1c40 R08 = 000000001faf6d28 R24 = 0000000000000002 R09 = 000212d6c1a2a20c R25 = c000000000fd4e6c R10 = 000000001ec1b118 R26 = c000000000fd4e6c R11 = 000000001ee7e040 R27 = c0000000014daae0 R12 = 000000000163c1d8 R28 = 0000000000000000 R13 = c000000007b6b600 R29 = 0000000000000086 R14 = c0000000014defb0 R30 = c000000000fd4e68 R15 = 0000000000000001 R31 = 000000001faf6bd0 pc = 000000001ec3072c cfar= 000000001ec2fedc lr = 000000001ec2fee0 msr = 8000000102801000 cr = 42000000 ctr = 000000001ec48788 xer = 0000000000000020 trap = 300 dar = 000212d6c1a2a20c dsisr = 42000000 86:mon> ---System Hang--- drop into xmon ---Steps to Reproduce--- 1. run IO stress tests on roselp4 2. run slot/cpu/mem dlpar tests on roselp4 Stack trace output: no Oops output: no System Dump Info: The system was configured to capture a dump, however a dump was not produced. *Additional Instructions for Ping Tian Han/pt...@cn.ibm.com: -Post a private note with access information to the machine that is currently in the debugger. -Attach sysctl -a output output to the bug. == Comment: #4 - PAWAN K. SINGH <pawak...@in.ibm.com> - 2016-12-27 02:19:58 == == Comment: #7 - Ping Tian Han <pt...@cn.ibm.com> - 2016-12-27 20:53:59 == == Comment: #8 - Ping Tian Han <pt...@cn.ibm.com> - 2016-12-27 20:59:04 == == Comment: #14 - Ping Tian Han <pt...@cn.ibm.com> - 2016-12-28 03:17:50 == FYI. With default min_free_kbytes, roselp4 still drops into xmon: Ubuntu 16.04.1 LTS roselp4 hvc0 roselp4 login: [ 260.094141] sysrq: SysRq : Changing Loglevel [ 260.094161] sysrq: Loglevel set to 9 [ 266.614273] cpu 152 (hwid 152) Ready to die... [ 266.794136] cpu 153 (hwid 153) Ready to die... [ 266.794694] cpu 154 (hwid 154) Ready to die... [ 266.804248] cpu 155 (hwid 155) Ready to die... [ 266.804302] cpu 156 (hwid 156) Ready to die... [ 266.804354] cpu 157 (hwid 157) Ready to die... [ 266.804410] cpu 158 (hwid 158) Ready to die... [ 266.804465] cpu 159 (hwid 159) Ready to die... [ 266.935065] cpu 144 (hwid 144) Ready to die... [ 267.144140] cpu 145 (hwid 145) Ready to die... [ 267.144683] cpu 146 (hwid 146) Ready to die... [ 267.154692] cpu 147 (hwid 147) Ready to die... [ 267.164134] cpu 148 (hwid 148) Ready to die... [ 267.164702] cpu 149 (hwid 149) Ready to die... [ 267.174819] cpu 150 (hwid 150) Ready to die... [ 267.184684] cpu 151 (hwid 151) Ready to die... [ 267.324831] cpu 136 (hwid 136) Ready to die... [ 267.614138] cpu 137 (hwid 137) Ready to die... [ 267.614745] cpu 138 (hwid 138) Ready to die... [ 267.624135] cpu 139 (hwid 139) Ready to die... [ 267.624716] cpu 140 (hwid 140) Ready to die... [ 267.634637] Querying DEAD? cpu 141 (141) shows 2 cpu 0x8d: Vector: 300 (Data Access) at [c000000007ad7d40] pc: 000000001ec26be0 lr: 000000001ec26ab4 sp: 1faf6920 msr: 8000000102801000 dar: fffffe801faf6bc0 dsisr: 40000000 current = 0xc000000474c51e00 paca = 0xc000000007b6f500 softe: 0 irq_happened: 0x01 pid = 0, comm = swapper/141 Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11) WARNING: exception is not recoverable, can't continue enter ? for help SP (1faf6920) is in userspace 8d:mon> cpu 0x8e: Vector: 300 (Data Access) at [c000000007acfd40] pc: 000000001ec22614 lr: 000000001ec22d5c sp: 1faf6b00 msr: 8000000102801000 dar: 20000000 dsisr: 40000000 current = 0xc000000474c7c800 paca = 0xc000000007b6fe00 softe: 0 irq_happened: 0x01 pid = 0, comm = swapper/142 Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11) WARNING: exception is not recoverable, can't continue 8d:mon> Unrecognized command: \x1be (type ? for help) 8d:mon> e cpu 0x8d: Vector: 300 (Data Access) at [c000000007ad7d40] pc: 000000001ec26be0 lr: 000000001ec26ab4 sp: 1faf6920 msr: 8000000102801000 dar: fffffe801faf6bc0 dsisr: 40000000 current = 0xc000000474c51e00 paca = 0xc000000007b6f500 softe: 0 irq_happened: 0x01 pid = 0, comm = swapper/141 Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11) 8d:mon> t SP (1faf6920) is in userspace 8d:mon> == Comment: #15 - Ping Tian Han <pt...@cn.ibm.com> - 2016-12-28 03:22:22 == == Comment: #19 - Ping Tian Han <pt...@cn.ibm.com> - 2016-12-29 00:13:39 == == Comment: #20 - Kevin W. Rudd <ru...@us.ibm.com> - 2016-12-29 12:49:15 == Nathan or Laurent, In the dmesg output, I'm seeing similar behavior to the problem reported in Bug 146931. The following error and completely bogus NIP/LR values appear to be the same scenario: pseries-hotplug-cpu: Failed to release drc (10000098) for CPU PowerPC,POWER8, rc: -17 The NIP and LR values appear to be completely bogus, so I'm not sure what about the Bug 146931 scenario matched the issue being tracked in Bug 146776. This looks to be a side issue of doing hotplugging on the CPUs Please review and provide your thoughts on this observed behavior. Thanks. == Comment: #25 - Nathan D. Fontenot <nfont...@us.ibm.com> - 2017-01-13 13:32:51 == My first thought in looking at this is that it appears that the swapper thread for a cpu is scheduled to run on a cpu that has been removed. This may explain the bogus pc and lr values. There have been a lot of updates to the generic kernel cpu hotplug code recently, perhaps some update there could be causing this. It would be interesting to see if this occurs on older kernels. As for the rtas set-indicator call returning -17, I don't know how that is possible. A return value of -17 is not even a defined return value in the PAPR. This could be a side effect of what is causing the crash though so that should be resolved first and then see if this still occurs. == Comment: #31 - Fernando Seiti Furusato <ferse...@br.ibm.com> - 2017-02-02 11:37:37 == Mirroring so Canonical is aware of this bug. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1661684/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp