------- Comment From lakun...@in.ibm.com 2017-08-24 14:38 EDT-------
As not much info was there in the bug, I tested the kernels (mentioned in 
comment 57) on ubuntu 16.04 (LTS- 4.4.0-92-generic). When tried IO 
(Austin,hydepar EN, Shiner, Glacierpar) and CPU dlpar on both the kernel I 
didnt see any issue. But while removing memory (after adding the to mem to 
lpar), lpar dropped at xmon.

console logs
===========
root@roselp2:~# [15615.010274] kernel BUG at mm/memory_hotplug.c:1908!
cpu 0x2a: Vector: 700 (Program Check) at [c00000024524b780]
pc: c00000000035d720: remove_memory+0x100/0x120
lr: c00000000035d6ac: remove_memory+0x8c/0x120
sp: c00000024524ba00
msr: 8000000000029033
current = 0xc0000001b2f05c00
paca    = 0xc00000000fadb900   softe: 0        irq_happened: 0x01
pid   = 48226, comm = drmgr
kernel BUG at mm/memory_hotplug.c:1908!
Linux version 4.13.0-rc5-next-20170817 (bauermann@u1604le) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4)) #2 SMP Mon Aug 21 19:07:20 BRT 
2017
enter ? for help
[c00000024524ba40] c0000000000c3554 pseries_remove_memblock+0xe4/0x140
[c00000024524ba90] c0000000000c37dc pseries_memory_notifier+0x22c/0x280
[c00000024524bad0] c00000000012d1f0 notifier_call_chain+0xa0/0x110
[c00000024524bb20] c00000000012d7f8 __blocking_notifier_call_chain+0x78/0xb0
[c00000024524bb70] c000000000a304f8 of_property_notify+0x58/0xc0
[c00000024524bbb0] c000000000a2b320 of_update_property+0x110/0x150
[c00000024524bc10] c0000000000b7340 ofdt_write+0x1d0/0x6d0
[c00000024524bcd0] c00000000043db7c proc_reg_write+0x8c/0xd0
[c00000024524bd00] c000000000391938 __vfs_write+0x48/0x200
[c00000024524bd90] c0000000003936f4 vfs_write+0xd4/0x270
[c00000024524bde0] c00000000039558c SyS_write+0x6c/0x110
[c00000024524be30] c00000000000b184 system_call+0x58/0x6c
--- Exception: c01 (System Call) at 00002000001ce09c
SP (7fffe1582790) is in userspace
2a:mon> c
cpus stopped: 0x0-0x9f
2a:mon> t
[c00000024524ba40] c0000000000c3554 pseries_remove_memblock+0xe4/0x140
[c00000024524ba90] c0000000000c37dc pseries_memory_notifier+0x22c/0x280
[c00000024524bad0] c00000000012d1f0 notifier_call_chain+0xa0/0x110
[c00000024524bb20] c00000000012d7f8 __blocking_notifier_call_chain+0x78/0xb0
[c00000024524bb70] c000000000a304f8 of_property_notify+0x58/0xc0
[c00000024524bbb0] c000000000a2b320 of_update_property+0x110/0x150
[c00000024524bc10] c0000000000b7340 ofdt_write+0x1d0/0x6d0
[c00000024524bcd0] c00000000043db7c proc_reg_write+0x8c/0xd0
[c00000024524bd00] c000000000391938 __vfs_write+0x48/0x200
[c00000024524bd90] c0000000003936f4 vfs_write+0xd4/0x270
[c00000024524bde0] c00000000039558c SyS_write+0x6c/0x110
[c00000024524be30] c00000000000b184 system_call+0x58/0x6c
--- Exception: c01 (System Call) at 00002000001ce09c
SP (7fffe1582790) is in userspace
2a:mon> r
R00 = c00000000035d6ac   R16 = 0000000000000000
R01 = c00000024524ba00   R17 = 0000000000000000
R02 = c0000000015bfb00   R18 = 0000000000000000
R03 = 0000000000000001   R19 = 0000000000000000
R04 = c00000027668ade8   R20 = 0000000000000000
R05 = c0000002766a1fe8   R21 = 0000000000000000
R06 = 0000000000015f60   R22 = 0000000000000000
R07 = 0000000000000001   R23 = 00000100111ecca8
R08 = 0000000000000007   R24 = 00000100112106b0
R09 = 0000000000000002   R25 = c00000027e089810
R10 = c00000023aa97c80   R26 = 0000000000000001
R11 = 303078302d303030   R27 = 0000000010000000
R12 = 0000000000002200   R28 = 0000000000000000
R13 = c00000000fadb900   R29 = c000000001494548
R14 = 0000000000000000   R30 = 0000000001000000
R15 = 0000000000000000   R31 = 0000000280000000
pc  = c00000000035d720 remove_memory+0x100/0x120
cfar= c00000000035d6b0 remove_memory+0x90/0x120
lr  = c00000000035d6ac remove_memory+0x8c/0x120
msr = 8000000000029033   cr  = 22002424
ctr = 0000000000000000   xer = 0000000020000000   trap =  700
2a:mon> d
0000000000000000 **************** ****************  |                |
2a:mon>

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1661684

Title:
  ISST-LTE:pVM:roselp4:ubuntu 16.04.2: drop in xmon when running dlpar
  tests under stress

Status in The Ubuntu-power-systems project:
  Opinion
Status in linux package in Ubuntu:
  Incomplete

Bug description:
  == Comment: #0 - Ping Tian Han <pt...@cn.ibm.com> - 2016-12-26 21:59:52 ==
  ---Problem Description---
  When testing DLPAR, include slot/cpu/mem, under stress on roselp4, system 
dropped into xmon:

  roselp4 login: [   95.511790] sysrq: SysRq : Changing Loglevel
  [   95.511816] sysrq: Loglevel set to 9
  [  289.363833] mlx4_en 0292:60:00.0: removed PHC
  [  293.123896] iommu: Removing device 0292:60:00.0 from group 3
  [  303.173744] pci_bus 0292:60: busn_res: [bus 60-ff] is released
  [  303.173865] rpadlpar_io: slot PHB 658 removed
  [  335.853779] iommu: Removing device 0021:01:00.0 from group 0
  [  345.893764] pci_bus 0021:01: busn_res: [bus 01-ff] is released
  [  345.893869] rpadlpar_io: slot PHB 33 removed
  [  382.204003] min_free_kbytes is not updated to 16885 because user defined 
value 551564 is preferred
  [  446.143648] cpu 152 (hwid 152) Ready to die...
  [  446.464057] cpu 153 (hwid 153) Ready to die...
  [  446.473525] cpu 154 (hwid 154) Ready to die...
  [  446.474077] cpu 155 (hwid 155) Ready to die...
  [  446.483529] cpu 156 (hwid 156) Ready to die...
  [  446.493532] cpu 157 (hwid 157) Ready to die...
  [  446.494078] cpu 158 (hwid 158) Ready to die...
  [  446.503527] cpu 159 (hwid 159) Ready to die...
  [  446.664534] cpu 144 (hwid 144) Ready to die...
  [  446.964113] cpu 145 (hwid 145) Ready to die...
  [  446.973525] cpu 146 (hwid 146) Ready to die...
  [  446.974094] cpu 147 (hwid 147) Ready to die...
  [  446.983944] cpu 148 (hwid 148) Ready to die...
  [  446.984062] cpu 149 (hwid 149) Ready to die...
  [  446.993518] cpu 150 (hwid 150) Ready to die...
  [  446.993543] Querying DEAD? cpu 150 (150) shows 2
  [  446.994098] cpu 151 (hwid 151) Ready to die...
  [  447.133726] cpu 136 (hwid 136) Ready to die...
  [  447.403532] cpu 137 (hwid 137) Ready to die...
  [  447.403772] cpu 138 (hwid 138) Ready to die...
  [  447.403839] cpu 139 (hwid 139) Ready to die...
  [  447.403887] cpu 140 (hwid 140) Ready to die...
  [  447.403937] cpu 141 (hwid 141) Ready to die...
  [  447.403979] cpu 142 (hwid 142) Ready to die...
  [  447.404038] cpu 143 (hwid 143) Ready to die...
  [  447.513546] cpu 128 (hwid 128) Ready to die...
  [  447.693533] cpu 129 (hwid 129) Ready to die...
  [  447.693999] cpu 130 (hwid 130) Ready to die...
  [  447.703530] cpu 131 (hwid 131) Ready to die...
  [  447.704087] Querying DEAD? cpu 132 (132) shows 2
  [  447.704102] cpu 132 (hwid 132) Ready to die...
  [  447.713534] cpu 133 (hwid 133) Ready to die...
  [  447.714064] Querying DEAD? cpu 134 (134) shows 2
  cpu 0x86: Vector: 300 (Data Access) at [c000000007b0fd40]
      pc: 000000001ec3072c
      lr: 000000001ec2fee0
      sp: 1faf6bd0
     msr: 8000000102801000
     dar: 212d6c1a2a20c
   dsisr: 42000000
    current = 0xc000000474c6d600
    paca    = 0xc000000007b6b600   softe: 0        irq_happened: 0x01
      pid   = 0, comm = swapper/134
  Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 
21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11)
  WARNING: exception is not recoverable, can't continue
  enter ? for help
  SP (1faf6bd0) is in userspace
  86:mon> 
  86:mon> t
  SP (1faf6bd0) is in userspace
  86:mon> r
  R00 = 000212d6c1a2a20f   R16 = c000000000ff1c38
  R01 = 000000001faf6bd0   R17 = c000000474c9c080
  R02 = 000000001ed1be80   R18 = c000000474c9c000
  R03 = 000000001faf6c80   R19 = c0000000013fdf08
  R04 = 0000000000000018   R20 = c000000474c9c080
  R05 = 00000000000000e0   R21 = c0000000013e8ad0
  R06 = 0000000000009e04   R22 = c000000474c9c000
  R07 = 000000001faf6d30   R23 = c00000047a9a1c40
  R08 = 000000001faf6d28   R24 = 0000000000000002
  R09 = 000212d6c1a2a20c   R25 = c000000000fd4e6c
  R10 = 000000001ec1b118   R26 = c000000000fd4e6c
  R11 = 000000001ee7e040   R27 = c0000000014daae0
  R12 = 000000000163c1d8   R28 = 0000000000000000
  R13 = c000000007b6b600   R29 = 0000000000000086
  R14 = c0000000014defb0   R30 = c000000000fd4e68
  R15 = 0000000000000001   R31 = 000000001faf6bd0
  pc  = 000000001ec3072c
  cfar= 000000001ec2fedc
  lr  = 000000001ec2fee0
  msr = 8000000102801000   cr  = 42000000
  ctr = 000000001ec48788   xer = 0000000000000020   trap =  300
  dar = 000212d6c1a2a20c   dsisr = 42000000
  86:mon> 

  
   
  Contact Information = Ping Tian Han/pt...@cn.ibm.com 
   
  ---uname output---
  Linux roselp4 4.8.0-34-generic #36~16.04.1-Ubuntu SMP Wed Dec 21 18:53:20 UTC 
2016 ppc64le ppc64le ppc64le GNU/Linux
   
  Machine Type = lpar 
   
  ---Debugger Data---
  cpu 0x86: Vector: 300 (Data Access) at [c000000007b0fd40]
      pc: 000000001ec3072c
      lr: 000000001ec2fee0
      sp: 1faf6bd0
     msr: 8000000102801000
     dar: 212d6c1a2a20c
   dsisr: 42000000
    current = 0xc000000474c6d600
    paca    = 0xc000000007b6b600   softe: 0        irq_happened: 0x01
      pid   = 0, comm = swapper/134
  Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 
21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11)
  WARNING: exception is not recoverable, can't continue
  enter ? for help
  SP (1faf6bd0) is in userspace
  86:mon> e
  cpu 0x86: Vector: 300 (Data Access) at [c000000007b0fd40]
      pc: 000000001ec3072c
      lr: 000000001ec2fee0
      sp: 1faf6bd0
     msr: 8000000102801000
     dar: 212d6c1a2a20c
   dsisr: 42000000
    current = 0xc000000474c6d600
    paca    = 0xc000000007b6b600   softe: 0        irq_happened: 0x01
      pid   = 0, comm = swapper/134
  Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 
21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11)
  86:mon> t
  SP (1faf6bd0) is in userspace
  86:mon> r
  R00 = 000212d6c1a2a20f   R16 = c000000000ff1c38
  R01 = 000000001faf6bd0   R17 = c000000474c9c080
  R02 = 000000001ed1be80   R18 = c000000474c9c000
  R03 = 000000001faf6c80   R19 = c0000000013fdf08
  R04 = 0000000000000018   R20 = c000000474c9c080
  R05 = 00000000000000e0   R21 = c0000000013e8ad0
  R06 = 0000000000009e04   R22 = c000000474c9c000
  R07 = 000000001faf6d30   R23 = c00000047a9a1c40
  R08 = 000000001faf6d28   R24 = 0000000000000002
  R09 = 000212d6c1a2a20c   R25 = c000000000fd4e6c
  R10 = 000000001ec1b118   R26 = c000000000fd4e6c
  R11 = 000000001ee7e040   R27 = c0000000014daae0
  R12 = 000000000163c1d8   R28 = 0000000000000000
  R13 = c000000007b6b600   R29 = 0000000000000086
  R14 = c0000000014defb0   R30 = c000000000fd4e68
  R15 = 0000000000000001   R31 = 000000001faf6bd0
  pc  = 000000001ec3072c
  cfar= 000000001ec2fedc
  lr  = 000000001ec2fee0
  msr = 8000000102801000   cr  = 42000000
  ctr = 000000001ec48788   xer = 0000000000000020   trap =  300
  dar = 000212d6c1a2a20c   dsisr = 42000000
  86:mon>  
   
  ---System Hang---
   drop into xmon
   
  ---Steps to Reproduce---
   1. run IO stress tests on roselp4
  2. run slot/cpu/mem dlpar tests on roselp4
   
  Stack trace output:
   no
   
  Oops output:
   no
   
  System Dump Info:
    The system was configured to capture a dump, however a dump was not 
produced.
   
  *Additional Instructions for Ping Tian Han/pt...@cn.ibm.com: 
  -Post a private note with access information to the machine that is currently 
in the debugger. 
  -Attach sysctl -a output output to the bug.

  == Comment: #4 - PAWAN K. SINGH <pawak...@in.ibm.com> - 2016-12-27
  02:19:58 ==

  
  == Comment: #7 - Ping Tian Han <pt...@cn.ibm.com> - 2016-12-27 20:53:59 ==

  
  == Comment: #8 - Ping Tian Han <pt...@cn.ibm.com> - 2016-12-27 20:59:04 ==

  
  == Comment: #14 - Ping Tian Han <pt...@cn.ibm.com> - 2016-12-28 03:17:50 ==
  FYI. With default min_free_kbytes, roselp4 still drops into xmon:

  Ubuntu 16.04.1 LTS roselp4 hvc0

  roselp4 login: [  260.094141] sysrq: SysRq : Changing Loglevel
  [  260.094161] sysrq: Loglevel set to 9
  [  266.614273] cpu 152 (hwid 152) Ready to die...
  [  266.794136] cpu 153 (hwid 153) Ready to die...
  [  266.794694] cpu 154 (hwid 154) Ready to die...
  [  266.804248] cpu 155 (hwid 155) Ready to die...
  [  266.804302] cpu 156 (hwid 156) Ready to die...
  [  266.804354] cpu 157 (hwid 157) Ready to die...
  [  266.804410] cpu 158 (hwid 158) Ready to die...
  [  266.804465] cpu 159 (hwid 159) Ready to die...
  [  266.935065] cpu 144 (hwid 144) Ready to die...
  [  267.144140] cpu 145 (hwid 145) Ready to die...
  [  267.144683] cpu 146 (hwid 146) Ready to die...
  [  267.154692] cpu 147 (hwid 147) Ready to die...
  [  267.164134] cpu 148 (hwid 148) Ready to die...
  [  267.164702] cpu 149 (hwid 149) Ready to die...
  [  267.174819] cpu 150 (hwid 150) Ready to die...
  [  267.184684] cpu 151 (hwid 151) Ready to die...
  [  267.324831] cpu 136 (hwid 136) Ready to die...
  [  267.614138] cpu 137 (hwid 137) Ready to die...
  [  267.614745] cpu 138 (hwid 138) Ready to die...
  [  267.624135] cpu 139 (hwid 139) Ready to die...
  [  267.624716] cpu 140 (hwid 140) Ready to die...
  [  267.634637] Querying DEAD? cpu 141 (141) shows 2
  cpu 0x8d: Vector: 300 (Data Access) at [c000000007ad7d40]
      pc: 000000001ec26be0
      lr: 000000001ec26ab4
      sp: 1faf6920
     msr: 8000000102801000
     dar: fffffe801faf6bc0
   dsisr: 40000000
    current = 0xc000000474c51e00
    paca    = 0xc000000007b6f500   softe: 0        irq_happened: 0x01
      pid   = 0, comm = swapper/141
  Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 
21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11)
  WARNING: exception is not recoverable, can't continue
  enter ? for help
  SP (1faf6920) is in userspace
  8d:mon> cpu 0x8e: Vector: 300 (Data Access) at [c000000007acfd40]
      pc: 000000001ec22614
      lr: 000000001ec22d5c
      sp: 1faf6b00
     msr: 8000000102801000
     dar: 20000000
   dsisr: 40000000
    current = 0xc000000474c7c800
    paca    = 0xc000000007b6fe00   softe: 0        irq_happened: 0x01
      pid   = 0, comm = swapper/142
  Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 
21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11)
  WARNING: exception is not recoverable, can't continue

  8d:mon> 
  Unrecognized command: \x1be (type ? for help)
  8d:mon> e
  cpu 0x8d: Vector: 300 (Data Access) at [c000000007ad7d40]
      pc: 000000001ec26be0
      lr: 000000001ec26ab4
      sp: 1faf6920
     msr: 8000000102801000
     dar: fffffe801faf6bc0
   dsisr: 40000000
    current = 0xc000000474c51e00
    paca    = 0xc000000007b6f500   softe: 0        irq_happened: 0x01
      pid   = 0, comm = swapper/141
  Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 
21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11)
  8d:mon> t
  SP (1faf6920) is in userspace
  8d:mon>

  == Comment: #15 - Ping Tian Han <pt...@cn.ibm.com> - 2016-12-28
  03:22:22 ==

  
  == Comment: #19 - Ping Tian Han <pt...@cn.ibm.com> - 2016-12-29 00:13:39 ==

  
  == Comment: #20 - Kevin W. Rudd <ru...@us.ibm.com> - 2016-12-29 12:49:15 ==
  Nathan or Laurent,

  In the dmesg output, I'm seeing similar behavior to the problem
  reported in Bug 146931.  The following error and completely bogus
  NIP/LR values appear to be the same scenario:

  pseries-hotplug-cpu: Failed to release drc (10000098) for CPU
  PowerPC,POWER8, rc: -17

  The NIP and LR values appear to be completely bogus, so I'm not sure
  what about the Bug 146931 scenario matched the issue being tracked in
  Bug 146776.

  This looks to be a side issue of doing hotplugging on the CPUs

  Please review and provide your thoughts on this observed behavior.

  Thanks.


  == Comment: #25 - Nathan D. Fontenot <nfont...@us.ibm.com> - 2017-01-13 
13:32:51 ==
  My first thought in looking at this is that it appears that the swapper 
thread for a cpu is scheduled to run on a cpu that has been removed. This may 
explain the bogus pc and lr values. There have been a lot of updates to the 
generic kernel cpu hotplug code recently, perhaps some update there could be 
causing this. It would be interesting to see if this occurs on older kernels.

  As for the rtas set-indicator call returning -17, I don't know how
  that is possible. A return value of -17 is not even a defined return
  value in the PAPR. This could be a side effect of what is causing the
  crash though so that should be resolved first and then see if this
  still occurs.

  == Comment: #31 - Fernando Seiti Furusato <ferse...@br.ibm.com> - 2017-02-02 
11:37:37 ==
  Mirroring so Canonical is aware of this bug.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1661684/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to