** Tags added: cscc

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1661684

Title:
  ISST-LTE:pVM:roselp4:ubuntu 16.04.2: drop in xmon when running dlpar
  tests under stress

Status in The Ubuntu-power-systems project:
  Opinion
Status in linux package in Ubuntu:
  Incomplete

Bug description:
  == Comment: #0 - Ping Tian Han <pt...@cn.ibm.com> - 2016-12-26 21:59:52 ==
  ---Problem Description---
  When testing DLPAR, include slot/cpu/mem, under stress on roselp4, system 
dropped into xmon:

  roselp4 login: [   95.511790] sysrq: SysRq : Changing Loglevel
  [   95.511816] sysrq: Loglevel set to 9
  [  289.363833] mlx4_en 0292:60:00.0: removed PHC
  [  293.123896] iommu: Removing device 0292:60:00.0 from group 3
  [  303.173744] pci_bus 0292:60: busn_res: [bus 60-ff] is released
  [  303.173865] rpadlpar_io: slot PHB 658 removed
  [  335.853779] iommu: Removing device 0021:01:00.0 from group 0
  [  345.893764] pci_bus 0021:01: busn_res: [bus 01-ff] is released
  [  345.893869] rpadlpar_io: slot PHB 33 removed
  [  382.204003] min_free_kbytes is not updated to 16885 because user defined 
value 551564 is preferred
  [  446.143648] cpu 152 (hwid 152) Ready to die...
  [  446.464057] cpu 153 (hwid 153) Ready to die...
  [  446.473525] cpu 154 (hwid 154) Ready to die...
  [  446.474077] cpu 155 (hwid 155) Ready to die...
  [  446.483529] cpu 156 (hwid 156) Ready to die...
  [  446.493532] cpu 157 (hwid 157) Ready to die...
  [  446.494078] cpu 158 (hwid 158) Ready to die...
  [  446.503527] cpu 159 (hwid 159) Ready to die...
  [  446.664534] cpu 144 (hwid 144) Ready to die...
  [  446.964113] cpu 145 (hwid 145) Ready to die...
  [  446.973525] cpu 146 (hwid 146) Ready to die...
  [  446.974094] cpu 147 (hwid 147) Ready to die...
  [  446.983944] cpu 148 (hwid 148) Ready to die...
  [  446.984062] cpu 149 (hwid 149) Ready to die...
  [  446.993518] cpu 150 (hwid 150) Ready to die...
  [  446.993543] Querying DEAD? cpu 150 (150) shows 2
  [  446.994098] cpu 151 (hwid 151) Ready to die...
  [  447.133726] cpu 136 (hwid 136) Ready to die...
  [  447.403532] cpu 137 (hwid 137) Ready to die...
  [  447.403772] cpu 138 (hwid 138) Ready to die...
  [  447.403839] cpu 139 (hwid 139) Ready to die...
  [  447.403887] cpu 140 (hwid 140) Ready to die...
  [  447.403937] cpu 141 (hwid 141) Ready to die...
  [  447.403979] cpu 142 (hwid 142) Ready to die...
  [  447.404038] cpu 143 (hwid 143) Ready to die...
  [  447.513546] cpu 128 (hwid 128) Ready to die...
  [  447.693533] cpu 129 (hwid 129) Ready to die...
  [  447.693999] cpu 130 (hwid 130) Ready to die...
  [  447.703530] cpu 131 (hwid 131) Ready to die...
  [  447.704087] Querying DEAD? cpu 132 (132) shows 2
  [  447.704102] cpu 132 (hwid 132) Ready to die...
  [  447.713534] cpu 133 (hwid 133) Ready to die...
  [  447.714064] Querying DEAD? cpu 134 (134) shows 2
  cpu 0x86: Vector: 300 (Data Access) at [c000000007b0fd40]
      pc: 000000001ec3072c
      lr: 000000001ec2fee0
      sp: 1faf6bd0
     msr: 8000000102801000
     dar: 212d6c1a2a20c
   dsisr: 42000000
    current = 0xc000000474c6d600
    paca    = 0xc000000007b6b600   softe: 0        irq_happened: 0x01
      pid   = 0, comm = swapper/134
  Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 
21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11)
  WARNING: exception is not recoverable, can't continue
  enter ? for help
  SP (1faf6bd0) is in userspace
  86:mon> 
  86:mon> t
  SP (1faf6bd0) is in userspace
  86:mon> r
  R00 = 000212d6c1a2a20f   R16 = c000000000ff1c38
  R01 = 000000001faf6bd0   R17 = c000000474c9c080
  R02 = 000000001ed1be80   R18 = c000000474c9c000
  R03 = 000000001faf6c80   R19 = c0000000013fdf08
  R04 = 0000000000000018   R20 = c000000474c9c080
  R05 = 00000000000000e0   R21 = c0000000013e8ad0
  R06 = 0000000000009e04   R22 = c000000474c9c000
  R07 = 000000001faf6d30   R23 = c00000047a9a1c40
  R08 = 000000001faf6d28   R24 = 0000000000000002
  R09 = 000212d6c1a2a20c   R25 = c000000000fd4e6c
  R10 = 000000001ec1b118   R26 = c000000000fd4e6c
  R11 = 000000001ee7e040   R27 = c0000000014daae0
  R12 = 000000000163c1d8   R28 = 0000000000000000
  R13 = c000000007b6b600   R29 = 0000000000000086
  R14 = c0000000014defb0   R30 = c000000000fd4e68
  R15 = 0000000000000001   R31 = 000000001faf6bd0
  pc  = 000000001ec3072c
  cfar= 000000001ec2fedc
  lr  = 000000001ec2fee0
  msr = 8000000102801000   cr  = 42000000
  ctr = 000000001ec48788   xer = 0000000000000020   trap =  300
  dar = 000212d6c1a2a20c   dsisr = 42000000
  86:mon> 

  
   
  Contact Information = Ping Tian Han/pt...@cn.ibm.com 
   
  ---uname output---
  Linux roselp4 4.8.0-34-generic #36~16.04.1-Ubuntu SMP Wed Dec 21 18:53:20 UTC 
2016 ppc64le ppc64le ppc64le GNU/Linux
   
  Machine Type = lpar 
   
  ---Debugger Data---
  cpu 0x86: Vector: 300 (Data Access) at [c000000007b0fd40]
      pc: 000000001ec3072c
      lr: 000000001ec2fee0
      sp: 1faf6bd0
     msr: 8000000102801000
     dar: 212d6c1a2a20c
   dsisr: 42000000
    current = 0xc000000474c6d600
    paca    = 0xc000000007b6b600   softe: 0        irq_happened: 0x01
      pid   = 0, comm = swapper/134
  Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 
21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11)
  WARNING: exception is not recoverable, can't continue
  enter ? for help
  SP (1faf6bd0) is in userspace
  86:mon> e
  cpu 0x86: Vector: 300 (Data Access) at [c000000007b0fd40]
      pc: 000000001ec3072c
      lr: 000000001ec2fee0
      sp: 1faf6bd0
     msr: 8000000102801000
     dar: 212d6c1a2a20c
   dsisr: 42000000
    current = 0xc000000474c6d600
    paca    = 0xc000000007b6b600   softe: 0        irq_happened: 0x01
      pid   = 0, comm = swapper/134
  Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 
21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11)
  86:mon> t
  SP (1faf6bd0) is in userspace
  86:mon> r
  R00 = 000212d6c1a2a20f   R16 = c000000000ff1c38
  R01 = 000000001faf6bd0   R17 = c000000474c9c080
  R02 = 000000001ed1be80   R18 = c000000474c9c000
  R03 = 000000001faf6c80   R19 = c0000000013fdf08
  R04 = 0000000000000018   R20 = c000000474c9c080
  R05 = 00000000000000e0   R21 = c0000000013e8ad0
  R06 = 0000000000009e04   R22 = c000000474c9c000
  R07 = 000000001faf6d30   R23 = c00000047a9a1c40
  R08 = 000000001faf6d28   R24 = 0000000000000002
  R09 = 000212d6c1a2a20c   R25 = c000000000fd4e6c
  R10 = 000000001ec1b118   R26 = c000000000fd4e6c
  R11 = 000000001ee7e040   R27 = c0000000014daae0
  R12 = 000000000163c1d8   R28 = 0000000000000000
  R13 = c000000007b6b600   R29 = 0000000000000086
  R14 = c0000000014defb0   R30 = c000000000fd4e68
  R15 = 0000000000000001   R31 = 000000001faf6bd0
  pc  = 000000001ec3072c
  cfar= 000000001ec2fedc
  lr  = 000000001ec2fee0
  msr = 8000000102801000   cr  = 42000000
  ctr = 000000001ec48788   xer = 0000000000000020   trap =  300
  dar = 000212d6c1a2a20c   dsisr = 42000000
  86:mon>  
   
  ---System Hang---
   drop into xmon
   
  ---Steps to Reproduce---
   1. run IO stress tests on roselp4
  2. run slot/cpu/mem dlpar tests on roselp4
   
  Stack trace output:
   no
   
  Oops output:
   no
   
  System Dump Info:
    The system was configured to capture a dump, however a dump was not 
produced.
   
  *Additional Instructions for Ping Tian Han/pt...@cn.ibm.com: 
  -Post a private note with access information to the machine that is currently 
in the debugger. 
  -Attach sysctl -a output output to the bug.

  == Comment: #4 - PAWAN K. SINGH <pawak...@in.ibm.com> - 2016-12-27
  02:19:58 ==

  
  == Comment: #7 - Ping Tian Han <pt...@cn.ibm.com> - 2016-12-27 20:53:59 ==

  
  == Comment: #8 - Ping Tian Han <pt...@cn.ibm.com> - 2016-12-27 20:59:04 ==

  
  == Comment: #14 - Ping Tian Han <pt...@cn.ibm.com> - 2016-12-28 03:17:50 ==
  FYI. With default min_free_kbytes, roselp4 still drops into xmon:

  Ubuntu 16.04.1 LTS roselp4 hvc0

  roselp4 login: [  260.094141] sysrq: SysRq : Changing Loglevel
  [  260.094161] sysrq: Loglevel set to 9
  [  266.614273] cpu 152 (hwid 152) Ready to die...
  [  266.794136] cpu 153 (hwid 153) Ready to die...
  [  266.794694] cpu 154 (hwid 154) Ready to die...
  [  266.804248] cpu 155 (hwid 155) Ready to die...
  [  266.804302] cpu 156 (hwid 156) Ready to die...
  [  266.804354] cpu 157 (hwid 157) Ready to die...
  [  266.804410] cpu 158 (hwid 158) Ready to die...
  [  266.804465] cpu 159 (hwid 159) Ready to die...
  [  266.935065] cpu 144 (hwid 144) Ready to die...
  [  267.144140] cpu 145 (hwid 145) Ready to die...
  [  267.144683] cpu 146 (hwid 146) Ready to die...
  [  267.154692] cpu 147 (hwid 147) Ready to die...
  [  267.164134] cpu 148 (hwid 148) Ready to die...
  [  267.164702] cpu 149 (hwid 149) Ready to die...
  [  267.174819] cpu 150 (hwid 150) Ready to die...
  [  267.184684] cpu 151 (hwid 151) Ready to die...
  [  267.324831] cpu 136 (hwid 136) Ready to die...
  [  267.614138] cpu 137 (hwid 137) Ready to die...
  [  267.614745] cpu 138 (hwid 138) Ready to die...
  [  267.624135] cpu 139 (hwid 139) Ready to die...
  [  267.624716] cpu 140 (hwid 140) Ready to die...
  [  267.634637] Querying DEAD? cpu 141 (141) shows 2
  cpu 0x8d: Vector: 300 (Data Access) at [c000000007ad7d40]
      pc: 000000001ec26be0
      lr: 000000001ec26ab4
      sp: 1faf6920
     msr: 8000000102801000
     dar: fffffe801faf6bc0
   dsisr: 40000000
    current = 0xc000000474c51e00
    paca    = 0xc000000007b6f500   softe: 0        irq_happened: 0x01
      pid   = 0, comm = swapper/141
  Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 
21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11)
  WARNING: exception is not recoverable, can't continue
  enter ? for help
  SP (1faf6920) is in userspace
  8d:mon> cpu 0x8e: Vector: 300 (Data Access) at [c000000007acfd40]
      pc: 000000001ec22614
      lr: 000000001ec22d5c
      sp: 1faf6b00
     msr: 8000000102801000
     dar: 20000000
   dsisr: 40000000
    current = 0xc000000474c7c800
    paca    = 0xc000000007b6fe00   softe: 0        irq_happened: 0x01
      pid   = 0, comm = swapper/142
  Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 
21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11)
  WARNING: exception is not recoverable, can't continue

  8d:mon> 
  Unrecognized command: \x1be (type ? for help)
  8d:mon> e
  cpu 0x8d: Vector: 300 (Data Access) at [c000000007ad7d40]
      pc: 000000001ec26be0
      lr: 000000001ec26ab4
      sp: 1faf6920
     msr: 8000000102801000
     dar: fffffe801faf6bc0
   dsisr: 40000000
    current = 0xc000000474c51e00
    paca    = 0xc000000007b6f500   softe: 0        irq_happened: 0x01
      pid   = 0, comm = swapper/141
  Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 
21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11)
  8d:mon> t
  SP (1faf6920) is in userspace
  8d:mon>

  == Comment: #15 - Ping Tian Han <pt...@cn.ibm.com> - 2016-12-28
  03:22:22 ==

  
  == Comment: #19 - Ping Tian Han <pt...@cn.ibm.com> - 2016-12-29 00:13:39 ==

  
  == Comment: #20 - Kevin W. Rudd <ru...@us.ibm.com> - 2016-12-29 12:49:15 ==
  Nathan or Laurent,

  In the dmesg output, I'm seeing similar behavior to the problem
  reported in Bug 146931.  The following error and completely bogus
  NIP/LR values appear to be the same scenario:

  pseries-hotplug-cpu: Failed to release drc (10000098) for CPU
  PowerPC,POWER8, rc: -17

  The NIP and LR values appear to be completely bogus, so I'm not sure
  what about the Bug 146931 scenario matched the issue being tracked in
  Bug 146776.

  This looks to be a side issue of doing hotplugging on the CPUs

  Please review and provide your thoughts on this observed behavior.

  Thanks.


  == Comment: #25 - Nathan D. Fontenot <nfont...@us.ibm.com> - 2017-01-13 
13:32:51 ==
  My first thought in looking at this is that it appears that the swapper 
thread for a cpu is scheduled to run on a cpu that has been removed. This may 
explain the bogus pc and lr values. There have been a lot of updates to the 
generic kernel cpu hotplug code recently, perhaps some update there could be 
causing this. It would be interesting to see if this occurs on older kernels.

  As for the rtas set-indicator call returning -17, I don't know how
  that is possible. A return value of -17 is not even a defined return
  value in the PAPR. This could be a side effect of what is causing the
  crash though so that should be resolved first and then see if this
  still occurs.

  == Comment: #31 - Fernando Seiti Furusato <ferse...@br.ibm.com> - 2017-02-02 
11:37:37 ==
  Mirroring so Canonical is aware of this bug.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1661684/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to