Also, note that I have opened bug LP#1730660 ("Set PANIC_TIMEOUT=10 on
Power Systems"), to handle the PANIC_TIMEOUT option. I have submitted
patches to set that option for Xenial, Zesty, Artful and beyond.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1680349

Title:
  Ubuntu 17.04: Kdump fails to capture dump on Firestone NV when machine
  crashes while running stress-ng.

Status in The Ubuntu-power-systems project:
  New
Status in linux package in Ubuntu:
  New

Bug description:
  == Comment: #0 - PAVITHRA R. PRAKASH <> - 2017-03-10 02:43:10 ==
  ---Problem Description---

  Ubuntu 17.04: Kdump fails to capture dump on Firestone NV when machine
  crashes while running stress-ng. Machine hangs.

  ---Steps to Reproduce---

  1. Configure kdump.
  2. Install stress-ng
  # apt-get install stress-ng
  3. Run stress-ng
  # stress-ng - a 0

  
  Logs:
  ========
  root@ltc-firep3:~# kdump-config load
  Modified cmdline:root=UUID=8b0d5b99-6087-4f40-82ea-375c83a4c139 ro quiet 
splash irqpoll nr_cpus=1 nousb systemd.unit=kdump-tools.service 
ata_piix.prefer_ms_hyperv=0 elfcorehdr=155200K 
   * loaded kdump kernel
  root@ltc-firep3:~# kdump-config show
  DUMP_MODE:        kdump
  USE_KDUMP:        1
  KDUMP_SYSCTL:     kernel.panic_on_oops=1
  KDUMP_COREDIR:    /var/crash
  crashkernel addr: 
     /var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinux-4.10.0-11-generic
  kdump initrd: 
     /var/lib/kdump/initrd.img: symbolic link to 
/var/lib/kdump/initrd.img-4.10.0-11-generic
  current state:    ready to kdump

  kexec command:
    /sbin/kexec -p 
--command-line="root=UUID=8b0d5b99-6087-4f40-82ea-375c83a4c139 ro quiet splash 
irqpoll nr_cpus=1 nousb systemd.unit=kdump-tools.service 
ata_piix.prefer_ms_hyperv=0" --initrd=/var/lib/kdump/initrd.img 
/var/lib/kdump/vmlinuz
  root@ltc-firep3:~# stress-ng -a 0
  stress-ng: info:  [3900] defaulting to a 86400 second run per stressor
  stress-ng: info:  [3900] dispatching hogs: 160 af-alg, 160 affinity, 160 aio, 
160 aiol, 160 apparmor, 160 atomic, 160 bigheap, 160 brk, 160 bsearch, 160 
cache, 160 cap, 160 chdir, 160 chmod, 160 chown, 160 chroot, 160 clock, 160 
clone, 160 context, 160 copy-file, 160 cpu, 160 cpu-online, 160 crypt, 160 
daemon, 160 dccp, 160 dentry, 160 dir, 160 dirdeep, 160 dnotify, 160 dup, 160 
epoll, 160 eventfd, 160 exec, 160 fallocate, 160 fanotify, 160 fault, 160 
fcntl, 160 fiemap, 160 fifo, 160 filename, 160 flock, 160 fork, 160 fp-error, 
160 fstat, 160 full, 160 futex, 160 get, 160 getdent, 160 getrandom, 160 
handle, 160 hdd, 160 heapsort, 160 hsearch, 160 icache, 160 icmp-flood, 160 
inotify, 160 io, 160 iomix, 160 ioprio, 160 itimer, 160 kcmp, 160 key, 160 
kill, 160 klog, 160 lease, 160 link, 160 locka, 160 lockbus, 160 lockf, 160 
lockofd, 160 longjmp, 160 lsearch, 160 madvise, 160 malloc, 160 matrix, 160 
membarrier, 160 memcpy, 160 memfd, 160 mergesort, 160 mincore, 160 mknod, 160 
mlock, 160 mmap, 160 mmapfork, 160 mmapmany, 160 mq, 160 mremap, 160 msg, 160 
msync, 160 netlink-proc, 160 nice, 160 nop, 160 null, 160 numa, 160 oom-pipe, 
160 opcode, 160 open, 160 personality, 160 pipe, 160 poll, 160 procfs, 160 
pthread, 160 ptrace, 160 pty, 160 qsort, 160 quota, 160 rdrand, 160 readahead, 
160 remap, 160 rename, 160 resources, 160 rlimit, 160 rmap, 160 rtc, 160 
schedpolicy, 160 sctp, 160 seal, 160 seccomp, 160 seek, 160 sem, 160 sem-sysv, 
160 sendfile, 160 shm, 160 shm-sysv, 160 sigfd, 160 sigfpe, 160 sigpending, 160 
sigq, 160 sigsegv, 160 sigsuspend, 160 sleep, 160 sock, 160 sockfd, 160 
sockpair, 160 spawn, 160 splice, 160 stack, 160 stackmmap, 160 str, 160 stream, 
160 switch, 160 symlink, 160 sync-file, 160 sysfs, 160 sysinfo, 160 tee, 160 
timer, 160 timerfd, 160 tlb-shootdown, 160 tmpfs, 160 tsc, 160 tsearch, 160 
udp, 160 udp-flood, 160 unshare, 160 urandom, 160 userfaultfd, 160 utime, 160 
vecmath, 160 vfork, 160 vforkmany, 160 vm, 160 vm-rw, 160 vm-splice, 160 wait, 
160 wcs, 160 xattr, 160 yield, 160 zero, 160 zlib, 160 zombie
  stress-ng: info:  [3900] cache allocate: using built-in defaults as unable to 
determine cache details
  stress-ng: info:  [3900] cache allocate: default cache size: 2048K
  stress-ng: info:  [3907] stress-ng-atomic: this stressor is not implemented 
on this system: ppc64le Linux 4.10.0-11-generic
  stress-ng: info:  [3955] stress-ng-exec: running as root, won't run test.
  stress-ng: info:  [3999] stress-ng-icache: this stressor is not implemented 
on this system: ppc64le Linux 4.10.0-11-generic
  stress-ng: info:  [4040] stress-ng-lockbus: this stressor is not implemented 
on this system: ppc64le Linux 4.10.0-11-generic
  stress-ng: info:  [4313] stress-ng-numa: system has 2 of a maximum 256 memory 
NUMA nodes
  stress-ng: info:  [4455] stress-ng-rdrand: this stressor is not implemented 
on this system: ppc64le Linux 4.10.0-11-generic
  stress-ng: fail:  [4558] stress-ng-rtc: ioctl RTC_ALRM_READ failed, errno=22 
(Invalid argument)
  stress-ng: fail:  [4017] stress-ng-key: keyctl KEYCTL_DESCRIBE failed, 
errno=127 (Key has expired)
  stress-ng: fail:  [4017] stress-ng-key: keyctl KEYCTL_UPDATE failed, 
errno=127 (Key has expired)
  stress-ng: fail:  [4017] stress-ng-key: keyctl KEYCTL_READ failed, errno=127 
(Key has expired)
  stress-ng: fail:  [4017] stress-ng-key: request_key failed, errno=126 
(Required key not available)
  stress-ng: fail:  [4017] stress-ng-key: keyctl KEYCTL_DESCRIBE failed, 
errno=127 (Key has expired)
  info: 5 failures reached, aborting stress process
  [  170.733680] Memory failure: 0xceda8: recovery action for dirty LRU page: 
Recovered
  [  171.036660] Memory failure: 0xce8e9: recovery action for dirty LRU page: 
Recovered
  [  171.161610] Memory failure: 0xce4fb: recovery action for dirty LRU page: 
Recovered
  [  171.170348] AppArmor DFA next/check upper bounds error
  [  171.204790] Memory failure: 0xd2146: recovery action for dirty LRU page: 
Recovered
  [  171.232026] Memory failure: 0xcefe6: recovery action for dirty LRU page: 
Recovered
  [  171.232899] Memory failure: 0xce578: recovery action for dirty LRU page: 
Recovered
  [  171.236850] Memory failure: 0xcfdfb: recovery action for dirty LRU page: 
Recovered
  [  171.336249] Memory failure: 0xcd715: recovery action for dirty LRU page: 
Recovered
  [  171.337550] Memory failure: 0xfb86c: recovery action for dirty LRU page: 
Recovered
  [  171.367483] Memory failure: 0xce92c: recovery action for dirty LRU page: 
Recovered
  [  171.369980] Memory failure: 0xceabe: recovery action for dirty LRU page: 
Recovered
  [  171.372534] Memory failure: 0xbcf3a: recovery action for dirty LRU page: 
Recovered
  [  171.375318] Memory failure: 0xceef9: recovery action for dirty LRU page: 
Recovered
  [  171.377701] Memory failure: 0xce722: recovery action for dirty LRU page: 
Recovered
  [  171.384725] Memory failure: 0xcedef: recovery action for dirty LRU page: 
Recovered
  [  171.398538] Memory failure: 0xcf927: recovery action for dirty LRU page: 
Recovered
  [  171.401492] Memory failure: 0xce881: recovery action for dirty LRU page: 
Recovered
  [  171.403476] Memory failure: 0xce2d4: recovery action for dirty LRU page: 
Recovered
  [  171.404104] Memory failure: 0xce17a: recovery action for dirty LRU page: 
Recovered
  [  171.404682] Memory failure: 0xd9f0b: recovery action for dirty LRU page: 
Recovered
  stress-ng: info:  [4865] stress-ng-spawn: running as root, won't run test.
  [  171.406159] Memory failure: 0xdaae0: recovery action for dirty LRU page: 
Recovered
  [  171.415810] Memory failure: 0xb5355: recovery action for dirty LRU page: 
Recovered
  [  171.434513] Memory failure: 0xb5576: recovery action for dirty LRU page: 
Recovered
  [  171.435161] Memory failure: 0xbd0fd: recovery action for dirty LRU page: 
Recovered
  [  171.436046] Memory failure: 0xceec0: recovery action for dirty LRU page: 
Recovered
  [  171.449215] Memory failure: 0xcecda: recovery action for dirty LRU page: 
Recovered
  [  171.453705] Memory failure: 0xcf005: recovery action for dirty LRU page: 
Recovered
  [  171.491202] Memory failure: 0xfb99e: recovery action for dirty LRU page: 
Recovered
  [  171.493054] Memory failure: 0xb2dbe: recovery action for dirty LRU page: 
Recovered
  [  171.503540] Memory failure: 0xced0f: recovery action for dirty LRU page: 
Recovered
  [  171.504809] Memory failure: 0xb2dad: recovery action for clean LRU page: 
Recovered
  [  171.506327] Memory failure: 0xb3268: recovery action for dirty LRU page: 
Recovered
  [  171.523449] Memory failure: 0xb3238: recovery action for dirty LRU page: 
Recovered
  [  171.524558] Memory failure: 0xcea57: recovery action for dirty LRU page: 
Recovered
  [  171.525611] Memory failure: 0xce6c8: recovery action for dirty LRU page: 
Recovered
  [  171.526501] Memory failure: 0xbd0d0: recovery action for dirty LRU page: 
Recovered
  [  171.528740] Memory failure: 0xcea27: recovery action for dirty LRU page: 
Recovered
  [  171.536166] Memory failure: 0xce469: recovery action for dirty LRU page: 
Recovered
  [  171.537409] Memory failure: 0xcec3f: recovery action for dirty LRU page: 
Recovered
  [  171.538991] Memory failure: 0xcec80: recovery action for dirty LRU page: 
Recovered
  [  171.540183] Memory failure: 0xb0283: recovery action for dirty LRU page: 
Recovered
  [  171.568190] Memory failure: 0xb0165: recovery action for dirty LRU page: 
Recovered
  [  171.569451] Memory failure: 0xda648: recovery action for dirty LRU page: 
Recovered
  [  171.669472] Memory failure: 0xb2d6a: recovery action for dirty LRU page: 
Recovered
  stress-ng: info:  [4929] stress-ng-stream: using built-in defaults as unable 
to determine cache details
  stress-ng: info:  [4929] stress-ng-stream: stressor loosely based on a 
variant of the STREAM benchmark code
  stress-ng: info:  [4929] stress-ng-stream: do NOT submit any of these results 
to the STREAM benchmark results
  stress-ng: info:  [4929] stress-ng-stream: Using CPU cache size of 2048K
  [  171.722081] Memory failure: 0xcf20d: recovery action for dirty LRU page: 
Recovered
  [  171.723615] Memory failure: 0xa975f: recovery action for dirty LRU page: 
Recovered
  [  171.745730] Memory failure: 0xb2d85: recovery action for clean LRU page: 
Recovered
  stress-ng: info:  [4986] stress-ng-sysfs: running as root, just traversing 
/sys and not read/writing to /sys files.
  [  172.043162] Memory failure: 0xaa6aa: recovery action for dirty LRU page: 
Recovered
  [  172.048888] Memory failure: 0xb02d0: recovery action for dirty LRU page: 
Recovered
  [  172.103892] Memory failure: 0xcd8db: recovery action for dirty LRU page: 
Recovered
  [  172.105545] Memory failure: 0xb2d9e: recovery action for dirty LRU page: 
Recovered
  [  172.106053] Memory failure: 0xcf2f4: recovery action for dirty LRU page: 
Recovered
  [  172.106224] Memory failure: 0xa9758: recovery action for clean LRU page: 
Recovered
  [  172.146851] Memory failure: 0xce5e8: recovery action for clean LRU page: 
Recovered
  [  172.234564] Memory failure: 0x9e8b2: recovery action for dirty LRU page: 
Recovered
  [  172.236835] Memory failure: 0xac4f8: recovery action for dirty LRU page: 
Recovered
  [  172.238363] Memory failure: 0xcebb2: recovery action for dirty LRU page: 
Recovered
  stress-ng: info:  [5105] stress-ng-tsc: this stressor is not implemented on 
this system: ppc64le Linux 4.10.0-11-generic
  [  172.494650] Memory failure: 0xcecb6: recovery action for clean LRU page: 
Recovered
  [  172.495944] Memory failure: 0xa9710: recovery action for dirty LRU page: 
Recovered
  [  172.496511] Memory failure: 0xb55d7: recovery action for dirty LRU page: 
Recovered
  [  172.496932] Memory failure: 0x9e8cb: recovery action for dirty LRU page: 
Recovered
  [  172.716658] Memory failure: 0x9e628: recovery action for dirty LRU page: 
Recovered
  [  172.780960] Memory failure: 0xcf3ac: recovery action for dirty LRU page: 
Recovered
  [  172.781447] Memory failure: 0xceaac: recovery action for dirty LRU page: 
Recovered
  [  172.781891] Memory failure: 0xb55a1: recovery action for dirty LRU page: 
Recovered
  [  172.845268] Memory failure: 0x84318: recovery action for dirty LRU page: 
Recovered
  [  172.846308] Memory failure: 0x84322: recovery action for dirty LRU page: 
Recovered
  [  172.860021] Memory failure: 0xbd067: recovery action for dirty LRU page: 
Recovered
  [  172.924176] Memory failure: 0xce68e: recovery action for dirty LRU page: 
Recovered
  [  172.926255] Memory failure: 0x92ee8: recovery action for dirty LRU page: 
Recovered
  [  172.926720] Memory failure: 0xda136: recovery action for dirty LRU page: 
Recovered
  [  172.927534] Memory failure: 0xb2d75: recovery action for dirty LRU page: 
Recovered
  [  173.008909] Memory failure: 0xac4e6: recovery action for dirty LRU page: 
Recovered
  [  173.042161] Memory failure: 0xcea49: recovery action for dirty LRU page: 
Recovered
  [  173.076591] Memory failure: 0x9e8fb: recovery action for dirty LRU page: 
Recovered
  [  173.124359] Memory failure: 0x8434b: recovery action for dirty LRU page: 
Recovered
  [  173.288102] Memory failure: 0xcf5e7: recovery action for dirty LRU page: 
Recovered
  [  173.440243] Memory failure: 0xb012d: recovery action for dirty LRU page: 
Recovered
  [  173.565679] Memory failure: 0x1cc382: recovery action for clean LRU page: 
Recovered
  [  173.620166] Memory failure: 0x84334: recovery action for dirty LRU page: 
Recovered
  [  173.635189] Memory failure: 0xb02bf: recovery action for dirty LRU page: 
Recovered
  [  173.636070] Memory failure: 0x9e8f0: recovery action for dirty LRU page: 
Recovered
  [  173.638929] Memory failure: 0x84362: recovery action for dirty LRU page: 
Recovered
  [  173.643249] Memory failure: 0xcda1c: recovery action for dirty LRU page: 
Recovered
  [  173.648607] Memory failure: 0x9a27a: recovery action for dirty LRU page: 
Recovered
  [  173.651927] Memory failure: 0xced46: recovery action for dirty LRU page: 
Recovered
  [  173.711413] Memory failure: 0x9a270: recovery action for dirty LRU page: 
Recovered
  [  173.733759] Memory failure: 0xb55b1: recovery action for dirty LRU page: 
Recovered
  [  173.738553] Memory failure: 0x840d1: recovery action for dirty LRU page: 
Recovered
  [  173.740023] Memory failure: 0xb01ae: recovery action for dirty LRU page: 
Recovered
  [  173.740992] Memory failure: 0x1c9ca8: recovery action for dirty LRU page: 
Recovered
  [  173.742282] Memory failure: 0xa97fa: recovery action for dirty LRU page: 
Recovered
  [  173.783778] Memory failure: 0xc763c: recovery action for dirty LRU page: 
Recovered
  [  173.785593] Memory failure: 0xb02b6: recovery action for dirty LRU page: 
Recovered
  [  173.788206] AppArmor DFA next/check upper bounds error
  [  173.788390] Memory failure: 0x1ca066: dirty LRU page still referenced by 1 
users
  [  173.788395] Memory failure: 0x1ca066: recovery action for dirty LRU page: 
Failed
  [  174.403722] Memory failure: 0x1c979a: recovery action for dirty LRU page: 
Recovered
  [  174.428211] Memory failure: 0xb02db: recovery action for dirty LRU page: 
Recovered
  stress-ng: info:  [5591] stress-ng-yield: limiting to 160 yielders (instance 
0)
  stress-ng: info:  [5689] stress-ng-atomic: this stressor is not implemented 
on this system: ppc64le Linux 4.10.0-11-generic
  [  174.643022] Memory failure: 0x1ca7c1: recovery action for dirty LRU page: 
Recovered
  stress-ng: info:  [6033] stress-ng-exec: running as root, won't run test.
  [  174.691794] Unable to handle kernel paging request for data at address 
0x000002f4
  [  174.692217] Faulting instruction address: 0xd000000014bc0a90
  [  174.692484] Oops: Kernel access of bad area, sig: 11 [#1]
  [  174.692780] SMP NR_CPUS=2048 
  [  174.692788] NUMA 
  [  174.693003] PowerNV
  [  174.693269] Modules linked in: btrfs xor raid6_pq cuse wp512 kvm_hv kvm_pr 
sctp(+) rmd320 libcrc32c dccp_ipv4(+) kvm rmd256 rmd160 rmd128 md4 binfmt_misc 
algif_hash dccp af_alg ofpart cmdlinepart ipmi_powernv ipmi_devintf 
powernv_flash ipmi_msghandler mtd opal_prd ibmpowernv powernv_rng joydev 
input_leds mac_hid at24 nvmem_core uio_pdrv_genirq uio ip_tables x_tables 
autofs4 hid_generic usbhid hid uas usb_storage ast crc32c_vpmsum i2c_algo_bit 
ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci 
libahci tg3
  [  174.694545] CPU: 98 PID: 6170 Comm: stress-ng-dccp Not tainted 
4.10.0-11-generic #13-Ubuntu
  [  174.694645] task: c000001e42c50800 task.stack: c000001e42cf8000
  [  174.694758] NIP: d000000014bc0a90 LR: d000000014bc21cc CTR: 
c000000000a3b340
  [  174.694872] REGS: c000001fff4476b0 TRAP: 0300   Not tainted  
(4.10.0-11-generic)
  [  174.694974] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
  [  174.695059]   CR: 24002242  XER: 20000000
  [  174.695217] CFAR: c000000000008850 DAR: 00000000000002f4 DSISR: 40000000 
SOFTE: 1 
  [  174.695217] GPR00: d000000014bc21cc c000001fff447930 d000000014bcb670 
0000000000000001 
  [  174.695217] GPR04: c000001e3b602700 c000001df14a0c60 0000000000000474 
c000001df14a0c74 
  [  174.695217] GPR08: c000001df14a0800 0000000000000000 c000001e3b60c400 
0000000000000000 
  [  174.695217] GPR12: 0000000000002200 c000000007b87200 c000001fff444000 
0000000000000000 
  [  174.695217] GPR16: c000000000fc2800 0000000000000000 0000000000000040 
0000000000002711 
  [  174.695217] GPR20: 000000000000a4d2 000000000100007f 000000000100007f 
c0000000013d2880 
  [  174.695217] GPR24: 0000000000000001 0000000000000001 0000000000000000 
0000000000000004 
  [  174.695217] GPR28: c0000000013d2880 c000001df14a0c74 0000000000000000 
c000001e3b602700 
  [  174.696147] NIP [d000000014bc0a90] dccp_v4_ctl_send_reset+0xa8/0x2f0 
[dccp_ipv4]
  [  174.696238] LR [d000000014bc21cc] dccp_v4_rcv+0x5d4/0x850 [dccp_ipv4]
  [  174.696312] Call Trace:
  [  174.696345] [c000001fff447930] [c000001fff4479c0] 0xc000001fff4479c0 
(unreliable)
  [  174.696978] [c000001fff4479c0] [d000000

  -----------------------------MACHINE HANGS
  -------------------------------------

  == Comment: #29 - Kevin W. Rudd <> - 2017-03-20 12:50:22 ==
  Hari.

  I was able to get access to the system for a quick set of validation
  tests.  With the default kdump settings, kdump completed OK and
  correctly saved a vmcore.  When stress-ng is run first, kdump hangs
  with the default settings, and also when the settings are modified to
  use "maxcpus=1" and "noirqdistrib".

  The following message was printed prior to the stress-ng induced
  hangs, but not when kdump completed without stress-ng running:

  "Ignoring boot flags, incorrect version 0x0"

  I will attach console logs from each test for review.

  == Comment: #30 - Kevin W. Rudd <> - 2017-03-20 12:51:54 ==
  The default boot options had "quiet splash", so I trimmed out the useless 
"ubuntu" splash messages from the log.

  == Comment: #36 - Hari Krishna Bathini <> - 2017-04-05 13:31:23 ==
  If panic timeout is set to zero and any secondary CPUs don't respond to IPI,
  kdump waits forever for a system reset, to try again to get ALL secondary
  CPUs to respond to IPI. The hang here is because panic timeout is set to
  zero and a few secondary CPUs didn't respond to IPI. System reset support
  is still work in progress for Open Power machines. Meantime, to workaround
  the hang issue, panic timeout value can be set to a non-zero value with

          $ echo 10 > /proc/sys/kernel/panic

  I did try it but kdump didn't take off. Instead the system just rebooted
  (better than a hang, I guess :) ). As for why kdump didn't take off, I
  am debugging it..

  Hi Canonical,

  I think it would be better to have a non-zero default value for panic
  timeout (CONFIG_PANIC_TIMEOUT)?

  Thanks
  Hari

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1680349/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to