Default Comment by Bridge

** Attachment added: "Kdump console logs"
   
https://bugs.launchpad.net/bugs/1751835/+attachment/5063610/+files/kdump_failure_on_proposed_4.15.txt

** Changed in: ubuntu
     Assignee: (unassigned) => Ubuntu on IBM Power Systems Bug Triage 
(ubuntu-power-triage)

** Package changed: ubuntu => kexec-tools (Ubuntu)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to kexec-tools in Ubuntu.
https://bugs.launchpad.net/bugs/1751835

Title:
  Ubuntu 18.04 kdump does not work on bionic-proposed 4.15 ppc64el with
  AC922

Status in The Ubuntu-power-systems project:
  New
Status in kexec-tools package in Ubuntu:
  New

Bug description:
  no guest was running, only Ubuntu 18.04 Host installed and trying
  kdump only on the host

  ===========
  tried on the ubuntu released proposed kernel 4.15.0-10-generic and kdump 
process was able to proceed (did not dump complete vmcore though) but hit 
issue, please check below console log
  ========

  root@ltciofvtr-spoon4:~# dmesg | grep -i crash 
  [    0.000000] Reserving 6144MB of memory at 128MB for crashkernel (System 
RAM: 524288MB)
  [    0.000000] Kernel command line: 
root=UUID=749429f9-83b9-4776-9a3e-147e297cdf99 ro quiet splash crashkernel=6144M
  root@ltciofvtr-spoon4:~# dmesg | grep -i reserv
  [    0.000000] Reserving 6144MB of memory at 128MB for crashkernel (System 
RAM: 524288MB)
  [    0.000000] cma: Reserved 26224 MiB at 0x0000203995000000
  [    0.000000]   DMA zone: 0 pages reserved
  [    0.000000]   DMA zone: 0 pages reserved
  [    0.000000] Memory: 502952192K/536870912K available (13376K kernel code, 
2048K rwdata, 3648K rodata, 4800K init, 3037K bss, 7065344K reserved, 26853376K 
cma-reserved)
  root@ltciofvtr-spoon4:~# uname -a
  Linux ltciofvtr-spoon4 4.15.0-10-generic #11-Ubuntu SMP Tue Feb 13 18:21:52 
UTC 2018 ppc64le ppc64le ppc64le GNU/Linux
  root@ltciofvtr-spoon4:~# kdump
  kdump         kdump-config  
  root@ltciofvtr-spoon4:~# kdump
  kdump         kdump-config  
  root@ltciofvtr-spoon4:~# kdump-config 
  Usage: /usr/sbin/kdump-config 
{help|test|show|status|load|unload|savecore|propagate|symlinks kernel-version}
  root@ltciofvtr-spoon4:~# kdump-config  status
  current state   : ready to kdump
  root@ltciofvtr-spoon4:~# echo 1 > /proc/sys/kernel/sysrq 
  root@ltciofvtr-spoon4:~# echo c > /proc/sysrq-trigger 
  [  164.604204] sysrq: SysRq : Trigger a crash
  [  164.604264] Unable to handle kernel paging request for data at address 
0x00000000
  [  164.604395] Faulting instruction address: 0xc0000000007ea268
  [  164.604481] Oops: Kernel access of bad area, sig: 11 [#1]
  [  164.604559] LE SMP NR_CPUS=2048 NUMA PowerNV
  [  164.604628] Modules linked in: idt_89hpesx ofpart ipmi_powernv 
ipmi_devintf ipmi_msghandler ibmpowernv cmdlinepart vmx_crypto powernv_flash 
at24 uio_pdrv_genirq uio mtd opal_prd crct10dif_vpmsum sch_fq_codel ip_tables 
x_tables autofs4 mlx5_ib ib_core mlx5_core bnx2x ast i2c_algo_bit ttm 
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mdio ahci mlxfw 
libcrc32c crc32c_vpmsum drm tg3 libahci devlink
  [  164.605176] CPU: 72 PID: 3444 Comm: bash Not tainted 4.15.0-10-generic 
#11-Ubuntu
  [  164.605305] NIP:  c0000000007ea268 LR: c0000000007eb1a8 CTR: 
c0000000007ea240
  [  164.605413] REGS: c0002038f8c2f9f0 TRAP: 0300   Not tainted  
(4.15.0-10-generic)
  [  164.605496] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 28222222 
 XER: 20040000
  [  164.605641] CFAR: c0000000007eb1a4 DAR: 0000000000000000 DSISR: 42000000 
SOFTE: 1 
  [  164.605641] GPR00: c0000000007eb1a8 c0002038f8c2fc70 c0000000016ea600 
0000000000000063 
  [  164.605641] GPR04: c000203993d0ce18 c000203993d24368 9000000000009033 
0000000000000808 
  [  164.605641] GPR08: 0000000000000007 0000000000000001 0000000000000000 
9000000000001003 
  [  164.605641] GPR12: c0000000007ea240 c000000007a51800 000008f7fbb8e008 
0000000000000000 
  [  164.605641] GPR16: 000008f7fb19e9f0 000008f7fb231998 000008f7fb2319d0 
000008f7fb268204 
  [  164.605641] GPR20: 0000000000000000 0000000000000001 0000000000000000 
00007fffed1e1b54 
  [  164.605641] GPR24: 00007fffed1e1b50 000008f7fb26afc4 c0000000015e9930 
0000000000000002 
  [  164.605641] GPR28: 0000000000000063 0000000000000004 c000000001572a9c 
c0000000015e9cd0 
  [  164.606714] NIP [c0000000007ea268] sysrq_handle_crash+0x28/0x30
  [  164.606796] LR [c0000000007eb1a8] __handle_sysrq+0xf8/0x2c0
  [  164.606865] Call Trace:
  [  164.606919] [c0002038f8c2fc70] [c0000000007eb188] 
__handle_sysrq+0xd8/0x2c0 (unreliable)
  [  164.607043] [c0002038f8c2fd10] [c0000000007eb9b4] 
write_sysrq_trigger+0x64/0x90
  [  164.607152] [c0002038f8c2fd40] [c00000000047c548] proc_reg_write+0x88/0xd0
  [  164.607252] [c0002038f8c2fd70] [c0000000003cf8dc] __vfs_write+0x3c/0x70
  [  164.607347] [c0002038f8c2fd90] [c0000000003cfb38] vfs_write+0xd8/0x220
  [  164.607445] [c0002038f8c2fde0] [c0000000003cfe58] SyS_write+0x68/0x110
  [  164.607545] [c0002038f8c2fe30] [c00000000000b184] system_call+0x58/0x6c
  [  164.607640] Instruction dump:
  [  164.607689] 4bfff9f1 4bfffe50 3c4c00f0 384203c0 7c0802a6 60000000 39200001 
3d42001c 
  [  164.607805] 394a76b0 912a0000 7c0004ac 39400000 <992a0000> 4e800020 
3c4c00f0 38420390 
  [  164.607925] ---[ end trace d9ded5212faa751b ]---
  [  165.612963] 
  [  165.613070] Sending IPI to other CPUs
  [  165.623168] IPI complete
  [  165.627790] kexec: waiting for cpu 1 (physical 1) to enter OPAL
  [  165.629875] kexec: waiting fo[  419.363342082,5] OPAL: Switch to 
big-endian OS
  r cpu 5 (physical 5) to enter[  423.007803159,5] OPAL: Switch to 
little-endian OS
   OPAL
  [  165.635949] kexec: waiting for cpu 7 (physical 7) to enter OPAL
  [  167.322686] kexec: Starting switchover sequence.
  [    1.245751] integrity: Unable to open file: /etc/keys/x509_ima.der (-2)
  [    1.245755] integrity: Unable to open file: /etc/keys/x509_evm.der (-2)
  [    1.314527] vio vio: uevent: failed to send synthetic uevent
  /dev/sda2: recovering journal
  /dev/sda2: clean, 134971/61054976 files, 5545559/244188416 blocks
  [    6.725980] vio vio: uevent: failed to send synthetic uevent
  [  OK  ] Started Show Plymouth Boot Screen.
  plymouth-start.service
  [  OK  ] Reached target Local Encrypted Volumes.
  [  OK  ] Started Forward Password Requests to Plymouth Directory Watch.
  [  OK  ] Started Network Time Synchronization.
  systemd-timesyncd.service
  [  OK  ] Reached target System Time Synchronized.
  [  OK  ] Started AppArmor initialization.
  apparmor.service
  [  OK  ] Reached target System Initialization.
  [  OK  ] Started Wait for Network to be Configured.
  systemd-networkd-wait-online.service
  [  OK  ] Reached target Network is Online.
           Starting Kernel crash dump capture service...
  [   24.498357] kdump-tools[1424]: Starting kdump-tools:  * running 
makedumpfile -c -d 31 /proc/vmcore /var/crash/201802160010/dump-incomplete
  [   41.042895] Fatal Hypervisor Maintenance interrupt [Not recovered]
  [   41.042962]  Error detail: Malfunction Alert
  [   41.043003]        HMER: 8040000000000000
  [   41.043060]        Unknown Malfunction Alert of type 0
  [   41.043113] opal: Hardware platform error: Unrecoverable HMI exception
  [   41.849182] WARNING: CPU: 2 PID: 55 at 
/build/linux-jWa1Fv/linux-4.15.0/kernel/sched/core.c:1188 
set_task_cpu+0x240/0x250
  [   41.849297] Modules linked in: mlx5_ib ibmpowernv vmx_crypto ipmi_powernv 
ofpart crct10dif_vpmsum ipmi_devintf ipmi_msghandler cmdlinepart powernv_flash 
opal_prd mtd idt_89hpesx ib_core bnx2x at24 mdio mlx5_core mlxfw libcrc32c 
devlink tg3 sch_fq_codel ip_tables x_tables autofs4 crc32c_vpmsum ast 
i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops 
ahci uio_pdrv_genirq drm uio libahci
  [   41.849810] CPU: 2 PID: 55 Comm: kworker/2:1 Not tainted 4.15.0-10-generic 
#11-Ubuntu
  [   41.849945] Workqueue: events hmi_event_handler
  [   41.850025] NIP:  c000000008149aa0 LR: c00000000814a6cc CTR: 
c000000008156600
  [   41.850154] REGS: c00000016feeeeb0 TRAP: 0700   Not tainted  
(4.15.0-10-generic)
  [   41.850270] MSR:  900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  
CR: 28002244  XER: 00000000
  [   41.850408] CFAR: c00000000814990c SOFTE: 0 
  [   41.850408] GPR00: c00000000814a6cc c00000016feef130 c0000000096ea600 
c00000016f432b00 
  [   41.850408] GPR04: 0000000000000000 0000000000000000 0000000000000000 
0000000000000000 
  [   41.850408] GPR08: c000000009721ed8 0000000000000000 0000000000000000 
0000000000000002 
  [   41.850408] GPR12: 0000000028002244 c00000000fff1600 c000000008138be8 
c00000016f54cf00 
  [   41.850408] GPR16: 0000000000000000 0000000000000000 0000000000000000 
000000017e0a0000 
  [   41.850408] GPR20: c0000001872635c0 0000000000000000 0000000000000000 
0000000000000000 
  [   41.850408] GPR24: c00000016f432f28 c00000000971dd78 c0000000091d8580 
0000000000000000 
  [   41.850408] GPR28: 0000000000000004 0000000000000000 0000000000000000 
c00000016f432b00 
  [   41.851427] NIP [c000000008149aa0] set_task_cpu+0x240/0x250
  [   41.851507] LR [c00000000814a6cc] try_to_wake_up+0x1bc/0x660
  [   41.851605] Call Trace:
  [   41.851649] [c00000016feef130] [c0000000091d8580] runqueues+0x0/0xc00 
(unreliable)
  [   41.851791] [c00000016feef170] [c00000000814a6cc] 
try_to_wake_up+0x1bc/0x660
  [   41.851921] [c00000016feef1f0] [c00000000816deb0] 
__wake_up_common+0xd0/0x200
  [   41.852053] [c00000016feef260] [c00000000843f020] 
ep_poll_callback+0xe0/0x2f0
  [   41.852159] [c00000016feef2c0] [c00000000816deb0] 
__wake_up_common+0xd0/0x200
  [   41.852271] [c00000016feef330] [c00000000816e09c] 
__wake_up_common_lock+0xbc/0x110
  [   41.852379] [c00000016feef3c0] [c00000000818ace0] 
wake_up_klogd_work_func+0x60/0xc0
  [   41.852492] [c00000016feef3f0] [c000000008291ef0] 
irq_work_run_list+0xb0/0x100
  [   41.852597] [c00000016feef430] [c0000000081b34e0] 
update_process_times+0x60/0x90
  [   41.852705] [c00000016feef460] [c0000000081cb1d4] 
tick_sched_handle.isra.5+0x34/0xd0
  [   41.852817] [c00000016feef490] [c0000000081cb2d0] 
tick_sched_timer+0x60/0xe0
  [   41.852921] [c00000016feef4d0] [c0000000081b4034] 
__hrtimer_run_queues+0x144/0x370
  [   41.853029] [c00000016feef550] [c0000000081b4f9c] 
hrtimer_interrupt+0xfc/0x350
  [   41.853137] [c00000016feef620] [c000000008024950] 
__timer_interrupt+0x90/0x260
  [   41.853234] [c00000016feef670] [c000000008024d68] timer_interrupt+0x98/0xe0
  [   41.853331] [c00000016feef6a0] [c000000008009014] 
decrementer_common+0x114/0x120
  [   41.853446] --- interrupt: 901 at replay_interrupt_return+0x0/0x4
  [   41.853446]     LR = arch_local_irq_restore+0x74/0x90
  [   41.853602] [c00000016feef990] [c00000016feef9d0] 0xc00000016feef9d0 
(unreliable)
  [   41.853711] [c00000016feef9b0] [c000000008d03d00] 
_raw_spin_unlock_irqrestore+0x40/0xa0
  [   41.853825] [c00000016feef9d0] [c00000000857923c] pstore_dump+0x31c/0x3e0
  [   41.853922] [c00000016feefb10] [c00000000818af04] kmsg_dump+0x134/0x1a0
  [   41.854017] [c00000016feefb70] [c0000000080a1214] 
pnv_platform_error_reboot+0x94/0x110
  [   41.854127] [c00000016feefbe0] [c0000000080a648c] 
hmi_event_handler+0x1bc/0x1c0
  [   41.854236] [c00000016feefc90] [c00000000812fdb8] 
process_one_work+0x298/0x5a0
  [   41.854341] [c00000016feefd20] [c000000008130158] worker_thread+0x98/0x630
  [   41.854438] [c00000016feefdc0] [c000000008138d88] kthread+0x1a8/0x1b0
  [   41.854533] [c00000016feefe30] [c00000000800b528] 
ret_from_kernel_thread+0x5c/0xb4
  [   41.854642] Instruction dump:
  [   41.854691] 7faa3670 7d4a0194 57a706be 7d4a07b4 794a1f24 7d28502a 7d293c36 
71290001 
  [   41.854797] 4082fe80 60000000 60000000 60420000 <0fe00000> 4bfffe6c 
60000000 60420000 
  [   41.854927] ---[ end trace c9f2cb34f3ff824e ]---
  [  497.382416495,0] OPAL: Reboot requested due to Platform error.
  [  497.382470331,3] OPAL: Reboot requested due to Platform error.

  --== Welcome to Hostboot hostboot-7050d0a/hbicore.bin ==--

    3.99470|secure|SecureROM valid - enabling functionality
    3.99474|secure|Booting in non-secure mode.
    5.18577|ERRL|Dumping errors reported prior to registration
    5.20139|================================================
    5.20139|Error reported by ipmi (0x2500) PLID 0x90033DC5
    5.20949|  Requested sensor is not present.
    5.20949|  ModuleId   0x03 IPMI::MOD_IPMISENSOR
    5.20949|  ReasonCode 0x2507 IPMI::RC_SENSOR_NOT_PRESENT
    5.20950|  UserData1  BMC IPMI Completion code. : 0x00000000000000cb
    5.20951|  UserData2  bytes [0-1]sensor name bytes [2-3]sensor number bytes 
[4-7]HUID of target. : 0xca22000b00010000
    5.20952|------------------------------------------------
    5.20952|  Callout type             : Procedure Callout
    5.20952|  Procedure                : EPUB_PRC_HB_CODE
    5.20953|  Priority                 : SRCI_PRIORITY_HIGH
    5.20953|------------------------------------------------
    5.20954|  Hostboot Build ID: hostboot-7050d0a/hbicore.bin
    5.20954|================================================
    6.20986|ISTEP  6. 5 - host_init_fsi
    6.34649|ISTEP  6. 6 - host_set_ipl_parms
    6.36715|ISTEP  6. 7 - host_discover_targets
    6.84607|HWAS|PRESENT> DIMM[03]=AAAA000000000000
    6.84608|HWAS|PRESENT> Proc[05]=8800000000000000
    6.84609|HWAS|PRESENT> Core[07]=CCFC3FFF03F30000
    6.86942|ISTEP  6. 8 - host_update_master_tpm

  
  == Comment: #9 - NAVEED A. UPPINANGADY SALIH <naveed...@in.ibm.com> - 
2018-02-15 23:55:28 ==
  root@ltciofvtr-spoon4:/var/crash# ls -lrht 201802160038
  total 251M
  -rw------- 1 root root 448M Feb 16 00:38 dump-incomplete

  == Comment: #14 - Hari Krishna Bathini <hbath...@in.ibm.com> - 2018-02-22 
01:40:19 ==
  The following 4 patches are missing in kexec-tools package leading to HMIs

    commit 69431282f075ab723c4886f20aa248976920aaae
    Author: Hari Bathini <hbath...@linux.vnet.ibm.com>
    Date:   Tue Aug 29 23:08:02 2017 +0530

      kexec-tools: ppc64: fix leak while checking for coherent device memory
      
      Signed-off-by: Hari Bathini <hbath...@linux.vnet.ibm.com>
      Signed-off-by: Simon Horman <ho...@verge.net.au>

    commit aec4d0f7a2502a13fc21e90ff32dc306b0ad1190
    Author: Hari Bathini <hbath...@linux.vnet.ibm.com>
    Date:   Thu Aug 17 18:01:51 2017 +0530

      kexec-tools: ppc64: avoid adding coherent memory regions to crash memory 
ranges
      
      Accelerator devices like GPU and FPGA cards contain onboard memory. This
      onboard memory is represented as a memory only NUMA node, integrating it
      with core memory subsystem. Since, the link through which these devices
      are integrated to core memory goes down after a system crash and they are
      meant for user workloads, avoid adding coherent device memory regions to
      crash memory ranges. Without this change, makedumpfile tool tries to save
      unaccessible coherent device memory regions, crashing the system.
      
      Signed-off-by: Hari Bathini <hbath...@linux.vnet.ibm.com>
      Tested-by: Pingfan Liu <pi...@redhat.com>
      Signed-off-by: Simon Horman <ho...@verge.net.au>

    commit 21eb397a5fc9227cd95d23e8c74a49cf6a293e57
    Author: Hari Bathini <hbath...@linux.vnet.ibm.com>
    Date:   Wed Aug 9 23:47:42 2017 +0530

      kexec-tools: powerpc: fix command line overflow error
      
      Since kernel commit a5980d064fe2 ("powerpc: Bump COMMAND_LINE_SIZE
      to 2048"), powerpc bumped command line size to 2048 but the size
      used here is still the default value of 512. Bump it to 2048 to
      fix command line overflow errors observed when command line length
      is above 512 bytes. Also, get rid of the multiple definitions of
      COMMAND_LINE_SIZE macro in ppc architecture.
      
      Signed-off-by: Hari Bathini <hbath...@linux.vnet.ibm.com>
      Signed-off-by: Simon Horman <ho...@verge.net.au>

    commit 47478ea66d4301b12a07862aebc8447a2932f0ed
    Author: Hari Bathini <hbath...@linux.vnet.ibm.com>
    Date:   Wed Jul 26 22:49:41 2017 +0530

      kexec-tools: ppc64: fix how RMA top is deduced
      
      Hang was observed, in purgatory, on a machine configured with
      single LPAR. This was because one of the segments was loaded
      outside the actual Real Memory Area (RMA) due to wrongly
      deduced RMA top value.
      
      Currently, top of real memory area, which is crucial for loading
      kexec/kdump kernel, is obtained by iterating through mem nodes
      and setting its value based on the base and size values of the
      last mem node in the iteration. That can't always be correct as
      the order of iteration may not be same and RMA base & size are
      always based on the first memory property. Fix this by setting
      RMA top value based on the base and size values of the memory
      node that has the smallest base value (first memory property)
      among all the memory nodes.
      
      Also, correct the misnomers rmo_base and rmo_top to rma_base
      and rma_top respectively.
      
      While how RMA top is deduced was broken for sometime, the issue
      may not have been seen so far, for couple of possible reasons:
      
          1. Only one mem node was available.
          2. First memory property has been the last node in
             iteration when multiple mem nodes were present.
      
      Fixes: 02f4088ffded ("kexec fix ppc64 device-tree mem node")
      Reported-by: Ankit Kumar <an...@linux.vnet.ibm.com>
      Cc: Michael Ellerman <m...@ellerman.id.au>
      Cc: Geoff Levand <ge...@infradead.org>
      Signed-off-by: Hari Bathini <hbath...@linux.vnet.ibm.com>
      Signed-off-by: Simon Horman <ho...@verge.net.au>
  --

  I would recommend upgrading to the latest kexec-tools verison 2.0.16 (which 
includes
  this patches) instead of cherry-picking these patches..

  Thanks
  Hari

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1751835/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to