comment #4 suggested that the panic only happened because
kernel.softlockup_panic was set to 1, was that verified?

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1846237

Title:
  Kernel Panic while virsh dump of Guest with 300G RAM is triggered.

Status in The Ubuntu-power-systems project:
  Triaged
Status in libvirt package in Ubuntu:
  Invalid
Status in linux package in Ubuntu:
  New

Bug description:
  == Comment: #0 - HARIHARAN T. SUNDARESH REDDY - 2019-04-01 12:23:14 ==
  ---Problem Description---
  Kernel panic occurred when `virsh dump` is triggered on the guest with 300G 
of RAM.

  ---uname output---
  Linux ltcgen6 4.15.0-1017.19-bz175922-ibm-gt #bz175922 SMP Thu Mar 21 
09:34:09 CDT 2019 ppc64le ppc64le ppc64le GNU/Linux

  ---Debugger---
  A debugger is not configured

  ---Steps to Reproduce---
  1. Define guest with 300G RAM, Start the guest
  2. Run the following command 
     `virsh dump ubuntui_hsr /home/vm.core --memory-only --live'
  3. Wait for some time, kernel panic will be see in the console.

  
  --Log---
  [ 1692.657251] INFO: task journal-offline:7763 blocked for more than 120 
seconds.
          [ 1692.657251] INFO: task journal-offline:7763 blocked for more than 
120 seconds.
          [ 1692.657754]       Not tainted 4.15.0-1017.19-bz175922-ibm-gt 
#bz175922
          [ 1692.657754]       Not tainted 4.15.0-1017.19-bz175922-ibm-gt 
#bz175922
          [ 1692.658220] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
          [ 1692.658220] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
          [ 1692.658839] Kernel panic - not syncing: hung_task: blocked tasks
          [ 1692.658839] Kernel panic - not syncing: hung_task: blocked tasks
          [ 1692.659238] CPU: 48 PID: 785 Comm: khungtaskd Not tainted 
4.15.0-1017.19-bz175922-ibm-gt #bz175922
          [ 1692.659238] CPU: 48 PID: 785 Comm: khungtaskd Not tainted 
4.15.0-1017.19-bz175922-ibm-gt #bz175922
          [ 1692.659835] Call Trace:
          [ 1692.659835] Call Trace:
          [ 1692.660025] [c000008fd0eefbf8] [c000000000cea13c] 
dump_stack+0xb0/0xf4 (unreliable)
          [ 1692.660025] [c000008fd0eefbf8] [c000000000cea13c] 
dump_stack+0xb0/0xf4 (unreliable)
          [ 1692.660564] [c000008fd0eefc38] [c000000000110020] panic+0x148/0x328
          [ 1692.660564] [c000008fd0eefc38] [c000000000110020] panic+0x148/0x328
          [ 1692.661004] [c000008fd0eefcd8] [c000000000233a08] 
watchdog+0x2c8/0x420
          [ 1692.661004] [c000008fd0eefcd8] [c000000000233a08] 
watchdog+0x2c8/0x420
          [ 1692.661429] [c000008fd0eefdb8] [c000000000140068] 
kthread+0x1a8/0x1b0
          [ 1692.661429] [c000008fd0eefdb8] [c000000000140068] 
kthread+0x1a8/0x1b0
          [ 1692.661881] [c000008fd0eefe28] [c00000000000b654] 
ret_from_kernel_thread+0x5c/0x88
          [ 1692.661881] [c000008fd0eefe28] [c00000000000b654] 
ret_from_kernel_thread+0x5c/0x88
          [ 1692.662439] Sending IPI to other CPUs
          [ 1692.662439] Sending IPI to other CPUs
          [ 1693.971250] IPI complete
          [ 1693.971250] IPI complete
          [ 1694.122536] kexec: Starting switchover sequence.
          [ 1694.122536] kexec: Starting switchover sequence.

          [ 1827.285354188,3] PHB#0003[0:3]: CRESET: Unexpected slot state 
00000102, resetting...
          [ 1831.426449832,3] PHB#0030[8:0]: CRESET: Unexpected slot state 
00000102, resetting...
          [ 1832.158844758,3] PHB#0033[8:3]: CRESET: Unexpected slot state 
00000102, resetting...
          [ 1832.403055326,3] PHB#0034[8:4]: CRESET: Unexpected slot state 
00000102, resetting...
          [    1.924644] integrity: Unable to open file: /etc/keys/x509_ima.der 
(-2)
          [    1.924644] integrity: Unable to open file: /etc/keys/x509_ima.der 
(-2)
          [    1.924647] integrity: Unable to open file: /etc/keys/x509_evm.der 
(-2)
          [    1.924647] integrity: Unable to open file: /etc/keys/x509_evm.der 
(-2)
          [    1.991351] vio vio: uevent: failed to send synthetic uevent
          [    1.991351] vio vio: uevent: failed to send synthetic uevent
          [    2.524824] nouveau 0004:04:00.0: unknown chipset (140000a1)
          [    2.524824] nouveau 0004:04:00.0: unknown chipset (140000a1)
          [    2.525314] nouveau 0004:05:00.0: unknown chipset (140000a1)
          [    2.525314] nouveau 0004:05:00.0: unknown chipset (140000a1)
          [    2.525831] nouveau 0035:03:00.0: unknown chipset (140000a1)
          [    2.525831] nouveau 0035:03:00.0: unknown chipset (140000a1)
          [    2.526315] nouveau 0035:04:00.0: unknown chipset (140000a1)
          [    2.526315] nouveau 0035:04:00.0: unknown chipset (140000a1)
          /dev/sda2: recovering journal
          systemd-update-utmp.service
          systemd-timesyncd.service
          apparmor.service
          [   19.799936] kdump-tools[1192]: Starting kdump-tools:  * running 
makedumpfile -c -d 31 /proc/vmcore /var/crash/201904010617/dump-incomplete
          Copying data                                      : [100.0 %] -       
    eta: 0s
          [   84.327105] kdump-tools[1192]: The kernel version is not supported.
          [   84.327190] kdump-tools[1192]: The makedumpfile operation may be 
incomplete.
          [   84.327269] kdump-tools[1192]: The dumpfile is saved to 
/var/crash/201904010617/dump-incomplete.
          [   84.327347] kdump-tools[1192]: makedumpfile Completed.
          [   84.357362] kdump-tools[1192]:  * kdump-tools: saved vmcore in 
/var/crash/201904010617
          [   86.168085] kdump-tools[1192]:  * running makedumpfile 
--dump-dmesg /proc/vmcore /var/crash/201904010617/dmesg.201904010617
          [   86.207356] kdump-tools[1192]: The kernel version is not supported.
          [   86.207492] kdump-tools[1192]: The makedumpfile operation may be 
incomplete.
          [   86.207564] kdump-tools[1192]: The dmesg log is saved to 
/var/crash/201904010617/dmesg.201904010617.
          [   86.207818] kdump-tools[1192]: makedumpfile Completed.
          [   86.211403] kdump-tools[1192]:  * kdump-tools: saved dmesg content 
in /var/crash/201904010617
          [   86.350667] kdump-tools[1192]: Mon, 01 Apr 2019 06:18:47 -0400
          [   86.531296] kdump-tools[1192]: Rebooting.

  == Comment: #4 - HARIHARAN T. SUNDARESH REDDY - 2019-04-01 23:43:05 ==
  After kernel panic the host triggers kdump and reboots back.

  == Comment: #13 - Leonardo Bras Soares Passos - 2019-05-15 11:51:33 ==
  I have been doing a lot of tests lately, and found out this kind of bug 
usually happens when disk IO is very intense, so that no other application can 
write on disk.

  I have studied libvirt and even proposed a patch on ML that solves the 
problem, sleeping once in a while to make room for other applications to write 
on disk. 
  Unfortunately, i have made a mistake. This sort of problem should not be 
responsibility of the application. 
  I got to know that it is the kernel responsibility to schedule IO access so 
that no one starves. It is done using the Linux IO Scheduler.

  The IO Scheduler priority seems to be set/get using ionice application. 
  The libvirt_iohelper is called with ionice = "none: prio 0". 
  journald also have ionice = "none: prio 0", and at times it seems to not be 
able to interrupt libvirt_iohelper, causing the panic.

  I have also crafted a simple test to mimic the behavior of libvirt_iohelper, 
that could also recreate this bug. (attached)
  It enforces the problem can happen independent to libvirt.

  By now, I am looking for a way to understand the scheduler workings. 
  I believe it can be fixed only by adjusting ionice priorities. 

  If it's not possible, I will try to debug the kernel and see if I can
  get any clue on why this happens, and a way to fix it.

  
  == Comment: #15 - Leonardo Bras Soares Passos - 2019-05-29 16:14:09 ==
  I was trying to better understand I/O priority, so I read a few documents and 
found out the I/O Scheduler on this host is CFQ.

  After reading a brief decription of Linux I/O CFQ, I understood it was
  supposed to not let that starvation happen (as said before, both
  journald and libvirt_iohelper had the same priority (none: prio 0)),
  once it should create few queues per process and serve each queue
  using priority to allocate a disk timeslice.

  By now, I am trying to better understand the behavior of CFQ  (block
  /cfq-iosched.c), to see if is there a fix that could solve the
  problem.

  While trying reproduce the problems on newer kernels (checking if the
  bug was already fixed), I found out block/cfq-iosched.c was removed as
  CFQ was deprecated in favor of BFQ. On Ubuntu kernel 5.0 the default
  IOsched was replaced by mq-deadline. (hwe-18.04-edge)

  The bug seems to reproduce much more on mq-deadline. As it is simpler, I 
quickly studied it's workings: 
  The concept of deadline scheduler is that it prioritizes requests that 
expired the maximum timeout, 

  Then, I created a systemtap program to better understand the reason of it 
starving journald:
  - It seems that the test program create write requests a lot faster than it's 
possible to write on disk, because even after the program finishes executing, 
mq-deadline keeps working for a few minutes.
  - After a few seconds on writing, all requests seem to be falling on 
deadline, i.e. they have a write only after timeout was reached.
  - It means that, due to the large amount of data to write, the scheduler 
became a simple FIFO.

  About mq-deadline: It's important to furher debug his behavior,
  because looks like Ubuntu kernels over 4.20 will use it as default
  IOscheduler. There are a few more tests I want to try (like repeating
  the process with a 90% full memory) in order to understand if less
  memory available for caching IO changes will improve the behavior.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1846237/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to