Processed: Re: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Processing control commands: > reassign -1 xen-hypervisor-4.8-amd64 Bug #880554 [linux-image-4.9.0-4-amd64] xen domu freezes with kernel linux-image-4.9.0-4-amd64 Bug reassigned from package 'linux-image-4.9.0-4-amd64' to 'xen-hypervisor-4.8-amd64'. No longer marked as found in versions linux/4.9.51-1. Ignoring request to alter fixed versions of bug #880554 to the same values previously set -- 880554: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=880554 Debian Bug Tracking System Contact ow...@bugs.debian.org with problems
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
control: reassign -1 xen-hypervisor-4.8-amd64 On Sat, 2018-01-06 at 15:23 +0100, Valentin Vidic wrote: > On Sat, Jan 06, 2018 at 03:08:26PM +0100, Yves-Alexis Perez wrote: > > According to that link, the fix seems to be configuration rather than > > code. > > Does this mean this bug against the kernel should be closed? > > Yes, the problem seems to be in the Xen hypervisor and not the Linux > kernel itself. The default value for the gnttab_max_frames parameter > needs to be increased to avoid domU disk IO hangs, for example: > > GRUB_CMDLINE_XEN="dom0_mem=10240M gnttab_max_frames=256" > > So either close the bug or reassign it to xen-hypervisor package so > they can increase the default value for this parameter in the > hypervisor code. > Ok, I'll reassign and let the Xen maintainers handle that (maybe in a stable update). @Xen maintainers: see the complete bug log for more information, but basically it seems that a domu freezes happens with the “new” multi-queue xen blk driver, and the fix is to increase a configuration value. Valentin suggests adding that to the default. Regards, -- Yves-Alexis signature.asc Description: This is a digitally signed message part
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
On Sat, Jan 06, 2018 at 03:08:26PM +0100, Yves-Alexis Perez wrote: > According to that link, the fix seems to be configuration rather than code. > Does this mean this bug against the kernel should be closed? Yes, the problem seems to be in the Xen hypervisor and not the Linux kernel itself. The default value for the gnttab_max_frames parameter needs to be increased to avoid domU disk IO hangs, for example: GRUB_CMDLINE_XEN="dom0_mem=10240M gnttab_max_frames=256" So either close the bug or reassign it to xen-hypervisor package so they can increase the default value for this parameter in the hypervisor code. -- Valentin
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
On Fri, 2017-11-17 at 07:39 +0100, Valentin Vidic wrote: > Hi, > > The problem seems to be caused by the new multi-queue xen blk driver > and I was advised by the Xen devs to increase the gnttab_max_frames=256 > parameter for the hypervisor. This has solved the blocking issue > for me and it has been running without problems for a few months now. I'm not really fluent in Xen, but does this relate to the kernel in dom0 or one of the domU then? > > I/O to LUNs hang / stall under high load when using xen-blkfront > https://www.novell.com/support/kb/doc.php?id=7018590 According to that link, the fix seems to be configuration rather than code. Does this mean this bug against the kernel should be closed? Regards, -- Yves-Alexis signature.asc Description: This is a digitally signed message part
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Hi, The problem seems to be caused by the new multi-queue xen blk driver and I was advised by the Xen devs to increase the gnttab_max_frames=256 parameter for the hypervisor. This has solved the blocking issue for me and it has been running without problems for a few months now. I/O to LUNs hang / stall under high load when using xen-blkfront https://www.novell.com/support/kb/doc.php?id=7018590 -- Valentin
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
We're having the same problem here. For some reason, only 2 domUs are affected (the dom0 has a total of 22 domUs, 14 of those are running Debian stretch, and 13 of those are running Linux 4.9.51-1). The `xl console` output of the first domU (according to our monitoring it hangs since yesterday 14:06): [ 3746.780086] INFO: task ntpd:670 blocked for more than 120 seconds. [ 3746.780094] Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1 [ 3746.780097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 3746.780223] INFO: task jbd2/xvdb6-8:8173 blocked for more than 120 seconds. [ 3746.780228] Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1 [ 3746.780233] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 3746.780304] INFO: task rsync:8188 blocked for more than 120 seconds. [ 3746.780308] Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1 [ 3746.780311] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 3867.612083] INFO: task jbd2/xvda1-8:203 blocked for more than 120 seconds. [ 3867.612091] Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1 [ 3867.612091] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 3867.612148] INFO: task ntpd:670 blocked for more than 120 seconds. [ 3867.612150] Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1 [ 3867.612152] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 3867.612238] INFO: task jbd2/xvdb6-8:8173 blocked for more than 120 seconds. [ 3867.612242] Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1 [ 3867.612245] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 3867.612287] INFO: task rsync:8188 blocked for more than 120 seconds. [ 3867.612291] Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1 [ 3867.612294] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 3988.444071] INFO: task jbd2/xvda1-8:203 blocked for more than 120 seconds. [ 3988.444080] Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1 [ 3988.444084] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 3988.444154] INFO: task ntpd:670 blocked for more than 120 seconds. [ 3988.444159] Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1 [ 3988.444162] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 3988.444266] INFO: task kworker/2:0:1533 blocked for more than 120 seconds. [ 3988.444271] Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1 [ 3988.444274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. The other domU had a similar error message before a coworker downgraded the kernel to 3.16 get it working again: INFO: task jbd2/xvda1-8:191 blocked for more than 120 seconds. [ 605.148107] Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1 [ 605.148111] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. The first domU is a backup machine, it mainly uses rsync --link-dest to pull backups from other machines, and is therefore rather IO intensive. The other domU is a firewall/router and shouldn't be IO intensive at all. -- Mit freundlichen Grüßen Martin v. Wittich IServ GmbH Bültenweg 73 38106 Braunschweig Telefon: 0531-2243666-0 Fax: 0531-2243666-9 E-Mail:i...@iserv.eu Internet: iserv.eu USt-IdNr. DE265149425 | Amtsgericht Braunschweig | HRB 201822 Geschäftsführer: Benjamin Heindl, Jörg Ludwig
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Update: First of all: Forget my observation about the 'system boot time'. I mixed up something, the dom0 boot time was increased, but this happened probably due to the not (well/propper) handled lvm thin activation during system boot. One last thing I pulled from domu with the original kernel (4.9.51-1) was this top output: top - 20:41:03 up 6:18, 2 users, load average: 17.03, 6.98, 2.62 Tasks: 231 total, 1 running, 230 sleeping, 0 stopped, 0 zombie %Cpu0 : 0.0 us, 0.0 sy, 0.0 ni, 0.0 id,100.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 : 0.0 us, 0.3 sy, 0.0 ni, 0.0 id, 99.7 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 8212616 total, 1907568 free, 1485276 used, 4819772 buff/cache KiB Swap: 2097148 total, 2097148 free,0 used. 6558984 avail Mem at this point, the system is more or less unusable, everything depending on IO is dead. Currently my production system domu is running for over a week with the last backports kernel (linux-image-4.13.0-0.bpo.1-amd64) dom0 is still on the current stretch kernel (4.9.51-1) and it seems stable for now. My guess would be some issue with the xen blkfront driver. About end of last year I experiences something similar with jessie. After some kernel updates those issues got better. They are not completely gone, some jessie domu's need a reboot from time to time due to raising wa, but the system is still responsive then, it's just getting slower and slower by the minute.
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Update: Sadly the my productive system froze in the early afternoon today again with the older kernel as well (4.9.30-2+deb9u5). so that wasn't a temp workaround. Paradoxically nothing showed up on the xl console (within a screen) at dom0. No errors, nothing, the vm just stopped responding. As I was monitoring the system, there where still two open shell connections. Some basic stuff still worked, but as soon as tried to open a file, the shell got unresponsive. I tried a shutdown on the other shell, but that didn't got very far. Searching the net for that issue I found this post at the xen project mailing list: https://lists.xen.org/archives/html/xen-users/2017-07/msg00057.html which sounds similar. He got some traces out of it, but no answer on the mailing list. Some information about my setup: hardware: xeon E5-2620 v4 board supermicro X10SRi-F 32gb ecc ram two 10tb server disk two I350 network adapter (onboard) dom0: debian stretch (up to date), kernel 4.9.51-1, xen-hypervisor 4.8.1-1+deb9u3, the two network as adapter as a bond in a bridge the discs: gpt, 4 part (1M, 256M esp, 256M md mirror with boot, rest as md mirror for lvm) domu: memory: 8192, 2 vcpus uses a network interface on the bridge several (thin)lvm volumes as phys devices debian stretch (up to date) issue with both kernel versions: 4.9.30-2+deb9u5 and 4.9.51-1 Some other domu's (wheezy, jessie and a windows 7) seem to run fine Next I'll try some newer kernels for the domu, starting with the stretch backport kernels.
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Package: linux-image-4.9.0-4-amd64 Version: 4.9.51-1 Severity: critical As I can tell right now, the domu system simply freezes. The logs simply end at some point until the new reboot stuff comes up. Sometimes it's still possible to log on to the system, but nothing really works. It is like all IO to the virtual block devices is suspended indefinitely. Until this happens, the systems seems to work without issues. As the new kernel isn't out that long, I can't tell how often this happens. first time was the day before yesterday and yesterday afternoon it happened twice within two hours. Something like 'ls' on a directory listed before still gets a result, but everything 'new', i.e. 'vim somefile' will cause the shell to stall. Sadly there is no visible error, services just fails to answer one by one (maybe when the try to read/write something new to the disk, then they simply wait for IO to happen). For testing I installed the older kernel (last linux-image-4.9.0-3-amd64 from security - 4.9.30-2+deb9u5) and realized immediately that the system boot time is a fraction with the old kernel in opposite to the new one. For the time being, I'm staying with that nn the production system. To see if anything will be dumped on the console, I started one within a screen on a test machine. Now I have to generate some activity and IO and see if something happens there. I haven't had the time to test the impact on the dom0 kernel jet, as far as I observed, the dom0 seems to be unaffected by the kernel update.