Processed: Re: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64

2018-01-06 Thread Debian Bug Tracking System
Processing control commands:

> reassign -1 xen-hypervisor-4.8-amd64
Bug #880554 [linux-image-4.9.0-4-amd64] xen domu freezes with kernel 
linux-image-4.9.0-4-amd64
Bug reassigned from package 'linux-image-4.9.0-4-amd64' to 
'xen-hypervisor-4.8-amd64'.
No longer marked as found in versions linux/4.9.51-1.
Ignoring request to alter fixed versions of bug #880554 to the same values 
previously set

-- 
880554: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=880554
Debian Bug Tracking System
Contact ow...@bugs.debian.org with problems



Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64

2018-01-06 Thread Yves-Alexis Perez
control: reassign -1 xen-hypervisor-4.8-amd64

On Sat, 2018-01-06 at 15:23 +0100, Valentin Vidic wrote:
> On Sat, Jan 06, 2018 at 03:08:26PM +0100, Yves-Alexis Perez wrote:
> > According to that link, the fix seems to be configuration rather than
> > code.
> > Does this mean this bug against the kernel should be closed?
> 
> Yes, the problem seems to be in the Xen hypervisor and not the Linux
> kernel itself.  The default value for the gnttab_max_frames parameter
> needs to be increased to avoid domU disk IO hangs, for example:
> 
>   GRUB_CMDLINE_XEN="dom0_mem=10240M gnttab_max_frames=256"
> 
> So either close the bug or reassign it to xen-hypervisor package so
> they can increase the default value for this parameter in the
> hypervisor code.
> 
Ok, I'll reassign and let the Xen maintainers handle that (maybe in a stable
update).

@Xen maintainers: see the complete bug log for more information, but basically
it seems that a domu freezes happens with the “new” multi-queue xen blk
driver, and the fix is to increase a configuration value. Valentin suggests
adding that to the default.

Regards,
-- 
Yves-Alexis

signature.asc
Description: This is a digitally signed message part


Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64

2018-01-06 Thread Valentin Vidic
On Sat, Jan 06, 2018 at 03:08:26PM +0100, Yves-Alexis Perez wrote:
> According to that link, the fix seems to be configuration rather than code.
> Does this mean this bug against the kernel should be closed?

Yes, the problem seems to be in the Xen hypervisor and not the Linux
kernel itself.  The default value for the gnttab_max_frames parameter
needs to be increased to avoid domU disk IO hangs, for example:

  GRUB_CMDLINE_XEN="dom0_mem=10240M gnttab_max_frames=256"

So either close the bug or reassign it to xen-hypervisor package so
they can increase the default value for this parameter in the
hypervisor code.

-- 
Valentin



Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64

2018-01-06 Thread Yves-Alexis Perez
On Fri, 2017-11-17 at 07:39 +0100, Valentin Vidic wrote:
> Hi,
> 
> The problem seems to be caused by the new multi-queue xen blk driver
> and I was advised by the Xen devs to increase the gnttab_max_frames=256
> parameter for the hypervisor.  This has solved the blocking issue
> for me and it has been running without problems for a few months now.

I'm not really fluent in Xen, but does this relate to the kernel in dom0 or
one of the domU then? 
> 
> I/O to LUNs hang / stall under high load when using xen-blkfront
> https://www.novell.com/support/kb/doc.php?id=7018590

According to that link, the fix seems to be configuration rather than code.
Does this mean this bug against the kernel should be closed?

Regards,
-- 
Yves-Alexis

signature.asc
Description: This is a digitally signed message part


Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64

2017-11-16 Thread Valentin Vidic
Hi,

The problem seems to be caused by the new multi-queue xen blk driver
and I was advised by the Xen devs to increase the gnttab_max_frames=256
parameter for the hypervisor.  This has solved the blocking issue
for me and it has been running without problems for a few months now.

I/O to LUNs hang / stall under high load when using xen-blkfront
https://www.novell.com/support/kb/doc.php?id=7018590

-- 
Valentin



Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64

2017-11-14 Thread Martin von Wittich
We're having the same problem here. For some reason, only 2 domUs are 
affected (the dom0 has a total of 22 domUs, 14 of those are running 
Debian stretch, and 13 of those are running Linux 4.9.51-1).


The `xl console` output of the first domU (according to our monitoring 
it hangs since yesterday 14:06):



[ 3746.780086] INFO: task ntpd:670 blocked for more than 120 seconds.
[ 3746.780094]   Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1
[ 3746.780097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 3746.780223] INFO: task jbd2/xvdb6-8:8173 blocked for more than 120 seconds.
[ 3746.780228]   Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1
[ 3746.780233] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 3746.780304] INFO: task rsync:8188 blocked for more than 120 seconds.
[ 3746.780308]   Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1
[ 3746.780311] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 3867.612083] INFO: task jbd2/xvda1-8:203 blocked for more than 120 seconds.
[ 3867.612091]   Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1
[ 3867.612091] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 3867.612148] INFO: task ntpd:670 blocked for more than 120 seconds.
[ 3867.612150]   Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1
[ 3867.612152] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 3867.612238] INFO: task jbd2/xvdb6-8:8173 blocked for more than 120 seconds.
[ 3867.612242]   Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1
[ 3867.612245] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 3867.612287] INFO: task rsync:8188 blocked for more than 120 seconds.
[ 3867.612291]   Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1
[ 3867.612294] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 3988.444071] INFO: task jbd2/xvda1-8:203 blocked for more than 120 seconds.
[ 3988.444080]   Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1
[ 3988.444084] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 3988.444154] INFO: task ntpd:670 blocked for more than 120 seconds.
[ 3988.444159]   Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1
[ 3988.444162] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 3988.444266] INFO: task kworker/2:0:1533 blocked for more than 120 seconds.
[ 3988.444271]   Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1
[ 3988.444274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.


The other domU had a similar error message before a coworker downgraded 
the kernel to 3.16 get it working again:



INFO: task jbd2/xvda1-8:191 blocked for more than 120 seconds.
[  605.148107]   Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1
[  605.148111] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.


The first domU is a backup machine, it mainly uses rsync --link-dest to 
pull backups from other machines, and is therefore rather IO intensive. 
The other domU is a firewall/router and shouldn't be IO intensive at all.


--
Mit freundlichen Grüßen
Martin v. Wittich

IServ GmbH
Bültenweg 73
38106 Braunschweig

Telefon:   0531-2243666-0
Fax:   0531-2243666-9
E-Mail:i...@iserv.eu
Internet:  iserv.eu

USt-IdNr. DE265149425 | Amtsgericht Braunschweig | HRB 201822
Geschäftsführer: Benjamin Heindl, Jörg Ludwig



Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64

2017-11-13 Thread Christian Schwamborn

Update:

First of all: Forget my observation about the 'system boot time'. I 
mixed up something, the dom0 boot time was increased, but this happened 
probably due to the not (well/propper) handled lvm thin activation 
during system boot.


One last thing I pulled from domu with the original kernel (4.9.51-1) 
was this top output:


top - 20:41:03 up  6:18,  2 users,  load average: 17.03, 6.98, 2.62
Tasks: 231 total,   1 running, 230 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,  0.0 id,100.0 wa,  0.0 hi,  0.0 si, 
0.0 st
%Cpu1  :  0.0 us,  0.3 sy,  0.0 ni,  0.0 id, 99.7 wa,  0.0 hi,  0.0 si, 
0.0 st

KiB Mem :  8212616 total,  1907568 free,  1485276 used,  4819772 buff/cache
KiB Swap:  2097148 total,  2097148 free,0 used.  6558984 avail Mem

at this point, the system is more or less unusable, everything depending 
on IO is dead.


Currently my production system domu is running for over a week with the 
last backports kernel (linux-image-4.13.0-0.bpo.1-amd64) dom0 is still 
on the current stretch kernel (4.9.51-1) and it seems stable for now.

My guess would be some issue with the xen blkfront driver.
About end of last year I experiences something similar with jessie. 
After some kernel updates those issues got better. They are not 
completely gone, some jessie domu's need a reboot from time to time due 
to raising wa, but the system is still responsive then, it's just 
getting slower and slower by the minute.




Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64

2017-11-02 Thread Christian Schwamborn

Update:

Sadly the my productive system froze in the early afternoon today again 
with the older kernel as well (4.9.30-2+deb9u5). so that wasn't a temp 
workaround. Paradoxically nothing showed up on the xl console (within a 
screen) at dom0. No errors, nothing, the vm just stopped responding. As 
I was monitoring the system, there where still two open shell 
connections. Some basic stuff still worked, but as soon as tried to open 
a file, the shell got unresponsive. I tried a shutdown on the other 
shell, but that didn't got very far.


Searching the net for that issue I found this post at the xen project 
mailing list: 
https://lists.xen.org/archives/html/xen-users/2017-07/msg00057.html 
which sounds similar. He got some traces out of it, but no answer on the 
mailing list.


Some information about my setup:

hardware:
xeon E5-2620 v4
board supermicro X10SRi-F
32gb ecc ram
two 10tb server disk
two I350 network adapter (onboard)

dom0:
debian stretch (up to date), kernel 4.9.51-1, xen-hypervisor 
4.8.1-1+deb9u3,

the two network as adapter as a bond in a bridge
the discs: gpt, 4 part (1M, 256M esp, 256M md mirror with boot, rest as 
md mirror for lvm)


domu:
memory: 8192, 2 vcpus
uses a network interface on the bridge
several (thin)lvm volumes as phys devices
debian stretch (up to date)
issue with both kernel versions: 4.9.30-2+deb9u5 and 4.9.51-1

Some other domu's (wheezy, jessie and a windows 7) seem to run fine

Next I'll try some newer kernels for the domu, starting with the stretch 
backport kernels.




Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64

2017-11-02 Thread Christian Schwamborn

Package: linux-image-4.9.0-4-amd64
Version: 4.9.51-1
Severity: critical

As I can tell right now, the domu system simply freezes. The logs simply 
end at some point until the new reboot stuff comes up. Sometimes it's 
still possible to log on to the system, but nothing really works. It is 
like all IO to the virtual block devices is suspended indefinitely. 
Until this happens, the systems seems to work without issues. As the new 
kernel isn't out that long, I can't tell how often this happens. first 
time was the day before yesterday and yesterday afternoon it happened 
twice within two hours.


Something like 'ls' on a directory listed before still gets a result, 
but everything 'new', i.e. 'vim somefile' will cause the shell to stall.
Sadly there is no visible error, services just fails to answer one by 
one (maybe when the try to read/write something new to the disk, then 
they simply wait for IO to happen).


For testing I installed the older kernel (last linux-image-4.9.0-3-amd64 
from security - 4.9.30-2+deb9u5) and realized immediately that the 
system boot time is a fraction with the old kernel in opposite to the 
new one. For the time being, I'm staying with that nn the production system.


To see if anything will be dumped on the console, I started one within a 
screen on a test machine. Now I have to generate some activity and IO 
and see if something happens there.


I haven't had the time to test the impact on the dom0 kernel jet, as far 
as I observed, the dom0 seems to be unaffected by the kernel update.