Bug#912975: xen-hypervisor-4.8-amd64: Dom0 crashes randomly without logs on Debian Stretch with Xen 4.8.4

2020-11-22 Thread Hans van Kranenburg
Hi,

This bug was reported against Xen 4.8 (which is out of support and out
of security support now) and there has not been any activity for over
almost two years.

I'm cleaning up old open bugs, and I will close the issue now.

If you found a solution to this problem, please let us know, so the
information is added in the bug report for anyone else who might run
into the same situation.

If the problem still persists with Xen 4.11 in Debian stable, please
reply and reopen.

Thanks,
Hans



Bug#912975: xen-hypervisor-4.8-amd64: Dom0 crashes randomly without logs on Debian Stretch with Xen 4.8.4

2019-02-25 Thread Roalt Zijlstra | webpower
Hi Hans,

We did have some crashes , but now more every 1 or 2 months. We did have
the noreboot option, but so far this did not show anything. We were looking
into doing syslog reporting, but I am not sure if we fixed that on all Xen
servers. So maybe we should close the bug report since with the latest
kernel updates things are a bit better.

There is one thing to mention, which might help others on this matter. We
are now only running pretty stable with:

   - the Dom0 running with kernel: Debian 4.9.110-3+deb9u6
(2018-10-08) or Debian
   4.9.130-2 (2018-10-27)  with Xen 4.8.5-pre. And on all DomU servers with
   a 4.9 kernel.
   - the Dom0 running with kernel: Debian 4.9.65-3+deb9u1 (2017-12-23) with
   Xen 4.8.5-pre. This setup runs with mixed 3.16 and 4.9 kernels fine but
   without spectre/meltdown fixes.

Other combination like running DomUs with 3.16 kernels on the 4.9.110-3+deb9u6
(2018-10-08) Dom0 kernel is bound to crash quickly on heavily used servers.
Be it Nginx SSL offloading or an active MySQL database.
We have not tested the latest Debian Kernels with 3.16 kernels again.

Greetings,

Roalt


Op vr 22 feb. 2019 om 19:24 schreef Hans van Kranenburg :

> tags 912975 + moreinfo
> thanks
>
> Hi,
>
> On 11/7/18 3:43 PM, Roalt Zijlstra | webpower wrote:
> >
> > Op wo 7 nov. 2018 om 14:30 schreef Hans van Kranenburg  > >:
> > >
> > > Well hopefully the 'noreboot' provided server crashes soon for some
> > > logs. I will check if we can do any serial console tricks.
> >
> > Oh and before I forget.. Thanks for all the feedback/help!
>
> Do you have any update here? Any debug logging?
>
> The current state of this bug does not really allow anyone other than
> yourself to cause it to progress.
>
> Hans
>


Bug#912975: xen-hypervisor-4.8-amd64: Dom0 crashes randomly without logs on Debian Stretch with Xen 4.8.4

2019-02-22 Thread Hans van Kranenburg
Hi Alexander,

On 2/22/19 9:41 PM, Alexander Dahl wrote:
> 
> On Fri, Feb 22, 2019 at 07:24:11PM +0100, Hans van Kranenburg wrote:
>> The current state of this bug does not really allow anyone other than
>> yourself to cause it to progress.
> 
> FWIW, I also have problems with Xen and stretch on amd64. Since
> upgrading from jessie I get random crashes, which means the system
> hangs and I only can do a hard powercycle.

Ok. I'm gonna sound a bit strict/rigorous/stern here (I don't know which
of those is the right one, not a native speaker).

Please don't use an existing bug for an "I'm also having similar
problems" report. It might seem helpful to group similar symptoms
together, but often there seem to be different causes in the end, and
people from the package maintainer team will get confused about what the
real issue was and if it's fixed now or not, and you might end up with a
closed bug report while your issue was not dealt with, or the original
reporter might end up with a closed bug because your me-too problem was
fixed.

>From your problem description, it seems you're experiencing dom0 or even
xen hypervisor problems, not domU (virtual machine) problems. Is that right?

> I'm currently reorganizing
> the partitions to get enough space for a debug kernel on the rootfs,
> otherwise the stacktraces are probably not of big help?

As long as you don't share any stack trace at all, they won't be of any
help no. :) You might be experiencing a known problem, or a problem that
we know already has been fixed upstream in later Xen or Linux, or a new
problem.

> (I would have upgraded to buster already, but pygrub is broken there,
> so maybe we get stretch fixed until then.)

The pygrub fixes are in the upload to unstable that was done today.

https://tracker.debian.org/news/1031793/accepted-xen-411126-g87f51bf366-2-all-amd64-source-into-unstable/

Hans

Off-the-record: 2018 was not a good year for the Linux kernel in
general, also thanks to all the spectre/meltdown things happening. My
own experience is that the Linux 4.9 LTS kernel is quite unusable (as
dom0 and domU) with Xen, and I jumped over it, towards Xen 4.11 and
Linux 4.19, which is a great success so far.

So, with my personal (and $dayjob) hat on, I can recommend leaving the
current situation behind and at least please run the Linux 4.19 kernel
from stretch-backports instead.

With my Debian Xen package mainainer hat on... Yes, I'd like to help
you, but please create an additional bug after you have been able to
collect more logs and stacktraces and explosions happening etc.

Thanks.



Bug#912975: xen-hypervisor-4.8-amd64: Dom0 crashes randomly without logs on Debian Stretch with Xen 4.8.4

2019-02-22 Thread Alexander Dahl
Hei hei,

On Fri, Feb 22, 2019 at 07:24:11PM +0100, Hans van Kranenburg wrote:
> The current state of this bug does not really allow anyone other than
> yourself to cause it to progress.

FWIW, I also have problems with Xen and stretch on amd64. Since
upgrading from jessie I get random crashes, which means the system
hangs and I only can do a hard powercycle. I'm currently reorganizing
the partitions to get enough space for a debug kernel on the rootfs,
otherwise the stacktraces are probably not of big help?

(I would have upgraded to buster already, but pygrub is broken there,
so maybe we get stretch fixed until then.)

Greets
Alex

-- 
/"\ ASCII RIBBON | »With the first link, the chain is forged. The first
\ / CAMPAIGN | speech censured, the first thought forbidden, the
 X  AGAINST  | first freedom denied, chains us all irrevocably.«
/ \ HTML MAIL| (Jean-Luc Picard, quoting Judge Aaron Satie)


signature.asc
Description: PGP signature


Bug#912975: xen-hypervisor-4.8-amd64: Dom0 crashes randomly without logs on Debian Stretch with Xen 4.8.4

2019-02-22 Thread Hans van Kranenburg
tags 912975 + moreinfo
thanks

Hi,

On 11/7/18 3:43 PM, Roalt Zijlstra | webpower wrote:
> 
> Op wo 7 nov. 2018 om 14:30 schreef Hans van Kranenburg  >:
> >
> > Well hopefully the 'noreboot' provided server crashes soon for some
> > logs. I will check if we can do any serial console tricks.
> 
> Oh and before I forget.. Thanks for all the feedback/help!

Do you have any update here? Any debug logging?

The current state of this bug does not really allow anyone other than
yourself to cause it to progress.

Hans



Bug#912975: xen-hypervisor-4.8-amd64: Dom0 crashes randomly without logs on Debian Stretch with Xen 4.8.4

2019-01-03 Thread Hans van Kranenburg
On 1/3/19 11:02 AM, Patrick Beckmann wrote:
> 
> this bug description sounds a lot like a problem we have with two Xen
> Dom0s, so I am replying here.

The original reporter for this bug didn't show errors yet to compare,
but at least it's a "something crashes" scenario. ;]

> One of our machines has been running stable on Debian 8 and was newly
> upgraded to Debian 9, another one is new hardware with a fresh
> installation. With the most recent Debian 9 they crash at a rate from
> every 3 days to 3 times a day, suspected to be depending on load.
> Versions are
> - Xen hypervisor: 4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10
> - Linux Kernel: 4.9.130-2
> 
> On Tue, 6 Nov 2018 18:54:53 +0100 Hans van Kranenburg 
> wrote:
>> Are you able to configure and capture output from serial console?
> 
> We have been able to capture the output of our new machine crashing.
> Please find it attached to this e-mail. Unfortunately it lacks the lines
> during boot time. If you need them or any other information, please let
> me know.

That's the same error as discussed in this thread, and it looks like
it's not narrowed down to something reproducible yet.

https://lists.xenproject.org/archives/html/xen-devel/2018-12/msg00938.html

I don't think the Debian packaging people can be of great help here by
sitting in between upstream and you. This is an upstream bug, and I've
never encountered that one myself, nor do I know how to cause it to help
debugging.

Maybe you can join in on that discussion on the Xen mailing list to
provide more info about your situation?

Hans



Bug#912975: [Pkg-xen-devel] Bug#912975: xen-hypervisor-4.8-amd64: Dom0 crashes randomly without logs on Debian Stretch with Xen 4.8.4

2019-01-03 Thread Patrick Beckmann
Hi,

this bug description sounds a lot like a problem we have with two Xen
Dom0s, so I am replying here.

One of our machines has been running stable on Debian 8 and was newly
upgraded to Debian 9, another one is new hardware with a fresh
installation. With the most recent Debian 9 they crash at a rate from
every 3 days to 3 times a day, suspected to be depending on load.
Versions are
- Xen hypervisor: 4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10
- Linux Kernel:   4.9.130-2

On Tue, 6 Nov 2018 18:54:53 +0100 Hans van Kranenburg 
wrote:
> Are you able to configure and capture output from serial console?

We have been able to capture the output of our new machine crashing.
Please find it attached to this e-mail. Unfortunately it lacks the lines
during boot time. If you need them or any other information, please let
me know.

> Can you confirm that this is the only change that you made between the
> before/after scenario? I mean, if you downgrade the packages, or you
> drop the old hypervisor xen-x.y-amd64.gz in /boot again, it's stable again?

We would try this next with Xen version
4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9.

Best Regards,
Patrick Beckmann
[SOL Session operational.  Use ~? for help]
[   99.992731] xen-blkback: backend/vbd/19/51712: prepare for reconnect

[  101.634684] xen-blkback: backend/vbd/20/51712: prepare for reconnect

[  103.653671] xen-blkback: backend/vbd/19/51712: using 4 queues, protocol 1 
(x86_64-abi) persistent grants

[  103.827314] vif vif-19-0 vif19.0: Guest Rx ready

[  103.827427] IPv6: ADDRCONF(NETDEV_CHANGE): vif19.0: link becomes ready

[  103.827534] br02: port 15(vif19.0) entered blocking state

[  103.827541] br02: port 15(vif19.0) entered forwarding state

[  104.476998] xen-blkback: backend/vbd/20/51712: using 4 queues, protocol 1 
(x86_64-abi) persistent grants

[  104.660889] vif vif-20-0 vif20.0: Guest Rx ready

[  104.661018] IPv6: ADDRCONF(NETDEV_CHANGE): vif20.0: link becomes ready

[  104.661168] br026: port 2(vif20.0) entered blocking state

[  104.661184] br026: port 2(vif20.0) entered forwarding state

(XEN) d8 L1TF-vulnerable L1e 01a23320 - Shadowing
(XEN) d8 L1TF-vulnerable L1e 01a23320 - Shadowing
(XEN) d8 L1TF-vulnerable L1e 01a23320 - Shadowing
(XEN) d11 L1TF-vulnerable L1e 020c3320 - Shadowing
(XEN) d13 L1TF-vulnerable L1e 01a3b320 - Shadowing
(XEN) d15 L1TF-vulnerable L1e 01a23320 - Shadowing



Debian GNU/Linux 9 caribou hvc0



caribou login: 


Debian GNU/Linux 9 caribou hvc0



caribou login: [ 4676.600094] br02: port 14(vif17.0) entered disabled state

[ 4676.744064] br02: port 14(vif17.0) entered disabled state

[ 4676.745573] device vif17.0 left promiscuous mode

[ 4676.745618] br02: port 14(vif17.0) entered disabled state

[ 4683.146619] br02: port 14(vif21.0) entered blocking state

[ 4683.146678] br02: port 14(vif21.0) entered disabled state

[ 4683.146921] device vif21.0 entered promiscuous mode

[ 4683.153997] IPv6: ADDRCONF(NETDEV_UP): vif21.0: link is not ready

[ 4683.639331] xen-blkback: backend/vbd/21/51712: using 1 queues, protocol 1 
(x86_64-abi) 

[ 4684.544484] xen-blkback: backend/vbd/21/51712: prepare for reconnect

[ 4684.938636] xen-blkback: backend/vbd/21/51712: using 1 queues, protocol 1 
(x86_64-abi) 

[ 4692.235692] xen-blkback: backend/vbd/21/51712: prepare for reconnect

[ 4694.917436] vif vif-21-0 vif21.0: Guest Rx ready

[ 4694.917800] IPv6: ADDRCONF(NETDEV_CHANGE): vif21.0: link becomes ready

[ 4694.917918] br02: port 14(vif21.0) entered blocking state

[ 4694.917926] br02: port 14(vif21.0) entered forwarding state

[ 4694.921344] xen-blkback: backend/vbd/21/51712: using 2 queues, protocol 1 
(x86_64-abi) persistent grants




Debian GNU/Linux 9 caribou hvc0



caribou login: (XEN) [ Xen-4.8.5-pre  x86_64  debug=n   Not tainted ]
(XEN) CPU:32
(XEN) RIP:e008:[] 
guest_4.o#sh_page_fault__guest_4+0x75d/0x1e30
(XEN) RFLAGS: 00010202   CONTEXT: hypervisor (d8v0)
(XEN) rax: 7fb5797e6580   rbx: 8310f4372000   rcx: 81c0e060
(XEN) rdx:    rsi: 8310f4372000   rdi: 0001fed5
(XEN) rbp: 8310f4372000   rsp: 8340250e7c78   r8:  0001fed5
(XEN) r9:     r10:    r11: 
(XEN) r12: 81c0e06ff6a8   r13: 0407fad6   r14: 830078da7000
(XEN) r15: 8340250e7ef8   cr0: 80050033   cr4: 00372660
(XEN) cr3: 00407ec02001   cr2: 81c0e06ff6a8
(XEN) fsb: 7fb58fc26700   gsb:    gss: 8801fea0
(XEN) ds:    es:    fs:    gs:    ss:    cs: e008
(XEN) Xen code around  
(guest_4.o#sh_page_fault__guest_4+0x75d/0x1e30):
(XEN)  ff ff 03 00 4e 8d 24 c1 <49> 8b 0c 24 f6 c1 01 0f 84 b6 06 00 00 48 c1 e1
(XEN) Xen stack trace from rsp=8340250e7c78:
(XEN)7fb5797e6580 027372df 82d080323600 8310f4372648
(XEN)8310f43726a8 027372df 8340250e7d50 

Bug#912975: xen-hypervisor-4.8-amd64: Dom0 crashes randomly without logs on Debian Stretch with Xen 4.8.4

2018-11-07 Thread Roalt Zijlstra | webpower
Op wo 7 nov. 2018 om 14:30 schreef Hans van Kranenburg :

> Hi,
>
> On 11/7/18 12:48 PM, Roalt Zijlstra | webpower wrote:
> >
> > Op di 6 nov. 2018 om 18:54 schreef Hans van Kranenburg  > >:
> >
> > Hi,
> >
> > On 11/5/18 12:37 PM, Roalt Zijlstra wrote:
> > > Package: src:xen
> > > Version: 4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10
> > > Severity: important
> > >
> > > Updating Xen to the latest 4.8 version from the security repo
> > makes servers unstable.
> >
> > Can you confirm that this is the only change that you made between
> the
> > before/after scenario? I mean, if you downgrade the packages, or you
> > drop the old hypervisor xen-x.y-amd64.gz in /boot again, it's stable
> > again?
> >
> >
> > We have several servers running the previous versions and those are
> > still stable. The servers that we upgraded using 'apt-get update;
> > apt-get upgrade'  were rock solid before the upgrade.
>
> Yes, that's why I was asking. Did that apt-get upgrade also upgrade your
> dom0 kernel? You can look back in /var/log/dpkg.log* about what
> happened. This is very relevant information.
>

Two servers have been installed at 2018-04-24 and then upgraded:
2018-10-08 19:40:57 upgrade xen-hypervisor-4.8-amd64:amd64
4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9
4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10
2018-10-08 19:41:14 status installed xen-hypervisor-4.8-amd64:amd64
4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10
2018-07-31 18:50:01 upgrade xen-hypervisor-4.8-amd64:amd64
4.8.3+comet2+shim4.10.0+comet3-1+deb9u5
4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9
2018-07-31 18:50:45 status installed xen-hypervisor-4.8-amd64:amd64
4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9
2018-04-24 16:22:56 install xen-hypervisor-4.8-amd64:amd64 
4.8.3+comet2+shim4.10.0+comet3-1+deb9u5
2018-04-24 16:23:05 status installed xen-hypervisor-4.8-amd64:amd64
4.8.3+comet2+shim4.10.0+comet3-1+deb9u5

The two other servers ran Cent OS first and were converted to Debian for
other reasons and so are fresh installs:
2018-09-26 22:01:34 install xen-hypervisor-4.8-amd64:amd64 
4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10
2018-09-26 22:01:57 status installed xen-hypervisor-4.8-amd64:amd64
4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10


>
> > I did prepare a downgrade script if needed, but atm. the crash interval
> > in days seems to be higher then before. We did have servers crashing
> > every 2 days or even one crashing twice a day.
>
> > > The servers randomly reset without any logs.
> >
> > Do you have the noreboot option set on the Xen hypervisor command
> line?
> >
> >
> > For now one busy servers runs an older 4.9.0-4-amd64 kernel with a 3.16
> > kernel DomU with MySQL server on it. The second busy server runs all
> > domUs with 4.9 (backport) kernels on the lastest 4.9.0-8-amd64 kernel
> > for the Dom0. Currently we are awaiting any crash.
>
> In Debian, 4.9.0-8-amd64 is in the name of the package, but the real
> kernel version is in the version of that package.
>
> So, if you have linux-image-4.9.0-8-amd64, you should always also
> mention the real version, which is now e.g. 4.9.110-3+deb9u6. This means
> it's based on 4.9.110 upstream.
>
> The kernel team follows the 4.9 LTS releases, but only if the changes
> have to break the ABI (so custom modules have to be rebuilt), they up
> the number in the package name to trigger that process.
>

Right I completely missed that detail:
Two heavy used servers run kernels:
4.9.65-3+deb9u1  with one Jessie DomU kernel: 3.16.57-2
4.9.110-3+deb9u6 with a few Jessie DomU kernels: 4.9.110-3+deb9u5~deb8u1
Two less used servers run:
4.9.110-3+deb9u5 with one Jessie DomU kernel:  3.16.59-1
4.9.110-3+deb9u5 with a few mixed Jessie DomU kernels: 3.16.59-1
and 3.16.57-2


>
> > The last mentioned server was rebooted with the noreboot option, so we
> > could eventually check the console for errors once it crashes.
> > The remain two servers are our fall-back servers and are not that busy.
> > We have seen them crashtoo, but we noticed that the less busy servers
> > did not crash that often. But once they were busy they crashed as
> > quickly as the master servers.
>
> Ok, that's interesting extra data.
>
> > Are you able to configure and capture output from serial console?
> >
> >
> > Oh wow..  Using old technology for debugging :-) I will need to see how
> > that configuration is done. We could connect up physical serial cables
> > between different machines.
>
> Well... old... It's the best way to capture text after everything
> crashes. On a vga display it scrolls away and you can't copy paste.
>
> If you're using recent Dell hardware, then I guess your drac provides an
> extra emulated serial console. I use HP hardware, there it's the ilo
> virtual serial port.
>

I will get into this, never used it before as most crashes so far, did log
errors
before things stop to work.


>
> > First interesting thing to know is if it's the Dom0 that crashes, or
> if
> > 

Bug#912975: xen-hypervisor-4.8-amd64: Dom0 crashes randomly without logs on Debian Stretch with Xen 4.8.4

2018-11-07 Thread Hans van Kranenburg
Hi,

On 11/7/18 12:48 PM, Roalt Zijlstra | webpower wrote:
> 
> Op di 6 nov. 2018 om 18:54 schreef Hans van Kranenburg  >:
> 
> Hi,
> 
> On 11/5/18 12:37 PM, Roalt Zijlstra wrote:
> > Package: src:xen
> > Version: 4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10
> > Severity: important
> >
> > Updating Xen to the latest 4.8 version from the security repo
> makes servers unstable.
> 
> Can you confirm that this is the only change that you made between the
> before/after scenario? I mean, if you downgrade the packages, or you
> drop the old hypervisor xen-x.y-amd64.gz in /boot again, it's stable
> again?
> 
> 
> We have several servers running the previous versions and those are
> still stable. The servers that we upgraded using 'apt-get update;
> apt-get upgrade'  were rock solid before the upgrade.

Yes, that's why I was asking. Did that apt-get upgrade also upgrade your
dom0 kernel? You can look back in /var/log/dpkg.log* about what
happened. This is very relevant information.

> I did prepare a downgrade script if needed, but atm. the crash interval
> in days seems to be higher then before. We did have servers crashing
> every 2 days or even one crashing twice a day.

> > The servers randomly reset without any logs.
> 
> Do you have the noreboot option set on the Xen hypervisor command line?
> 
>  
> For now one busy servers runs an older 4.9.0-4-amd64 kernel with a 3.16
> kernel DomU with MySQL server on it. The second busy server runs all
> domUs with 4.9 (backport) kernels on the lastest 4.9.0-8-amd64 kernel
> for the Dom0. Currently we are awaiting any crash. 

In Debian, 4.9.0-8-amd64 is in the name of the package, but the real
kernel version is in the version of that package.

So, if you have linux-image-4.9.0-8-amd64, you should always also
mention the real version, which is now e.g. 4.9.110-3+deb9u6. This means
it's based on 4.9.110 upstream.

The kernel team follows the 4.9 LTS releases, but only if the changes
have to break the ABI (so custom modules have to be rebuilt), they up
the number in the package name to trigger that process.

> The last mentioned server was rebooted with the noreboot option, so we
> could eventually check the console for errors once it crashes. 
> The remain two servers are our fall-back servers and are not that busy.
> We have seen them crashtoo, but we noticed that the less busy servers
> did not crash that often. But once they were busy they crashed as
> quickly as the master servers.

Ok, that's interesting extra data.

> Are you able to configure and capture output from serial console?
> 
>  
> Oh wow..  Using old technology for debugging :-) I will need to see how
> that configuration is done. We could connect up physical serial cables
> between different machines.

Well... old... It's the best way to capture text after everything
crashes. On a vga display it scrolls away and you can't copy paste.

If you're using recent Dell hardware, then I guess your drac provides an
extra emulated serial console. I use HP hardware, there it's the ilo
virtual serial port.

> First interesting thing to know is if it's the Dom0 that crashes, or if
> it's the hypervisor itself, and the logging will tell you that.
> 
> > We have serveral Debian Stretch servers running Xen 4.8 and only
> the ones updated to the 4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10
> > version tend to crash ranging from 'twice a day' to 'once every
> two weeks'. We have already ruled out if hardware was an
> > issue, since we have 4 individual servers which are different in
> hardware setup and also were bought at different times.
> > And these servers ran stable with the previsous version
> 4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9.
> > These servers are acting exactly the same. Every thing works as it
> should, but without any logs it crashes and resets at
> > a certain point.
> >
> > It looks like it could have something to do with DomUs running
> older (3.16) Linux kernels. As a test we applied 4.9 kernels to
> > all Jessie DomU servers and so far it runs for 13 days (but this
> server did crash twice on a day).
> > We have seen this behaviour with Xen on CentOS6 and 7 too, but the
> trouble seems to be fixed after some more updates.
> 
> It can be frustrating that there's not much response on the mailing
> lists. But, these kinds of problems can be really hard to debug and
> solve. Unless there's a clear reproduction scenario and debug output,
> there's often noone who can help you remotely.
> 
>  
> Well we have been having the issues since february this year with
> unstable Xen servers crashing once in a months or so. The first issues
> were on fresh Cent OS 7 servers, but then we also got them with updated
> Cent OS 6 servers. We then decided to use Debian Stretch and the first
> tests were pretty stable. We did install a new R740 with 

Bug#912975: [Pkg-xen-devel] Bug#912975: xen-hypervisor-4.8-amd64: Dom0 crashes randomly without logs on Debian Stretch with Xen 4.8.4

2018-11-07 Thread Roalt Zijlstra | webpower
Hi Hans,


Op di 6 nov. 2018 om 18:54 schreef Hans van Kranenburg :

> Hi,
>
> On 11/5/18 12:37 PM, Roalt Zijlstra wrote:
> > Package: src:xen
> > Version: 4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10
> > Severity: important
> >
> > Updating Xen to the latest 4.8 version from the security repo makes
> servers unstable.
>
> Can you confirm that this is the only change that you made between the
> before/after scenario? I mean, if you downgrade the packages, or you
> drop the old hypervisor xen-x.y-amd64.gz in /boot again, it's stable again?
>

We have several servers running the previous versions and those are still
stable. The servers that we upgraded using 'apt-get update; apt-get
upgrade'  were rock solid before the upgrade.
I did prepare a downgrade script if needed, but atm. the crash interval in
days seems to be higher then before. We did have servers crashing every 2
days or even one crashing twice a day.


>
> > The servers randomly reset without any logs.
>
> Do you have the noreboot option set on the Xen hypervisor command line?
>
>
For now one busy servers runs an older 4.9.0-4-amd64 kernel with a 3.16
kernel DomU with MySQL server on it. The second busy server runs all domUs
with 4.9 (backport) kernels on the lastest 4.9.0-8-amd64 kernel for the
Dom0. Currently we are awaiting any crash.

The last mentioned server was rebooted with the noreboot option, so we
could eventually check the console for errors once it crashes.  The remain
two servers are our fall-back servers and are not that busy. We have seen
them crash too, but we noticed that the less busy servers did not crash
that often. But once they were busy they crashed as quickly as the master
servers.


> Are you able to configure and capture output from serial console?
>

Oh wow..  Using old technology for debugging :-) I will need to see how
that configuration is done. We could connect up physical serial cables
between different machines.


> First interesting thing to know is if it's the Dom0 that crashes, or if
> it's the hypervisor itself, and the logging will tell you that.
>
> > We have serveral Debian Stretch servers running Xen 4.8 and only the
> ones updated to the 4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10
> > version tend to crash ranging from 'twice a day' to 'once every two
> weeks'. We have already ruled out if hardware was an
> > issue, since we have 4 individual servers which are different in
> hardware setup and also were bought at different times.
> > And these servers ran stable with the previsous version
> 4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9.
> > These servers are acting exactly the same. Every thing works as it
> should, but without any logs it crashes and resets at
> > a certain point.
> >
> > It looks like it could have something to do with DomUs running older
> (3.16) Linux kernels. As a test we applied 4.9 kernels to
> > all Jessie DomU servers and so far it runs for 13 days (but this server
> did crash twice on a day).
> > We have seen this behaviour with Xen on CentOS6 and 7 too, but the
> trouble seems to be fixed after some more updates.
>
> It can be frustrating that there's not much response on the mailing
> lists. But, these kinds of problems can be really hard to debug and
> solve. Unless there's a clear reproduction scenario and debug output,
> there's often noone who can help you remotely.
>

Well we have been having the issues since february this year with unstable
Xen servers crashing once in a months or so. The first issues were on fresh
Cent OS 7 servers, but then we also got them with updated Cent OS 6
servers. We then decided to use Debian Stretch and the first tests were
pretty stable. We did install a new R740 with it (Xen 4.8.4-pre) and that
ran for 110 days pretty well.


> > As said.. I cannot provide logs since it simply resets without notice.
>
> It's still the best starting point...


Well hopefully the 'noreboot' provided server crashes soon for some logs. I
will check if we can do any serial console tricks.

Roalt


Bug#912975: [Pkg-xen-devel] Bug#912975: xen-hypervisor-4.8-amd64: Dom0 crashes randomly without logs on Debian Stretch with Xen 4.8.4

2018-11-06 Thread Hans van Kranenburg
Hi,

On 11/5/18 12:37 PM, Roalt Zijlstra wrote:
> Package: src:xen
> Version: 4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10
> Severity: important
> 
> Updating Xen to the latest 4.8 version from the security repo makes servers 
> unstable.

Can you confirm that this is the only change that you made between the
before/after scenario? I mean, if you downgrade the packages, or you
drop the old hypervisor xen-x.y-amd64.gz in /boot again, it's stable again?

> The servers randomly reset without any logs.

Do you have the noreboot option set on the Xen hypervisor command line?

Are you able to configure and capture output from serial console?

First interesting thing to know is if it's the Dom0 that crashes, or if
it's the hypervisor itself, and the logging will tell you that.

> We have serveral Debian Stretch servers running Xen 4.8 and only the ones 
> updated to the 4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10
> version tend to crash ranging from 'twice a day' to 'once every two weeks'. 
> We have already ruled out if hardware was an 
> issue, since we have 4 individual servers which are different in hardware 
> setup and also were bought at different times. 
> And these servers ran stable with the previsous version 
> 4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9.
> These servers are acting exactly the same. Every thing works as it should, 
> but without any logs it crashes and resets at 
> a certain point.
> 
> It looks like it could have something to do with DomUs running older (3.16) 
> Linux kernels. As a test we applied 4.9 kernels to 
> all Jessie DomU servers and so far it runs for 13 days (but this server did 
> crash twice on a day). 
> We have seen this behaviour with Xen on CentOS6 and 7 too, but the trouble 
> seems to be fixed after some more updates.

It can be frustrating that there's not much response on the mailing
lists. But, these kinds of problems can be really hard to debug and
solve. Unless there's a clear reproduction scenario and debug output,
there's often noone who can help you remotely.

> As said.. I cannot provide logs since it simply resets without notice.

It's still the best starting point...

Thanks,
Hans



Bug#912975: xen-hypervisor-4.8-amd64: Dom0 crashes randomly without logs on Debian Stretch with Xen 4.8.4

2018-11-05 Thread Roalt Zijlstra
Package: src:xen
Version: 4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10
Severity: important

Updating Xen to the latest 4.8 version from the security repo makes servers 
unstable. The servers randomly reset without any logs.
We have serveral Debian Stretch servers running Xen 4.8 and only the ones 
updated to the 4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10
version tend to crash ranging from 'twice a day' to 'once every two weeks'. We 
have already ruled out if hardware was an 
issue, since we have 4 individual servers which are different in hardware setup 
and also were bought at different times. 
And these servers ran stable with the previsous version 
4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9.
These servers are acting exactly the same. Every thing works as it should, but 
without any logs it crashes and resets at 
a certain point.

It looks like it could have something to do with DomUs running older (3.16) 
Linux kernels. As a test we applied 4.9 kernels to 
all Jessie DomU servers and so far it runs for 13 days (but this server did 
crash twice on a day). 
We have seen this behaviour with Xen on CentOS6 and 7 too, but the trouble 
seems to be fixed after some more updates.

As said.. I cannot provide logs since it simply resets without notice.


-- System Information:
Debian Release: 9.5
  APT prefers stable-updates
  APT policy: (500, 'stable-updates'), (500, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 4.9.0-8-amd64 (SMP w/4 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), 
LANGUAGE=en_US:en (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)

xen-hypervisor-4.8-amd64 depends on no packages.

Versions of packages xen-hypervisor-4.8-amd64 recommends:
ii  xen-utils-4.8  4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10

xen-hypervisor-4.8-amd64 suggests no packages.

-- Configuration Files:
/etc/default/grub.d/xen.cfg changed:
echo "Including Xen overrides from /etc/default/grub.d/xen.cfg"
GRUB_CMDLINE_XEN="dom0_mem=3072M,max:4096M loglvl=all guest_loglvl=all 
dom0_max_vcpus=4 dom0_vcpus_pin noreboot"
if [ "$XEN_OVERRIDE_GRUB_DEFAULT" = "" ]; then
echo "WARNING: GRUB_DEFAULT changed to boot into Xen by default!"
echo " Edit /etc/default/grub.d/xen.cfg to avoid this warning."
XEN_OVERRIDE_GRUB_DEFAULT=1
fi
if [ "$XEN_OVERRIDE_GRUB_DEFAULT" = "1" ]; then
GRUB_DEFAULT="Debian GNU/Linux, with Xen hypervisor"
fi


-- no debconf information