Re: [Xen-devel] several domU crashes after 4.1->4.8 live migration

2017-02-03 Thread Fatih Acar
On 02/02/2017 16:40, George Dunlap wrote:
> 
> Sorry, I just want to clarify: Is it the case that before you tried to
> upgrade, you often migrated VMs between your Xen 4.1 hosts without
> problems?
> 
> In other words, is it *4.8* that's the problem, or is it *migration*
> that's the problem?
> 
>  -George
> 

We have been using live migration on Xen 4.1 for several years now
without encountering any problem like this.

Domains work fine after a complete restart on Xen 4.8 although we
couldn't see how 4.8 to 4.8 live migration goes on long term yet, so the
problem is much likely to be the 4.1 to 4.8 migration.

-- 
Fatih ACAR
Gandi
fatih.a...@gandi.net

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] several domU crashes after 4.1->4.8 live migration

2017-02-02 Thread George Dunlap
On Thu, Feb 2, 2017 at 1:30 PM, Vincent Legout  wrote:
> On Thu, Feb 02, 2017 at 12:05:09PM +, Andrew Cooper wrote :
>> On 02/02/17 07:58, Vincent Legout wrote:
>> > Hello,
>> >
>> > We had some issues after live migrating several domU from xen 4.1 to xen
>> > 4.8. We migrated around 200 domU and 5 crashed, from a few hours up to
>> > several days after the migration. All the domU had more than 1 year of
>> > uptime, and for example one crashed several days after the migration
>> > during a high load period.
>> >
>> > All 5 domU are running a 3.10 kernel (from 3.10.44 to 3.10.103). They
>> > have between 2GB and 16GB of RAM, and between 1 and 4 vCPUS.
>> >
>> > We use 3 types of machines (several PowerEdge C6100 (Intel L5640) and
>> > R710 (Intel L5520), and one C8220 (Intel E5-2650)). The C6100 and R710
>> > have 24 logical cores with HT enabled. The PowerEdge C8220 is only used
>> > for Xen 4.8, and has 32 logical cores with HT enabled. The C6100 and the
>> > R710 have 50GB of RAM, and the C8220 128GB. The xen 4.1 dom0 is running
>> > a 3.4.69 kernel, and the xen 4.8 one a 4.1.37 kernel.
>> >
>> > I've attached the most relevant parts of the domU kernel logs we could
>> > get. It seems the crashes came from different components of the kernel,
>> > though most of them seem to be related to memory.
>> >
>> > Would anyone have any idea if that's something that could be fixed? Or
>> > is it just that migrating from 4.1 to 4.8 is not supported?
>>
>> Do the VMs migrate normally in production?
>
> Thanks for the comment. Yes, the migrations took place normally, at
> least we didn't see anything wrong then. These crashes happened randomly
> on a few VMs only, and at least a few hours after the migration.

Sorry, I just want to clarify: Is it the case that before you tried to
upgrade, you often migrated VMs between your Xen 4.1 hosts without
problems?

In other words, is it *4.8* that's the problem, or is it *migration*
that's the problem?

 -George

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] several domU crashes after 4.1->4.8 live migration

2017-02-02 Thread Andrew Cooper
On 02/02/17 13:30, Vincent Legout wrote:
> On Thu, Feb 02, 2017 at 12:05:09PM +, Andrew Cooper wrote :
>> On 02/02/17 07:58, Vincent Legout wrote:
>>> Hello,
>>>
>>> We had some issues after live migrating several domU from xen 4.1 to xen
>>> 4.8. We migrated around 200 domU and 5 crashed, from a few hours up to
>>> several days after the migration. All the domU had more than 1 year of
>>> uptime, and for example one crashed several days after the migration
>>> during a high load period.
>>>
>>> All 5 domU are running a 3.10 kernel (from 3.10.44 to 3.10.103). They
>>> have between 2GB and 16GB of RAM, and between 1 and 4 vCPUS.
>>>
>>> We use 3 types of machines (several PowerEdge C6100 (Intel L5640) and
>>> R710 (Intel L5520), and one C8220 (Intel E5-2650)). The C6100 and R710
>>> have 24 logical cores with HT enabled. The PowerEdge C8220 is only used
>>> for Xen 4.8, and has 32 logical cores with HT enabled. The C6100 and the
>>> R710 have 50GB of RAM, and the C8220 128GB. The xen 4.1 dom0 is running
>>> a 3.4.69 kernel, and the xen 4.8 one a 4.1.37 kernel.
>>>
>>> I've attached the most relevant parts of the domU kernel logs we could
>>> get. It seems the crashes came from different components of the kernel,
>>> though most of them seem to be related to memory.
>>>
>>> Would anyone have any idea if that's something that could be fixed? Or
>>> is it just that migrating from 4.1 to 4.8 is not supported?
>> Do the VMs migrate normally in production?
> Thanks for the comment. Yes, the migrations took place normally, at
> least we didn't see anything wrong then. These crashes happened randomly
> on a few VMs only, and at least a few hours after the migration.
>
>> This looks like a Linux kernel bug in the Xen suspend/resume paths, and
>> unlikely to be related to the version of the hypervisor in use.
> I agree about the Linux kernel bug. But I still think it should also be
> related to the migration because we never had anything like that without
> migration, on either xen 4.1 or 4.8.

Ok, so it does look like a change in behaviour between 4.1 and 4.8.

Have you observed any further crashes from migrations on 4.8 after the
upgrade?

I am not aware of any alterations to the hypervisor side of things which
would be relevant.

The toolstack however has changed quite a lot.

The first change is with the migration stream itself.  A migration like
that will be piped through tools/python/scripts/convert-legacy-stream to
convert from the old format to the new.  It is certainly possible that
there is a bug in that process, although we test it extensively in
XenServer and have never encountered a crash looking like this in any
vintage of PV guest (all the way back to the RHEL 4 days).  Also, the
fact that it didn't abort midwaythrough is a good sign that nothing
unexpected was encountered during the conversion process.

Another area which has changed is the semantics of how the toolstack
returns from the suspend call.  There have been various changes to
support fast resume, all of which revolve around modifying the return
value from the hypercall as observed by the guest.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] several domU crashes after 4.1->4.8 live migration

2017-02-02 Thread Vincent Legout
On Thu, Feb 02, 2017 at 12:05:09PM +, Andrew Cooper wrote :
> On 02/02/17 07:58, Vincent Legout wrote:
> > Hello,
> >
> > We had some issues after live migrating several domU from xen 4.1 to xen
> > 4.8. We migrated around 200 domU and 5 crashed, from a few hours up to
> > several days after the migration. All the domU had more than 1 year of
> > uptime, and for example one crashed several days after the migration
> > during a high load period.
> >
> > All 5 domU are running a 3.10 kernel (from 3.10.44 to 3.10.103). They
> > have between 2GB and 16GB of RAM, and between 1 and 4 vCPUS.
> >
> > We use 3 types of machines (several PowerEdge C6100 (Intel L5640) and
> > R710 (Intel L5520), and one C8220 (Intel E5-2650)). The C6100 and R710
> > have 24 logical cores with HT enabled. The PowerEdge C8220 is only used
> > for Xen 4.8, and has 32 logical cores with HT enabled. The C6100 and the
> > R710 have 50GB of RAM, and the C8220 128GB. The xen 4.1 dom0 is running
> > a 3.4.69 kernel, and the xen 4.8 one a 4.1.37 kernel.
> >
> > I've attached the most relevant parts of the domU kernel logs we could
> > get. It seems the crashes came from different components of the kernel,
> > though most of them seem to be related to memory.
> >
> > Would anyone have any idea if that's something that could be fixed? Or
> > is it just that migrating from 4.1 to 4.8 is not supported?
> 
> Do the VMs migrate normally in production?

Thanks for the comment. Yes, the migrations took place normally, at
least we didn't see anything wrong then. These crashes happened randomly
on a few VMs only, and at least a few hours after the migration.

> This looks like a Linux kernel bug in the Xen suspend/resume paths, and
> unlikely to be related to the version of the hypervisor in use.

I agree about the Linux kernel bug. But I still think it should also be
related to the migration because we never had anything like that without
migration, on either xen 4.1 or 4.8.

Vincent

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] several domU crashes after 4.1->4.8 live migration

2017-02-02 Thread Vincent Legout
On Thu, Feb 02, 2017 at 11:22:02AM +, George Dunlap wrote :
> On Thu, Feb 2, 2017 at 7:58 AM, Vincent Legout  
> wrote:
> > Hello,
> >
> > We had some issues after live migrating several domU from xen 4.1 to xen
> > 4.8. We migrated around 200 domU and 5 crashed, from a few hours up to
> > several days after the migration. All the domU had more than 1 year of
> > uptime, and for example one crashed several days after the migration
> > during a high load period.
> >
> > All 5 domU are running a 3.10 kernel (from 3.10.44 to 3.10.103). They
> > have between 2GB and 16GB of RAM, and between 1 and 4 vCPUS.
> >
> > We use 3 types of machines (several PowerEdge C6100 (Intel L5640) and
> > R710 (Intel L5520), and one C8220 (Intel E5-2650)). The C6100 and R710
> > have 24 logical cores with HT enabled. The PowerEdge C8220 is only used
> > for Xen 4.8, and has 32 logical cores with HT enabled. The C6100 and the
> > R710 have 50GB of RAM, and the C8220 128GB. The xen 4.1 dom0 is running
> > a 3.4.69 kernel, and the xen 4.8 one a 4.1.37 kernel.
> >
> > I've attached the most relevant parts of the domU kernel logs we could
> > get. It seems the crashes came from different components of the kernel,
> > though most of them seem to be related to memory.
> >
> > Would anyone have any idea if that's something that could be fixed? Or
> > is it just that migrating from 4.1 to 4.8 is not supported?
> 
> Officially the open-source project only supports migrations between
> single releases -- i.e., 4.1 -> 4.2 or 4.7 -> 4.8.  This is mainly
> because we haven't had the bandwidth to test more than that.  I know
> XenServer tests and supports a broader window for migrations between
> their release versions; other downstreams (Oracle, SLES) may as well;
> but that probably won't help you in your current situation.

Thanks. Yes, we know it's not officially supported, but we wanted to see
if that was something we could consider.

Vincent

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] several domU crashes after 4.1->4.8 live migration

2017-02-02 Thread Andrew Cooper
On 02/02/17 07:58, Vincent Legout wrote:
> Hello,
>
> We had some issues after live migrating several domU from xen 4.1 to xen
> 4.8. We migrated around 200 domU and 5 crashed, from a few hours up to
> several days after the migration. All the domU had more than 1 year of
> uptime, and for example one crashed several days after the migration
> during a high load period.
>
> All 5 domU are running a 3.10 kernel (from 3.10.44 to 3.10.103). They
> have between 2GB and 16GB of RAM, and between 1 and 4 vCPUS.
>
> We use 3 types of machines (several PowerEdge C6100 (Intel L5640) and
> R710 (Intel L5520), and one C8220 (Intel E5-2650)). The C6100 and R710
> have 24 logical cores with HT enabled. The PowerEdge C8220 is only used
> for Xen 4.8, and has 32 logical cores with HT enabled. The C6100 and the
> R710 have 50GB of RAM, and the C8220 128GB. The xen 4.1 dom0 is running
> a 3.4.69 kernel, and the xen 4.8 one a 4.1.37 kernel.
>
> I've attached the most relevant parts of the domU kernel logs we could
> get. It seems the crashes came from different components of the kernel,
> though most of them seem to be related to memory.
>
> Would anyone have any idea if that's something that could be fixed? Or
> is it just that migrating from 4.1 to 4.8 is not supported?

Do the VMs migrate normally in production?

This looks like a Linux kernel bug in the Xen suspend/resume paths, and
unlikely to be related to the version of the hypervisor in use.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] several domU crashes after 4.1->4.8 live migration

2017-02-02 Thread George Dunlap
On Thu, Feb 2, 2017 at 7:58 AM, Vincent Legout  wrote:
> Hello,
>
> We had some issues after live migrating several domU from xen 4.1 to xen
> 4.8. We migrated around 200 domU and 5 crashed, from a few hours up to
> several days after the migration. All the domU had more than 1 year of
> uptime, and for example one crashed several days after the migration
> during a high load period.
>
> All 5 domU are running a 3.10 kernel (from 3.10.44 to 3.10.103). They
> have between 2GB and 16GB of RAM, and between 1 and 4 vCPUS.
>
> We use 3 types of machines (several PowerEdge C6100 (Intel L5640) and
> R710 (Intel L5520), and one C8220 (Intel E5-2650)). The C6100 and R710
> have 24 logical cores with HT enabled. The PowerEdge C8220 is only used
> for Xen 4.8, and has 32 logical cores with HT enabled. The C6100 and the
> R710 have 50GB of RAM, and the C8220 128GB. The xen 4.1 dom0 is running
> a 3.4.69 kernel, and the xen 4.8 one a 4.1.37 kernel.
>
> I've attached the most relevant parts of the domU kernel logs we could
> get. It seems the crashes came from different components of the kernel,
> though most of them seem to be related to memory.
>
> Would anyone have any idea if that's something that could be fixed? Or
> is it just that migrating from 4.1 to 4.8 is not supported?

Officially the open-source project only supports migrations between
single releases -- i.e., 4.1 -> 4.2 or 4.7 -> 4.8.  This is mainly
because we haven't had the bandwidth to test more than that.  I know
XenServer tests and supports a broader window for migrations between
their release versions; other downstreams (Oracle, SLES) may as well;
but that probably won't help you in your current situation.

 -George

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel