The version we are using is:
1.10.2-0ubuntu2~cloud0

The version that was not working for us is:
2.0.1+git20140120-0ubuntu2~cloud1

Network:
Intel Corporation I350 Gigabit Network Connection (igb module)

We were seeing the problem, strangely enough, at the application level, inside the VMs, where Hadoop was reporting corrupted data on TCP connections. No other messages on the hypervisor or in the VM kernel. Hadoop makes lots of connections to lots of different VMs moving lots (terabytes) of data as fast as posssibile. Also, it was non-deterministic, Hadoop would try several times to transfer the data, sometimes successfully, sometimes giving up. I tried some quick iperf tests, but they worked fine.

Daniele

On 10/20/14 18:46, Manish Godara wrote:
> We had to do the same downgrade with openvswitch, the newest version, under heavy load, corrupts packets in-transit, but we do not have the time to investigate the issue further.

Daniele, what was the openvswitch version before and after the upgrade? And which ethernet drivers do you have? The corruption maybe related to the drivers you have (the issues may be triggered by the way openvswitch flows are configured in Icehouse vs Havana).

Thanks.

From: Daniele Venzano <daniele.venz...@eurecom.fr <mailto:daniele.venz...@eurecom.fr>>
Organization: Eurecom
Date: Sunday, October 19, 2014 11:46 PM
To: "openstack-operators@lists.openstack.org <mailto:openstack-operators@lists.openstack.org>" <openstack-operators@lists.openstack.org <mailto:openstack-operators@lists.openstack.org>>
Subject: Re: [Openstack-operators] qemu 1.x to 2.0

We have the same setup (Icehouse on Ubuntu 12.04) and had similar issues. We downgraded qemu from 2.x to 1.x, as we cannot terminate all VMs for all users. We had non-resumable VMs also in the middle of the 1.x series and nothing was documented in the changlelog. We had to do the same downgrade with openvswitch, the newest version, under heavy load, corrupts packets in-transit, but we do not have the time to investigate the issue further.

We plan to warn our users in time for the next major upgrade to Juno that all VMs need to be terminated, probably during the Christmas holidays. I do not think they will be happy. Seeing also all the problems we had upgrading Neutron from OVS to ML2, terminating all VMs is probably the best policy anyway during an OpenStack upgrade. Or you do lots of migrations and upgrade qemu one compute host at the time, but if something goes wrong you end-up with an angry user and a stuck VM.

It certainly is a big deal.

On 10/20/14 00:59, Joe Topjian wrote:
Hello,

We recently upgraded an OpenStack Grizzly environment to Icehouse (doing a quick stop-over at Havana). This environment is still running Ubuntu 12.04.

The Ubuntu 14.04 release notes <https://wiki.ubuntu.com/TrustyTahr/ReleaseNotes#Ubuntu_Server> make mention of incompatibilities with 12.04 and moving to 14.04 and qemu 2.0. I didn't think that this would apply for upgrades staying on 12.04, but it indeed does.

We found that existing instances could not be live migrated (as per the release notes). Additionally, instances that were hard-rebooted and had the libvirt xml file rebuilt could no longer start, either.

The exact error message we saw was:

"Length mismatch: vga.vram: 1000000 in != 800000"

I found a few bugs that are related to this, but I don't think they're fully relevant to the issue I ran into:

https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1308756
https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1291321
https://bugs.launchpad.net/nova/+bug/1312133

We ended up downgrading to the stock Ubuntu 12.04 qemu 1.0 packages and everything is working nicely.

I'm wondering if anyone else has run into this issue and how they dealt with it or plan to deal with it.

Also, I'm curious as to why exactly qemu 1.x to 2.0 are incompatible with each other. Is this just an Ubuntu issue? Or is this native of qemu?

Unless I'm missing something, this seems like a big deal. If we continue to use Ubuntu's OpenStack packages, we're basically stuck at 12.04 and Icehouse unless we have all users snapshot their instance and re-launch in a new cloud.

Thanks,
Joe



_______________________________________________
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.orghttp://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators




_______________________________________________
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Reply via email to