We've resolved our issues by disabling KSM on the affected nodes. All of
the non-affected nodes didn't have KSM enabled (due to a packaging bug
elsewhere). After disabling KSM, our problems went away gradually in ~3
days.
This means we're no longer affected by this issue (and given the other
We haven't been able to reproduce the issues under lab conditions, and
I'm not willing to use our production setup as a guinypig anymore. These
issues have cost me too much credibility already.
We believe #1326367 is causing this, as we've bisected this issue to be
between 3.13.0-27.50 and
Note that my list of affected nodes also include migrated VMs, so there
are some false positives (VMs that came from an affected node). The
affected VMs on node 1-8 all seem to be migrated from another node.
--
You received this bug notification because you are a member of qemu-
devel-ml, which
I'm not confident yet we're seeing the exact same problem, but it is
pretty close. We're running a somewhat wide range of hyperisor kernels,
these are our observations so far.
node-1-1 3.13.0-24-generic is affected for 0% of vms
node-1-3 3.13.0-24-generic is affected for 0% of vms
node-1-5