[Bug 1826051] Re: VMs go to 100% CPU after live migration from Trusty to Bionic

Christian Ehrhardt  Thu, 25 Apr 2019 00:11:26 -0700

Thanks for the data so far,
the guest does not look very "special" other than the ceph storage and that 
looks fine at a first glance.


It seems 4 of the 6 guest vCPU threads are what is in this 100% hog.
None of the other helper threads seems busy.
Of these vCPU threads we see that they are about 50% in host-kernel and 50% 
guest and not much more.
I wonder what they are doing ...

We can see in the strace that the vCPUs never leave to userspace (which they'd 
do for heavy exits).
Instead those seem to really just spin between kernel and guest as seen in the 
cpu utilization.
   ioctl(34, KVM_RUN, 0 <unfinished ...>

Every now and then we can see most of the other threads to show up on Futex 
locks.
Sometimes a few ceph/rbd related messages also show up like
  read(25, 0x7fe9f8006ec0, 4096) = -1 EAGAIN (Resource temporarily unavailable) 
<0.00001
Both could be a red herring or not - no message is clear enough to point that 
out yet.

So for the next step the guest still seems to do soemthing (even if it might 
spin in a bad loop) and regularly exits to the host kernel.
Lets try to find where that is, once you have a guest in that situation you 
might:
1. check which kind of host exits we see
2. check where the guest is atm

If the affected guest is the only one for #1 you can e.g. run perf kvm stat 
like:
  $ sudo perf kvm stat --live
But since you know the PID of one of the vCPU threads that we are interested in 
this would be better (also add -d 30 for some more reliability in the numbers):
  $ sudo perf kvm stat --live -d 30 --vcpu 0 --pid=<pid-of-one-vcpu-thread>
Let it run for a while and then report what exits you are seeing in your case. 
An example of an idle guest is below.

Maybe also worth could be the KVM tracepoints, you can check that (globally) 
with:
  $ sudo perf stat -e 'kvm:*' sleep 30s

[2]: has more general info on perf counters with KVM e.g. how to get kallsysms 
and modules files.
For some of these actions (to get more details) you might want to get a dbgsym 
of the guest kernel [1] has more about that.
For your #2 you could maybe run:

# Record data to file (on Host)
$ sudo perf kvm --host --guest --guestkallsyms=kallsyms 
--guestvmlinux=debug-kernel/usr/lib/debug/boot/vmlinux-5.0.0-13-generic 
--guestmodules=modules record
# Host info
$ sudo perf kvm --host report -i perf.data.kvm
$ sudo perf kvm --guest report -i perf.data.kvm

In general it would help to clean results by isolating the host that the
affected guest runs on to only run this guest and nothing else - not
sure how doable that is for your case thou :-/

Lets see from here what we get ...

[1]: 
https://wiki.ubuntu.com/Kernel/CrashdumpRecipe#Inspecting_the_crash_dump_using_crash
[2]: https://www.linux-kvm.org/page/Perf_events#Recording_events_for_a_guest

Example idle guest exits:
Analyze events for pid(s) 14470, VCPU 0:

             VM-EXIT    Samples  Samples%     Time%    Min Time    Max
Time         Avg time

           MSR_WRITE       1585    75.55%     0.00%      0.00us      7.83us     
 0.85us ( +-   2.09% )
                 HLT        473    22.55%   100.00%      0.00us 100138.85us  
58658.26us ( +-   2.88% )
            MSR_READ         30     1.43%     0.00%      0.00us      2.82us     
 0.78us ( +-   9.24% )
  EXTERNAL_INTERRUPT          7     0.33%     0.00%      0.00us      6.68us     
 1.61us ( +-  52.60% )
    PREEMPTION_TIMER          2     0.10%     0.00%      0.00us      0.95us     
 0.92us ( +-   3.65% )
   PENDING_INTERRUPT          1     0.05%     0.00%      0.00us      0.62us     
 0.62us ( +-   0.00% )

Total Samples:2098, Total events handled time:27746743.01us.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1826051

Title:
  VMs go to 100% CPU after live migration from Trusty to Bionic

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1826051/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1826051] Re: VMs go to 100% CPU after live migration from Trusty to Bionic

Reply via email to