We use qemu (4.0.0, about to flip the switch to 5.0.0) to test our aarch64
images, running in linux containers on x86_64 alongside other workloads.

We've recently run into issues where it looks like an emulated CPU (out of
four) sometimes stops making progress for ten or more seconds, and we're
trying to characterize the problem. When this happens, the other emulated
CPUs run just fine, though sometimes two will stall out at the same time.

Any suggestions for how to tell if an emulated CPU stopped doing work?

Based on our experiments, the guest-visible clocks and cycle counters
continue to run when a qemu CPU thread is suspended, so it's hard to tell
whether the emulation paused, or if our code is spinning with interrupts
disabled (though evidence is mounting that that's not the case). We're
adding a bunch more instrumentation to our code, but maybe qemu has some
features that will help us out.

I tried to find a way to count the number of TBs executed by an emulated
core over time, but I didn't see a cheap way to do that with the plugin
APIs.

We could maybe turn on instruction tracing, but this problem happens pretty
rarely (<1%), we don't have a repro case yet, and we can't really afford
the cost of slowing down every test run. There's a decent chance that this
is caused by an overloaded host, but our host-side investigations haven't
turned up anything concrete either.

Any advice?

--dbort

Reply via email to