How to tell if an emulated aarch64 CPU has stopped doing work?

2020-05-14 Thread Dave Bort
We use qemu (4.0.0, about to flip the switch to 5.0.0) to test our aarch64
images, running in linux containers on x86_64 alongside other workloads.

We've recently run into issues where it looks like an emulated CPU (out of
four) sometimes stops making progress for ten or more seconds, and we're
trying to characterize the problem. When this happens, the other emulated
CPUs run just fine, though sometimes two will stall out at the same time.

Any suggestions for how to tell if an emulated CPU stopped doing work?

Based on our experiments, the guest-visible clocks and cycle counters
continue to run when a qemu CPU thread is suspended, so it's hard to tell
whether the emulation paused, or if our code is spinning with interrupts
disabled (though evidence is mounting that that's not the case). We're
adding a bunch more instrumentation to our code, but maybe qemu has some
features that will help us out.

I tried to find a way to count the number of TBs executed by an emulated
core over time, but I didn't see a cheap way to do that with the plugin
APIs.

We could maybe turn on instruction tracing, but this problem happens pretty
rarely (<1%), we don't have a repro case yet, and we can't really afford
the cost of slowing down every test run. There's a decent chance that this
is caused by an overloaded host, but our host-side investigations haven't
turned up anything concrete either.

Any advice?

--dbort


Re: Using All Cores of CPU on Snapdragon Processor during x86-to-ARM User Space Emulation

2020-05-14 Thread Peter Maydell
On Thu, 14 May 2020 at 11:32, Jakob Bohm  wrote:
> The one exception to this lack was instruction decoding, where certain
> commonly used branch instructions were defined as implicitly picking up
> any changes in instruction memory.  This of cause corresponds to the TCG
> checking for needed retranslation of buffers at those points.

As it happens, QEMU will force retranslation of a buffer for x86
guests even if they modify the immediately next insn, rather than
only picking up the change at the next branch. The x86 target
sets TARGET_HAS_PRECISE_SMC, which enables some extra code that
stops execution of the CPU when a write to the current TB is
detected; all other targets don't set this, because architecturally
it's OK for them to finish execution of the current TB before
picking up the changed code. More generally, we detect self-modifying
code by trapping writes to areas of memory which we've translated
code from, rather than by doing things on the guest CPU events
like icache-flush which the h/w uses to handle SMC.

thanks
-- PMM



Re: Using All Cores of CPU on Snapdragon Processor during x86-to-ARM User Space Emulation

2020-05-14 Thread Jakob Bohm

On 13/05/2020 12:02, Alex Bennée wrote:

Vijay Daita  writes:


Hello

It is my understanding that one would be unable to do x86-to-ARM user space
emulation while utilizing all cores because of x86 barriers.

Actually the utilisation of multiple cores (often referred to at MTTCG)
is a function of system emulation and you are correct for x86-on-ARM we
don't enable MTTCG because we don't currently add barrier instructions
to fully emulate the x86 memory model. However for linux-user we have
always followed the guest threading model because the guest clone() is
passed down to the host. However because the memory modelling isn't
perfect you can run into problems because of the mismatch.


I wanted to
know if there is difference between what QEMU aims to do and using a
interpreter of sorts to convert x86 instructions directly to ARM
instructions so that when run on the system directly, the system can
decide, itself, how to apportion the task.

This is what the TCG does - it translates guest instructions into groups
of host instructions. We could insert the extra barriers for all loads
and stores but the effect would be to cripple performance. In an ideal
world we would only do these for the load/store instructions involved in
inter-thread synchronisation operations but that's a fairly tricky
problem to solve.
Especially because the x86 memory model traditionally has 
barrier/synchronization
instructions automatically push through their ordering to all other 
cores/CPUs,

and as a result, "barrier load" wasn't really a thing until CMPXCHG was
introduced in a later CPU generation than the basic sync instructions
(8086) and cache coherency mechanisms (80486).  In fact, the LOCK 
barrier prefix

triggers an #UD exception with most load instructions.

The one exception to this lack was instruction decoding, where certain 
commonly
used branch instructions were defined as implicitly picking up any 
changes in
instruction memory.  This of cause corresponds to the TCG checking for 
needed

retranslation of buffers at those points.

Additionally, x86 barriers generally guarantee total ordering relative 
to the
barrier operation of all memory accesses that occur before or after in 
program

order, which some other CPU families do not.


I am new to this, so sorry if
this doesn't make very much sense.

Thank you





Enjoy

Jakob
--
Jakob Bohm, CIO, Partner, WiseMo A/S.  http://www.wisemo.com
Transformervej 29, 2860 Soborg, Denmark.  Direct +45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded