How to tell if an emulated aarch64 CPU has stopped doing work?
We use qemu (4.0.0, about to flip the switch to 5.0.0) to test our aarch64 images, running in linux containers on x86_64 alongside other workloads. We've recently run into issues where it looks like an emulated CPU (out of four) sometimes stops making progress for ten or more seconds, and we're trying to characterize the problem. When this happens, the other emulated CPUs run just fine, though sometimes two will stall out at the same time. Any suggestions for how to tell if an emulated CPU stopped doing work? Based on our experiments, the guest-visible clocks and cycle counters continue to run when a qemu CPU thread is suspended, so it's hard to tell whether the emulation paused, or if our code is spinning with interrupts disabled (though evidence is mounting that that's not the case). We're adding a bunch more instrumentation to our code, but maybe qemu has some features that will help us out. I tried to find a way to count the number of TBs executed by an emulated core over time, but I didn't see a cheap way to do that with the plugin APIs. We could maybe turn on instruction tracing, but this problem happens pretty rarely (<1%), we don't have a repro case yet, and we can't really afford the cost of slowing down every test run. There's a decent chance that this is caused by an overloaded host, but our host-side investigations haven't turned up anything concrete either. Any advice? --dbort
Re: Using All Cores of CPU on Snapdragon Processor during x86-to-ARM User Space Emulation
On Thu, 14 May 2020 at 11:32, Jakob Bohm wrote: > The one exception to this lack was instruction decoding, where certain > commonly used branch instructions were defined as implicitly picking up > any changes in instruction memory. This of cause corresponds to the TCG > checking for needed retranslation of buffers at those points. As it happens, QEMU will force retranslation of a buffer for x86 guests even if they modify the immediately next insn, rather than only picking up the change at the next branch. The x86 target sets TARGET_HAS_PRECISE_SMC, which enables some extra code that stops execution of the CPU when a write to the current TB is detected; all other targets don't set this, because architecturally it's OK for them to finish execution of the current TB before picking up the changed code. More generally, we detect self-modifying code by trapping writes to areas of memory which we've translated code from, rather than by doing things on the guest CPU events like icache-flush which the h/w uses to handle SMC. thanks -- PMM
Re: Using All Cores of CPU on Snapdragon Processor during x86-to-ARM User Space Emulation
On 13/05/2020 12:02, Alex Bennée wrote: Vijay Daita writes: Hello It is my understanding that one would be unable to do x86-to-ARM user space emulation while utilizing all cores because of x86 barriers. Actually the utilisation of multiple cores (often referred to at MTTCG) is a function of system emulation and you are correct for x86-on-ARM we don't enable MTTCG because we don't currently add barrier instructions to fully emulate the x86 memory model. However for linux-user we have always followed the guest threading model because the guest clone() is passed down to the host. However because the memory modelling isn't perfect you can run into problems because of the mismatch. I wanted to know if there is difference between what QEMU aims to do and using a interpreter of sorts to convert x86 instructions directly to ARM instructions so that when run on the system directly, the system can decide, itself, how to apportion the task. This is what the TCG does - it translates guest instructions into groups of host instructions. We could insert the extra barriers for all loads and stores but the effect would be to cripple performance. In an ideal world we would only do these for the load/store instructions involved in inter-thread synchronisation operations but that's a fairly tricky problem to solve. Especially because the x86 memory model traditionally has barrier/synchronization instructions automatically push through their ordering to all other cores/CPUs, and as a result, "barrier load" wasn't really a thing until CMPXCHG was introduced in a later CPU generation than the basic sync instructions (8086) and cache coherency mechanisms (80486). In fact, the LOCK barrier prefix triggers an #UD exception with most load instructions. The one exception to this lack was instruction decoding, where certain commonly used branch instructions were defined as implicitly picking up any changes in instruction memory. This of cause corresponds to the TCG checking for needed retranslation of buffers at those points. Additionally, x86 barriers generally guarantee total ordering relative to the barrier operation of all memory accesses that occur before or after in program order, which some other CPU families do not. I am new to this, so sorry if this doesn't make very much sense. Thank you Enjoy Jakob -- Jakob Bohm, CIO, Partner, WiseMo A/S. http://www.wisemo.com Transformervej 29, 2860 Soborg, Denmark. Direct +45 31 13 16 10 This public discussion message is non-binding and may contain errors. WiseMo - Remote Service Management for PCs, Phones and Embedded