> From: Richard Henderson [mailto:rth7...@gmail.com] On Behalf Of Richard > Henderson > Sent: Monday, 24 July 2017 15:03 > > On 07/24/2017 02:23 PM, Emilio G. Cota wrote: > > (Adding some Cc's) > > > > On Mon, Jul 24, 2017 at 19:05:33 +0000, Andrew Baumann via Qemu-devel > wrote: > >> Hi all, > >> > >> I'm trying to track down what appears to be a translation bug in either > >> the aarch64 target or x86_64 TCG (in multithreaded mode). The > symptoms > > I assume this is really x86_64 and not i686 as host.
Yes. > >> are entirely consistent with a torn read/write -- that is, a 64-bit > >> load or store that was translated to two 32-bit loads and stores -- > >> but that's obviously not what happens in the common path through the > >> translation for this code, so I'm wondering: are there any cases in > >> which qemu will split a 64-bit memory access into two 32-bit accesses? > > > > That would be a bug in MTTCG. > > > >> The code: Guest CPU A writes a 64-bit value to an aligned memory > >> location that was previously 0, using a regular store; e.g.: > >> f9000034 str x20,[x1] > >> > >> Guest CPU B (who is busy-waiting) reads a value from the same location: > >> f9400280 ldr x0,[x20] > >> > >> The symptom: CPU B loads a value that is neither NULL nor the value > >> written. Instead, x0 gets only the low 32-bits of the value written > >> (high bits are all zero). By the time this value is dereferenced (a > >> few instructions later) and the exception handlers run, the memory > >> location from which it was loaded has the correct 64-bit value with > >> a non-zero upper half. > >> > >> Obviously on a real ARM memory barriers are critical, and indeed > >> the code has such barriers in it, but I'm assuming that any possible > >> mistranslation of the barriers is irrelevant because for a 64-bit load > >> and a 64-bit store you should get all or nothing. Other clues that may > >> be relevant: the code is _near_ a LDREX/STREX pair (the busy-waiting > >> is used to resolve a race when updating another variable), and the > >> busy-wait loop has a yield instruction in it (although those appear > >> to be no-ops with MTTCG). > > > > This might have to do with how ldrex/strex is emulated; are you relying > > on the exclusive pair detecting ABA? If so, your code won't work in > > QEMU since it uses cmpxchg to emulate ldrex/strex. > > ABA problem is nothing to do with tearing. And cmpxchg will definitely not > create tearing problems. In fact, the ldrex/strex are there implementing a cmpxchg. :) > I don't know how we would manage 64-bit tearing on a 64-bit host, at least > for > the aarch64 guest, for which I believe we have a good emulation. > > > - Pin the QEMU-MTTCG process to a single CPU. Can you repro then? > > A good suggestion. Thanks for the suggestions. I've been running this configuration all day, and haven't seen a repro yet in hundreds of attempts. I'll report if that changes. The problem is that this test isn't very conclusive... I don't know how often the VCPU threads will context switch, but I suspect it's pretty rare relative to the ability to expose races when they run directly on different host cores. And if the race really doesn't exist when running on a single core, that to me suggests a hardware problem, which is even less likely. > > - Force the emulation of cmpxchg via EXCP_ATOMIC with: > > > > diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c > > index 87f673e..771effe5 100644 > > --- a/tcg/tcg-op.c > > +++ b/tcg/tcg-op.c > > @@ -2856,7 +2856,7 @@ void tcg_gen_atomic_cmpxchg_i64(TCGv_i64 > retv, TCGv addr, TCGv_i64 cmpv, > > } > > tcg_temp_free_i64(t1); > > } else if ((memop & MO_SIZE) == MO_64) { > > -#ifdef CONFIG_ATOMIC64 > > +#if 0 > > I suspect this will simply alter the timing. However, give it a go by all > means. I haven't tried this yet. If it behaves as you describe, it seems likely to make the race harder to hit. > If there's a test case that you can share, that would be awesome. I wish we could :( > Especially if you can prod it to happen with a standalone minimal binary. > With > luck you can reproduce via aarch64-linux-user too, and simply signal an error > via branch to __builtin_trap. Initial attempts to repro this with a tight loop failed, but I'll take another stab at producing a standalone test program that we can share. Andrew