> From: Richard Henderson [mailto:rth7...@gmail.com] On Behalf Of Richard
> Henderson
> Sent: Monday, 24 July 2017 15:03
> 
> On 07/24/2017 02:23 PM, Emilio G. Cota wrote:
> > (Adding some Cc's)
> >
> > On Mon, Jul 24, 2017 at 19:05:33 +0000, Andrew Baumann via Qemu-devel
> wrote:
> >> Hi all,
> >>
> >> I'm trying to track down what appears to be a translation bug in either
> >> the aarch64 target or x86_64 TCG (in multithreaded mode). The
> symptoms
> 
> I assume this is really x86_64 and not i686 as host.

Yes.

> >> are entirely consistent with a torn read/write -- that is, a 64-bit
> >> load or store that was translated to two 32-bit loads and stores --
> >> but that's obviously not what happens in the common path through the
> >> translation for this code, so I'm wondering: are there any cases in
> >> which qemu will split a 64-bit memory access into two 32-bit accesses?
> >
> > That would be a bug in MTTCG.
> >
> >> The code: Guest CPU A writes a 64-bit value to an aligned memory
> >> location that was previously 0, using a regular store; e.g.:
> >>    f9000034 str         x20,[x1]
> >>
> >> Guest CPU B (who is busy-waiting) reads a value from the same location:
> >>    f9400280 ldr         x0,[x20]
> >>
> >> The symptom: CPU B loads a value that is neither NULL nor the value
> >> written. Instead, x0 gets only the low 32-bits of the value written
> >> (high bits are all zero). By the time this value is dereferenced (a
> >> few instructions later) and the exception handlers run, the memory
> >> location from which it was loaded has the correct 64-bit value with
> >> a non-zero upper half.
> >>
> >> Obviously on a real ARM memory barriers are critical, and indeed
> >> the code has such barriers in it, but I'm assuming that any possible
> >> mistranslation of the barriers is irrelevant because for a 64-bit load
> >> and a 64-bit store you should get all or nothing. Other clues that may
> >> be relevant: the code is _near_ a LDREX/STREX pair (the busy-waiting
> >> is used to resolve a race when updating another variable), and the
> >> busy-wait loop has a yield instruction in it (although those appear
> >> to be no-ops with MTTCG).
> >
> > This might have to do with how ldrex/strex is emulated; are you relying
> > on the exclusive pair detecting ABA? If so, your code won't work in
> > QEMU since it uses cmpxchg to emulate ldrex/strex.
> 
> ABA problem is nothing to do with tearing.  And cmpxchg will definitely not
> create tearing problems.

In fact, the ldrex/strex are there implementing a cmpxchg. :)

> I don't know how we would manage 64-bit tearing on a 64-bit host, at least
> for
> the aarch64 guest, for which I believe we have a good emulation.
> 
> > - Pin the QEMU-MTTCG process to a single CPU. Can you repro then?
> 
> A good suggestion.

Thanks for the suggestions. I've been running this configuration all day, and 
haven't seen a repro yet in hundreds of attempts. I'll report if that changes.

The problem is that this test isn't very conclusive... I don't know how often 
the VCPU threads will context switch, but I suspect it's pretty rare relative 
to the ability to expose races when they run directly on different host cores. 
And if the race really doesn't exist when running on a single core, that to me 
suggests a hardware problem, which is even less likely.

> > - Force the emulation of cmpxchg via EXCP_ATOMIC with:
> >
> > diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
> > index 87f673e..771effe5 100644
> > --- a/tcg/tcg-op.c
> > +++ b/tcg/tcg-op.c
> > @@ -2856,7 +2856,7 @@ void tcg_gen_atomic_cmpxchg_i64(TCGv_i64
> retv, TCGv addr, TCGv_i64 cmpv,
> >           }
> >           tcg_temp_free_i64(t1);
> >       } else if ((memop & MO_SIZE) == MO_64) {
> > -#ifdef CONFIG_ATOMIC64
> > +#if 0
> 
> I suspect this will simply alter the timing.  However, give it a go by all 
> means.

I haven't tried this yet. If it behaves as you describe, it seems likely to 
make the race harder to hit.

> If there's a test case that you can share, that would be awesome.

I wish we could :(

> Especially if you can prod it to happen with a standalone minimal binary.  
> With
> luck you can reproduce via aarch64-linux-user too, and simply signal an error
> via branch to __builtin_trap.

Initial attempts to repro this with a tight loop failed, but I'll take another 
stab at producing a standalone test program that we can share.

Andrew

Reply via email to