On 07/24/2017 02:23 PM, Emilio G. Cota wrote:
(Adding some Cc's)
On Mon, Jul 24, 2017 at 19:05:33 +0000, Andrew Baumann via Qemu-devel wrote:
Hi all,
I'm trying to track down what appears to be a translation bug in either
the aarch64 target or x86_64 TCG (in multithreaded mode). The symptoms
I assume this is really x86_64 and not i686 as host.
are entirely consistent with a torn read/write -- that is, a 64-bit
load or store that was translated to two 32-bit loads and stores --
but that's obviously not what happens in the common path through the
translation for this code, so I'm wondering: are there any cases in
which qemu will split a 64-bit memory access into two 32-bit accesses?
That would be a bug in MTTCG.
The code: Guest CPU A writes a 64-bit value to an aligned memory
location that was previously 0, using a regular store; e.g.:
f9000034 str x20,[x1]
Guest CPU B (who is busy-waiting) reads a value from the same location:
f9400280 ldr x0,[x20]
The symptom: CPU B loads a value that is neither NULL nor the value
written. Instead, x0 gets only the low 32-bits of the value written
(high bits are all zero). By the time this value is dereferenced (a
few instructions later) and the exception handlers run, the memory
location from which it was loaded has the correct 64-bit value with
a non-zero upper half.
Obviously on a real ARM memory barriers are critical, and indeed
the code has such barriers in it, but I'm assuming that any possible
mistranslation of the barriers is irrelevant because for a 64-bit load
and a 64-bit store you should get all or nothing. Other clues that may
be relevant: the code is _near_ a LDREX/STREX pair (the busy-waiting
is used to resolve a race when updating another variable), and the
busy-wait loop has a yield instruction in it (although those appear
to be no-ops with MTTCG).
This might have to do with how ldrex/strex is emulated; are you relying
on the exclusive pair detecting ABA? If so, your code won't work in
QEMU since it uses cmpxchg to emulate ldrex/strex.
ABA problem is nothing to do with tearing. And cmpxchg will definitely not
create tearing problems.
I don't know how we would manage 64-bit tearing on a 64-bit host, at least for
the aarch64 guest, for which I believe we have a good emulation.
- Pin the QEMU-MTTCG process to a single CPU. Can you repro then?
A good suggestion.
- Force the emulation of cmpxchg via EXCP_ATOMIC with:
diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
index 87f673e..771effe5 100644
--- a/tcg/tcg-op.c
+++ b/tcg/tcg-op.c
@@ -2856,7 +2856,7 @@ void tcg_gen_atomic_cmpxchg_i64(TCGv_i64 retv, TCGv addr,
TCGv_i64 cmpv,
}
tcg_temp_free_i64(t1);
} else if ((memop & MO_SIZE) == MO_64) {
-#ifdef CONFIG_ATOMIC64
+#if 0
I suspect this will simply alter the timing. However, give it a go by all
means.
If there's a test case that you can share, that would be awesome.
Especially if you can prod it to happen with a standalone minimal binary. With
luck you can reproduce via aarch64-linux-user too, and simply signal an error
via branch to __builtin_trap.
r~