Re: apparent race condition in mttcg memory handling

Pierrick Bouvier Mon, 21 Jul 2025 10:55:33 -0700

On 7/21/25 10:31 AM, Peter Maydell wrote:

On Mon, 21 Jul 2025 at 18:26, Pierrick Bouvier
<pierrick.bouv...@linaro.org> wrote:


On 7/21/25 10:14 AM, Michael Tokarev wrote:

rr is the first thing I tried.  Nope, it's absolutely hopeless.   It
tried to boot just the kernel for over 30 minutes, after which I just
gave up.


I had a similar thing to debug recently, and with a simple loop, I
couldn't expose it easily. The bug I had was triggered with 3%
probability, which seems close from yours.
As rr record -h is single threaded, I found useful to write a wrapper
script [1] to run one instance, and then run it in parallel using:
./run_one.sh | head -n 10000 | parallel --bar -j$(nproc)

With that, I could expose the bug in 2 minutes reliably (vs trying for
more than one hour before). With your 64 cores, I'm sure it will quickly
expose it.


I think the problem here is that the whole runtime to get to
point-of-potential failure is too long, not that it takes too
many runs to get a failure.

For that kind of thing I have had success in the past with
making a QEMU snapshot close to the point of failure so that
the actual runtime that it's necessary to record under rr is
reduced.

That's a good idea indeed. In the bug I had, it was due to KASLR addresschosen, so by using a snapshot I would have had not expose the randomaspect.In case of current bug, it seems to be a proper race condition, sotrying more combinations with a preloaded snapshot to save a few secondsper run is a good point.

-- PMM

Re: apparent race condition in mttcg memory handling

Reply via email to