On 7/21/25 10:31 AM, Peter Maydell wrote:
On Mon, 21 Jul 2025 at 18:26, Pierrick Bouvier
<pierrick.bouv...@linaro.org> wrote:

On 7/21/25 10:14 AM, Michael Tokarev wrote:
rr is the first thing I tried.  Nope, it's absolutely hopeless.   It
tried to boot just the kernel for over 30 minutes, after which I just
gave up.


I had a similar thing to debug recently, and with a simple loop, I
couldn't expose it easily. The bug I had was triggered with 3%
probability, which seems close from yours.
As rr record -h is single threaded, I found useful to write a wrapper
script [1] to run one instance, and then run it in parallel using:
./run_one.sh | head -n 10000 | parallel --bar -j$(nproc)

With that, I could expose the bug in 2 minutes reliably (vs trying for
more than one hour before). With your 64 cores, I'm sure it will quickly
expose it.

I think the problem here is that the whole runtime to get to
point-of-potential failure is too long, not that it takes too
many runs to get a failure.

For that kind of thing I have had success in the past with
making a QEMU snapshot close to the point of failure so that
the actual runtime that it's necessary to record under rr is
reduced.


That's a good idea indeed. In the bug I had, it was due to KASLR address chosen, so by using a snapshot I would have had not expose the random aspect. In case of current bug, it seems to be a proper race condition, so trying more combinations with a preloaded snapshot to save a few seconds per run is a good point.

-- PMM


Reply via email to