On 7/21/25 10:31 AM, Peter Maydell wrote:
On Mon, 21 Jul 2025 at 18:26, Pierrick Bouvier
<pierrick.bouv...@linaro.org> wrote:
On 7/21/25 10:14 AM, Michael Tokarev wrote:
rr is the first thing I tried. Nope, it's absolutely hopeless. It
tried to boot just the kernel for over 30 minutes, after which I just
gave up.
I had a similar thing to debug recently, and with a simple loop, I
couldn't expose it easily. The bug I had was triggered with 3%
probability, which seems close from yours.
As rr record -h is single threaded, I found useful to write a wrapper
script [1] to run one instance, and then run it in parallel using:
./run_one.sh | head -n 10000 | parallel --bar -j$(nproc)
With that, I could expose the bug in 2 minutes reliably (vs trying for
more than one hour before). With your 64 cores, I'm sure it will quickly
expose it.
I think the problem here is that the whole runtime to get to
point-of-potential failure is too long, not that it takes too
many runs to get a failure.
For that kind of thing I have had success in the past with
making a QEMU snapshot close to the point of failure so that
the actual runtime that it's necessary to record under rr is
reduced.
That's a good idea indeed. In the bug I had, it was due to KASLR address
chosen, so by using a snapshot I would have had not expose the random
aspect.
In case of current bug, it seems to be a proper race condition, so
trying more combinations with a preloaded snapshot to save a few seconds
per run is a good point.
-- PMM