On Mon, 21 Jul 2025 at 18:26, Pierrick Bouvier
<pierrick.bouv...@linaro.org> wrote:
>
> On 7/21/25 10:14 AM, Michael Tokarev wrote:
> > rr is the first thing I tried.  Nope, it's absolutely hopeless.   It
> > tried to boot just the kernel for over 30 minutes, after which I just
> > gave up.
> >
>
> I had a similar thing to debug recently, and with a simple loop, I
> couldn't expose it easily. The bug I had was triggered with 3%
> probability, which seems close from yours.
> As rr record -h is single threaded, I found useful to write a wrapper
> script [1] to run one instance, and then run it in parallel using:
> ./run_one.sh | head -n 10000 | parallel --bar -j$(nproc)
>
> With that, I could expose the bug in 2 minutes reliably (vs trying for
> more than one hour before). With your 64 cores, I'm sure it will quickly
> expose it.

I think the problem here is that the whole runtime to get to
point-of-potential failure is too long, not that it takes too
many runs to get a failure.

For that kind of thing I have had success in the past with
making a QEMU snapshot close to the point of failure so that
the actual runtime that it's necessary to record under rr is
reduced.

-- PMM

Reply via email to