On Mon, 21 Jul 2025 at 18:26, Pierrick Bouvier <pierrick.bouv...@linaro.org> wrote: > > On 7/21/25 10:14 AM, Michael Tokarev wrote: > > rr is the first thing I tried. Nope, it's absolutely hopeless. It > > tried to boot just the kernel for over 30 minutes, after which I just > > gave up. > > > > I had a similar thing to debug recently, and with a simple loop, I > couldn't expose it easily. The bug I had was triggered with 3% > probability, which seems close from yours. > As rr record -h is single threaded, I found useful to write a wrapper > script [1] to run one instance, and then run it in parallel using: > ./run_one.sh | head -n 10000 | parallel --bar -j$(nproc) > > With that, I could expose the bug in 2 minutes reliably (vs trying for > more than one hour before). With your 64 cores, I'm sure it will quickly > expose it.
I think the problem here is that the whole runtime to get to point-of-potential failure is too long, not that it takes too many runs to get a failure. For that kind of thing I have had success in the past with making a QEMU snapshot close to the point of failure so that the actual runtime that it's necessary to record under rr is reduced. -- PMM