It looks like we have a solution to the RCU patch which was causing problems 
with the func-alpha-replay test (see 
[email protected]).
While this was going on I spent a bit of time investigating repeatability in 
record/replay and I think there may be broader problems with record & replay.

While running the func-alpha-replay test we have two threads reading or writing 
the replay event log; the "main" thread running qemu_main_loop and the "RR" 
(round robin) thread running rr_cpu_thread_fn. Both of these use 
replay_mutex_lock() and bql_lock() to synchronize some actions. There's a third 
thread running RCU maintenance which also uses bql_lock(), but not 
replay_mutex_lock().

replay_mutex_lock() has some extra logic to improve fairness of locking. This 
means that the first caller of replay_mutex_lock() should obtain the lock 
first. However, so far as I can see, this doesn't make the scheduling of the 
Main and RR threads deterministic.
I have observed times when neither of those threads holds the lock, and as 
such, there's no way to predict which will call replay_mutex_lock() first. This 
means the ordering of events during either recording or replay is not 
deterministic.

It is possible to alter the lock function such that the two threads will run in 
lockstep; see 
https://gitlab.com/jmacarthur/qemu-jmac-development/-/commits/jmac/replay-tick-tock
 for a rough demonstration. Adding this significantly reduced timeouts on 
func-alpha-replay; I can also see that the replay recordings are much more 
consistent from one recording to the next; typically diverging around the 
380000th event, rather than the 20th event without this hack.
This is not a good fix since it slows QEMU down significantly and may be prone 
to deadlocks, but I think this demonstrates that the current system is not 
perfect.

Do you agree with my analysis above? Is there something I've missed which is 
meant to deterministically schedule these two threads?

Jim MacArthur

Reply via email to