On Tue Mar 12, 2024 at 11:33 PM AEST, Alex Bennée wrote:
> Nicholas Piggin <npig...@gmail.com> writes:
>
> > This reverts commit 1f881ea4a444ef36a8b6907b0b82be4b3af253a2.
> >
> > That commit causes reverse_debugging.py test failures, and does
> > not seem to solve the root cause of the problem x86-64 still
> > hangs in record/replay tests.
>
> I'm still finding the reverse debugging tests failing with this series.

:(

In gitlab CI or your own testing? What are you running exactly?

I found 440fx must have some bug becaues it's quite bad, but
q35 was more stable.

>
> > The problem with short-cutting the iowait that was taken during
> > record phase is that related events will not get consumed at the
> > same points (e.g., reading the clock).
> >
> > A hang with zero icount always seems to be a symptom of an earlier
> > problem that has caused the recording to become out of synch with
> > the execution and consumption of events by replay.
>
> Would it be possible to still detect the failure mode rather than a full
> revert?

I'm not actually sure exactly how this patch causes test failures
(or how it was fixing the ones you saw). If I can reproduce the
ones you see I could look a bit deeper at it.

I have been looking at some ways to try detect/report record/replay
errors more sanely because it can be very hard to debug them at
the moment. No patches quite ready to post yet, but yes if there's
something we could do here to help we should.

Thanks,
Nick

>
> >
> > Signed-off-by: Nicholas Piggin <npig...@gmail.com>
> > ---
> >  include/sysemu/replay.h      |  5 -----
> >  accel/tcg/tcg-accel-ops-rr.c |  2 +-
> >  replay/replay.c              | 21 ---------------------
> >  3 files changed, 1 insertion(+), 27 deletions(-)
> >
> > diff --git a/include/sysemu/replay.h b/include/sysemu/replay.h
> > index f229b2109c..8102fa54f0 100644
> > --- a/include/sysemu/replay.h
> > +++ b/include/sysemu/replay.h
> > @@ -73,11 +73,6 @@ int replay_get_instructions(void);
> >  /*! Updates instructions counter in replay mode. */
> >  void replay_account_executed_instructions(void);
> >  
> > -/**
> > - * replay_can_wait: check if we should pause for wait-io
> > - */
> > -bool replay_can_wait(void);
> > -
> >  /* Processing clocks and other time sources */
> >  
> >  /*! Save the specified clock */
> > diff --git a/accel/tcg/tcg-accel-ops-rr.c b/accel/tcg/tcg-accel-ops-rr.c
> > index 894e73e52c..a942442a33 100644
> > --- a/accel/tcg/tcg-accel-ops-rr.c
> > +++ b/accel/tcg/tcg-accel-ops-rr.c
> > @@ -109,7 +109,7 @@ static void rr_wait_io_event(void)
> >  {
> >      CPUState *cpu;
> >  
> > -    while (all_cpu_threads_idle() && replay_can_wait()) {
> > +    while (all_cpu_threads_idle()) {
> >          rr_stop_kick_timer();
> >          qemu_cond_wait_bql(first_cpu->halt_cond);
> >      }
> > diff --git a/replay/replay.c b/replay/replay.c
> > index b8564a4813..895fa6b67a 100644
> > --- a/replay/replay.c
> > +++ b/replay/replay.c
> > @@ -451,27 +451,6 @@ void replay_start(void)
> >      replay_enable_events();
> >  }
> >  
> > -/*
> > - * For none/record the answer is yes.
> > - */
> > -bool replay_can_wait(void)
> > -{
> > -    if (replay_mode == REPLAY_MODE_PLAY) {
> > -        /*
> > -         * For playback we shouldn't ever be at a point we wait. If
> > -         * the instruction count has reached zero and we have an
> > -         * unconsumed event we should go around again and consume it.
> > -         */
> > -        if (replay_state.instruction_count == 0 && 
> > replay_state.has_unread_data) {
> > -            return false;
> > -        } else {
> > -            replay_sync_error("Playback shouldn't have to iowait");
> > -        }
> > -    }
> > -    return true;
> > -}
> > -
> > -
> >  void replay_finish(void)
> >  {
> >      if (replay_mode == REPLAY_MODE_NONE) {


Reply via email to