Hey Gabe,
  Thanks for the suggestion. This work-around doesn't appear to work. In
the O3CPU, the instruction still does not get committed due to the fault
(DefaultCommit<Impl>::commitHead(<suspend instruction>) generates a trap
and returns that the instruction cannot be committed). After thread
reactivation, the instruction is executed again causing a the thread to
suspend. The SimpleTiming CPU has a similar issue that it executes the
fault and suspends the thread, but while the thread is suspended, the core
appears to just continue trying to execute the suspend instruction.

  It seems like the right way to fix this may be to introduce a
ThreadContext state, say "Activating", which the thread is put into when
activate() is called on it, and the thread is not allowed to enter the
Active state until the ROB has been cleared (i.e. any remaining
instructions from before the suspend are squashed and retired). Does this
sound reasonable?

  Thanks,
  Joel


On Tue, Jan 20, 2015 at 12:32 PM, Gabe Black via gem5-dev <gem5-dev@gem5.org
> wrote:

> It sounds like a bug/race condition in the O3 CPU, which I think you
> already knew. You could try moving the suspend call into a fault returned
> by the MicroHalt microop instead of the instruction itself. That might
> break the race, although it's not really fixing the issue with O3.
>
> Gabe
>
> On Tue, Jan 20, 2015 at 8:43 AM, Joel Hestness via gem5-dev <
> gem5-dev@gem5.org> wrote:
>
> > Hi guys,
> >   I'm running into a very tricky problem with halt/suspend x86
> instructions
> > with the O3 CPU. This might be a question for Nilay, Gabe B. or Mitch H.,
> > and I'm really hoping for input given the complexity of this one.
> >
> >   The specific problem is that when calling suspend from the execute
> stage
> > of an instruction (e.g. a pseudoinstruction or the x86 MicroHalt
> microop),
> > the CPU context gets suspended, but after reactivating the context later,
> > the instruction gets squashed and replayed, potentially causing the
> context
> > to get suspended again immediately. The pseudoinstruction that I'm using
> > doesn't do anything except call the thread context suspend, and the
> > functionality is nearly identical to that of the MicroHalt op (I've now
> > tried swapping in the MicroHalt and run into the same problem, so I
> suspect
> > this may also affect the MWAIT implementation). The instruction that
> > suspends the context moves to the commit buffer in the core, but cannot
> be
> > committed before the thread is suspended. When the thread is restarted,
> the
> > commit stage squashes all instructions, retiring the suspend instruction,
> > and fetch starts back at the PC of the suspend instruction. In cases that
> > appear to execute correctly, the pipeline re-fetches the suspend
> > instruction, but it gets squashed from the commit stage and removed from
> > the instruction list. In apparently broken cases, the instruction does
> not
> > get squashed, so the thread goes back to sleep. Interrupts can jar the
> CPU
> > out of the incorrect suspend loop, but sometimes it takes 3-6 interrupts
> > (i.e. up to 10s of milliseconds).
> >
> >   Some details: I'm currently using gem5 revision 10237:b2850bdcec07 and
> > the bug occurs in long-running sims in FS mode (single-threaded cores -
> no
> > SMT). I've also pulled some more recent changeset and applied them to my
> > repo, since they address O3 CPU issues: 10239
> > <http://repo.gem5.org/gem5/rev/592f0bb6bd6f>, 10327
> > <http://repo.gem5.org/gem5/rev/5b6279635c49>, 10328
> > <http://repo.gem5.org/gem5/rev/867b536a68be>, 10329
> > <http://repo.gem5.org/gem5/rev/12e3be8203a5>, 10331
> > <http://repo.gem5.org/gem5/rev/ed05298e8566>, 10332
> > <http://repo.gem5.org/gem5/rev/1ba825974ee6>, 10340
> > <http://repo.gem5.org/gem5/rev/40d24a672351>. I'm unable to reproduce
> the
> > bug in SE mode, and I suspect that sporadic interrupt handling in O3 may
> be
> > part of the problem, since the examples that I can generate show CPU
> > interrupts raised in close proximity to the thread suspend and activate
> > activity.
> >
> >   I've attached a annotated O3 execution traces for seemingly correct and
> > incorrect instances. Here are some specific questions I'm hoping for help
> > with:
> >
> >   1) Are there any other known changes in the mainline repo that might
> fix
> > this?
> >
> >   2) If not (1), I'm not clear on the purpose of retiring the suspend
> > instruction (rather than committing) after reactivating the thread
> context.
> > I can understand that the full pipeline squash would be a standard
> > procedure after reactivating a thread. However, the suspend instruction
> > finishes execution in the same cycle that thread suspend starts, so it
> > should be free to commit. Since the suspend instruction doesn't commit,
> it
> > is pointed to as the next PC after thread reactivation, which allows it
> to
> > be reexecuted incorrectly. Is this retirement process the intended
> behavior
> > for these thread suspend instructions like these? It seems like this
> might
> > be a corner case where the suspend instruction should get committed
> rather
> > than squashed/retired, even though the pipeline is being squashed for the
> > suspend process.
> >
> >   3) I'm also not clear why the suspend instruction after reactivation
> does
> > (correct) or does not (incorrect) get squashed. The suspend instruction
> is
> > refetched from the icache in the correct case, and by the time it gets to
> > the commit stage, the interrupt that reactivated the thread has caused
> the
> > commit stage to enter the TrapPending state. This is why the refetched
> > suspend instruction is squashed after thread reactivation, but it is not
> > marked as previously squashed/retired.
> >   In the incorrect case, it looks like the instruction is refetched from
> a
> > fetch buffer rather than the icache, so the instruction is created
> > immediately (after thread reactivation, shouldn't fetch be completely
> > squashed/drained?). Given that the instruction gets to the ROB so
> quickly,
> > the pending interrupt has not been initialized by commit, so commit has
> not
> > yet entered the TrapPending state. This is why the refetched suspend
> > instruction is able to get to execution and (incorrectly) resuspend the
> > thread before the instruction can get squashed or the interrupt can be
> > handled.
> >   If it is intended that the suspend instruction be refetched after
> thread
> > reactivation like this, should there be a more rigorous way of
> identifying
> > it as a suspend instruction and ensuring that it gets squashed?
> >
> >   4) Any thoughts would be appreciated about the "intended" or "right
> way"
> > for this to function. I might be wrong, but it seems like thread
> suspension
> > triggered by the thread executing an instruction may be somewhat untested
> > in the O3 CPU, so any thoughts on how this should work would probably be
> > useful.
> >
> >
> >   Thanks!
> >   Joel
> >
> >
> > --
> >   Joel Hestness
> >   PhD Candidate, Computer Architecture
> >   Dept. of Computer Science, University of Wisconsin - Madison
> >   http://pages.cs.wisc.edu/~hestness/
> >
> > _______________________________________________
> > gem5-dev mailing list
> > gem5-dev@gem5.org
> > http://m5sim.org/mailman/listinfo/gem5-dev
> >
> >
> _______________________________________________
> gem5-dev mailing list
> gem5-dev@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>



-- 
  Joel Hestness
  PhD Candidate, Computer Architecture
  Dept. of Computer Science, University of Wisconsin - Madison
  http://pages.cs.wisc.edu/~hestness/
_______________________________________________
gem5-dev mailing list
gem5-dev@gem5.org
http://m5sim.org/mailman/listinfo/gem5-dev

Reply via email to