Let’s move this conversation to just the email thread.

I suspect we may be talking past each other, so let’s talk about the complete 
implementations not just Ruby.  There are multiple ways one can implement the 
store portion of x86-TSO.  I’m not sure what the O3 model does, but here are a 
few possibilities:

-          Do not issue any part of the store to the memory system when the 
instruction is executed.  Instead, simply buffer it in the LSQ until the 
instruction retires, then buffer in the store buffer after retirement.  Only 
when the store reaches the head of the store buffer, issue it to Ruby.  The 
next store is not issued to Ruby until the previous store head completes, 
maintaining correct store ordering.

-          Do not issue any part of the store to the memory system when the 
instruction is executed.  Instead, simply buffer it in the LSQ until the 
instruction retires.   Once it retires and enters the store buffer and we issue 
the address request to Ruby (no L1 data update).  Ruby forwards 
probes/replacemetns to the store buffer and if the store buffer sees a 
probe/replacement to an address who’s address request has already completed, 
the store buffer reissues the request.  Once the store reaches the head of the 
store buffer, double check with Ruby that write permissions still exist in the 
L1.

-          Issue the store address (no L1 data update) to Ruby when the 
instruction is executed.  When it retires, it enters the store buffer.  Ruby 
forwards probes/replacemetns to the LSQ+store buffer and if either sees a 
probe/replacement to an address who’s address request has already completed, 
the request reissues (several policies exist on when to reissue the request).  
Once the store reaches the head of the store buffer, double check with Ruby 
that write permissions still exist in the L1.

Do those scenarios make sense to you?  I believe we can implement any one of 
them without modifying Ruby’s core functionality.  If you are envisioning or if 
O3 implements something completely different, please let me know.

Brad



From: Brad Beckmann [mailto:[email protected]]
Sent: Friday, October 28, 2011 3:01 PM
To: Nilay Vaish; Beckmann, Brad; Default
Subject: Re: Review Request: Forward invalidations from Ruby to O3 CPU

This is an automatically generated e-mail. To reply, visit: 
http://reviews.m5sim.org/r/894/



On October 27th, 2011, 10:35 p.m., Brad Beckmann wrote:

Thanks for the heads up on this patch.  I'm glad you found the time to dive 
into it.







I'm confused that the comment mentions a "list of ports", but I don't see a 
list of ports in the code and I'm not sure how would even be used?



The two questions you pose are good ones.  Hopefully someone who understands 
the O3 LSQ can answer the first, and I would suggest creating a new directed 
test that can manipulate the enqueue latency on the mandatory queue to create 
the necessary test situations.



Also, I have a couple high-level comments right now:







- Ruby doesn't implement any particular memory model.  It just implements the 
cache coherence protocol, and more specifically invalidation based protocols.  
The protocol, in combination with the core model, results in the memory model.





- I don't think it is sufficient to just forward those probes that hit valid 
copies to the O3 model.  What about replacements of blocks that have serviced a 
speculative load?  Instead, my thought would be to forward all probes to the O3 
LSQ and think of cpu-controlled policies to filter out unecessary probes.

On October 28th, 2011, 3:32 a.m., Nilay Vaish wrote:

Hi Brad, thanks for the response.



* A list of ports has been added to RubyPort.hh, the ports are added

  to the list whenever a new M5Port is created.



* As long as the core waits for an ack from the memory system for every store

  before issuing the next one, I can understand that memory model is independent

  of how the memory system is implemented. But suppose the caches are 
multi-ported.

  Then will the core only use one of the ports for stores and wait for an ack?

  The current LSQ implementation uses as many ports as available. In this case,

  would not the memory system need to ensure the order in which the stores are

  performed?



* I think the current implementation handles blocks coherence permissions for

  which were speculatively fetched. If the cache looses permissions on this

  block, then it will forward the probe to the CPU. If the cache again receives

  a probe for this block, I don't think that the CPU will have any instruction

  using the value from that block.



* For testing, Prof. Wood suggested having some thing similar to TSOtool.

On October 28th, 2011, 9:55 a.m., Brad Beckmann wrote:

Hmm...I'm now even more confused.  I have not looked at the O3 LSQ, but it 
sounds like from your description that one particular instantiation of the LSQ 
will use N ports, not just a single port to the L1D.  So does N equal the 
number of simultaneous loads and stores that can be issued per cycle, or is N 
equal to the number of outstanding loads and stores supported by the LSQ?  Or 
does it equal something completely different?



Stores to different cache blocks can be issued to the memory system 
out-of-order and in parallel.  Ruby already supports such functionality.  The 
key is the store buffer must be drained in-order.  It is up to the store 
buffer's functionality to get that right.  Ruby can assist by providing 
interfaces for checking permission state and forwarding probes upstream, but it 
is up to the LSQ/store buffer to act appropriately and retry requests when 
necessary.  I don't believe Ruby needs any fundamental changes to support 
x86-TSO.  Instead, Ruby just needs to provide more information back to the LSQ.



Earlier I didn't notice that you also squash speculation on replacements, in 
addition to probes.  Yeah, I think those changes take care of correctly 
squashing speculative loads.  However, as I mentioned above, I still think we 
need to figure out how to provide the necessary information to allow stores to 
be issued in parallel, while still retiring in-order.



Implementing something similar to TSOtool would be great.  However, I think 
there is benefit to do some quick tests using a DirectedTester before creating 
something like TSOtool.





On October 28th, 2011, 2:13 p.m., Nilay Vaish wrote:

Brad,



My understanding is that the LSQ can issue at most N loads and stores to

the memory system in each cycle.



For parallel stores, it seems that the core should have permissions for

these cache blocks all at the same time. Even if Ruby fetches coherence

permissions out-of-order, it would still have to ensure, for SC or TSO,

that stores that happened logically later in time become visible only

after all the earlier ones are visible to rest of the system. As of now,

I disagree with the statement that --

          '' Stores to different cache blocks can be issued to the

             memory system out-of-order and in parallel ''

Unless we have some kind of guarantee on the order in which these stores

become visible to the rest of the system, I don't see how we can separate

out the memory system's behavior from the consistency model.



I was thinking of writing a tester that reads in a trace of memory operations

performed by a multi-processor system and the times at which these are 
performed.

Then we can check the load values against the expected load values. I think the

underlying assumption is that everything behaves in a deterministic fashion. 
What

do you think?

Thanks for confirming the O3 LSQ requirement for N ports.  I've got no further 
questions on that.



Stores can certainly be issued out-of-order in modern x86 processors.  It is 
the store buffer's responsibility to ensure that stores become globally visible 
in program order.  Maybe what you're getting at is that Ruby needs to support a 
two-phase store scheme so that the initial writeHitCallback supplies data to 
the CPU but does not update the L1 D cache block.  I would agree to that.  My 
point is that Ruby should only be responsible to provide the necessary 
information and interfaces to the LSQ logic.  There is no reason to change the 
logic of Ruby's invalidation-based coherence protocols.  It is the LSQ's 
(including store buffer) responsibility to ensure the correct order of store 
visibility.



Yes, your tester idea is essentially what I had in mind.  The only thing I want 
to point out is that it may beneficial to include both the time the request 
should issue and a delta of how long the request should be stalled in the 
mandatory queue.  That way you can instigate races where younger memory ops 
deterministically bypass older ops.


- Brad


On October 17th, 2011, 11:50 p.m., Nilay Vaish wrote:
Review request for Default.
By Nilay Vaish.

Updated 2011-10-17 23:50:47

Description

This patch implements the functionality for forwarding invalidations

and replacements from the L1 cache of the Ruby memory system to the O3

CPU. The implementation adds a list of ports to RubyPort. Whenever a replacement

or an invalidation is performed, the L1 cache forwards this to all the ports,

which I believe is the LSQ in case of the O3 CPU. Those who understand the O3

LSQ should take a close look at the implementation and figure out (at least

qualitatively) if some thing is missing or erroneous.



This patch only modifies the MESI CMP directory protocol. I will modify other

protocols once we sort the major issues surrounding this patch.



My understanding is that this should ensure an SC execution, as

long as Ruby can support SC. But I think Ruby does not support any

memory model currently. A couple of issues that need discussion --



* Can this get in to a deadlock? A CPU may not be able to proceed if

  a particularly cache block is repeatedly invalidated before the CPU

  can retire the actual load/store instruction. How do we ensure that

  at least one instruction is retired before an invalidation/replacement

  is processed?



* How to test this implementation? Is it possible to implement some of the

  tests that we regularly come across in papers on consistency models? Or

  those present in manuals from AMD and Intel? I have tested that Ruby will

  forward the invalidations, but not the part where the LSQ needs to act on

  it.


Diffs

 *   build_opts/ALPHA_SE_MESI_CMP_directory (92ba80d63abc)
 *   configs/example/se.py (92ba80d63abc)
 *   configs/ruby/MESI_CMP_directory.py (92ba80d63abc)
 *   src/mem/protocol/MESI_CMP_directory-L1cache.sm (92ba80d63abc)
 *   src/mem/protocol/RubySlicc_Types.sm (92ba80d63abc)
 *   src/mem/ruby/system/RubyPort.hh (92ba80d63abc)
 *   src/mem/ruby/system/RubyPort.cc (92ba80d63abc)
 *   src/mem/ruby/system/Sequencer.hh (92ba80d63abc)
 *   src/mem/ruby/system/Sequencer.cc (92ba80d63abc)

View Diff<http://reviews.m5sim.org/r/894/diff/>


_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Reply via email to