Thanks Joel!

This email is excellent!  I will make sure it is "required reading" by all gem5 
developers at AMD.  I have several thoughts below:

1) This is a great reminder that actions should be designed to be simple.  
Conditional checks should be handled by generating different events.  
Unfortunately we have protocols today that have pushed far too much logic 
within actions.  That leads to the conservative resource checking and very 
difficult protocols to debug.  I definitely want to take that lesson forward 
when reviewing future SLICC code.

2) You have a "ship it" from me.  Thank you for catching and fixing the bug!

3) Yes, this has been a source of confusion for a long time.  I remember 
looking at this a while back and noticing that different protocols were also 
not consistent either.  I think the solution here is not to use an integer, but 
to introduce a priority data type that is used consistently in Ruby, 
SimpleNetwork, and Garnet.

4) I personally think the solution here is to get rid of the mandatory queue.  
I do not think it models a real resource and stalling it can lead to *huge* 
performance problems.  Instead I would like to see all RubyPorts include two 
maps one for buffered, but not yet issued requests, and another for issued 
requests.

5) Have you posted a patch on this?  If it is a simple fix, then I think it 
reasonable for the SimpleNetwork.  However, we need to make sure we keep the 
SimpleNetwork simple.  This absolutely sounds like an issue that Garnet should 
handle correctly.

6 & 7) So does this cover all usage of PacketQueues in Ruby (both 
QueuedMasterPorts and QueuedSlavePorts)?  I believe all RubyPorts currently use 
the packet queues.  I would like to see them all fixed and we are willing to 
help on this.

8) I think we have a different idea of what hardware looks like.  Hardware does 
not access the cache tag and data arrays multiple times for coalesced requests. 
 There is only one access.  I believe your #9 agrees with that and somewhat 
contradicts what you are saying here.  Currently all requests sent to the 
controller requires a separate access to the tags and data as specified in the 
.sm file.  Also note that while the GPUCoalescer expects RfO protocol behavior, 
the VIPERCoalescer does not.

9) Coalescing in the caches adds a lot of unnecessary complexity to the 
protocol.  It is far easier to do that in the RubyPorts.  That is the point of 
the Sequencer and Coalescers (i.e. RubyPorts).  Do the protocol independent 
work in the RubyPorts and then only implement the necessary protocol complexity 
in the cache controllers.  I believe we all agree the RubyPorts need to be 
fixed, but I think we fix them by adding a second map for stalled, not yet 
issued requests, and removing the mandatory queue.  

10) I'm a bit confused here.  Transitions that allocate TBE entries, already 
check for TBE availability.  As long as protocols do not use z_stalls on the 
mandatory queue (BTW, we should make a goal of removing all support of 
z_stalls) independent events that do not need a TBE entry should be triggered.  
Are you seeing different behavior?  If so, what protocol?

Thanks again Joel.  This is a great reference and lists several items we should 
work on.

Brad


-----Original Message-----
From: gem5-dev [mailto:[email protected]] On Behalf Of Joel Hestness
Sent: Tuesday, February 02, 2016 12:50 PM
To: gem5 Developer List
Subject: [gem5-dev] Toward higher-fidelity Ruby memory: Using (actually) finite 
buffering

Hi guys,

  I've been working on some benchmarks that place unique/new stresses on 
heterogeneous CPU-GPU memory hierarchies. In trying to tighten up the hierarchy 
performance, I've run into a number of strange cache buffering/flow control 
issues in Ruby. We've talked about fixing these things, but I've found a need 
to inventory all the places where buffering/prioritization needs work. Below is 
my list, which can hopefully serve as a starting point and offer a broader 
picture to anyone who wishes to use Ruby with more realistic memory hierarchy 
buffering. I've included my current status addressing each.

  Please let me know if you have any input or would like to help address the 
issues. Any help would be appreciated.

  Thank you!
  Joel


1) [status: not started] SLICC parses actions to accumulate the total buffer 
capacity as required by all enqueue operations within the action.
Unfortunately, the resource checking is usually overly conservative resulting 
in two problems:
   A) Many actions contain multiple code paths and not all paths get executed 
to push requests into buffers. The actual resources required are frequently 
less than SLICC's parsed value. As an example, an action with an if-else block 
containing an enqueue() on both paths will parse to require two buffers even 
though only one or the other enqueue() will be called, but not both.
   B) The resource checking can result in poorly prioritized transitions if 
they require allocating more resources than other transitions. For instance, a 
high priority transition (e.g. responses) may require a slot in
2 separate buffers, while a lower priority transition (e.g. requests) may 
require one of those slots. If the higher priority transition gets blocked, the 
lower priority transition can be allowed to proceed, resulting in priority 
inversion and possibly even starvation.
  Performance debugging these issues could be exceptionally difficult. As an 
example of performance issues, in MOESI_hammer, a directory transition that 
would involve activity iff using a full-bit directory may register excessive 
buffer requirements and block waiting for the unnecessary buffers (even though 
the directory is not configured to use the full-bit data!).
  By manually hacking generated files to avoid these incorrect buffer 
requirements, I've already witnessed performance improvements of greater than 
3%, and I haven't even stressed the memory hierarchy.

2) [status: complete, posted] Finite-sized buffers are not actually finite
sized: When a Ruby controller stalls a message in a MessageBuffer, that message 
is removed from the MessageBuffer m_prio_heap and placed in m_stall_msg_map 
queues until the queue is reanalyzed by a wake-up activity in the controller. 
Unfortunately, when checking whether there is enough space in the buffer to add 
more requests, the measured size only considers the size of m_prio_heap, but 
not messages that might be in the m_stall_msg_map. In extreme cases, I've seen 
the m_stall_msg_map hold >500 messages in a MessageBuffer with size = 10. 
Here's a patch that fixes this:
http://reviews.gem5.org/r/3283/

3) [status: not started] Virtual channel specification and prioritization is 
inconsistent: Currently, in each cycle, the PerfectSwitch in the Ruby simple 
network iterates through virtual channels from highest ID to lowest ID, 
indicating that higher IDs have higher priority. By contrast, Garnet cycles 
through virtual channel IDs from lowest to highest, indicating that lower IDs 
have higher priority. Since SLICC controller files specify virtual channels 
independent of the interconnect that is used with Ruby, the virtual channel 
prioritization may be inverted depending on the network that is used. The 
different Ruby network models need to agree on the prioritization in order to 
avoid potential priority inversion.

4) [status: not started] Sequencers push requests into mandatory queues 
regardless of whether the mandatory queue is finite-sized and possibly full. 
With poorly configured sequencer and L1 mandatory queue, it is possible to fill 
the L1 mandatory queue, but still have space in the sequencer's requestTable. 
Since the Sequencer doesn't check whether the mandatory queue has slots 
available, it cannot honor the mandatory queue's capacity correctly. This 
should be fixed and/or a warn/fatal should be raised to let the user know about 
poor configuration.

5) [status: complete, revising] SimpleNetwork access prioritization is not 
suited for finite buffering + near-peak bandwidth: The PerfectSwitch uses a 
round-robin prioritization scheme to select the input ports that have priority 
to issues to an output port, and it steps through input ports in ascending 
order to find one with ready messages. When some input port buffers are full 
and others are empty, the lower ID input ports effectively get prioritization 
when the round-robin ID is greater than the highest ID input port that has 
messages. For fair prioritization, input ports with ready messages should not 
be allowed to issue twice before other input ports with ready messages are 
allowed to issue once. My cursory inspection of Garnet routers suggests that 
they probably suffer from the same arbitration issue.

6) [status: complete, revising] QueuedMasterPort used for requests from Ruby 
directories to memory controllers: This fills up very quickly with a GPU 
requester, and results in the PacketQueue triggering the panic that the queue 
is too large (>100 packets). The RubyMemoryControl has infinite input port 
queuing, so it can be used, but other memory controllers cannot.
Further, I have measured that even with roughly reasonable buffering throughout 
the memory hierarchy, average memory stall cycles in the Ruby memory controller 
in-queue can be upwards of 5,000 cycles (which is nonsensical). To fix this, we 
need to pull the queuing out of the Directory_Controller memory port and into a 
finite queue managed by the *- dir.sm files, and handle flow control in the 
port to memory. I have mostly implemented this, and will post a patch for 
review soon.

7) [status: partially implemented] QueuedSlavePort from memory controllers back 
to Ruby directories: After fixing the memory controller input queues, bloated 
buffering immediately jumps to the memory controller response queues, which are 
implemented as QueuedSlavePorts. I've started trying to fix this up in the 
DRAMCtrl, but given the complexity, have yet to finish it. The 
RubyMemoryControl has the same issue, but as we have deprecated it, I don't 
feel it would be a good idea to invest effort to fix it.

8) [status: partially implemented] Allowing multiple requests to a single cache 
line into Ruby cache controllers: Currently, Ruby Sequencers block multiple 
outstanding requests to a single cache line, while the new GPUCoalescer will 
buffer the requests before they can enter the cache controllers. Both of these 
schemes introduce significant inaccuracy compared to hardware, which can accept 
multiple accesses per line and queue them as appropriate (e.g. using MSHRs if 
the line is in an intermediate state, waiting on a request outstanding to a 
lower level of the hierarchy, etc.). In order to get reasonable modeling, 
Sequencers will need to pass memory requests to the cache controllers 
regardless of whether they access a line in an intermediate state. I have 
implemented this for stores in no-RFO GPU caches, and the performance 
difference can be massive (e.g.
1.5-3x). The GPUCoalescer will not suffice for this use case, because it 
requires RFO access to the line.

9) [status: not started] Coalescing within the caches: With the addition of 
per-byte dirty bits and AMD's GPU cache controllers, there appear to be places 
where request coalescing can/should be implemented in caches. For example, most 
L1 cache controllers block stores in mandatory queues while the line is in an 
intermediate state, but often these stores can be accumulated into a single 
MSHR and written to the cache block as the cache array is filled with the line. 
This can have a substantial effect on performance by cutting L1->L2 and L2->MC 
accesses by factors up to 32+.

10) [status: not started] TBE (MSHR) allocation: Currently, TBETables are 
finite-sized and disallow over-allocation. If the TBETable size is not set 
reasonably large, over-allocation results in assertion failures. Often the 
required sizing to avoid assertion failures is unrealistic (e.g. an RFO GPU
L1 cache with 16kB capacity might need as many TBEs as there are entries in the 
cache itself). This limits the ability to test more reasonable TBE 
restrictions. It should be straightforward to assess which transitions need to 
allocate TBEs, so we can test for TBE availability in the controller wakeup 
functions. This would allow wakeup to skip over transitions that need TBEs when 
there are none available.

--
  Joel Hestness
  PhD Candidate, Computer Architecture
  Dept. of Computer Science, University of Wisconsin - Madison
  http://pages.cs.wisc.edu/~hestness/
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Reply via email to