Hi Brad, Thanks for the quick feedback. Much appreciated! I've inlined some more notes below, though I'm still marinating on your other comments:
> 1) [status: not started] SLICC parses actions to accumulate the total >> buffer capacity as required by all enqueue operations within the action. >> Unfortunately, the resource checking is usually overly conservative >> resulting in two problems: >> A) Many actions contain multiple code paths and not all paths get >> executed to push requests into buffers. The actual resources required are >> frequently less than SLICC's parsed value. As an example, an action with an >> if-else block containing an enqueue() on both paths will parse to require >> two buffers even though only one or the other enqueue() will be called, but >> not both. >> B) The resource checking can result in poorly prioritized transitions >> if they require allocating more resources than other transitions. For >> instance, a high priority transition (e.g. responses) may require a slot in >> 2 separate buffers, while a lower priority transition (e.g. requests) may >> require one of those slots. If the higher priority transition gets blocked, >> the lower priority transition can be allowed to proceed, resulting in >> priority inversion and possibly even starvation. >> Performance debugging these issues could be exceptionally difficult. As >> an example of performance issues, in MOESI_hammer, a directory transition >> that would involve activity iff using a full-bit directory may register >> excessive buffer requirements and block waiting for the unnecessary buffers >> (even though the directory is not configured to use the full-bit data!). >> By manually hacking generated files to avoid these incorrect buffer >> requirements, I've already witnessed performance improvements of greater >> than 3%, and I haven't even stressed the memory hierarchy. > > 1) This is a great reminder that actions should be designed to be simple. > Conditional checks should be handled by generating different events. > Unfortunately we have protocols today that have pushed far too much logic > within actions. That leads to the conservative resource checking and very > difficult protocols to debug. I definitely want to take that lesson > forward when reviewing future SLICC code. > Yes. Another way to address the issue in (1A) without changing SLICC is to change the actions that have separate enqueue code paths so that they instead set up the enqueued messages along a single code path and use a single enqueue call. I consider this a bandaid, since it would be nice if SLICC either required the programmer to write this way, or recognized that only one of the enqueues is called. 5) [status: complete, revising] SimpleNetwork access prioritization is not >> suited for finite buffering + near-peak bandwidth: The PerfectSwitch uses a >> round-robin prioritization scheme to select the input ports that have >> priority to issues to an output port, and it steps through input ports in >> ascending order to find one with ready messages. When some input port >> buffers are full and others are empty, the lower ID input ports effectively >> get prioritization when the round-robin ID is greater than the highest ID >> input port that has messages. For fair prioritization, input ports with >> ready messages should not be allowed to issue twice before other input >> ports with ready messages are allowed to issue once. My cursory inspection >> of Garnet routers suggests that they probably suffer from the same >> arbitration issue. > > > 5) Have you posted a patch on this? If it is a simple fix, then I think > it reasonable for the SimpleNetwork. However, we need to make sure we keep > the SimpleNetwork simple. This absolutely sounds like an issue that Garnet > should handle correctly. > I haven't posted a patch yet, because my solution is still hacky. Let me know if you'd like to see my solution, and I can send it to you. I also have not tried to fix this in Garnet. > 6) [status: complete, revising] QueuedMasterPort used for requests from >> Ruby directories to memory controllers: This fills up very quickly with a >> GPU requester, and results in the PacketQueue triggering the panic that the >> queue is too large (>100 packets). The RubyMemoryControl has infinite input >> port queuing, so it can be used, but other memory controllers cannot. >> Further, I have measured that even with roughly reasonable buffering >> throughout the memory hierarchy, average memory stall cycles in the Ruby >> memory controller in-queue can be upwards of 5,000 cycles (which is >> nonsensical). To fix this, we need to pull the queuing out of the >> Directory_Controller memory port and into a finite queue managed by the *- >> dir.sm files, and handle flow control in the port to memory. I have >> mostly implemented this, and will post a patch for review soon. >> 7) [status: partially implemented] QueuedSlavePort from memory >> controllers back to Ruby directories: After fixing the memory controller >> input queues, bloated buffering immediately jumps to the memory controller >> response queues, which are implemented as QueuedSlavePorts. I've started >> trying to fix this up in the DRAMCtrl, but given the complexity, have yet >> to finish it. The RubyMemoryControl has the same issue, but as we have >> deprecated it, I don't feel it would be a good idea to invest effort to fix >> it. > > > 6 & 7) So does this cover all usage of PacketQueues in Ruby (both > QueuedMasterPorts and QueuedSlavePorts)? I believe all RubyPorts currently > use the packet queues. I would like to see them all fixed and we are > willing to help on this. > 7.1) Ah, good point. Yes, the RubyPort uses a QueuedSlavePort to respond to the core when complete. In my testing so far, I have never tripped the packet buffering limit on that PacketQueue, because I have gem5-gpu configured to ensure that the GPU cores are always able to pull responses faster than the L1 can send them. I'll add this item to my list as (7.1), and I might be able to help fix it also. > 8) [status: partially implemented] Allowing multiple requests to a single >> cache line into Ruby cache controllers: Currently, Ruby Sequencers block >> multiple outstanding requests to a single cache line, while the new >> GPUCoalescer will buffer the requests before they can enter the cache >> controllers. Both of these schemes introduce significant inaccuracy >> compared to hardware, which can accept multiple accesses per line and queue >> them as appropriate (e.g. using MSHRs if the line is in an intermediate >> state, waiting on a request outstanding to a lower level of the hierarchy, >> etc.). In order to get reasonable modeling, Sequencers will need to pass >> memory requests to the cache controllers regardless of whether they access >> a line in an intermediate state. I have implemented this for stores in >> no-RFO GPU caches, and the performance difference can be massive (e.g. >> 1.5-3x). >> The GPUCoalescer will not suffice for this use case, because it requires >> RFO access to the line. >> 9) [status: not started] Coalescing within the caches: With the addition >> of per-byte dirty bits and AMD's GPU cache controllers, there appear to be >> places where request coalescing can/should be implemented in caches. For >> example, most L1 cache controllers block stores in mandatory queues while >> the line is in an intermediate state, but often these stores can be >> accumulated into a single MSHR and written to the cache block as the cache >> array is filled with the line. This can have a substantial effect on >> performance by cutting L1->L2 and L2->MC accesses by factors up to 32+. > > > 8) I think we have a different idea of what hardware looks like. Hardware > does not access the cache tag and data arrays multiple times for coalesced > requests. There is only one access. I believe your #9 agrees with that > and somewhat contradicts what you are saying here. Currently all requests > sent to the controller requires a separate access to the tags and data as > specified in the .sm file. Also note that while the GPUCoalescer expects > RfO protocol behavior, the VIPERCoalescer does not. > > 9) Coalescing in the caches adds a lot of unnecessary complexity to the > protocol. It is far easier to do that in the RubyPorts. That is the point > of the Sequencer and Coalescers (i.e. RubyPorts). Do the protocol > independent work in the RubyPorts and then only implement the necessary > protocol complexity in the cache controllers. I believe we all agree the > RubyPorts need to be fixed, but I think we fix them by adding a second map > for stalled, not yet issued requests, and removing the mandatory queue. > I think we probably agree on what hardware looks like, but maybe disagree on how we should be modeling it. First, I'm sure we agree that there are 3-4 levels of coalescing that occur in state-of-the-art GPU memory hierarchies: (1) GPU core: subgroups (warps) have coalescing across their workitems (threads) if those workitems access common cache lines, so only one access per cache line goes to the cache from each subgroup instruction, (2) L1 cache: if a single subgroup accesses the same line in successive instructions or separate subgroups access the same cache line, these can be queued/coalesced in MSHRs of the L1 cache to limit blocking or outstanding requests to the L2, and (3) L2 cache/memory controllers: if multiple GPU cores access the same cache lines in temporal proximity, the L2 can queue/coalesce these in MSHRs (to unblock input queues), and/or depending on L2 policies, the memory controller might do bundling and access other buffers to service requests. Currently, if my read of the code is correct, the AMD GPU uses the GPUCoalescer to partially handle both (1) and (2). I say "partially", because it seems like if the GPUCoalescer is configured with finite buffers, it could truncate sets of accesses that would otherwise be coalesced by (1) or (2). If this read is correct, I feel strongly that this is an unreasonable abstraction, because it leaves out important timing, contention, and backpressure modeling. I've also inspected the the AMD GPU cache controllers, and I don't think (3) is being modeled in L2s, though it's possible I have overlooked something. gem5-gpu models (1) in the GPU cores before accesses are sent to Ruby Sequencers, and we've validated that that this coalescing is quite accurate for NVIDIA hardware. In this setting, the GPUCoalescer might functionally work to model (2), but it has problems that will cause inaccuracy: It cannot be used to enforce accurate MSHR allocation and blocking, because it is buffering separate from the actual MSHR structures, and it does not model valid tag/data/MSHR access counts, etc. in the L1 cache. gem5-gpu currently sort of models (2) with per-line load buffering in the GPU LSQs, but does not model (3). These coalescing spots have not manifest as major performance impacts until I've started running particular optimized workloads and tightening up finite buffers through the memory hierarchy. Overall, my opinion is that modeling of coalescing should occur where coalescing actually happens in hardware. While you're right that it's not strictly necessary to model coalescing in cache controllers, I disagree that it is complicated to do so - in fact, it's as simple as adding 1-2 actions, and changing the behavior of a few transitions that would otherwise stall. I'll admit I haven't deeply investigated the VIPERCoalescer, which might be better than the GPUCoalescer, but I have a strong distaste for the GPUCoalescer, which appears as though its coalescing will be hard to make accurate and extensible. 10) [status: not started] TBE (MSHR) allocation: Currently, TBETables are >> finite-sized and disallow over-allocation. If the TBETable size is not set >> reasonably large, over-allocation results in assertion failures. Often the >> required sizing to avoid assertion failures is unrealistic (e.g. an RFO GPU >> L1 cache with 16kB capacity might need as many TBEs as there are entries >> in the cache itself). This limits the ability to test more reasonable TBE >> restrictions. It should be straightforward to assess which transitions need >> to allocate TBEs, so we can test for TBE availability in the controller >> wakeup functions. This would allow wakeup to skip over transitions that >> need TBEs when there are none available. > > > 10) I'm a bit confused here. Transitions that allocate TBE entries, > already check for TBE availability. As long as protocols do not use > z_stalls on the mandatory queue (BTW, we should make a goal of removing all > support of z_stalls) independent events that do not need a TBE entry should > be triggered. Are you seeing different behavior? If so, what protocol? > Shoot, I sent the wrong issue here: I had some confusion about TBE blocking when I started this list and forgot to copy the changed description to my email. To clarify, yes, transitions already check TBE availability. Here's the issue I meant to send: 10) [status: not started] TBE (MSHR) tracking: Currently, only one TBE can be allocated per cache line address in TBETables, and/or there isn't really a good way to queue accesses in TBEs. A cache design could avoid blocking the mandatory/request queue of a controller by allocating TBEs or TBE queue slots for accesses as they become ready on the request queue, regardless of whether there is an outstanding memory access. For instance, when buffering stores, the send data could be sent to the next level of the memory hierarchy, but it may be desirable to keep information (TBE) about the store to await an ack of receipt by the downstream controller (e.g. for enforcing fences). If only one TBE is allowed per cache line address, then only a single outstanding store to the line would be allowed to the lower level of the hierarchy. We might also desire to allocate multiple GET TBEs per line address and to forward the returned data among those TBEs as a queue. On review, I realize that this one might be more a matter of interpreting one's willingness to hack on Ruby/SLICC to get the desired functionality. I feel it would be nice to have this functionality available in mainline, though others may disagree. Thanks! Joel -----Original Message----- > From: gem5-dev [mailto:[email protected]] On Behalf Of Joel > Hestness > Sent: Tuesday, February 02, 2016 12:50 PM > To: gem5 Developer List > Subject: [gem5-dev] Toward higher-fidelity Ruby memory: Using (actually) > finite buffering > > Hi guys, > > I've been working on some benchmarks that place unique/new stresses on > heterogeneous CPU-GPU memory hierarchies. In trying to tighten up the > hierarchy performance, I've run into a number of strange cache > buffering/flow control issues in Ruby. We've talked about fixing these > things, but I've found a need to inventory all the places where > buffering/prioritization needs work. Below is my list, which can hopefully > serve as a starting point and offer a broader picture to anyone who wishes > to use Ruby with more realistic memory hierarchy buffering. I've included > my current status addressing each. > > Please let me know if you have any input or would like to help address > the issues. Any help would be appreciated. > > Thank you! > Joel > > > > 2) [status: complete, posted] Finite-sized buffers are not actually finite > sized: When a Ruby controller stalls a message in a MessageBuffer, that > message is removed from the MessageBuffer m_prio_heap and placed in > m_stall_msg_map queues until the queue is reanalyzed by a wake-up activity > in the controller. Unfortunately, when checking whether there is enough > space in the buffer to add more requests, the measured size only considers > the size of m_prio_heap, but not messages that might be in the > m_stall_msg_map. In extreme cases, I've seen the m_stall_msg_map hold >500 > messages in a MessageBuffer with size = 10. Here's a patch that fixes this: > http://reviews.gem5.org/r/3283/ > > 3) [status: not started] Virtual channel specification and prioritization > is inconsistent: Currently, in each cycle, the PerfectSwitch in the Ruby > simple network iterates through virtual channels from highest ID to lowest > ID, indicating that higher IDs have higher priority. By contrast, Garnet > cycles through virtual channel IDs from lowest to highest, indicating that > lower IDs have higher priority. Since SLICC controller files specify > virtual channels independent of the interconnect that is used with Ruby, > the virtual channel prioritization may be inverted depending on the network > that is used. The different Ruby network models need to agree on the > prioritization in order to avoid potential priority inversion. > > 4) [status: not started] Sequencers push requests into mandatory queues > regardless of whether the mandatory queue is finite-sized and possibly > full. With poorly configured sequencer and L1 mandatory queue, it is > possible to fill the L1 mandatory queue, but still have space in the > sequencer's requestTable. Since the Sequencer doesn't check whether the > mandatory queue has slots available, it cannot honor the mandatory queue's > capacity correctly. This should be fixed and/or a warn/fatal should be > raised to let the user know about poor configuration. > > -- > Joel Hestness > PhD Candidate, Computer Architecture > Dept. of Computer Science, University of Wisconsin - Madison > http://pages.cs.wisc.edu/~hestness/ > _______________________________________________ > gem5-dev mailing list > [email protected] > http://m5sim.org/mailman/listinfo/gem5-dev > _______________________________________________ > gem5-dev mailing list > [email protected] > http://m5sim.org/mailman/listinfo/gem5-dev > -- Joel Hestness PhD Candidate, Computer Architecture Dept. of Computer Science, University of Wisconsin - Madison http://pages.cs.wisc.edu/~hestness/ _______________________________________________ gem5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/gem5-dev
