Hi guys, I've been working on some benchmarks that place unique/new stresses on heterogeneous CPU-GPU memory hierarchies. In trying to tighten up the hierarchy performance, I've run into a number of strange cache buffering/flow control issues in Ruby. We've talked about fixing these things, but I've found a need to inventory all the places where buffering/prioritization needs work. Below is my list, which can hopefully serve as a starting point and offer a broader picture to anyone who wishes to use Ruby with more realistic memory hierarchy buffering. I've included my current status addressing each.
Please let me know if you have any input or would like to help address the issues. Any help would be appreciated. Thank you! Joel 1) [status: not started] SLICC parses actions to accumulate the total buffer capacity as required by all enqueue operations within the action. Unfortunately, the resource checking is usually overly conservative resulting in two problems: A) Many actions contain multiple code paths and not all paths get executed to push requests into buffers. The actual resources required are frequently less than SLICC's parsed value. As an example, an action with an if-else block containing an enqueue() on both paths will parse to require two buffers even though only one or the other enqueue() will be called, but not both. B) The resource checking can result in poorly prioritized transitions if they require allocating more resources than other transitions. For instance, a high priority transition (e.g. responses) may require a slot in 2 separate buffers, while a lower priority transition (e.g. requests) may require one of those slots. If the higher priority transition gets blocked, the lower priority transition can be allowed to proceed, resulting in priority inversion and possibly even starvation. Performance debugging these issues could be exceptionally difficult. As an example of performance issues, in MOESI_hammer, a directory transition that would involve activity iff using a full-bit directory may register excessive buffer requirements and block waiting for the unnecessary buffers (even though the directory is not configured to use the full-bit data!). By manually hacking generated files to avoid these incorrect buffer requirements, I've already witnessed performance improvements of greater than 3%, and I haven't even stressed the memory hierarchy. 2) [status: complete, posted] Finite-sized buffers are not actually finite sized: When a Ruby controller stalls a message in a MessageBuffer, that message is removed from the MessageBuffer m_prio_heap and placed in m_stall_msg_map queues until the queue is reanalyzed by a wake-up activity in the controller. Unfortunately, when checking whether there is enough space in the buffer to add more requests, the measured size only considers the size of m_prio_heap, but not messages that might be in the m_stall_msg_map. In extreme cases, I've seen the m_stall_msg_map hold >500 messages in a MessageBuffer with size = 10. Here's a patch that fixes this: http://reviews.gem5.org/r/3283/ 3) [status: not started] Virtual channel specification and prioritization is inconsistent: Currently, in each cycle, the PerfectSwitch in the Ruby simple network iterates through virtual channels from highest ID to lowest ID, indicating that higher IDs have higher priority. By contrast, Garnet cycles through virtual channel IDs from lowest to highest, indicating that lower IDs have higher priority. Since SLICC controller files specify virtual channels independent of the interconnect that is used with Ruby, the virtual channel prioritization may be inverted depending on the network that is used. The different Ruby network models need to agree on the prioritization in order to avoid potential priority inversion. 4) [status: not started] Sequencers push requests into mandatory queues regardless of whether the mandatory queue is finite-sized and possibly full. With poorly configured sequencer and L1 mandatory queue, it is possible to fill the L1 mandatory queue, but still have space in the sequencer's requestTable. Since the Sequencer doesn't check whether the mandatory queue has slots available, it cannot honor the mandatory queue's capacity correctly. This should be fixed and/or a warn/fatal should be raised to let the user know about poor configuration. 5) [status: complete, revising] SimpleNetwork access prioritization is not suited for finite buffering + near-peak bandwidth: The PerfectSwitch uses a round-robin prioritization scheme to select the input ports that have priority to issues to an output port, and it steps through input ports in ascending order to find one with ready messages. When some input port buffers are full and others are empty, the lower ID input ports effectively get prioritization when the round-robin ID is greater than the highest ID input port that has messages. For fair prioritization, input ports with ready messages should not be allowed to issue twice before other input ports with ready messages are allowed to issue once. My cursory inspection of Garnet routers suggests that they probably suffer from the same arbitration issue. 6) [status: complete, revising] QueuedMasterPort used for requests from Ruby directories to memory controllers: This fills up very quickly with a GPU requester, and results in the PacketQueue triggering the panic that the queue is too large (>100 packets). The RubyMemoryControl has infinite input port queuing, so it can be used, but other memory controllers cannot. Further, I have measured that even with roughly reasonable buffering throughout the memory hierarchy, average memory stall cycles in the Ruby memory controller in-queue can be upwards of 5,000 cycles (which is nonsensical). To fix this, we need to pull the queuing out of the Directory_Controller memory port and into a finite queue managed by the *- dir.sm files, and handle flow control in the port to memory. I have mostly implemented this, and will post a patch for review soon. 7) [status: partially implemented] QueuedSlavePort from memory controllers back to Ruby directories: After fixing the memory controller input queues, bloated buffering immediately jumps to the memory controller response queues, which are implemented as QueuedSlavePorts. I've started trying to fix this up in the DRAMCtrl, but given the complexity, have yet to finish it. The RubyMemoryControl has the same issue, but as we have deprecated it, I don't feel it would be a good idea to invest effort to fix it. 8) [status: partially implemented] Allowing multiple requests to a single cache line into Ruby cache controllers: Currently, Ruby Sequencers block multiple outstanding requests to a single cache line, while the new GPUCoalescer will buffer the requests before they can enter the cache controllers. Both of these schemes introduce significant inaccuracy compared to hardware, which can accept multiple accesses per line and queue them as appropriate (e.g. using MSHRs if the line is in an intermediate state, waiting on a request outstanding to a lower level of the hierarchy, etc.). In order to get reasonable modeling, Sequencers will need to pass memory requests to the cache controllers regardless of whether they access a line in an intermediate state. I have implemented this for stores in no-RFO GPU caches, and the performance difference can be massive (e.g. 1.5-3x). The GPUCoalescer will not suffice for this use case, because it requires RFO access to the line. 9) [status: not started] Coalescing within the caches: With the addition of per-byte dirty bits and AMD's GPU cache controllers, there appear to be places where request coalescing can/should be implemented in caches. For example, most L1 cache controllers block stores in mandatory queues while the line is in an intermediate state, but often these stores can be accumulated into a single MSHR and written to the cache block as the cache array is filled with the line. This can have a substantial effect on performance by cutting L1->L2 and L2->MC accesses by factors up to 32+. 10) [status: not started] TBE (MSHR) allocation: Currently, TBETables are finite-sized and disallow over-allocation. If the TBETable size is not set reasonably large, over-allocation results in assertion failures. Often the required sizing to avoid assertion failures is unrealistic (e.g. an RFO GPU L1 cache with 16kB capacity might need as many TBEs as there are entries in the cache itself). This limits the ability to test more reasonable TBE restrictions. It should be straightforward to assess which transitions need to allocate TBEs, so we can test for TBE availability in the controller wakeup functions. This would allow wakeup to skip over transitions that need TBEs when there are none available. -- Joel Hestness PhD Candidate, Computer Architecture Dept. of Computer Science, University of Wisconsin - Madison http://pages.cs.wisc.edu/~hestness/ _______________________________________________ gem5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/gem5-dev
