On 3/27/2023 6:13 AM, gabriel.busnot--- via gem5-users wrote:
Thanks, Gabriel, for your response, now a month ago. I want to turn my attention back to this ... :-)
I can’t provide you with an assertive answer but I’ve also been looking at CXL recently so here is what I understand so far.
From a functional perspective, the classic cache system seems able to support the hierarchical coherency aspects just fine with the coherent Xbar of each chip connected to a CPU side port of the other chip’s Xbar. The performance will probably be quite off, though. You could improve on it by implementing a kind of throttle adapter SimObject that would model the CXL link layer between the 2 xbars. Snoop performance modeling will remain atomic/blocking just as with any classic cache configuration.
I'm trying to envision doing this in a way that would work. First, I interpret you as saying that each component that plays this CXL "game" has its own coherent Xbar. You seemed to say that they would be cross connected. Suppose we have two devices, X and Y. The mem side of X would be connected to the cpu side of Y, and mem side of Y to the cpu side of X. What confuses me about this is that it seems it would lead to infinite forwarding to mem sides. It also seems to make it difficult to offer a single point of coherence. A second arrangement I thought of is that CXL memories could be "level infinity" caches, i.e., act like caches though the set of lines they hold is fixed, and their lines are always valid. Their mem sides would go to a final coherency Xbar that would serve as the point-of-coherence of the system. A CXL memory would always fast-route requests having to with things outside its address space to this coherency bus, so that that some other memory could respond. A third arrangement would be a variation on the second one: CXL memories are level-infinity caches on the other side of a coherent Xbar "memory bus" with routing such that each CXL memory get requests pertaining only to its part of the physical address space. A CXL device that has its own cache would connect like a cpu+cache to the memory bus. A CXL device that has no cache could connect directly to the coherent Xbar memory bus. It is not clear to me how that is different from the current sort of arrangement. The setup I would like to be able to assemble is this: - Regular cpu cores with a regular L1/L2/L3 cache hierarchy - A memory system like the smart memory cube [SMC] - highly parallel - A processor-in-memory [PIM] that: - has more-direct access to the SMC, but that access is still coherent - has a private scratch pad memory (non-coherent) - has its own its own cache that is coherent with the regular cores' memory hierarchy - has its own DMA units that transport data between coherent memory and the private scratchpad I have previously built most of this, but the PIM's cache and DMA were not coherent, and going through extra protocols to deal with that dragged performance down.
As for Ruby, the goal is further away. AFAIK, no protocols supports hierarchical coherency (home node to home node requests, snoopable home node, etc.). If you don’t care too much about these details, then I would argue that configuring any Ruby protocol as usual and configuring your topology to force traffic through a single link could get you closer to a CXL-style configuration. You could also implement a link adapter/bridge component to model the CXL link layer better.
I'm not really interested in Ruby - I've generally "rolled my own", so to speak. Maybe it would be useful to set up a Zoom meeting where we can sketch systems diagrams or something! Best wishes - Eliot _______________________________________________ gem5-users mailing list -- gem5-users@gem5.org To unsubscribe send an email to gem5-users-le...@gem5.org