On 3/27/2023 6:13 AM, gabriel.busnot--- via gem5-users wrote:

Thanks, Gabriel, for your response, now a month ago.  I want to turn my
attention back to this ... :-)

I can’t provide you with an assertive answer but I’ve also been looking at
CXL recently so here is what I understand so far.

 From a functional perspective, the classic cache system seems able to
support the hierarchical coherency aspects just fine with the coherent Xbar
of each chip connected to a CPU side port of the other chip’s Xbar. The
performance will probably be quite off, though. You could improve on it by
implementing a kind of throttle adapter SimObject that would model the CXL
link layer between the 2 xbars. Snoop performance modeling will remain
atomic/blocking just as with any classic cache configuration.

I'm trying to envision doing this in a way that would work.  First, I
interpret you as saying that each component that plays this CXL "game" has its
own coherent Xbar.  You seemed to say that they would be cross connected.
Suppose we have two devices, X and Y.  The mem side of X would be connected to
the cpu side of Y, and mem side of Y to the cpu side of X.  What confuses me
about this is that it seems it would lead to infinite forwarding to mem sides.
It also seems to make it difficult to offer a single point of coherence.

A second arrangement I thought of is that CXL memories could be "level
infinity" caches, i.e., act like caches though the set of lines they hold is
fixed, and their lines are always valid.  Their mem sides would go to a final
coherency Xbar that would serve as the point-of-coherence of the system.  A
CXL memory would always fast-route requests having to with things outside its
address space to this coherency bus, so that that some other memory could
respond.

A third arrangement would be a variation on the second one: CXL memories are
level-infinity caches on the other side of a coherent Xbar "memory bus" with
routing such that each CXL memory get requests pertaining only to its part of
the physical address space.  A CXL device that has its own cache would connect
like a cpu+cache to the memory bus.  A CXL device that has no cache could
connect directly to the coherent Xbar memory bus.  It is not clear to me how
that is different from the current sort of arrangement.

The setup I would like to be able to assemble is this:

- Regular cpu cores with a regular L1/L2/L3 cache hierarchy
- A memory system like the smart memory cube [SMC] - highly parallel
- A processor-in-memory [PIM] that:
  - has more-direct access to the SMC, but that access is still coherent
  - has a private scratch pad memory (non-coherent)
  - has its own its own cache that is coherent with the regular cores' memory
    hierarchy
  - has its own DMA units that transport data between coherent memory and the
    private scratchpad

I have previously built most of this, but the PIM's cache and DMA were not
coherent, and going through extra protocols to deal with that dragged
performance down.

As for Ruby, the goal is further away. AFAIK, no protocols supports
hierarchical coherency (home node to home node requests, snoopable home
node, etc.). If you don’t care too much about these details, then I would
argue that configuring any Ruby protocol as usual and configuring your
topology to force traffic through a single link could get you closer to a
CXL-style configuration.  You could also implement a link adapter/bridge
component to model the CXL link layer better.

I'm not really interested in Ruby - I've generally "rolled my own", so to
speak.

Maybe it would be useful to set up a Zoom meeting where we can sketch systems
diagrams or something!

Best wishes - Eliot
_______________________________________________
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org

Reply via email to