Steve,
First of all, thank you for taking the time to provide an explanation.
Much appreciated :)
So if I understand correctly, when you say "it absolutely needs to know
whether that invalidation belongs to a request that precedes or succeeds
its own request", you mean succeeds or precedes the moment that the
downstream request arrives on the common bus (since the first common bus
is the authority for determining order)? And the issue would essentially
be that, an upper-level cache has no way of knowing when the request it
sent downstream hits the common bus? That would indeed mean that it also
has no way of determining the relative ordering between an incoming
invalidate that got instantly propagated upwards because it takes time
for the request to trickle down to the common bus and for the response
to move back up, so indeed the invalidate can be received while the
request is still making its way downstream, or while it's already moving
back up but hasn't arrived yet.
That would also makes sense with regard to a cache's reliance on the
instantaneous hit/miss processing by the downstream device I mentioned
in my first post, because if the bus the request is sent onto is the
common bus, then the upstream MSHRs will have to be instantly notified
so that they know the request got there first by the time the invalidate
passes through the bus and (instantly) back upwards through the cache.
So then, come to think about it, downstreamPending could be interpreted
as sort of a "still has levels to go until the common bus"-flag for the
downstream request. One case that then inevitably pops up in my head
(although I'm not actually sure if it's plausible -- I still have many
insights to gain in cache coherence) is if a cache snoops an
invalidating request from a peer on its local bus rather than from a
downstream cache, while the miss request is still moving downstream long
past the local bus.
In this case, the common bus would be the cache's local bus, but when
the invalidate comes in from the peer, downstreamPending will still be
set as the request is still pending somewhere downstream. But I guess
that would be where the difference between an express snoop and a snoop
on the local bus comes in?
In my particular S-NUCA setup, I know that the L2 will always be the
last-level cache, so I think I should indeed be able to get away with
marking the upstream MSHR as downstreamPending as soon as the request is
received on the main CPU-side port, and then calling
clearDownstreamPending() only if the request hit in the bank. In case of
a miss, downstreamPending can simply remain true. If the bank then
maintains its regular behaviour when sending the miss request further
downstream to main memory, clearDownstreamPending should also be called
after it notices that no MSHR was allocated downstream.
That is, of course, provided again that the request leaving the bank can
be instantly propagated to the local bus and doesn't remain queued in an
internal port. It wouldn't make a difference for "seeing" whether an
MSHR was allocated (since it's the LLC anyway), but I imagine the
request actually has to be seen on the bus so that any other caches
peered to it can see it before sending any invalidates. In a single-LLC
setup, though, this argument no longer applies and I think it should be
safe to queue outgoing downstream requests.
Am I on the right track here?
Cheers,
Jeroen
Thanks for the detailed analysis... this code is complex, I agree; I
wrote it, and when I have to get back into it to fix a bug it always
takes a while to recall all the detailed interactions.
I don't completely follow why downstreamPending is causing problems
for you, but I can elaborate a little on its purpose, which I hope may
help. The protocol assumes an arbitrarily deep hierarchy of busses,
and conflicting accesses are ordered according to which access is the
first to reach the nearest bus that's common to both requesters.
Also, as an invalidating request hits each level of bus, the
invalidation is atomically and instantaneously broadcast to all caches
above that bus (using the expressSnoop feature).
The basic problem is that once a cache has sent out a request and is
awaiting the response, it can snoop an invalidation that's been
propagated upward, and it absolutely needs to know whether that
invalidation belongs to a request that precedes or succeeds its own
request. However, that isn't easily determined, since it takes time
both for the request to propagate downward until it is satisfied, and
for the response to propagate back upwards. So the invalidation could
have arrived at a lower-level bus before the cache's request got there
(meaning the invalidation comes first), or the cache's request could
have come first but the invalidation could have passed the response on
the way back up (since the invalidations use the magic expressSnoop
path). The downstreamPending flags and the clearDownstreamPending()
mechanism solve this problem by providing an instantaneous mechanism
to notify all the upstream caches when a request is satisfied. If an
invalidation is snooped while downstreamPending is true, the
invalidation is ordered before the request; if downstreamPending is
false, the request has already been satisfied at some cache level and
hence is ordered before the invalidation.
Now that I've gone through all that, it seems like the problem is that
the requesting cache's downstreamPending is false by default, and is
only set to true if the downstream cache misses; you kind of want the
opposite, which is that downstreamPending is true by default and only
cleared if the downstream cache hits. You could get that effect by
setting downstreamPending in the requesting cache's MSHR when you
buffer the request, then explicitly calling clearDownstreamPending()
if it hits in the downstream cache, or perhaps you could actually
change the default setting in the cache to avoid thse contortions in
your code.
The whole flow control/retry interface is one that we've gone around
about quite a bit, but despite the limitations of the current setup
we've never come up with a better replacement. As you point out,
doing something like using address ranges would possibly be a big
change in a number of places (though most devices do derive from
SimpleTimingPort, so maybe it's more localized there than it seems).
Incidentally, the complexity and fragility of the coherence protocol
is one of the reasons we're integrating the GEMS Ruby memory model, to
provide a more flexible memory system. Unfortunately that's still in
progress, and right now Ruby is also quite a bit slower, but we're
working on that.
Steve
On Sat, Nov 27, 2010 at 11:45 AM, Jeroen DR <[email protected]
<mailto:voetsjoeba%[email protected]>> wrote:
Hi,
I'm currently implementing S-NUCA, and I've ran into an issue with
the way M5's MSHR and blocking mechanisms works while attempting
to distribute incoming packets to several distinct UCA caches.
I've modelled the S-NUCA as a container of multiple individual
regular UCA caches that serve as the banks, each with their own
(smaller) hit latency plus interconnect latency depending on which
CPU is accessing the bank. Since each port must be connected to
one and only one peer port, I've created a bunch of internal
SimpleTimingPorts to serve as the peers of the individual banks'
cpuSide and memSide ports.
The idea is that upon receiving a packet on the main CPU-side
port, we examine which bank to send the request to (based on the
low-order bits of the set index) and schedule it for departure
from the associated internal port. Because each bank has its own
interconnect latency, the overall access time for banks closeby
may be lower than that for banks that are farther away.
An advantage of S-NUCA is that the entire cache needn't block if a
single bank blocks. This is supported by means of the internal
ports, as any packet sent to a blocked bank may remain queued in
the internal port until it can be serviced by the bank. Meanwhile,
the main CPU-side port can continue to accept packets for other
banks. To implement this, I have the main CPU-side port distribute
the packets to the internal ports and always signify success to
its peer (unless of course all banks are blocked).
In the interest of validating my S-NUCA implementation with a
single bank against a regular UCA cache with the same parameters,
I've temporarily set the interconnect latencies to 0 and modified
the internal ports to accept scheduling calls at curTick, as these
normally only allow for scheduled packets to be sent at the next
cycle. This basically works by inserting the packet at the right
position in the transmitList in the exact same way it normally
happens, and then immediately calling sendEvent->process() if the
packet got inserted at the front of the queue. This works well.
While digging through the codebase to find an explanation for some
of the remaining timing differences I encountered, I found that
the way that a Cache's memside port sends packets to the bus and
how that interacts with the MSHR chain is posing a problem for the
way I'd like my S-NUCA to work.
It basically comes down to the fact that a regular Cache's memside
port, when it successfully sends an MSHR request downstream,
relies on the downstream device to already have processed the
request and to have allocated an MSHR in case of a miss. This is
supported by the bus, which basically takes the packet sent by the
Cache's memside port and directly invokes sendTiming() on the
downstream device. If the downstream device is a cache, this
causes it to perform its whole timingAccess() call, which checks
for a hit or a miss and allocates an MSHR. In other words, when
the cache's memside port receives the return value "true" for its
sendTiming call, it relies on the fact that at that time an MSHR
must have already been allocated downstream if the request missed.
From studying the MSHR code, I understand that this is done in
order to maintain the downstreamPending flags across the MSHR
chain. A cache has no way of knowing whether its downstream device
is going to be another cache or main memory, so it also has no way
of knowing whether the MSHR request will receive a response at
this level (because it might miss in another downstream cache). I
also understand that for this reason, MSHRs are passed down in
senderState, and that upon allocation, the downstreamPending flag
of the "parent" MSHR is set.
In this way, the mere fact of an MSHR getting allocated in a
downstream device will cause the downstreamPending flag on the
current-level MSHR to be set. A regular Cache relies on this
behaviour to determine whether it is going to receive a response
to the MSHR request it just sent out at this level; it can simply
check whether the MSHR's downstreamPending flag was set, because
if the request missed at the downstream device, the downstream
device must have been a cache which must have allocated an MSHR,
which must in turn have caused the downstreamPending flag in
/this/ cache's MSHR to be set:
from Cache::MemSidePort::sendPacket:
MSHR *mshr = dynamic_cast<MSHR*>(pkt->senderState);
bool success = sendTiming(pkt); // this assumes
instant request processing by the peer
waitingOnRetry = !success;
if (waitingOnRetry) {
DPRINTF(CachePort, "now waiting on a retry\n");
if (!mshr->isForwardNoResponse()) {
delete pkt;
}
} else {
myCache()->markInService(mshr, pkt); // this
assumes the mshr->downstreamPending flag to have been correctly
set (or correctly remained untouched) by the downstream device at
this point
}
It's the markInService() call that will check whether the
downstreamPending flag is set. If it isn't set, then no MSHR was
allocated downstream, signifying that it will receive a response.
However, this only works if the call is indeed immediately
processed by the downstream device to check for a hit or a miss.
In order to avoid blocking the entire cache, my S-NUCA
implementation might return success but have the packet queued in
an internal port still, waiting for departure. The upper-level
MSHR then checks its downstreamPending flag, and may incorrectly
conclude that the request won't miss, even though it still might
when the packet is eventually sent from the internal port queue.
So at this point, I'm a bit at a loss what my options are. I tried
to find out what the downstreamPending flag is used for to see if
I there's anything I can do about this problem and I found the
comments below, but I don't understand a word of what is going on
here:
from MSHR::handleSnoop:
if (!inService || (pkt->isExpressSnoop() && downstreamPending)) {
// Request has not been issued yet, or it's been issued
// locally but is buffered unissued at some downstream cache
// which is forwarding us this snoop. Either way, the packet
// we're snooping logically precedes this MSHR's request, so
// the snoop has no impact on the MSHR, but must be processed
// in the standard way by the cache. The only exception is
// that if we're an L2+ cache buffering an UpgradeReq from a
// higher-level cache, and the snoop is invalidating, then our
// buffered upgrades must be converted to read exclusives,
// since the upper-level cache no longer has a valid copy.
// That is, even though the upper-level cache got out on its
// local bus first, some other invalidating transaction
// reached the global bus before the upgrade did.
if (pkt->needsExclusive()) {
targets->replaceUpgrades();
deferredTargets->replaceUpgrades();
}
return false;
}
This is obviously related to cache-coherence and handling snoops
at the upper-level cache, which I guess may occur at any time
while a packet is pending in a internal port downstream in the
S-NUCA, so I suspect things may get hairy moving forward.
Another option I've considered is modifying the retry mechanism to
no longer be an opaque "yes, keep sending me stuff"/"no I'm
blocked, wait for my retry" but instead issue retries for
particular address ranges. This would allow my S-NUCA to
selectively issue retries to address ranges for which the internal
bank is blocked, but then SimpleTimingPort would also have to be
modified to not just push the failed packet back to the front of
the list and wait for an opaque retry, but continue searching down
the transmitList to find any ready packets to address ranges it
hasn't been told are blocked yet. I think it's an interesting
idea, but I imagine there's a whole slew of fairness issues with
that approach.
On a side note, I've extensively documented all the behaviour I
discussed previously in code, and would be more than willing to
contribute this back to the community. These timing issues turned
out to be very important for my purposes, but were hidden away
behind 3 levels of calls and an at first sight innocuous, entirely
uncommented if(!downstreamPending) check buried somewhere in
MSHR::markInService, so some comments in there about all these
underlying assumptions definitely wouldn't hurt.
Cheers,
-- Jeroen
_______________________________________________
m5-users mailing list
[email protected] <mailto:[email protected]>
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
_______________________________________________
m5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
_______________________________________________
m5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users