Re: [m5-users] S-NUCA: dealing with delayed MSHR allocation

Jeroen DR Sun, 28 Nov 2010 17:44:29 -0800

Steve,

First of all, thank you for taking the time to provide an explanation.Much appreciated :)

So if I understand correctly, when you say "it absolutely needs to knowwhether that invalidation belongs to a request that precedes or succeedsits own request", you mean succeeds or precedes the moment that thedownstream request arrives on the common bus (since the first common busis the authority for determining order)? And the issue would essentiallybe that, an upper-level cache has no way of knowing when the request itsent downstream hits the common bus? That would indeed mean that it alsohas no way of determining the relative ordering between an incominginvalidate that got instantly propagated upwards because it takes timefor the request to trickle down to the common bus and for the responseto move back up, so indeed the invalidate can be received while therequest is still making its way downstream, or while it's already movingback up but hasn't arrived yet.

That would also makes sense with regard to a cache's reliance on theinstantaneous hit/miss processing by the downstream device I mentionedin my first post, because if the bus the request is sent onto is thecommon bus, then the upstream MSHRs will have to be instantly notifiedso that they know the request got there first by the time the invalidatepasses through the bus and (instantly) back upwards through the cache.

So then, come to think about it, downstreamPending could be interpretedas sort of a "still has levels to go until the common bus"-flag for thedownstream request. One case that then inevitably pops up in my head(although I'm not actually sure if it's plausible -- I still have manyinsights to gain in cache coherence) is if a cache snoops aninvalidating request from a peer on its local bus rather than from adownstream cache, while the miss request is still moving downstream longpast the local bus.

In this case, the common bus would be the cache's local bus, but whenthe invalidate comes in from the peer, downstreamPending will still beset as the request is still pending somewhere downstream. But I guessthat would be where the difference between an express snoop and a snoopon the local bus comes in?

In my particular S-NUCA setup, I know that the L2 will always be thelast-level cache, so I think I should indeed be able to get away withmarking the upstream MSHR as downstreamPending as soon as the request isreceived on the main CPU-side port, and then callingclearDownstreamPending() only if the request hit in the bank. In case ofa miss, downstreamPending can simply remain true. If the bank thenmaintains its regular behaviour when sending the miss request furtherdownstream to main memory, clearDownstreamPending should also be calledafter it notices that no MSHR was allocated downstream.

That is, of course, provided again that the request leaving the bank canbe instantly propagated to the local bus and doesn't remain queued in aninternal port. It wouldn't make a difference for "seeing" whether anMSHR was allocated (since it's the LLC anyway), but I imagine therequest actually has to be seen on the bus so that any other cachespeered to it can see it before sending any invalidates. In a single-LLCsetup, though, this argument no longer applies and I think it should besafe to queue outgoing downstream requests.


Am I on the right track here?

Cheers,
Jeroen

Thanks for the detailed analysis... this code is complex, I agree; Iwrote it, and when I have to get back into it to fix a bug it alwaystakes a while to recall all the detailed interactions.

I don't completely follow why downstreamPending is causing problemsfor you, but I can elaborate a little on its purpose, which I hope mayhelp. The protocol assumes an arbitrarily deep hierarchy of busses,and conflicting accesses are ordered according to which access is thefirst to reach the nearest bus that's common to both requesters.Also, as an invalidating request hits each level of bus, theinvalidation is atomically and instantaneously broadcast to all cachesabove that bus (using the expressSnoop feature).

The basic problem is that once a cache has sent out a request and isawaiting the response, it can snoop an invalidation that's beenpropagated upward, and it absolutely needs to know whether thatinvalidation belongs to a request that precedes or succeeds its ownrequest. However, that isn't easily determined, since it takes timeboth for the request to propagate downward until it is satisfied, andfor the response to propagate back upwards. So the invalidation couldhave arrived at a lower-level bus before the cache's request got there(meaning the invalidation comes first), or the cache's request couldhave come first but the invalidation could have passed the response onthe way back up (since the invalidations use the magic expressSnooppath). The downstreamPending flags and the clearDownstreamPending()mechanism solve this problem by providing an instantaneous mechanismto notify all the upstream caches when a request is satisfied. If aninvalidation is snooped while downstreamPending is true, theinvalidation is ordered before the request; if downstreamPending isfalse, the request has already been satisfied at some cache level andhence is ordered before the invalidation.

Now that I've gone through all that, it seems like the problem is thatthe requesting cache's downstreamPending is false by default, and isonly set to true if the downstream cache misses; you kind of want theopposite, which is that downstreamPending is true by default and onlycleared if the downstream cache hits. You could get that effect bysetting downstreamPending in the requesting cache's MSHR when youbuffer the request, then explicitly calling clearDownstreamPending()if it hits in the downstream cache, or perhaps you could actuallychange the default setting in the cache to avoid thse contortions inyour code.

The whole flow control/retry interface is one that we've gone aroundabout quite a bit, but despite the limitations of the current setupwe've never come up with a better replacement. As you point out,doing something like using address ranges would possibly be a bigchange in a number of places (though most devices do derive fromSimpleTimingPort, so maybe it's more localized there than it seems).

Incidentally, the complexity and fragility of the coherence protocolis one of the reasons we're integrating the GEMS Ruby memory model, toprovide a more flexible memory system. Unfortunately that's still inprogress, and right now Ruby is also quite a bit slower, but we'reworking on that.


Steve

On Sat, Nov 27, 2010 at 11:45 AM, Jeroen DR <[email protected]<mailto:voetsjoeba%[email protected]>> wrote:


    Hi,

    I'm currently implementing S-NUCA, and I've ran into an issue with
    the way M5's MSHR and blocking mechanisms works while attempting
    to distribute incoming packets to several distinct UCA caches.

    I've modelled the S-NUCA as a container of multiple individual
    regular UCA caches that serve as the banks, each with their own
    (smaller) hit latency plus interconnect latency depending on which
    CPU is accessing the bank. Since each port must be connected to
    one and only one peer port, I've created a bunch of internal
    SimpleTimingPorts to serve as the peers of the individual banks'
    cpuSide and memSide ports.

    The idea is that upon receiving a packet on the main CPU-side
    port, we examine which bank to send the request to (based on the
    low-order bits of the set index) and schedule it for departure
    from the associated internal port. Because each bank has its own
    interconnect latency, the overall access time for banks closeby
    may be lower than that for banks that are farther away.

    An advantage of S-NUCA is that the entire cache needn't block if a
    single bank blocks. This is supported by means of the internal
    ports, as any packet sent to a blocked bank may remain queued in
    the internal port until it can be serviced by the bank. Meanwhile,
    the main CPU-side port can continue to accept packets for other
    banks. To implement this, I have the main CPU-side port distribute
    the packets to the internal ports and always signify success to
    its peer (unless of course all banks are blocked).

    In the interest of validating my S-NUCA implementation with a
    single bank against a regular UCA cache with the same parameters,
    I've temporarily set the interconnect latencies to 0 and modified
    the internal ports to accept scheduling calls at curTick, as these
    normally only allow for scheduled packets to be sent at the next
    cycle. This basically works by inserting the packet at the right
    position in the transmitList in the exact same way it normally
    happens, and then immediately calling sendEvent->process() if the
    packet got inserted at the front of the queue. This works well.

    While digging through the codebase to find an explanation for some
    of the remaining timing differences I encountered, I found that
    the way that a Cache's memside port sends packets to the bus and
    how that interacts with the MSHR chain is posing a problem for the
    way I'd like my S-NUCA to work.

    It basically comes down to the fact that a regular Cache's memside
    port, when it successfully sends an MSHR request downstream,
    relies on the downstream device to already have processed the
    request and to have allocated an MSHR in case of a miss. This is
    supported by the bus, which basically takes the packet sent by the
    Cache's memside port and directly invokes sendTiming() on the
    downstream device. If the downstream device is a cache, this
    causes it to perform its whole timingAccess() call, which checks
    for a hit or a miss and allocates an MSHR. In other words, when
    the cache's memside port receives the return value "true" for its
    sendTiming call, it relies on the fact that at that time an MSHR
    must have already been allocated downstream if the request missed.

    From studying the MSHR code, I understand that this is done in
    order to maintain the downstreamPending flags across the MSHR
    chain. A cache has no way of knowing whether its downstream device
    is going to be another cache or main memory, so it also has no way
    of knowing whether the MSHR request will receive a response at
    this level (because it might miss in another downstream cache). I
    also understand that for this reason, MSHRs are passed down in
    senderState, and that upon allocation, the downstreamPending flag
    of the "parent" MSHR is set.

    In this way, the mere fact of an MSHR getting allocated in a
    downstream device will cause the downstreamPending flag on the
    current-level MSHR to be set. A regular Cache relies on this
    behaviour to determine whether it is going to receive a response
    to the MSHR request it just sent out at this level; it can simply
    check whether the MSHR's downstreamPending flag was set, because
    if the request missed at the downstream device, the downstream
    device must have been a cache which must have allocated an MSHR,
    which must in turn have caused the downstreamPending flag in
    /this/ cache's MSHR to be set:


    from Cache::MemSidePort::sendPacket:

                MSHR *mshr = dynamic_cast<MSHR*>(pkt->senderState);

                bool success = sendTiming(pkt); // this assumes
    instant request processing by the peer

                waitingOnRetry = !success;
                if (waitingOnRetry) {
                    DPRINTF(CachePort, "now waiting on a retry\n");
                    if (!mshr->isForwardNoResponse()) {
                        delete pkt;
                    }
                } else {
                    myCache()->markInService(mshr, pkt); // this
    assumes the mshr->downstreamPending flag to have been correctly
    set (or correctly remained untouched) by the downstream device at
    this point
                }


    It's the markInService() call that will check whether the
    downstreamPending flag is set. If it isn't set, then no MSHR was
    allocated downstream, signifying that it will receive a response.

    However, this only works if the call is indeed immediately
    processed by the downstream device to check for a hit or a miss.
    In order to avoid blocking the entire cache, my S-NUCA
    implementation might return success but have the packet queued in
    an internal port still, waiting for departure. The upper-level
    MSHR then checks its downstreamPending flag, and may incorrectly
    conclude that the request won't miss, even though it still might
    when the packet is eventually sent from the internal port queue.

    So at this point, I'm a bit at a loss what my options are. I tried
    to find out what the downstreamPending flag is used for to see if
    I there's anything I can do about this problem and I found the
    comments below, but I don't understand a word of what is going on
    here:

    from MSHR::handleSnoop:

    if (!inService || (pkt->isExpressSnoop() && downstreamPending)) {
            // Request has not been issued yet, or it's been issued
            // locally but is buffered unissued at some downstream cache
            // which is forwarding us this snoop.  Either way, the packet
            // we're snooping logically precedes this MSHR's request, so
            // the snoop has no impact on the MSHR, but must be processed
            // in the standard way by the cache.  The only exception is
            // that if we're an L2+ cache buffering an UpgradeReq from a
            // higher-level cache, and the snoop is invalidating, then our
            // buffered upgrades must be converted to read exclusives,
            // since the upper-level cache no longer has a valid copy.
            // That is, even though the upper-level cache got out on its
            // local bus first, some other invalidating transaction
            // reached the global bus before the upgrade did.
            if (pkt->needsExclusive()) {
                targets->replaceUpgrades();
                deferredTargets->replaceUpgrades();
            }

            return false;
        }

    This is obviously related to cache-coherence and handling snoops
    at the upper-level cache, which I guess may occur at any time
    while a packet is pending in a internal port downstream in the
    S-NUCA, so I suspect things may get hairy moving forward.

    Another option I've considered is modifying the retry mechanism to
    no longer be an opaque "yes, keep sending me stuff"/"no I'm
    blocked, wait for my retry" but instead issue retries for
    particular address ranges. This would allow my S-NUCA to
    selectively issue retries to address ranges for which the internal
    bank is blocked, but then SimpleTimingPort would also have to be
    modified to not just push the failed packet back to the front of
    the list and wait for an opaque retry, but continue searching down
    the transmitList to find any ready packets to address ranges it
    hasn't been told are blocked yet. I think it's an interesting
    idea, but I imagine there's a whole slew of fairness issues with
    that approach.

    On a side note, I've extensively documented all the behaviour I
    discussed previously in code, and would be more than willing to
    contribute this back to the community. These timing issues turned
    out to be very important for my purposes, but were hidden away
    behind 3 levels of calls and an at first sight innocuous, entirely
    uncommented if(!downstreamPending) check buried somewhere in
    MSHR::markInService, so some comments in there about all these
    underlying assumptions definitely wouldn't hurt.

    Cheers,
    -- Jeroen

    _______________________________________________
    m5-users mailing list
    [email protected] <mailto:[email protected]>
    http://m5sim.org/cgi-bin/mailman/listinfo/m5-users



_______________________________________________
m5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users

_______________________________________________
m5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users

Re: [m5-users] S-NUCA: dealing with delayed MSHR allocation

Reply via email to