Dear Eliot,

This is all invaluable so thank you for taking the time to message.

This message is just my current thinking, so please let me know if I’ve 
misinterpreted anything.

>From what I can now tell, the best way to go is to add a request flag to 
>mem/request.hh, and then issue the request with writeMemTiming from 
>memhelpers.hh. Then as you have done, it should be possible to extend the 
>caches to respond to this request (but in the case of fence.t, up to the point 
>of unification rather than coherence? It seems you can just add the DST_POU 
>flag to the request to achieve this.). You could make each cache visit every 
>block with some added delay depending on your exact modelling. I’ve seen such 
>a thing implemented by functional accesses in BaseCache::memWriteback and 
>BaseCache::memInvalidate, but I am assuming your engine probably does this via 
>timing writebacks on each block. From what I can see, Cache::writebackBlk 
>seems to be timing, and any latency from determining dirty lines (depending on 
>our particular model) could be added to the cycle count.

As for the writeback buffer issue, it seems that given any placement of fence.t 
it should be conceptually valid to say that no channel exists across it. 
Therefore you’d need to ensure the writeback buffer was emptied regardless. Is 
a memory fence able to achieve this or does it require extending the caches 
further? Then, I guess you would need some concept of worst-case execution time 
(as you have said, a fixed maximum), as otherwise fence.t in of itself would 
become a communication channel.

I imagine a basic first implementation could do this functionally, to verify 
everything that should be flushed is, and then made more accurate afterwards.

At this point I’ve got the instruction decoding, and can flush an individual L1 
block, so with respect to caches – I just need to extend the protocol 
appropriately. I would appreciate a high-level, but slightly more detailed 
explanation of the changes you made (particularly the engine) and the functions 
you called to get your implementation working whilst also making it timing 
accurate. Assuming that it is easier to provide than producing a potentially 
quite complicated patch.

Thanks again for your support,

Ethan

From: Eliot Moss <m...@cs.umass.edu>
Sent: 14 March 2022 14:15
To: Ethan Bannister <qs18...@bristol.ac.uk>; gem5-users@gem5.org 
<gem5-users@gem5.org>
Subject: Re: [gem5-users] Modelling cache flushing on gem5 (RISC-V)

I just skimmed that paper (not surprised to see Gernot Heiser's name there!)
and I think that, while it would be a little bit of work, it might not be
*too* hard to implement something like fence.t for the caches.  It would be
substantially different from wbinvd.  The latter speaks to the whole cache
system, and I implemented it by a request that flows all the way up to the Point
of Coherence (memory bus) and back down as a new kind of snoop to all the
caches that talk through one or more levels to memory.  Then each cache
essentially has a little engine for writing dirty lines back.  It's that part
that would be useful here - I guess we'd be looking at a variation on it,
triggered in a slightly different way (not by a snoop, but by a different kind
of request).  To get sensible timings you'd need to decide what hardware
mechanisms are available for finding dirty lines.  I assumed they were indexed
in some way that finding at least a set with one or more dirty lines had no
substantial overhead.  L1 cache is small enough that we might get by with that
assumption.  Alternatively, assuming each set provides an "at least one dirty
line" bit, and that 64 of the these set bits can be examined by a priority
encoder to give you a set to work on - or indicate that all 64 sets are clean
- then a typical L1 cache would not need many cycles of reading those bits out
to find the relevant sets.

For 64 KB cache, 64 B lines, associativity 2, there are 512 sets, meaning we'd
need to read 8 groups of 64 of these "dirty set" bits.  The actual writing
back would usually take most of the time.

Presumably you would need to wait until all the dirty lines make it to L2,
since if the writeback buffers are clogged there might still be a
communication channel there.  Still, by the time a context switch is complete,
those buffers may be guaranteed to have cleared - provided we can make an
argument that there is a fixed maximum amount of time needed for that to
happen.

Anyway, I hope this helps.

Eliot Moss
_______________________________________________
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s

Reply via email to