On Mon, Feb 27, 2012 at 7:38 AM, Stefan Neumann <
[email protected]> wrote:
> Hi guys,
>
> I am using MARSSx86 to do some investigations on cache design and noticed
> some strange behavior in the implementation of the simulator for store
> handling.
>
> Stores are fed into the memory hierarchy during commit stage and the
> corresponding ROB and LSQ entries are deallocated immediately afterwards.
> In some situations this will cause the prendingRequest list in
> cpuController to be flooded with store requests.
> This does not seem right from the architecture point of view, right? All
> memory requests need to be tracked by some kind of hardware structure until
> the data is finally merged into the cache.
> So in MARSSx86 this would be the ROB+STQ entry. In case of a cache miss,
> finalization of the request might take some time and new requests from the
> issueQ need to be re-issued in case the LD/ST queues are full.
>
> I agree that queue in cpu controller should be tracked either from CPU
side or from Cache side. The reason I put 'Cache side' is because CPU
controllers are designed to be a purely non-architectural structures that
are used only for simplifying cache and cpu interface.
In current implementation we don't track any pending stores in
cpucontroller's queue because when a store is committed we update the
'data' in the RAM directly and only simulate the effect of storing data to
caches. So while the store is pending in cpucontroller's queue and another
load is requested to same address then it can read the latest data from RAM
and it doesn't break any correctness.
I did some debugging by dumping the pendingRequest list in the
> cpuController from time to time as I was curious about the purpose and
> functionality of the pendingRequest list.
> For that reason I increased the size to 512 entries (also increased the
> pendingRequest lists of the cacheControllers) and what happens is that in
> some cases the list will fill up with store requests.
>
> Cpucontroller queue gets filled up with store requests because
'access_fast_path' (used for fast access to L1-D cache) currently works
only for loads. We should change this function to support stores so that
we don't hog up the cpu controller queue.
I would assume that this list can hold at max STQ_SIZE+LDQ_SIZE(+some
> entries for icache requests?) entries.
>
> Due to the fact that the ROB/STQ entries will be deallocated and the
> stores_in_flight counter is decremented as well after the store request, a
> consecutive store might allocate that ROB/STQ and send itself a new
> request, while the first store is still in flight due to a miss.
>
> Request{Memory Request: core[0] thread[0] address[0x00011db42bc0] robid[*
> 109*] init-cycle[276140] ref-counter[4] op-type[memory_op_write]
> isData[1] ownerUUID[288412] ownerRIP[0x4ca4b2] History[ {+core_0_cont}
> {+L1_D_0} {+L2_0} {+MEM_0} ] Signal[ ooo_0_0-dcache-wakeup] } idx[145]
> cycles[-428] depends[176] waitFor[-1] annuled[0]
> Request{Memory Request: core[0] thread[0] address[0x00011db42cf0] robid[*
> 109*] init-cycle[276185] ref-counter[1] op-type[memory_op_write]
> isData[1] ownerUUID[288540] ownerRIP[0x4ca4ab] History[ {+core_0_cont} ]
> Signal[ ooo_0_0-dcache-wakeup] } idx[225] cycles[-383] depends[226]
> waitFor[242] annuled[0]
>
> (I have added the robid here for debugging purposes. In the original
> sources the robid is always zero in case of a store request)
>
> I could observe some situations where over 460 store requests were present
> in the pendingRequest list of the cpuController. (size = 512)
> This will happen if, for example, a memset funtion is called to zero a
> bunch of cachelines inside a loop.
>
> What do you think about this?
> I think it would be a valid scenario if the ROB entry would be deallocated
> after the store request, but in that case the STQ entry needs to stay valid
> until the store request is finalized. I am not sure if that's possible as,
> the ROB and LSQ are closely bound together in MARSS if I interpreted the
> code correctly.
> For now I solved the issue by limiting the allocation of new pending
> requests during the call to MemoryHierarchy::is_cache_available(). I track
> the number of pending loads and stores inside the cpuController class and
> only allow allocation if store count does not exceed STQ_SIZE/load count
> does not exceed LDQ_SIZE. In case the function will return false, a load
> operation will be re-issued and a store may not commit at that point.
> Though I am not sure if that's a good solution for this.
>
> To minimize the effect of cpu controller queue on cache access, we should
start with modification to access_fast_path to allow stores. As this cpu
controller queue is not a real architectural module we don't have to
implement any tracking for pending stores. From CPU side all stores that
are committed, they are written to cache and on subsequent request to those
cache lines will return the most up-to-date data.
I agree that this is really complicated because of queuing in
cpucontroller. Now cpu is wake-up on cache miss by a signal in the
request, so we don't need to track pending requests in cpu-controller and
we can completely remove this queue.
- Avadh
Regards,
> Stefan
>
> _______________________________________________
> http://www.marss86.org
> Marss86-Devel mailing list
> [email protected]
> https://www.cs.binghamton.edu/mailman/listinfo/marss86-devel
>
>
_______________________________________________
http://www.marss86.org
Marss86-Devel mailing list
[email protected]
https://www.cs.binghamton.edu/mailman/listinfo/marss86-devel