Hi all,

I am trying to learn how to implement cache coherence protocols in gem5 using 
SLICC.
I am currently working on an MSI protocol, similar to the one described in the 
gem5 book.
The protocol passes the random tester for X86 
(`configs/learning_gem5/part3/ruby_test.py`), even when faced with a very large 
workload (4+ cores, 100k+ accesses).
It however does not pass the tester which executes the pre-compiled "threads" 
binary (`configs/learning_gem5/part3/simple_ruby.py`), citing a deadlock.

Inspecting the generated error trace, I find no obvious reason for a deadlock 
(e.g. repeating sequences of messages). This combined with the fact that the 
random tester is unable to find any issues leads me to think the error is not 
caused by for example improper allocation of the messages to the different 
networks causing circular dependencies. Instead, inspecting the error trace I 
find that Store events are always followed by "SC_Failed" instead of "Done", 
which I presume means "Store Conditional Failed". I take it that the X86 
"threads" binary uses Load-Link/Store-Conditional to implement some 
mutex/synchronization.

Consider the following section of the error trace:
```
533000   0        Seq               Begin       >       [0x2b9a8, line 0x2b980] 
ST
534000: system.caches.controllers0: MSI-cache.sm:1072: Store in state I at line 
0x2b980
... *cache0 and directory transition to M* ...
585000   0        Seq           SC_Failed       >       [0x2b9a8, line 0x2b980] 
0 cycles
586000   0        Seq               Begin       >       [0x9898, line 0x9880] 
IFETCH         -- Note this load is to a line separate from the stores
587000   0        Seq                Done       >       [0x9898, line 0x9880] 0 
cycles
588000   0        Seq               Begin       >       [0x2b998, line 0x2b980] 
ST             -- Store to same line as before
589000: system.caches.controllers0: MSI-cache.sm:1072: Store in state M at line 
0x2b980
589000   0        Seq           SC_Failed       >       [0x2b998, line 0x2b980] 
0 cycles
589000   0    L1Cache           store_hit      M>M      [0x2b980, line 0x2b980]
```

In this short trace we first see a store to line 0x2b980, which is not present 
in the cache. This finishes with the event "SC_Failed", which seems reasonable 
to me given that the store required a coherence transaction. We then see a load 
to an irrelevant line, which does not evict the line 0x2b980. Finally we see 
another store to line 0x2b980, which this time hits in M state, yet it is once 
again followed by SC_Failed instead of Done. I also find it a bit weird that it 
is reported that SC_Failed before the store_hit event (which is the only event 
triggered when the cache receives a ST event to a line in M state) is reported 
as having taken place.

My code for handling the store_hit in M state is as follows:
```
assert(is_valid(cache_entry));
cache.setMRU(cache_entry);
sequencer.writeCallback(in_msg.LineAddress, entry.cache_line, false);
mandatory_in.dequeue(clockEdge());
```

I realise my question thus far is a bit vague, which I apologise for. What I am 
hoping is that someone more knowledgeable than me could help me understand the 
following:
1. Is my interpretation of SC_Failed as "Store Conditional Failed" correct? (I 
thought x86 didn't support LL/SC, so this seems a bit fishy to me...)
2. Am I right in thinking that if stores are always followed by SC_Fail, this 
might cause a deadlock when executing the "threads" 
(`tests/test-progs/threads/bin/X86/linux/threads`) binary?
3. Any suggestions as to why it might be that I always get SC_Failed despite 
for example stores hitting in M only invoke setMRU and writeCallback?

Apologies for the lengthy question!

Best Regards,
Theo Olausson
_______________________________________________
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s

Reply via email to