Hi all,

I am trying to run a Linux kernel in FS mode, with a custom-rolled SLICC/Ruby 
directory-based cache coherence protocol, but it seems like the memory 
controller is dropping some requests in rare circumstances -- possibly due to 
it being overwhelmed with requests.

The protocol seems to work fine for a long time but about 90% of the way into 
booting the kernel, around the same time as the "mounting filesystems..." 
message appears, gem5 crashes and reports a deadlock.
Inspecting the trace, it seems that the deadlock occurs during a period of very 
high main memory traffic; the trace looks something like this:
> Directory receives DMA read request for Address 1, sends MEMORY_READ to 
> memory controller
> Directory receives DMA read request for Address 2, sends MEMORY_READ to 
> memory controller
> ...
>  Directory receives DMA read request for Address N, sends MEMORY_READ to 
> memory controller
> Directory receives CPU read request for Address A, sends MEMORY_READ to 
> memory controller

After some time, the Directory receives responses for all of the DMA-induced 
requests (Address 1...N). However, it never hears back about the MEMORY_READ to 
Address A, and so eventually gem5 calls it a day and reports a deadlock. 
Address A is distinct from addresses 1..N and its read should therefore not be 
affected by the requests to the other addresses.

I have tried:
* Using the same kernel with one of the example SLICC protocols 
(MOESI_CMP_directory). No error occurred, so the underlying issue must be with 
my protocol.
* Upping the memory size to 8192MB (from 512MB) and increasing the number of 
channels to 4 (from 1). Under this configuration the above issue does not 
occur, and the Linux kernel happily finishes booting. This combined with the 
fact that it takes so long for any issues to occur makes me think that my 
protocol is somehow overwhelming the memory controller, causing it to drop the 
request to read Address A. In other words, I am pretty confident that the error 
is not something as simple as forgetting to pop the memory queue, for example.

If anyone has any clues as to what might be going on I would very much 
appreciate your comments.
I was especially wondering about the following:
* Is it even possible for requests to main memory to fail due to for example 
network congestion? If so, is there any way to catch this and retry the request?
* (Noob question): Where in gem5 do the main memory requests "go to"? Is there 
a debugging flag I could use to check whether the main memory receives the 
request?

Best,
Theo Olausson
Univ. of Edinburgh
_______________________________________________
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s

Reply via email to