[gem5-users] SLICC/Ruby (Mesh topology): L2 between directory and DRAM with same address ranges as the directory
Hey there folks, I am trying to add an L2 between the directory and DRAM in a (otherwise flat) SLICC protocol I've been working on but have been running into some issues. I know that some of the example protocols in src/mem/ruby/protocol/ do have co-located L3s alongside the directories, but to simplify the state machine of the directory I would really like to keep the controllers separate. As far as I've been able to find, there aren't any example protocols which connect a cache directly to the memory. If anyone has worked on similar things and has pointers to examples I could have a look at, I would be very grateful! Some more thorough information about what I'm trying to do and what I've tried so far: -- Background -- The system I am hoping to create is a mesh of nodes, where each node has a CPU, a private L1, and a pair consisting of a directory and an L2 that is responsible for a subset of the address space. So, if a CPU makes a request that cannot be satisfied by its L1, it sends a message to a directory using the mapAddressToMachine function. Depending on the address of the request, the message will be routed through the mesh to the corresponding directory. So far so good: setting this up has been easy thanks to the Mesh_XY topology and the setup_memory_controllers() function in configs/Ruby/Ruby.py. My woes come from the fact instead of connecting the directory to memory, I now want the directory to send its main-memory requests to an L2 instead, and to then have that L2 connected to main memory. The L2 has a simple Valid/Invalid state design, since it effectively just serves as a DRAM cache (i.e. the directory is responsible for upholding SWMR). Unlike a DRAM cache, however, I want the L2 to be collocated with the directory, so if the directory at node /n/ makes a request using mapAddressToMachine(..., MachineType:L2Cache) then the target L2 will also be on node /n/. -- Approach -- To set this up, I have taken some of the code from setup_memory_controllers() and modified it so that the memory controllers are connected to the L2s and so that the L2s and the directories have the same addr_ranges. I have tried both manually generating the addr_ranges using the m5.objects.AddrRange() constructor and by setting them equal to the addr_ranges of the constructed DRAM controllers (without success). This can be seen in my config file for the protocol: https://gist.githubusercontent.com/theoxo/56d35e7a38a01155029748199c1ac7c9/raw/fe031542188ecfbfc41a791b91756d975777dae9/gistfile1.txt -- Problem -- Unfortunately, this doesn't really seem to work. I've been testing in SE mode with the "threads" test program and while it does successfully run for some time, I eventually encounter the following error: > panic: Tried to read unmapped address 0. > PC: 0x7890, Instr: ADD_M_R : ldst t1b, DS:[rax] As far as I understand, this means that my attempt at setting up the addr_ranges is failing? My understanding of gem5 internals is unfortunately quite shallow, so I am struggling to decode more than that from the error message. Sorry about the long email – if anyone recognizes any of these issues from similar systems you've configured yourself or know of any pointers to example protocols that are at all similar, please do let me know! Best, Theo Olausson Univ. of Edinburgh ___ gem5-users mailing list -- gem5-users@gem5.org To unsubscribe send an email to gem5-users-le...@gem5.org %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
[gem5-users] Re: SLICC: Main memory overwhelmed by requests?
Thank you (once again) for your helpful answers, Jason! After having done some more experimenting following your suggestions, I've found that increasing the deadlock threshold (by several orders of magnitude) does not make the problem go away, nor does only increasing the number of memory channels to 4. Nonetheless I suspect your gut feeling that bandwidth problems are to be blamed still holds, as increasing the DRAM size to 8192MB makes the "deadlock" go away and is accompanied by the following message: > warn: Physical memory size specified is 8192MB which is greater than 3GB. > *Twice the number of memory controllers would be created.* More memory controllers => less congestion in the memory controller(s) – makes sense to me! I wonder if the underlying issue has more to do with the cache hierarchy of my system (e.g. no L2 cache, only small L1's) than the protocol itself. Either way, having found a band-aid solution is good enough for my current purposes :) Thanks again for your help Jason! Best, Theo ___ gem5-users mailing list -- gem5-users@gem5.org To unsubscribe send an email to gem5-users-le...@gem5.org %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
[gem5-users] SLICC: Main memory overwhelmed by requests?
Hi all, I am trying to run a Linux kernel in FS mode, with a custom-rolled SLICC/Ruby directory-based cache coherence protocol, but it seems like the memory controller is dropping some requests in rare circumstances -- possibly due to it being overwhelmed with requests. The protocol seems to work fine for a long time but about 90% of the way into booting the kernel, around the same time as the "mounting filesystems..." message appears, gem5 crashes and reports a deadlock. Inspecting the trace, it seems that the deadlock occurs during a period of very high main memory traffic; the trace looks something like this: > Directory receives DMA read request for Address 1, sends MEMORY_READ to > memory controller > Directory receives DMA read request for Address 2, sends MEMORY_READ to > memory controller > ... > Directory receives DMA read request for Address N, sends MEMORY_READ to > memory controller > Directory receives CPU read request for Address A, sends MEMORY_READ to > memory controller After some time, the Directory receives responses for all of the DMA-induced requests (Address 1...N). However, it never hears back about the MEMORY_READ to Address A, and so eventually gem5 calls it a day and reports a deadlock. Address A is distinct from addresses 1..N and its read should therefore not be affected by the requests to the other addresses. I have tried: * Using the same kernel with one of the example SLICC protocols (MOESI_CMP_directory). No error occurred, so the underlying issue must be with my protocol. * Upping the memory size to 8192MB (from 512MB) and increasing the number of channels to 4 (from 1). Under this configuration the above issue does not occur, and the Linux kernel happily finishes booting. This combined with the fact that it takes so long for any issues to occur makes me think that my protocol is somehow overwhelming the memory controller, causing it to drop the request to read Address A. In other words, I am pretty confident that the error is not something as simple as forgetting to pop the memory queue, for example. If anyone has any clues as to what might be going on I would very much appreciate your comments. I was especially wondering about the following: * Is it even possible for requests to main memory to fail due to for example network congestion? If so, is there any way to catch this and retry the request? * (Noob question): Where in gem5 do the main memory requests "go to"? Is there a debugging flag I could use to check whether the main memory receives the request? Best, Theo Olausson Univ. of Edinburgh ___ gem5-users mailing list -- gem5-users@gem5.org To unsubscribe send an email to gem5-users-le...@gem5.org %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
[gem5-users] Re: Stores always cause SC_Failed in Ruby/SLICC protocol
Hi Jason, Thank you for your very helpful (and prompt) reply! You were right that the SC_Failed was a red herring. After playing around with my protocol a bit more, the issue seems to have been that I was making the callback for load and store hits (e.g. `sequencer.{x}Callback(address, entry, false)`) directly in the mandatoryQueue_in definition, rather than by invoking a transition which then made the callback -- it appears this makes the callback silently fail. What's a bit strange is that callbacks for external hits (e.g. `sequencer.{x}(address, entry, true, {data-source})`) seem to work just fine when you declare them directly in an in_port rather than as part of an invoked transition... Not sure if this is either because the mandatoryQueue is a bit special, or if its because of the "initial access was a miss" flag. Thank you once again for taking the time to help out a less experienced fisherman :) Best, Theo ___ gem5-users mailing list -- gem5-users@gem5.org To unsubscribe send an email to gem5-users-le...@gem5.org %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
[gem5-users] Stores always cause SC_Failed in Ruby/SLICC protocol
Hi all, I am trying to learn how to implement cache coherence protocols in gem5 using SLICC. I am currently working on an MSI protocol, similar to the one described in the gem5 book. The protocol passes the random tester for X86 (`configs/learning_gem5/part3/ruby_test.py`), even when faced with a very large workload (4+ cores, 100k+ accesses). It however does not pass the tester which executes the pre-compiled "threads" binary (`configs/learning_gem5/part3/simple_ruby.py`), citing a deadlock. Inspecting the generated error trace, I find no obvious reason for a deadlock (e.g. repeating sequences of messages). This combined with the fact that the random tester is unable to find any issues leads me to think the error is not caused by for example improper allocation of the messages to the different networks causing circular dependencies. Instead, inspecting the error trace I find that Store events are always followed by "SC_Failed" instead of "Done", which I presume means "Store Conditional Failed". I take it that the X86 "threads" binary uses Load-Link/Store-Conditional to implement some mutex/synchronization. Consider the following section of the error trace: ``` 533000 0Seq Begin > [0x2b9a8, line 0x2b980] ST 534000: system.caches.controllers0: MSI-cache.sm:1072: Store in state I at line 0x2b980 ... *cache0 and directory transition to M* ... 585000 0Seq SC_Failed > [0x2b9a8, line 0x2b980] 0 cycles 586000 0Seq Begin > [0x9898, line 0x9880] IFETCH -- Note this load is to a line separate from the stores 587000 0SeqDone > [0x9898, line 0x9880] 0 cycles 588000 0Seq Begin > [0x2b998, line 0x2b980] ST -- Store to same line as before 589000: system.caches.controllers0: MSI-cache.sm:1072: Store in state M at line 0x2b980 589000 0Seq SC_Failed > [0x2b998, line 0x2b980] 0 cycles 589000 0L1Cache store_hit M>M [0x2b980, line 0x2b980] ``` In this short trace we first see a store to line 0x2b980, which is not present in the cache. This finishes with the event "SC_Failed", which seems reasonable to me given that the store required a coherence transaction. We then see a load to an irrelevant line, which does not evict the line 0x2b980. Finally we see another store to line 0x2b980, which this time hits in M state, yet it is once again followed by SC_Failed instead of Done. I also find it a bit weird that it is reported that SC_Failed before the store_hit event (which is the only event triggered when the cache receives a ST event to a line in M state) is reported as having taken place. My code for handling the store_hit in M state is as follows: ``` assert(is_valid(cache_entry)); cache.setMRU(cache_entry); sequencer.writeCallback(in_msg.LineAddress, entry.cache_line, false); mandatory_in.dequeue(clockEdge()); ``` I realise my question thus far is a bit vague, which I apologise for. What I am hoping is that someone more knowledgeable than me could help me understand the following: 1. Is my interpretation of SC_Failed as "Store Conditional Failed" correct? (I thought x86 didn't support LL/SC, so this seems a bit fishy to me...) 2. Am I right in thinking that if stores are always followed by SC_Fail, this might cause a deadlock when executing the "threads" (`tests/test-progs/threads/bin/X86/linux/threads`) binary? 3. Any suggestions as to why it might be that I always get SC_Failed despite for example stores hitting in M only invoke setMRU and writeCallback? Apologies for the lengthy question! Best Regards, Theo Olausson ___ gem5-users mailing list -- gem5-users@gem5.org To unsubscribe send an email to gem5-users-le...@gem5.org %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s