Re: [m5-dev] Profile Results for Mesh Network
On Mon, 24 Jan 2011, Nilay Vaish wrote: On Mon, 24 Jan 2011, Steve Reinhardt wrote: Yes, that's right. So there's probably no big win in trying to further reduce the number of calls to lookup() in Ruby; the possibilities I see for improvement are: 1. Adding an instruction buffer to SimpleCPU so we don't do a cache lookup on *every* instruction fetch 2. Trying again to make the lookup() calls themselves faster (for example, a lookup that hits the MRU block should really only take a handful of instructions, while IIRC we were seeing much larger costs for the hash table lookup) 3. Moving on to some other area (like the Histogram thing) #1 is not a Ruby issue, and could well be different under x86 since (1) x86 has a byte-stream-oriented predecoder so it doesn't do a fetch per instruction anyway and (2) you may have to worry about self-modifying code. Gabe, how many bytes at a time does the x86 predecoder fetch? If it doesn't currently grab a cache line at a time, could it be made to do so, and do you know if that would cause any issues with SMC? Nilay, I'd appreciate your comments on #2 and whether you think that's worth pursuing or should we move on to #3. Steve I now understand why the ratio is 2:1. Before every instruction fetch, the data cache is looked up to make sure that it does not contain the cache block. After that the instruction cache is looked up. Similarly, before any data access, the instruction cache is looked up. This is probably for correctly handling self modifying code. Steve, we can try caching MRU cache block. We can also try replacing hash table with a two dimensional array indexed using cache set and cache way. There are calls to CacheMemory::isTagPresent() in Sequencer.cc. These calls are made just before calls to setMRU(). I am thinking of folding these calls to isTagPresent() within setMRU() which calls CacheMemory::findTagInSet() anyway. -- Nilay I tried caching the index for the MRU block, so that the hash table need not be looked up. It is hard to point if there is a speed up or not. When I run m5.prof, profile results show that time taken by CacheMemory::lookup() drops from ~ 5.5% to ~ 4%. But when I run m5.fast, the time of execution increases by 2%. Earlier when we had the same address being looked up multiple times in succession, this change made sense. Right now, I uncertain about whether this change improves performance. -- Nilay ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] Profile Results for Mesh Network
On Thu, Jan 27, 2011 at 4:36 AM, Nilay Vaish ni...@cs.wisc.edu wrote: I tried caching the index for the MRU block, so that the hash table need not be looked up. It is hard to point if there is a speed up or not. When I run m5.prof, profile results show that time taken by CacheMemory::lookup() drops from ~ 5.5% to ~ 4%. But when I run m5.fast, the time of execution increases by 2%. Earlier when we had the same address being looked up multiple times in succession, this change made sense. Right now, I uncertain about whether this change improves performance. Hi Nilay, Are you talking about caching the MRU block for the entire cache? I agree that that isn't likely to help much now that we've eliminated the redundant lookup() calls. I was thinking of a conventional set-associative cache lookup (a 2D array, or a 1D array of sets where each set is represented by a linked list) where the MRU block of each set is moved to the head of the list. For an L1 cache in particular, most of the accesses to each set will be to the MRU block within that set, so even though you have a linear search within the set, you will often only have to do one comparison. See accessBlock() in mem/cache/tags/lru.cc for an example of this. And even if you don't hit the MRU block, most L1s have relatively small associativity so a linear search is not that bad. I have a hard time believing that this wouldn't be substantially faster than a hash table lookup. For caches that are more highly associative, or L2/L3 caches where the probability of hitting the MRU block is not as high, or for snoop lookups where you are typically searching the whole set to verify that a block is *not* there, then a hash table *may* make more sense. (Though even in that case I wonder whether a whole-cache hash table vs. per-set hash tables is a better idea.) But these should all be much less frequent than L1 hit lookups, so I would think the set-associative lookup would still be faster overall. The optimal strategy in the long run may be to have both types of lookup implemented and to choose the right one at configuration time based on the caches associativity and position... we sort of tried to do this in the original M5 code with the LRU and FALRU tags, but didn't really push the concept all the way to figuring out the optimal crossover point. My mention above about snooping being a bad case for the set-associative linear-search lookup makes me wonder about the double icache/dcache lookup you mentioned... that could be part of the problem, and it's not necessary. There should be a real coherence protocol between the two caches, so that you only have to snoop the other cache in the same situations where you'd have to snoop another core's data cache... i.e., if either cache has an exclusive copy, or is only reading a shared copy, there's no point in doing that second lookup. Is there any notion in the protocol of having different protocol states for the icache and dcache? Steve ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] Profile Results for Mesh Network
From Steve's response, it looks like I'm jumping in the conversation on the wrong page. To be clear, Nilay were you optimizing the lookup() calls or trying to reduce the number of times lookup gets called? My MRU comments and keeping things in the the SimpleCPU was directed toward the latter. On Thu, Jan 27, 2011 at 8:29 AM, Korey Sewell ksew...@umich.edu wrote: I tried caching the index for the MRU block, so that the hash table need not be looked up. It is hard to point if there is a speed up or not. When I run m5.prof, profile results show that time taken by CacheMemory::lookup() drops from ~ 5.5% to ~ 4%. But when I run m5.fast, the time of execution increases by 2%. Hi Nilay, when you did the MRU block caching did it reduce the # of sendFetch calls? I would think that that # should go down from say 140 million to something drastically smaller (...say 50 million)... My next question why did you choose caching the MRU index rather than the MRU block. Ideally, what could happen is that in the TimingSimpleCPU you could check to to see if that block is cached inside the CPU model and if so you would not to have send a fetch to the memory system at all (thus reducing lookup() calls). Is this what is happening here? -- - Korey -- - Korey ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] Profile Results for Mesh Network
On Thu, 27 Jan 2011, Korey Sewell wrote: From Steve's response, it looks like I'm jumping in the conversation on the wrong page. To be clear, Nilay were you optimizing the lookup() calls or trying to reduce the number of times lookup gets called? My MRU comments and keeping things in the the SimpleCPU was directed toward the latter. I am only trying to reduce the time taken by a CacheMemory::lookup() call. -- Nilay ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] Profile Results for Mesh Network
On Thu, 27 Jan 2011, Steve Reinhardt wrote: On Thu, Jan 27, 2011 at 4:36 AM, Nilay Vaish ni...@cs.wisc.edu wrote: I tried caching the index for the MRU block, so that the hash table need not be looked up. It is hard to point if there is a speed up or not. When I run m5.prof, profile results show that time taken by CacheMemory::lookup() drops from ~ 5.5% to ~ 4%. But when I run m5.fast, the time of execution increases by 2%. Earlier when we had the same address being looked up multiple times in succession, this change made sense. Right now, I uncertain about whether this change improves performance. Hi Nilay, Are you talking about caching the MRU block for the entire cache? I agree that that isn't likely to help much now that we've eliminated the redundant lookup() calls. I had a single MRU block per cache in my experiments. I was thinking of a conventional set-associative cache lookup (a 2D array, or a 1D array of sets where each set is represented by a linked list) where the MRU block of each set is moved to the head of the list. For an L1 cache in particular, most of the accesses to each set will be to the MRU block within that set, so even though you have a linear search within the set, you will often only have to do one comparison. See accessBlock() in mem/cache/tags/lru.cc for an example of this. And even if you don't hit the MRU block, most L1s have relatively small associativity so a linear search is not that bad. I have a hard time believing that this wouldn't be substantially faster than a hash table lookup. For caches that are more highly associative, or L2/L3 caches where the probability of hitting the MRU block is not as high, or for snoop lookups where you are typically searching the whole set to verify that a block is *not* there, then a hash table *may* make more sense. (Though even in that case I wonder whether a whole-cache hash table vs. per-set hash tables is a better idea.) But these should all be much less frequent than L1 hit lookups, so I would think the set-associative lookup would still be faster overall. The optimal strategy in the long run may be to have both types of lookup implemented and to choose the right one at configuration time based on the caches associativity and position... we sort of tried to do this in the original M5 code with the LRU and FALRU tags, but didn't really push the concept all the way to figuring out the optimal crossover point. I also have an implementation that performs linear search of cache sets instead of hash table lookup. Again, I saw small improvement. But as you mentioned, when the associativity will go up, linear search will perform worse. Profile results showed that it takes about 100 ns for linear search and about 125 ns for hash table lookup, associativity being 2 for the l1 caches. May be some one should independently profile m5. My mention above about snooping being a bad case for the set-associative linear-search lookup makes me wonder about the double icache/dcache lookup you mentioned... that could be part of the problem, and it's not necessary. There should be a real coherence protocol between the two caches, so that you only have to snoop the other cache in the same situations where you'd have to snoop another core's data cache... i.e., if either cache has an exclusive copy, or is only reading a shared copy, there's no point in doing that second lookup. Is there any notion in the protocol of having different protocol states for the icache and dcache? Both icache and the dcache have the same set of states. But now that you have mentioned it, I think it possible to do away with most of double lookups. We maintain the invariant that caches under a single controller maintain disjoint set of cache blocks. In this case, icache and dcache come under the same cache controller. Currently, for an instruction fetch, we check the dcache first. But this is obviously what we should not be doing. Because in common case (unless of course we simulate self-modifying code often enough), the dcache lookup will report a miss and the icache lookup is report a hit. So, we should lookup the icache first. In case of a hit, we know by our invariant that the dcache does not have the line. If we miss, then we should lookup the dcache. We shoul adopt a similar strategy for loads and stores as well. We can split them under different controllers, but that would mean a lot more work. I will get back to you on this soon. -- Nilay ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] Profile Results for Mesh Network
On Thu, Jan 27, 2011 at 4:01 PM, Nilay Vaish ni...@cs.wisc.edu wrote: I also have an implementation that performs linear search of cache sets instead of hash table lookup. Again, I saw small improvement. But as you mentioned, when the associativity will go up, linear search will perform worse. Profile results showed that it takes about 100 ns for linear search and about 125 ns for hash table lookup, associativity being 2 for the l1 caches. May be some one should independently profile m5. That's surprising to me, but I suppose if the hash function is cheap to compute and typically hits in the first bucket then the performance could be similar. I suppose this isn't worth a lot more effort; it's certainly less significant than the double-lookup issue below. Both icache and the dcache have the same set of states. But now that you have mentioned it, I think it possible to do away with most of double lookups. We maintain the invariant that caches under a single controller maintain disjoint set of cache blocks. In this case, icache and dcache come under the same cache controller. Currently, for an instruction fetch, we check the dcache first. But this is obviously what we should not be doing. Because in common case (unless of course we simulate self-modifying code often enough), the dcache lookup will report a miss and the icache lookup is report a hit. So, we should lookup the icache first. In case of a hit, we know by our invariant that the dcache does not have the line. If we miss, then we should lookup the dcache. We shoul adopt a similar strategy for loads and stores as well. Good point... I had forgotten the disjoint blocks invariant. It would seem like it's a total waste of time to look in any other cache than the one you expect to find it in unless there's a miss. I'd hope that's an easy fix. We can split them under different controllers, but that would mean a lot more work. Yea, I wasn't suggesting that. I will get back to you on this soon. Thanks! Steve ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] Profile Results for Mesh Network
On Sun, Jan 23, 2011 at 4:08 PM, Nilay Vaish ni...@cs.wisc.edu wrote: On Sun, 23 Jan 2011, Korey Sewell wrote: In sendFetch(), it calls sendTiming() which would then call the recvTiming on the cache port since those two should be binded as peers. I'm a little unsure of how the RubyPort, Sequencer, CacheMemory, and CacheController (?) relationship is working (right now at least), but the relationship between sendTiming and recvTiming is the key concept that connects 2 memory objects unless things have changed. On Sun, Jan 23, 2011 at 3:51 PM, Nilay Vaish ni...@cs.wisc.edu wrote: I dug more in to the code today. There are three paths along which calls are made to the RubyPort::M5Port::recvTiming(), which eventually results in calls to CacheMemory::lookup(). 1. TimingSimpleCPU::sendFetch() - 140 million 2. TimingSimpleCPU::handleReadPacket() - 30 million 3. TimingSimpleCPU::handleWritePacket() - 18 million The number of times last two functions are called is very close to the total number of memory references (48 million) for all the cpus together. The number of lookup() calls is about 392 million. If we take into account the calls to sendFetch(), then the ratio of number of lookup() calls to that of the number of requests pushed in to ruby reduces to 2 to 1, from an earlier estimate of 8 to 1. My question would be why does sendFetch() makes calls to recvTiming()? Some more reading revealed that that sendFetch() is calling recvTiming for instruction cache accesses. Whereas the other two calls (handleReadPacket and handleWritePacket) are for data cache accesses. Yes, that's right. So there's probably no big win in trying to further reduce the number of calls to lookup() in Ruby; the possibilities I see for improvement are: 1. Adding an instruction buffer to SimpleCPU so we don't do a cache lookup on *every* instruction fetch 2. Trying again to make the lookup() calls themselves faster (for example, a lookup that hits the MRU block should really only take a handful of instructions, while IIRC we were seeing much larger costs for the hash table lookup) 3. Moving on to some other area (like the Histogram thing) #1 is not a Ruby issue, and could well be different under x86 since (1) x86 has a byte-stream-oriented predecoder so it doesn't do a fetch per instruction anyway and (2) you may have to worry about self-modifying code. Gabe, how many bytes at a time does the x86 predecoder fetch? If it doesn't currently grab a cache line at a time, could it be made to do so, and do you know if that would cause any issues with SMC? Nilay, I'd appreciate your comments on #2 and whether you think that's worth pursuing or should we move on to #3. Steve ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] Profile Results for Mesh Network
On Mon, 24 Jan 2011, Steve Reinhardt wrote: On Sun, Jan 23, 2011 at 4:08 PM, Nilay Vaish ni...@cs.wisc.edu wrote: On Sun, 23 Jan 2011, Korey Sewell wrote: In sendFetch(), it calls sendTiming() which would then call the recvTiming on the cache port since those two should be binded as peers. I'm a little unsure of how the RubyPort, Sequencer, CacheMemory, and CacheController (?) relationship is working (right now at least), but the relationship between sendTiming and recvTiming is the key concept that connects 2 memory objects unless things have changed. On Sun, Jan 23, 2011 at 3:51 PM, Nilay Vaish ni...@cs.wisc.edu wrote: I dug more in to the code today. There are three paths along which calls are made to the RubyPort::M5Port::recvTiming(), which eventually results in calls to CacheMemory::lookup(). 1. TimingSimpleCPU::sendFetch() - 140 million 2. TimingSimpleCPU::handleReadPacket() - 30 million 3. TimingSimpleCPU::handleWritePacket() - 18 million The number of times last two functions are called is very close to the total number of memory references (48 million) for all the cpus together. The number of lookup() calls is about 392 million. If we take into account the calls to sendFetch(), then the ratio of number of lookup() calls to that of the number of requests pushed in to ruby reduces to 2 to 1, from an earlier estimate of 8 to 1. My question would be why does sendFetch() makes calls to recvTiming()? Some more reading revealed that that sendFetch() is calling recvTiming for instruction cache accesses. Whereas the other two calls (handleReadPacket and handleWritePacket) are for data cache accesses. Yes, that's right. So there's probably no big win in trying to further reduce the number of calls to lookup() in Ruby; the possibilities I see for improvement are: 1. Adding an instruction buffer to SimpleCPU so we don't do a cache lookup on *every* instruction fetch 2. Trying again to make the lookup() calls themselves faster (for example, a lookup that hits the MRU block should really only take a handful of instructions, while IIRC we were seeing much larger costs for the hash table lookup) 3. Moving on to some other area (like the Histogram thing) #1 is not a Ruby issue, and could well be different under x86 since (1) x86 has a byte-stream-oriented predecoder so it doesn't do a fetch per instruction anyway and (2) you may have to worry about self-modifying code. Gabe, how many bytes at a time does the x86 predecoder fetch? If it doesn't currently grab a cache line at a time, could it be made to do so, and do you know if that would cause any issues with SMC? Nilay, I'd appreciate your comments on #2 and whether you think that's worth pursuing or should we move on to #3. Steve I now understand why the ratio is 2:1. Before every instruction fetch, the data cache is looked up to make sure that it does not contain the cache block. After that the instruction cache is looked up. Similarly, before any data access, the instruction cache is looked up. This is probably for correctly handling self modifying code. Steve, we can try caching MRU cache block. We can also try replacing hash table with a two dimensional array indexed using cache set and cache way. There are calls to CacheMemory::isTagPresent() in Sequencer.cc. These calls are made just before calls to setMRU(). I am thinking of folding these calls to isTagPresent() within setMRU() which calls CacheMemory::findTagInSet() anyway. -- Nilay ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] Profile Results for Mesh Network
Steve, we can try caching MRU cache block. We can also try replacing hash table with a two dimensional array indexed using cache set and cache way. This should at least show some decent speedup (depending on SMC code). The O3 caches the MRU and ironically, I had just patched the InOrder model to do the same thing. Templating the code from fetch_impl.hh (in the O3 model) to cache the block may be a good way to get that change done fairly quick. -- - Korey ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] Profile Results for Mesh Network
Quoting Steve Reinhardt ste...@gmail.com: Gabe, how many bytes at a time does the x86 predecoder fetch? If it doesn't currently grab a cache line at a time, could it be made to do so, and do you know if that would cause any issues with SMC? All of the predecoders expect to receive one MachInst at a time so that for fixed width ISAs they naturally get one instruction at a time. For x86 that type doesn't really mean anything other than controlling that width. Currently it's defined to be a uint64_t, but that could be changed pretty easily with hopefully basically no functional effect. Whatever it is would need to have some integer like properties, though, where it could be shifted, masked, etc. I have it set to a big int to try to minimize trips through the predecoder, the work needed to glue big immediates together, etc. I heard during a presentation by a pretty senior VMware guy that x86 isn't guaranteed to refetch instruction bytes until a control flow instruction and/or a serializing instruction or something along those lines. A former AMD guy in the room was surprised by that, though, so I'm not sure it's true. It could be one of those things that's supposed to be true, but to get minesweeper to work on Windows 95 they have to refetch anyway. It could also be a difference between Intel and AMD. So the take away answer is possibly no, possibly yes. Gabe ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] Profile Results for Mesh Network
I dug more in to the code today. There are three paths along which calls are made to the RubyPort::M5Port::recvTiming(), which eventually results in calls to CacheMemory::lookup(). 1. TimingSimpleCPU::sendFetch() - 140 million 2. TimingSimpleCPU::handleReadPacket() - 30 million 3. TimingSimpleCPU::handleWritePacket() - 18 million The number of times last two functions are called is very close to the total number of memory references (48 million) for all the cpus together. The number of lookup() calls is about 392 million. If we take into account the calls to sendFetch(), then the ratio of number of lookup() calls to that of the number of requests pushed in to ruby reduces to 2 to 1, from an earlier estimate of 8 to 1. My question would be why does sendFetch() makes calls to recvTiming()? -- Nilay ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] Profile Results for Mesh Network
On Sun, 23 Jan 2011, Korey Sewell wrote: In sendFetch(), it calls sendTiming() which would then call the recvTiming on the cache port since those two should be binded as peers. I'm a little unsure of how the RubyPort, Sequencer, CacheMemory, and CacheController (?) relationship is working (right now at least), but the relationship between sendTiming and recvTiming is the key concept that connects 2 memory objects unless things have changed. On Sun, Jan 23, 2011 at 3:51 PM, Nilay Vaish ni...@cs.wisc.edu wrote: I dug more in to the code today. There are three paths along which calls are made to the RubyPort::M5Port::recvTiming(), which eventually results in calls to CacheMemory::lookup(). 1. TimingSimpleCPU::sendFetch() - 140 million 2. TimingSimpleCPU::handleReadPacket() - 30 million 3. TimingSimpleCPU::handleWritePacket() - 18 million The number of times last two functions are called is very close to the total number of memory references (48 million) for all the cpus together. The number of lookup() calls is about 392 million. If we take into account the calls to sendFetch(), then the ratio of number of lookup() calls to that of the number of requests pushed in to ruby reduces to 2 to 1, from an earlier estimate of 8 to 1. My question would be why does sendFetch() makes calls to recvTiming()? Some more reading revealed that that sendFetch() is calling recvTiming for instruction cache accesses. Whereas the other two calls (handleReadPacket and handleWritePacket) are for data cache accesses. -- Nilay ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] Profile Results for Mesh Network
What's the deal with Histogram::add()? Either it's too slow or it's being called too much, I'd say, unless we're tracking some incredibly vital statistics there. Can you use the call graph part of the profile to find where most of the calls are coming from? Also, can you look at the stats and verify that the number of calls to CacheMemory::lookup() is not much larger than the total number of CPU memory references? Another thing to note is that even though PerfectSwitch::wakeup() is not a huge amount of time, the time per call is pretty large, so there may be opportunities for a lot of improvement there. Thanks, Steve On Wed, Jan 19, 2011 at 9:20 AM, Nilay Vaish ni...@cs.wisc.edu wrote: I profiled m5 again, using the following command. ./build/ALPHA_FS_MOESI_hammer/m5.prof ./configs/example/ruby_fs.py --maxtick 2000 -n 8 --topology Mesh --mesh-rows 2 --num-l2cache 8 --num-dir 8 Results have been copied below. CacheMemory::lookup() still consumes some time but is lot less than before. PrefectSwitch can be a candidate for re-design. But given that, PerfectSwitch is taking only 3% of the time, it would not yield that much gain. May be we need to look at these different functions in a more holistic fashion. -- Nilay % cumulative self self total time seconds secondscalls s/call s/call name 7.29 35.8935.89 606750284 0.00 0.00 Histogram::add(long long) 5.84 64.6228.73 256533483 0.00 0.00 CacheMemory::lookup(Address const) 4.56 87.0622.44 124360139 0.00 0.00 L1Cache_Controller::wakeup() 3.88106.1819.12 121283110 0.00 0.00 RubyPort::M5Port::recvTiming(Packet*) 3.13121.6015.42 6875704 0.00 0.00 PerfectSwitch::wakeup() 3.00136.3814.78 39527382 0.00 0.00 MemoryControl::executeCycle() 2.95150.9114.53 90855686 0.00 0.00 BaseSimpleCPU::preExecute() 2.68164.1013.19 147750111 0.00 0.00 MessageBuffer::enqueue(RefCountingPtrMessage, long long) 2.41175.9611.86 180281626 0.00 0.00 RubyEventQueueNode::process() 2.07186.1710.21 302054741 0.00 0.00 EventQueue::serviceOne() 2.03196.15 9.98 121176116 0.00 0.00 Sequencer::getRequestStatus(RubyRequest const) On Tue, 18 Jan 2011, Nilay wrote: Brad, I got the simulation working. It seems to me that you wrote Mesh.py under the assumption that number of cpus = number of L1 controllers = number of L2 controllers (if present) = number of directory controllers. The following options worked after some struggle and some help from Arka - ./build/ALPHA_FS_MESI_CMP_directory/m5.fast ./configs/example/ruby_fs.py --maxtick 20 -n 16 --topology Mesh --mesh-rows 4 --num-dirs 16 --num-l2caches 16 -- Nilay On Tue, January 18, 2011 10:28 am, Beckmann, Brad wrote: Hi Nilay, My plan is to tackle the functional access support as soon as I check in our current group of outstanding patches. I'm hoping to at least check in the majority of them in the next couple of days. Now that you've completed the CacheMemory access changes, you may want to re-profile GEM5 and make sure the next performance bottleneck is routing network messages in the Perfect Switch. In particular, you'll want to look at rather large (16+ core) systems using a standard Mesh network. If you have any questions on how to do that, Arka may be able to help you out, if not, I can certainly help you. Assuming the Perfect Switch shows up as a major bottleneck ( 10%), then I would suggest that as the next area you can work on. When looking at possible solutions, don't limit yourself to just changes within Perfect Switch itself. I suspect that redesigning how destinations are encoded and/or the interface between MessageBuffer dequeues and the PerfectSwitch wakeup, will lead to a better solution. Brad ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] Profile Results for Mesh Network
What's the deal with Histogram::add()? Either it's too slow or it's being called too much, I'd say, unless we're tracking some incredibly vital statistics there. Can you use the call graph part of the profile to find where most of the calls are coming from? Don't spend too much time fixing it if you want to. I've added a Histogram stat to the base stats package and it would be better to switch. Nate ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] Profile Results for Mesh Network
Some more data from the same simulation, this time from m5out/stats.txt. The number of memory references is slightly more than 30,000,000 where as the number of lookups in the cache is about 256,000,000. So that would be a ratio of 1 : 8.5. I suspect that the reason for this might be that every cache controller performs lookup for every memory reference. It seems to me that only one thread was running all the time which executed mostly on processors 0 and 4. Is the rate at which sim tick is incremented different from the processor clock rate? The simulation was allowed to run for 200 billion sim ticks while the processor cycles is just 400 million. Given the way the processors behaved during this simulation, I think we should profile a simulation in which all the processors are active most of the time. -- Nilay system.cpu0.not_idle_fraction0.170783 # Percentage of non-idle cycles system.cpu0.numCycles 35398 # number of cpu cycles simulated system.cpu0.num_insts 9504409 # Number of instructions executed system.cpu0.num_refs 2258648 # Number of memory references system.cpu1.not_idle_fraction0.002503 # Percentage of non-idle cycles system.cpu1.numCycles 398441993 # number of cpu cycles simulated system.cpu1.num_insts 212019 # Number of instructions executed system.cpu1.num_refs71465 # Number of memory references CPUs 2 and 3 behavior is same as 1. system.cpu4.not_idle_fraction0.823388 # Percentage of non-idle cycles system.cpu4.numCycles 4 # number of cpu cycles simulated system.cpu4.num_insts80686910 # Number of instructions executed system.cpu4.num_refs 28047398 # Number of memory references system.cpu5.not_idle_fraction0.000185 # Percentage of non-idle cycles system.cpu5.numCycles 399900914 # number of cpu cycles simulated system.cpu5.num_insts8413 # Number of instructions executed system.cpu5.num_refs 2862 # Number of memory references CPUs 6 and 7 also have stats similar to 5. On Wed, 19 Jan 2011, Steve Reinhardt wrote: What's the deal with Histogram::add()? Either it's too slow or it's being called too much, I'd say, unless we're tracking some incredibly vital statistics there. Can you use the call graph part of the profile to find where most of the calls are coming from? Also, can you look at the stats and verify that the number of calls to CacheMemory::lookup() is not much larger than the total number of CPU memory references? Another thing to note is that even though PerfectSwitch::wakeup() is not a huge amount of time, the time per call is pretty large, so there may be opportunities for a lot of improvement there. Thanks, Steve On Wed, Jan 19, 2011 at 9:20 AM, Nilay Vaish ni...@cs.wisc.edu wrote: I profiled m5 again, using the following command. ./build/ALPHA_FS_MOESI_hammer/m5.prof ./configs/example/ruby_fs.py --maxtick 2000 -n 8 --topology Mesh --mesh-rows 2 --num-l2cache 8 --num-dir 8 Results have been copied below. CacheMemory::lookup() still consumes some time but is lot less than before. PrefectSwitch can be a candidate for re-design. But given that, PerfectSwitch is taking only 3% of the time, it would not yield that much gain. May be we need to look at these different functions in a more holistic fashion. -- Nilay % cumulative self self total time seconds secondscalls s/call s/call name 7.29 35.8935.89 606750284 0.00 0.00 Histogram::add(long long) 5.84 64.6228.73 256533483 0.00 0.00 CacheMemory::lookup(Address const) 4.56 87.0622.44 124360139 0.00 0.00 L1Cache_Controller::wakeup() 3.88106.1819.12 121283110 0.00 0.00 RubyPort::M5Port::recvTiming(Packet*) 3.13121.6015.42 6875704 0.00 0.00 PerfectSwitch::wakeup() 3.00136.3814.78 39527382 0.00 0.00 MemoryControl::executeCycle() 2.95150.9114.53 90855686 0.00 0.00 BaseSimpleCPU::preExecute() 2.68164.1013.19 147750111 0.00 0.00 MessageBuffer::enqueue(RefCountingPtrMessage, long long) 2.41175.9611.86 180281626 0.00 0.00 RubyEventQueueNode::process() 2.07186.1710.21 302054741 0.00 0.00 EventQueue::serviceOne() 2.03196.15 9.98 121176116 0.00 0.00 Sequencer::getRequestStatus(RubyRequest const) ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] Profile Results for Mesh Network
On Wed, Jan 19, 2011 at 3:56 PM, Nilay Vaish ni...@cs.wisc.edu wrote: Some more data from the same simulation, this time from m5out/stats.txt. The number of memory references is slightly more than 30,000,000 where as the number of lookups in the cache is about 256,000,000. So that would be a ratio of 1 : 8.5. I suspect that the reason for this might be that every cache controller performs lookup for every memory reference. Do you mean that I do a lookup in all 8 caches for each reference? Is this part of an assertion that's checking coherence invariants? It seems like that would be something that we'd want to only do in debug mode (or maybe only when explicitly enabled). It seems to me that only one thread was running all the time which executed mostly on processors 0 and 4. Is the rate at which sim tick is incremented different from the processor clock rate? The simulation was allowed to run for 200 billion sim ticks while the processor cycles is just 400 million. Ticks are typically picoseconds, so assuming a 2 GHz CPU clock that looks right. Given the way the processors behaved during this simulation, I think we should profile a simulation in which all the processors are active most of the time. I'm not sure that would affect the simulator performance profile that much. Steve -- Nilay system.cpu0.not_idle_fraction0.170783 # Percentage of non-idle cycles system.cpu0.numCycles 35398 # number of cpu cycles simulated system.cpu0.num_insts 9504409 # Number of instructions executed system.cpu0.num_refs 2258648 # Number of memory references system.cpu1.not_idle_fraction0.002503 # Percentage of non-idle cycles system.cpu1.numCycles 398441993 # number of cpu cycles simulated system.cpu1.num_insts 212019 # Number of instructions executed system.cpu1.num_refs71465 # Number of memory references CPUs 2 and 3 behavior is same as 1. system.cpu4.not_idle_fraction0.823388 # Percentage of non-idle cycles system.cpu4.numCycles 4 # number of cpu cycles simulated system.cpu4.num_insts80686910 # Number of instructions executed system.cpu4.num_refs 28047398 # Number of memory references system.cpu5.not_idle_fraction0.000185 # Percentage of non-idle cycles system.cpu5.numCycles 399900914 # number of cpu cycles simulated system.cpu5.num_insts8413 # Number of instructions executed system.cpu5.num_refs 2862 # Number of memory references CPUs 6 and 7 also have stats similar to 5. On Wed, 19 Jan 2011, Steve Reinhardt wrote: What's the deal with Histogram::add()? Either it's too slow or it's being called too much, I'd say, unless we're tracking some incredibly vital statistics there. Can you use the call graph part of the profile to find where most of the calls are coming from? Also, can you look at the stats and verify that the number of calls to CacheMemory::lookup() is not much larger than the total number of CPU memory references? Another thing to note is that even though PerfectSwitch::wakeup() is not a huge amount of time, the time per call is pretty large, so there may be opportunities for a lot of improvement there. Thanks, Steve On Wed, Jan 19, 2011 at 9:20 AM, Nilay Vaish ni...@cs.wisc.edu wrote: I profiled m5 again, using the following command. ./build/ALPHA_FS_MOESI_hammer/m5.prof ./configs/example/ruby_fs.py --maxtick 2000 -n 8 --topology Mesh --mesh-rows 2 --num-l2cache 8 --num-dir 8 Results have been copied below. CacheMemory::lookup() still consumes some time but is lot less than before. PrefectSwitch can be a candidate for re-design. But given that, PerfectSwitch is taking only 3% of the time, it would not yield that much gain. May be we need to look at these different functions in a more holistic fashion. -- Nilay % cumulative self self total time seconds secondscalls s/call s/call name 7.29 35.8935.89 606750284 0.00 0.00 Histogram::add(long long) 5.84 64.6228.73 256533483 0.00 0.00 CacheMemory::lookup(Address const) 4.56 87.0622.44 124360139 0.00 0.00 L1Cache_Controller::wakeup() 3.88106.1819.12 121283110 0.00 0.00 RubyPort::M5Port::recvTiming(Packet*) 3.13121.6015.42 6875704 0.00 0.00 PerfectSwitch::wakeup() 3.00136.3814.78 39527382 0.00 0.00 MemoryControl::executeCycle() 2.95150.9114.53 90855686 0.00 0.00 BaseSimpleCPU::preExecute() 2.68164.1013.19 147750111 0.00 0.00 MessageBuffer::enqueue(RefCountingPtrMessage, long long) 2.41
Re: [m5-dev] Profile Results for Mesh Network
Do you mean that I do a lookup in all 8 caches for each reference? Is this part of an assertion that's checking coherence invariants? It seems like that would be something that we'd want to only do in debug mode (or maybe only when explicitly enabled). Seems like something that we could enable with a Debug Flag. Hmmm. I still need to commit that. Nate ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] Profile Results for Mesh Network
On Wed, 19 Jan 2011, Steve Reinhardt wrote: On Wed, Jan 19, 2011 at 3:56 PM, Nilay Vaish ni...@cs.wisc.edu wrote: Some more data from the same simulation, this time from m5out/stats.txt. The number of memory references is slightly more than 30,000,000 where as the number of lookups in the cache is about 256,000,000. So that would be a ratio of 1 : 8.5. I suspect that the reason for this might be that every cache controller performs lookup for every memory reference. Do you mean that I do a lookup in all 8 caches for each reference? Is this part of an assertion that's checking coherence invariants? It seems like that would be something that we'd want to only do in debug mode (or maybe only when explicitly enabled). I think the number of caches looked will depend on the coherence protocol (MOESI Hammer in this case). I have also profiled ALPHA FS MOESI Hammer with just one processor. In that case the ratio was 1:1.5, that is why I thought that may be all the caches are being looked up on each reference. Since I am running m5.prof, I would expect that none of the assertion statements are being compiled in. It seems to me that only one thread was running all the time which executed mostly on processors 0 and 4. Is the rate at which sim tick is incremented different from the processor clock rate? The simulation was allowed to run for 200 billion sim ticks while the processor cycles is just 400 million. Ticks are typically picoseconds, so assuming a 2 GHz CPU clock that looks right. Given the way the processors behaved during this simulation, I think we should profile a simulation in which all the processors are active most of the time. I'm not sure that would affect the simulator performance profile that much. I would expect more coherence traffic, network messages. More active CPUs might also mean that portions of code, other than those relating to memory accesses, are executed more often. Steve -- Nilay system.cpu0.not_idle_fraction0.170783 # Percentage of non-idle cycles system.cpu0.numCycles 35398 # number of cpu cycles simulated system.cpu0.num_insts 9504409 # Number of instructions executed system.cpu0.num_refs 2258648 # Number of memory references system.cpu1.not_idle_fraction0.002503 # Percentage of non-idle cycles system.cpu1.numCycles 398441993 # number of cpu cycles simulated system.cpu1.num_insts 212019 # Number of instructions executed system.cpu1.num_refs71465 # Number of memory references CPUs 2 and 3 behavior is same as 1. system.cpu4.not_idle_fraction0.823388 # Percentage of non-idle cycles system.cpu4.numCycles 4 # number of cpu cycles simulated system.cpu4.num_insts80686910 # Number of instructions executed system.cpu4.num_refs 28047398 # Number of memory references system.cpu5.not_idle_fraction0.000185 # Percentage of non-idle cycles system.cpu5.numCycles 399900914 # number of cpu cycles simulated system.cpu5.num_insts8413 # Number of instructions executed system.cpu5.num_refs 2862 # Number of memory references CPUs 6 and 7 also have stats similar to 5. On Wed, 19 Jan 2011, Steve Reinhardt wrote: What's the deal with Histogram::add()? Either it's too slow or it's being called too much, I'd say, unless we're tracking some incredibly vital statistics there. Can you use the call graph part of the profile to find where most of the calls are coming from? Also, can you look at the stats and verify that the number of calls to CacheMemory::lookup() is not much larger than the total number of CPU memory references? Another thing to note is that even though PerfectSwitch::wakeup() is not a huge amount of time, the time per call is pretty large, so there may be opportunities for a lot of improvement there. Thanks, Steve On Wed, Jan 19, 2011 at 9:20 AM, Nilay Vaish ni...@cs.wisc.edu wrote: I profiled m5 again, using the following command. ./build/ALPHA_FS_MOESI_hammer/m5.prof ./configs/example/ruby_fs.py --maxtick 2000 -n 8 --topology Mesh --mesh-rows 2 --num-l2cache 8 --num-dir 8 Results have been copied below. CacheMemory::lookup() still consumes some time but is lot less than before. PrefectSwitch can be a candidate for re-design. But given that, PerfectSwitch is taking only 3% of the time, it would not yield that much gain. May be we need to look at these different functions in a more holistic fashion. -- Nilay % cumulative self self total time seconds secondscalls s/call s/call name 7.29 35.8935.89 606750284 0.00 0.00 Histogram::add(long long) 5.84 64.6228.73 256533483 0.00 0.00