Re: [m5-dev] Profile Results for Mesh Network

2011-01-27 Thread Nilay Vaish

On Mon, 24 Jan 2011, Nilay Vaish wrote:


On Mon, 24 Jan 2011, Steve Reinhardt wrote:


Yes, that's right.  So there's probably no big win in trying to further
reduce the number of calls to lookup() in Ruby; the possibilities I see for
improvement are:
1. Adding an instruction buffer to SimpleCPU so we don't do a cache lookup
on *every* instruction fetch
2. Trying again to make the lookup() calls themselves faster (for example, 
a

lookup that hits the MRU block should really only take a handful of
instructions, while IIRC we were seeing much larger costs for the hash 
table

lookup)
3. Moving on to some other area (like the Histogram thing)

#1 is not a Ruby issue, and could well be different under x86 since (1) x86
has a byte-stream-oriented predecoder so it doesn't do a fetch per
instruction anyway and (2) you may have to worry about self-modifying code.
Gabe, how many bytes at a time does the x86 predecoder fetch?  If it
doesn't currently grab a cache line at a time, could it be made to do so,
and do you know if that would cause any issues with SMC?

Nilay, I'd appreciate your comments on #2 and whether you think that's 
worth

pursuing or should we move on to #3.

Steve



I now understand why the ratio is 2:1. Before every instruction fetch, the 
data cache is looked up to make sure that it does not contain the cache 
block. After that the instruction cache is looked up. Similarly, before any 
data access, the instruction cache is looked up. This is probably for 
correctly handling self modifying code.


Steve, we can try caching MRU cache block. We can also try replacing hash 
table with a two dimensional array indexed using cache set and cache way.


There are calls to CacheMemory::isTagPresent() in Sequencer.cc. These calls 
are made just before calls to setMRU(). I am thinking of folding these calls 
to isTagPresent() within setMRU() which calls CacheMemory::findTagInSet() 
anyway.


--
Nilay



I tried caching the index for the MRU block, so that the hash table need
not be looked up. It is hard to point if there is a speed up or not. When
I run m5.prof, profile results show that time taken by
CacheMemory::lookup() drops from ~ 5.5% to ~ 4%. But when I run m5.fast,
the time of execution increases by 2%.

Earlier when we had the same address being looked up multiple times in
succession, this change made sense. Right now, I uncertain about whether
this change improves performance.

--
Nilay
___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


Re: [m5-dev] Profile Results for Mesh Network

2011-01-27 Thread Steve Reinhardt
On Thu, Jan 27, 2011 at 4:36 AM, Nilay Vaish ni...@cs.wisc.edu wrote:

 I tried caching the index for the MRU block, so that the hash table need
 not be looked up. It is hard to point if there is a speed up or not. When
 I run m5.prof, profile results show that time taken by
 CacheMemory::lookup() drops from ~ 5.5% to ~ 4%. But when I run m5.fast,
 the time of execution increases by 2%.

 Earlier when we had the same address being looked up multiple times in
 succession, this change made sense. Right now, I uncertain about whether
 this change improves performance.



Hi Nilay,

Are you talking about caching the MRU block for the entire cache?  I agree
that that isn't likely to help much now that we've eliminated the redundant
lookup() calls.  I was thinking of a conventional set-associative cache
lookup (a 2D array, or a 1D array of sets where each set is represented by a
linked list) where the MRU block of each set is moved to the head of the
list.  For an L1 cache in particular, most of the accesses to each set will
be to the MRU block within that set, so even though you have a linear search
within the set, you will often only have to do one comparison.  See
accessBlock() in mem/cache/tags/lru.cc for an example of this.  And even if
you don't hit the MRU block, most L1s have relatively small associativity so
a linear search is not that bad.  I have a hard time believing that this
wouldn't be substantially faster than a hash table lookup.

For caches that are more highly associative, or L2/L3 caches where the
probability of hitting the MRU block is not as high, or for snoop lookups
where you are typically searching the whole set to verify that a block is
*not* there, then a hash table *may* make more sense.  (Though even in that
case I wonder whether a whole-cache hash table vs. per-set hash tables is a
better idea.)  But these should all be much less frequent than L1 hit
lookups, so I would think the set-associative lookup would still be faster
overall.  The optimal strategy in the long run may be to have both types of
lookup implemented and to choose the right one at configuration time based
on the caches associativity and position... we sort of tried to do this in
the original M5 code with the LRU and FALRU tags, but didn't really push the
concept all the way to figuring out the optimal crossover point.

My mention above about snooping being a bad case for the set-associative
linear-search lookup makes me wonder about the double icache/dcache lookup
you mentioned... that could be part of the problem, and it's not necessary.
 There should be a real coherence protocol between the two caches, so that
you only have to snoop the other cache in the same situations where you'd
have to snoop another core's data cache... i.e., if either cache has an
exclusive copy, or is only reading a shared copy, there's no point in doing
that second lookup.  Is there any notion in the protocol of having different
protocol states for the icache and dcache?

Steve
___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


Re: [m5-dev] Profile Results for Mesh Network

2011-01-27 Thread Korey Sewell
From Steve's response,
it looks like I'm jumping in the conversation on the wrong page.

To be clear,
Nilay were you optimizing the lookup() calls or trying to reduce the number
of times lookup gets called?  My MRU comments and keeping things in the the
SimpleCPU was directed toward the latter.

On Thu, Jan 27, 2011 at 8:29 AM, Korey Sewell ksew...@umich.edu wrote:


 I tried caching the index for the MRU block, so that the hash table need
 not be looked up. It is hard to point if there is a speed up or not. When
 I run m5.prof, profile results show that time taken by
 CacheMemory::lookup() drops from ~ 5.5% to ~ 4%. But when I run m5.fast,
 the time of execution increases by 2%.


 Hi Nilay,
 when you did the MRU block caching did it reduce the # of sendFetch calls?
 I would think that that # should go down from say 140 million to something
 drastically smaller (...say 50 million)...

 My next question why did you choose caching the MRU index rather than the
 MRU block. Ideally, what could happen is that in the TimingSimpleCPU you
 could check to to see if that block is cached inside the CPU model and if so
 you would not to have send a fetch to the memory system at all (thus
 reducing lookup() calls). Is this what is happening here?

 --
 - Korey




-- 
- Korey
___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


Re: [m5-dev] Profile Results for Mesh Network

2011-01-27 Thread Nilay Vaish

On Thu, 27 Jan 2011, Korey Sewell wrote:


From Steve's response,
it looks like I'm jumping in the conversation on the wrong page.

To be clear,
Nilay were you optimizing the lookup() calls or trying to reduce the number
of times lookup gets called?  My MRU comments and keeping things in the the
SimpleCPU was directed toward the latter.


I am only trying to reduce the time taken by a CacheMemory::lookup() call.

--
Nilay
___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


Re: [m5-dev] Profile Results for Mesh Network

2011-01-27 Thread Nilay Vaish

On Thu, 27 Jan 2011, Steve Reinhardt wrote:


On Thu, Jan 27, 2011 at 4:36 AM, Nilay Vaish ni...@cs.wisc.edu wrote:


I tried caching the index for the MRU block, so that the hash table need
not be looked up. It is hard to point if there is a speed up or not. When
I run m5.prof, profile results show that time taken by
CacheMemory::lookup() drops from ~ 5.5% to ~ 4%. But when I run m5.fast,
the time of execution increases by 2%.

Earlier when we had the same address being looked up multiple times in
succession, this change made sense. Right now, I uncertain about whether
this change improves performance.




Hi Nilay,

Are you talking about caching the MRU block for the entire cache?  I agree
that that isn't likely to help much now that we've eliminated the redundant
lookup() calls.


I had a single MRU block per cache in my experiments.


I was thinking of a conventional set-associative cache
lookup (a 2D array, or a 1D array of sets where each set is represented by a
linked list) where the MRU block of each set is moved to the head of the
list.  For an L1 cache in particular, most of the accesses to each set will
be to the MRU block within that set, so even though you have a linear search
within the set, you will often only have to do one comparison.  See
accessBlock() in mem/cache/tags/lru.cc for an example of this.  And even if
you don't hit the MRU block, most L1s have relatively small associativity so
a linear search is not that bad.  I have a hard time believing that this
wouldn't be substantially faster than a hash table lookup.

For caches that are more highly associative, or L2/L3 caches where the
probability of hitting the MRU block is not as high, or for snoop lookups
where you are typically searching the whole set to verify that a block is
*not* there, then a hash table *may* make more sense.  (Though even in that
case I wonder whether a whole-cache hash table vs. per-set hash tables is a
better idea.)  But these should all be much less frequent than L1 hit
lookups, so I would think the set-associative lookup would still be faster
overall.  The optimal strategy in the long run may be to have both types of
lookup implemented and to choose the right one at configuration time based
on the caches associativity and position... we sort of tried to do this in
the original M5 code with the LRU and FALRU tags, but didn't really push the
concept all the way to figuring out the optimal crossover point.



I also have an implementation that performs linear search of cache sets 
instead of hash table lookup. Again, I saw small improvement. But as you 
mentioned, when the associativity will go up, linear search will perform 
worse. Profile results showed that it takes about 100 ns for linear search 
and about 125 ns for hash table lookup, associativity being 2 for the l1 
caches. May be some one should independently profile m5.



My mention above about snooping being a bad case for the set-associative
linear-search lookup makes me wonder about the double icache/dcache lookup
you mentioned... that could be part of the problem, and it's not necessary.
There should be a real coherence protocol between the two caches, so that
you only have to snoop the other cache in the same situations where you'd
have to snoop another core's data cache... i.e., if either cache has an
exclusive copy, or is only reading a shared copy, there's no point in doing
that second lookup.  Is there any notion in the protocol of having different
protocol states for the icache and dcache?



Both icache and the dcache have the same set of states. But now that you 
have mentioned it, I think it possible to do away with most of double 
lookups. We maintain the invariant that caches under a single controller 
maintain disjoint set of cache blocks. In this case, icache and dcache 
come under the same cache controller. Currently, for an instruction fetch, 
we check the dcache first. But this is obviously what we should not be 
doing. Because in common case (unless of course we simulate self-modifying 
code often enough), the dcache lookup will report a miss and the icache 
lookup is report a hit. So, we should lookup the icache first. In case of 
a hit, we know by our invariant that the dcache does not have the line. If 
we miss, then we should lookup the dcache. We shoul adopt a similar 
strategy for loads and stores as well.


We can split them under different controllers, but that would mean a lot 
more work.


I will get back to you on this soon.

--
Nilay
___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


Re: [m5-dev] Profile Results for Mesh Network

2011-01-27 Thread Steve Reinhardt
On Thu, Jan 27, 2011 at 4:01 PM, Nilay Vaish ni...@cs.wisc.edu wrote:



 I also have an implementation that performs linear search of cache sets
 instead of hash table lookup. Again, I saw small improvement. But as you
 mentioned, when the associativity will go up, linear search will perform
 worse. Profile results showed that it takes about 100 ns for linear search
 and about 125 ns for hash table lookup, associativity being 2 for the l1
 caches. May be some one should independently profile m5.


That's surprising to me, but I suppose if the hash function is cheap to
compute and typically hits in the first bucket then the performance could be
similar.  I suppose this isn't worth a lot more effort; it's certainly less
significant than the double-lookup issue below.


 Both icache and the dcache have the same set of states. But now that you
 have mentioned it, I think it possible to do away with most of double
 lookups. We maintain the invariant that caches under a single controller
 maintain disjoint set of cache blocks. In this case, icache and dcache come
 under the same cache controller. Currently, for an instruction fetch, we
 check the dcache first. But this is obviously what we should not be doing.
 Because in common case (unless of course we simulate self-modifying code
 often enough), the dcache lookup will report a miss and the icache lookup is
 report a hit. So, we should lookup the icache first. In case of a hit, we
 know by our invariant that the dcache does not have the line. If we miss,
 then we should lookup the dcache. We shoul adopt a similar strategy for
 loads and stores as well.


Good point... I had forgotten the disjoint blocks invariant.  It would seem
like it's a total waste of time to look in any other cache than the one you
expect to find it in unless there's a miss.  I'd hope that's an easy fix.


 We can split them under different controllers, but that would mean a lot
 more work.


Yea, I wasn't suggesting that.


 I will get back to you on this soon.


Thanks!

Steve
___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


Re: [m5-dev] Profile Results for Mesh Network

2011-01-24 Thread Steve Reinhardt
On Sun, Jan 23, 2011 at 4:08 PM, Nilay Vaish ni...@cs.wisc.edu wrote:

 On Sun, 23 Jan 2011, Korey Sewell wrote:

  In sendFetch(), it calls sendTiming() which would then call the recvTiming
 on the cache port since those two should be binded as peers.

 I'm a little unsure of how the RubyPort, Sequencer, CacheMemory, and
 CacheController (?) relationship is working (right now at least), but the
 relationship between sendTiming and recvTiming is the key concept that
 connects 2 memory objects unless things have changed.

 On Sun, Jan 23, 2011 at 3:51 PM, Nilay Vaish ni...@cs.wisc.edu wrote:

  I dug more in to the code today. There are three paths along which calls
 are made to the RubyPort::M5Port::recvTiming(), which eventually results
 in
 calls to CacheMemory::lookup().

 1. TimingSimpleCPU::sendFetch() - 140 million
 2. TimingSimpleCPU::handleReadPacket() - 30 million
 3. TimingSimpleCPU::handleWritePacket() - 18 million

 The number of times last two functions are called is very close to the
 total number of memory references (48 million) for all the cpus together.
 The number of lookup() calls is about 392 million. If we take into
 account
 the calls to sendFetch(), then the ratio of number of lookup() calls to
 that
 of the number of requests pushed in to ruby reduces to 2 to 1, from an
 earlier estimate of 8 to 1.

 My question would be why does sendFetch() makes calls to recvTiming()?


 Some more reading revealed that that sendFetch() is calling recvTiming for
 instruction cache accesses. Whereas the other two calls (handleReadPacket
 and handleWritePacket) are for data cache accesses.


Yes, that's right.  So there's probably no big win in trying to further
reduce the number of calls to lookup() in Ruby; the possibilities I see for
improvement are:
1. Adding an instruction buffer to SimpleCPU so we don't do a cache lookup
on *every* instruction fetch
2. Trying again to make the lookup() calls themselves faster (for example, a
lookup that hits the MRU block should really only take a handful of
instructions, while IIRC we were seeing much larger costs for the hash table
lookup)
3. Moving on to some other area (like the Histogram thing)

#1 is not a Ruby issue, and could well be different under x86 since (1) x86
has a byte-stream-oriented predecoder so it doesn't do a fetch per
instruction anyway and (2) you may have to worry about self-modifying code.
 Gabe, how many bytes at a time does the x86 predecoder fetch?  If it
doesn't currently grab a cache line at a time, could it be made to do so,
and do you know if that would cause any issues with SMC?

Nilay, I'd appreciate your comments on #2 and whether you think that's worth
pursuing or should we move on to #3.

Steve
___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


Re: [m5-dev] Profile Results for Mesh Network

2011-01-24 Thread Nilay Vaish

On Mon, 24 Jan 2011, Steve Reinhardt wrote:


On Sun, Jan 23, 2011 at 4:08 PM, Nilay Vaish ni...@cs.wisc.edu wrote:


On Sun, 23 Jan 2011, Korey Sewell wrote:

 In sendFetch(), it calls sendTiming() which would then call the recvTiming

on the cache port since those two should be binded as peers.

I'm a little unsure of how the RubyPort, Sequencer, CacheMemory, and
CacheController (?) relationship is working (right now at least), but the
relationship between sendTiming and recvTiming is the key concept that
connects 2 memory objects unless things have changed.

On Sun, Jan 23, 2011 at 3:51 PM, Nilay Vaish ni...@cs.wisc.edu wrote:

 I dug more in to the code today. There are three paths along which calls

are made to the RubyPort::M5Port::recvTiming(), which eventually results
in
calls to CacheMemory::lookup().

1. TimingSimpleCPU::sendFetch() - 140 million
2. TimingSimpleCPU::handleReadPacket() - 30 million
3. TimingSimpleCPU::handleWritePacket() - 18 million

The number of times last two functions are called is very close to the
total number of memory references (48 million) for all the cpus together.
The number of lookup() calls is about 392 million. If we take into
account
the calls to sendFetch(), then the ratio of number of lookup() calls to
that
of the number of requests pushed in to ruby reduces to 2 to 1, from an
earlier estimate of 8 to 1.

My question would be why does sendFetch() makes calls to recvTiming()?



Some more reading revealed that that sendFetch() is calling recvTiming for
instruction cache accesses. Whereas the other two calls (handleReadPacket
and handleWritePacket) are for data cache accesses.



Yes, that's right.  So there's probably no big win in trying to further
reduce the number of calls to lookup() in Ruby; the possibilities I see for
improvement are:
1. Adding an instruction buffer to SimpleCPU so we don't do a cache lookup
on *every* instruction fetch
2. Trying again to make the lookup() calls themselves faster (for example, a
lookup that hits the MRU block should really only take a handful of
instructions, while IIRC we were seeing much larger costs for the hash table
lookup)
3. Moving on to some other area (like the Histogram thing)

#1 is not a Ruby issue, and could well be different under x86 since (1) x86
has a byte-stream-oriented predecoder so it doesn't do a fetch per
instruction anyway and (2) you may have to worry about self-modifying code.
Gabe, how many bytes at a time does the x86 predecoder fetch?  If it
doesn't currently grab a cache line at a time, could it be made to do so,
and do you know if that would cause any issues with SMC?

Nilay, I'd appreciate your comments on #2 and whether you think that's worth
pursuing or should we move on to #3.

Steve



I now understand why the ratio is 2:1. Before every instruction 
fetch, the data cache is looked up to make sure that it does 
not contain the cache block. After that the instruction cache is 
looked up. Similarly, before any data access, the instruction cache is 
looked up. This is probably for correctly handling self modifying code.


Steve, we can try caching MRU cache block. We can also try replacing hash 
table with a two dimensional array indexed using cache set and cache way.


There are calls to CacheMemory::isTagPresent() in Sequencer.cc. These 
calls are made just before calls to setMRU(). I am thinking of folding 
these calls to isTagPresent() within setMRU() which calls 
CacheMemory::findTagInSet() anyway.


--
Nilay
___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


Re: [m5-dev] Profile Results for Mesh Network

2011-01-24 Thread Korey Sewell
 Steve, we can try caching MRU cache block. We can also try replacing hash
 table with a two dimensional array indexed using cache set and cache way.

This should at least show some decent speedup (depending on SMC code). The
O3 caches the MRU and ironically, I had just patched the InOrder model to do
the same thing. Templating the code from fetch_impl.hh (in the O3 model) to
cache the block may be a good way to get that change done fairly quick.

-- 
- Korey
___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


Re: [m5-dev] Profile Results for Mesh Network

2011-01-24 Thread Gabriel Michael Black

Quoting Steve Reinhardt ste...@gmail.com:


 Gabe, how many bytes at a time does the x86 predecoder fetch?  If it
doesn't currently grab a cache line at a time, could it be made to do so,
and do you know if that would cause any issues with SMC?


All of the predecoders expect to receive one MachInst at a time so  
that for fixed width ISAs they naturally get one instruction at a  
time. For x86 that type doesn't really mean anything other than  
controlling that width. Currently it's defined to be a uint64_t, but  
that could be changed pretty easily with hopefully basically no  
functional effect. Whatever it is would need to have some integer like  
properties, though, where it could be shifted, masked, etc. I have it  
set to a big int to try to minimize trips through the predecoder, the  
work needed to glue big immediates together, etc.


I heard during a presentation by a pretty senior VMware guy that x86  
isn't guaranteed to refetch instruction bytes until a control flow  
instruction and/or a serializing instruction or something along those  
lines. A former AMD guy in the room was surprised by that, though, so  
I'm not sure it's true. It could be one of those things that's  
supposed to be true, but to get minesweeper to work on Windows 95 they  
have to refetch anyway. It could also be a difference between Intel  
and AMD. So the take away answer is possibly no, possibly yes.


Gabe
___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


Re: [m5-dev] Profile Results for Mesh Network

2011-01-23 Thread Nilay Vaish
I dug more in to the code today. There are three paths along which calls 
are made to the RubyPort::M5Port::recvTiming(), which eventually results 
in calls to CacheMemory::lookup().


1. TimingSimpleCPU::sendFetch() - 140 million
2. TimingSimpleCPU::handleReadPacket() - 30 million
3. TimingSimpleCPU::handleWritePacket() - 18 million

The number of times last two functions are called is very close to the 
total number of memory references (48 million) for all the cpus together. 
The number of lookup() calls is about 392 million. If we take into account 
the calls to sendFetch(), then the ratio of number of lookup() calls to 
that of the number of requests pushed in to ruby reduces to 2 to 1, from 
an earlier estimate of 8 to 1.


My question would be why does sendFetch() makes calls to recvTiming()?

--
Nilay
___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


Re: [m5-dev] Profile Results for Mesh Network

2011-01-23 Thread Nilay Vaish

On Sun, 23 Jan 2011, Korey Sewell wrote:


In sendFetch(), it calls sendTiming() which would then call the recvTiming
on the cache port since those two should be binded as peers.

I'm a little unsure of how the RubyPort, Sequencer, CacheMemory, and
CacheController (?) relationship is working (right now at least), but the
relationship between sendTiming and recvTiming is the key concept that
connects 2 memory objects unless things have changed.

On Sun, Jan 23, 2011 at 3:51 PM, Nilay Vaish ni...@cs.wisc.edu wrote:


I dug more in to the code today. There are three paths along which calls
are made to the RubyPort::M5Port::recvTiming(), which eventually results in
calls to CacheMemory::lookup().

1. TimingSimpleCPU::sendFetch() - 140 million
2. TimingSimpleCPU::handleReadPacket() - 30 million
3. TimingSimpleCPU::handleWritePacket() - 18 million

The number of times last two functions are called is very close to the
total number of memory references (48 million) for all the cpus together.
The number of lookup() calls is about 392 million. If we take into account
the calls to sendFetch(), then the ratio of number of lookup() calls to that
of the number of requests pushed in to ruby reduces to 2 to 1, from an
earlier estimate of 8 to 1.

My question would be why does sendFetch() makes calls to recvTiming()?



Some more reading revealed that that sendFetch() is calling recvTiming for 
instruction cache accesses. Whereas the other two calls (handleReadPacket 
and handleWritePacket) are for data cache accesses.


--
Nilay
___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


Re: [m5-dev] Profile Results for Mesh Network

2011-01-19 Thread Steve Reinhardt
What's the deal with Histogram::add()?  Either it's too slow or it's being
called too much, I'd say, unless we're tracking some incredibly vital
statistics there.  Can you use the call graph part of the profile to find
where most of the calls are coming from?

Also, can you look at the stats and verify that the number of calls to
CacheMemory::lookup() is not much larger than the total number of CPU memory
references?

Another thing to note is that even though PerfectSwitch::wakeup() is not a
huge amount of time, the time per call is pretty large, so there may be
opportunities for a lot of improvement there.

Thanks,

Steve

On Wed, Jan 19, 2011 at 9:20 AM, Nilay Vaish ni...@cs.wisc.edu wrote:

 I profiled m5 again, using the following command.

 ./build/ALPHA_FS_MOESI_hammer/m5.prof ./configs/example/ruby_fs.py
 --maxtick 2000 -n 8 --topology Mesh --mesh-rows 2 --num-l2cache 8
 --num-dir 8

 Results have been copied below. CacheMemory::lookup() still consumes some
 time but is lot less than before. PrefectSwitch can be a candidate for
 re-design. But given that, PerfectSwitch is taking only 3% of the time, it
 would not yield that much gain. May be we need to look at these different
 functions in a more holistic fashion.

 --
 Nilay


  %   cumulative   self  self total
  time   seconds   secondscalls   s/call   s/call  name
  7.29 35.8935.89 606750284 0.00 0.00  Histogram::add(long
 long)
  5.84 64.6228.73 256533483 0.00 0.00
 CacheMemory::lookup(Address const)
  4.56 87.0622.44 124360139 0.00 0.00
 L1Cache_Controller::wakeup()
  3.88106.1819.12 121283110 0.00 0.00
 RubyPort::M5Port::recvTiming(Packet*)
  3.13121.6015.42  6875704 0.00 0.00 PerfectSwitch::wakeup()
  3.00136.3814.78 39527382 0.00 0.00
 MemoryControl::executeCycle()
  2.95150.9114.53 90855686 0.00 0.00
 BaseSimpleCPU::preExecute()
  2.68164.1013.19 147750111 0.00 0.00
 MessageBuffer::enqueue(RefCountingPtrMessage, long long)
  2.41175.9611.86 180281626 0.00 0.00
 RubyEventQueueNode::process()
  2.07186.1710.21 302054741 0.00 0.00
 EventQueue::serviceOne()
  2.03196.15 9.98 121176116 0.00 0.00
 Sequencer::getRequestStatus(RubyRequest const)


 On Tue, 18 Jan 2011, Nilay wrote:

  Brad,

 I got the simulation working. It seems to me that you wrote Mesh.py under
 the assumption that number of cpus = number of L1 controllers = number of
 L2 controllers (if present) = number of directory controllers.

 The following options worked after some struggle and some help from Arka -

 ./build/ALPHA_FS_MESI_CMP_directory/m5.fast ./configs/example/ruby_fs.py
 --maxtick 20 -n 16 --topology Mesh --mesh-rows 4 --num-dirs 16
 --num-l2caches 16

 --
 Nilay


 On Tue, January 18, 2011 10:28 am, Beckmann, Brad wrote:

 Hi Nilay,

 My plan is to tackle the functional access support as soon as I check in
 our current group of outstanding patches.  I'm hoping to at least check
 in
 the majority of them in the next couple of days.  Now that you've
 completed the CacheMemory access changes, you may want to re-profile GEM5
 and make sure the next performance bottleneck is routing network messages
 in the Perfect Switch.  In particular, you'll want to look at rather
 large
 (16+ core) systems using a standard Mesh network.  If you have any
 questions on how to do that, Arka may be able to help you out, if not, I
 can certainly help you.  Assuming the Perfect Switch shows up as a major
 bottleneck ( 10%),  then I would suggest that as the next area you can
 work on.  When looking at possible solutions, don't limit yourself to
 just
 changes within Perfect Switch itself.  I suspect that redesigning how
 destinations are encoded and/or the interface between MessageBuffer
 dequeues and the PerfectSwitch wakeup, will lead to a better solution.

 Brad


  ___
 m5-dev mailing list
 m5-dev@m5sim.org
 http://m5sim.org/mailman/listinfo/m5-dev

___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


Re: [m5-dev] Profile Results for Mesh Network

2011-01-19 Thread nathan binkert
 What's the deal with Histogram::add()?  Either it's too slow or it's being
 called too much, I'd say, unless we're tracking some incredibly vital
 statistics there.  Can you use the call graph part of the profile to find
 where most of the calls are coming from?

Don't spend too much time fixing it if you want to.  I've added a
Histogram stat to the base stats package and it would be better to
switch.

  Nate
___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


Re: [m5-dev] Profile Results for Mesh Network

2011-01-19 Thread Nilay Vaish
Some more data from the same simulation, this time from m5out/stats.txt. 
The number of memory references is slightly more than 30,000,000 where as 
the number of lookups in the cache is about 256,000,000. So that would be 
a ratio of 1 : 8.5. I suspect that the reason for this might be that every 
cache controller performs lookup for every memory reference.


It seems to me that only one thread was running all the time which 
executed mostly on processors 0 and 4. Is the rate at which sim tick is 
incremented different from the processor clock rate? The simulation was 
allowed to run for 200 billion sim ticks while the processor cycles is 
just 400 million.


Given the way the processors behaved during this simulation, I think we 
should profile a simulation in which all the processors are active most of 
the time.


--
Nilay


system.cpu0.not_idle_fraction0.170783 
# Percentage of non-idle cycles
system.cpu0.numCycles   35398 
# number of cpu cycles simulated
system.cpu0.num_insts 9504409 
# Number of instructions executed
system.cpu0.num_refs  2258648 
# Number of memory references


system.cpu1.not_idle_fraction0.002503 
# Percentage of non-idle cycles
system.cpu1.numCycles   398441993 
# number of cpu cycles simulated
system.cpu1.num_insts  212019 
# Number of instructions executed
system.cpu1.num_refs71465 
# Number of memory references


CPUs 2 and 3 behavior is same as 1.

system.cpu4.not_idle_fraction0.823388 
# Percentage of non-idle cycles
system.cpu4.numCycles   4 
# number of cpu cycles simulated
system.cpu4.num_insts80686910 
# Number of instructions executed
system.cpu4.num_refs 28047398 
# Number of memory references


system.cpu5.not_idle_fraction0.000185 
# Percentage of non-idle cycles
system.cpu5.numCycles   399900914 
# number of cpu cycles simulated
system.cpu5.num_insts8413 
# Number of instructions executed
system.cpu5.num_refs 2862 
# Number of memory references


CPUs 6 and 7 also have stats similar to 5.



On Wed, 19 Jan 2011, Steve Reinhardt wrote:


What's the deal with Histogram::add()?  Either it's too slow or it's being
called too much, I'd say, unless we're tracking some incredibly vital
statistics there.  Can you use the call graph part of the profile to find
where most of the calls are coming from?

Also, can you look at the stats and verify that the number of calls to
CacheMemory::lookup() is not much larger than the total number of CPU memory
references?

Another thing to note is that even though PerfectSwitch::wakeup() is not a
huge amount of time, the time per call is pretty large, so there may be
opportunities for a lot of improvement there.

Thanks,

Steve

On Wed, Jan 19, 2011 at 9:20 AM, Nilay Vaish ni...@cs.wisc.edu wrote:


I profiled m5 again, using the following command.

./build/ALPHA_FS_MOESI_hammer/m5.prof ./configs/example/ruby_fs.py
--maxtick 2000 -n 8 --topology Mesh --mesh-rows 2 --num-l2cache 8
--num-dir 8

Results have been copied below. CacheMemory::lookup() still consumes some
time but is lot less than before. PrefectSwitch can be a candidate for
re-design. But given that, PerfectSwitch is taking only 3% of the time, it
would not yield that much gain. May be we need to look at these different
functions in a more holistic fashion.

--
Nilay


 %   cumulative   self  self total
 time   seconds   secondscalls   s/call   s/call  name
 7.29 35.8935.89 606750284 0.00 0.00  Histogram::add(long
long)
 5.84 64.6228.73 256533483 0.00 0.00
CacheMemory::lookup(Address const)
 4.56 87.0622.44 124360139 0.00 0.00
L1Cache_Controller::wakeup()
 3.88106.1819.12 121283110 0.00 0.00
RubyPort::M5Port::recvTiming(Packet*)
 3.13121.6015.42  6875704 0.00 0.00 PerfectSwitch::wakeup()
 3.00136.3814.78 39527382 0.00 0.00
MemoryControl::executeCycle()
 2.95150.9114.53 90855686 0.00 0.00
BaseSimpleCPU::preExecute()
 2.68164.1013.19 147750111 0.00 0.00
MessageBuffer::enqueue(RefCountingPtrMessage, long long)
 2.41175.9611.86 180281626 0.00 0.00
RubyEventQueueNode::process()
 2.07186.1710.21 302054741 0.00 0.00
EventQueue::serviceOne()
 2.03196.15 9.98 121176116 0.00 0.00
Sequencer::getRequestStatus(RubyRequest const)



___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


Re: [m5-dev] Profile Results for Mesh Network

2011-01-19 Thread Steve Reinhardt
On Wed, Jan 19, 2011 at 3:56 PM, Nilay Vaish ni...@cs.wisc.edu wrote:

 Some more data from the same simulation, this time from m5out/stats.txt.
 The number of memory references is slightly more than 30,000,000 where as
 the number of lookups in the cache is about 256,000,000. So that would be a
 ratio of 1 : 8.5. I suspect that the reason for this might be that every
 cache controller performs lookup for every memory reference.


Do you mean that I do a lookup in all 8 caches for each reference?  Is this
part of an assertion that's checking coherence invariants?  It seems like
that would be something that we'd want to only do in debug mode (or maybe
only when explicitly enabled).


 It seems to me that only one thread was running all the time which executed
 mostly on processors 0 and 4. Is the rate at which sim tick is incremented
 different from the processor clock rate? The simulation was allowed to run
 for 200 billion sim ticks while the processor cycles is just 400 million.


Ticks are typically picoseconds, so assuming a 2 GHz CPU clock that looks
right.


 Given the way the processors behaved during this simulation, I think we
 should profile a simulation in which all the processors are active most of
 the time.


I'm not sure that would affect the simulator performance profile that much.

Steve



 --
 Nilay


 system.cpu0.not_idle_fraction0.170783 # Percentage of
 non-idle cycles
 system.cpu0.numCycles   35398 # number of cpu
 cycles simulated
 system.cpu0.num_insts 9504409 # Number of
 instructions executed
 system.cpu0.num_refs  2258648 # Number of memory
 references

 system.cpu1.not_idle_fraction0.002503 # Percentage of
 non-idle cycles
 system.cpu1.numCycles   398441993 # number of cpu
 cycles simulated
 system.cpu1.num_insts  212019 # Number of
 instructions executed
 system.cpu1.num_refs71465 # Number of memory
 references

 CPUs 2 and 3 behavior is same as 1.

 system.cpu4.not_idle_fraction0.823388 # Percentage of
 non-idle cycles
 system.cpu4.numCycles   4 # number of cpu
 cycles simulated
 system.cpu4.num_insts80686910 # Number of
 instructions executed
 system.cpu4.num_refs 28047398 # Number of memory
 references

 system.cpu5.not_idle_fraction0.000185 # Percentage of
 non-idle cycles
 system.cpu5.numCycles   399900914 # number of cpu
 cycles simulated
 system.cpu5.num_insts8413 # Number of
 instructions executed
 system.cpu5.num_refs 2862 # Number of memory
 references

 CPUs 6 and 7 also have stats similar to 5.




 On Wed, 19 Jan 2011, Steve Reinhardt wrote:

  What's the deal with Histogram::add()?  Either it's too slow or it's being
 called too much, I'd say, unless we're tracking some incredibly vital
 statistics there.  Can you use the call graph part of the profile to find
 where most of the calls are coming from?

 Also, can you look at the stats and verify that the number of calls to
 CacheMemory::lookup() is not much larger than the total number of CPU
 memory
 references?

 Another thing to note is that even though PerfectSwitch::wakeup() is not a
 huge amount of time, the time per call is pretty large, so there may be
 opportunities for a lot of improvement there.

 Thanks,

 Steve

 On Wed, Jan 19, 2011 at 9:20 AM, Nilay Vaish ni...@cs.wisc.edu wrote:

  I profiled m5 again, using the following command.

 ./build/ALPHA_FS_MOESI_hammer/m5.prof ./configs/example/ruby_fs.py
 --maxtick 2000 -n 8 --topology Mesh --mesh-rows 2 --num-l2cache 8
 --num-dir 8

 Results have been copied below. CacheMemory::lookup() still consumes some
 time but is lot less than before. PrefectSwitch can be a candidate for
 re-design. But given that, PerfectSwitch is taking only 3% of the time,
 it
 would not yield that much gain. May be we need to look at these different
 functions in a more holistic fashion.

 --
 Nilay


  %   cumulative   self  self total
  time   seconds   secondscalls   s/call   s/call  name
  7.29 35.8935.89 606750284 0.00 0.00  Histogram::add(long
 long)
  5.84 64.6228.73 256533483 0.00 0.00
 CacheMemory::lookup(Address const)
  4.56 87.0622.44 124360139 0.00 0.00
 L1Cache_Controller::wakeup()
  3.88106.1819.12 121283110 0.00 0.00
 RubyPort::M5Port::recvTiming(Packet*)
  3.13121.6015.42  6875704 0.00 0.00
 PerfectSwitch::wakeup()
  3.00136.3814.78 39527382 0.00 0.00
 MemoryControl::executeCycle()
  2.95150.9114.53 90855686 0.00 0.00
 BaseSimpleCPU::preExecute()
  2.68164.1013.19 147750111 0.00 0.00
 MessageBuffer::enqueue(RefCountingPtrMessage, long long)
  2.41

Re: [m5-dev] Profile Results for Mesh Network

2011-01-19 Thread nathan binkert
 Do you mean that I do a lookup in all 8 caches for each reference?  Is this
 part of an assertion that's checking coherence invariants?  It seems like
 that would be something that we'd want to only do in debug mode (or maybe
 only when explicitly enabled).

Seems like something that we could enable with a Debug Flag.  Hmmm.  I
still need to commit that.

  Nate
___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


Re: [m5-dev] Profile Results for Mesh Network

2011-01-19 Thread Nilay Vaish

On Wed, 19 Jan 2011, Steve Reinhardt wrote:


On Wed, Jan 19, 2011 at 3:56 PM, Nilay Vaish ni...@cs.wisc.edu wrote:


Some more data from the same simulation, this time from m5out/stats.txt.
The number of memory references is slightly more than 30,000,000 where as
the number of lookups in the cache is about 256,000,000. So that would be a
ratio of 1 : 8.5. I suspect that the reason for this might be that every
cache controller performs lookup for every memory reference.



Do you mean that I do a lookup in all 8 caches for each reference?  Is this
part of an assertion that's checking coherence invariants?  It seems like
that would be something that we'd want to only do in debug mode (or maybe
only when explicitly enabled).


I think the number of caches looked will depend on the coherence protocol 
(MOESI Hammer in this case). I have also profiled ALPHA FS MOESI 
Hammer with just one processor. In that case the ratio was 1:1.5, that is 
why I thought that may be all the caches are being looked up on each 
reference.


Since I am running m5.prof, I would expect that none of the assertion 
statements are being compiled in.






It seems to me that only one thread was running all the time which executed
mostly on processors 0 and 4. Is the rate at which sim tick is incremented
different from the processor clock rate? The simulation was allowed to run
for 200 billion sim ticks while the processor cycles is just 400 million.



Ticks are typically picoseconds, so assuming a 2 GHz CPU clock that looks
right.



Given the way the processors behaved during this simulation, I think we
should profile a simulation in which all the processors are active most of
the time.



I'm not sure that would affect the simulator performance profile that much.


I would expect more coherence traffic, network messages. More active CPUs 
might also mean that portions of code, other than those relating to memory 
accesses, are executed more often.




Steve




--
Nilay


system.cpu0.not_idle_fraction0.170783 # Percentage of
non-idle cycles
system.cpu0.numCycles   35398 # number of cpu
cycles simulated
system.cpu0.num_insts 9504409 # Number of
instructions executed
system.cpu0.num_refs  2258648 # Number of memory
references

system.cpu1.not_idle_fraction0.002503 # Percentage of
non-idle cycles
system.cpu1.numCycles   398441993 # number of cpu
cycles simulated
system.cpu1.num_insts  212019 # Number of
instructions executed
system.cpu1.num_refs71465 # Number of memory
references

CPUs 2 and 3 behavior is same as 1.

system.cpu4.not_idle_fraction0.823388 # Percentage of
non-idle cycles
system.cpu4.numCycles   4 # number of cpu
cycles simulated
system.cpu4.num_insts80686910 # Number of
instructions executed
system.cpu4.num_refs 28047398 # Number of memory
references

system.cpu5.not_idle_fraction0.000185 # Percentage of
non-idle cycles
system.cpu5.numCycles   399900914 # number of cpu
cycles simulated
system.cpu5.num_insts8413 # Number of
instructions executed
system.cpu5.num_refs 2862 # Number of memory
references

CPUs 6 and 7 also have stats similar to 5.




On Wed, 19 Jan 2011, Steve Reinhardt wrote:

 What's the deal with Histogram::add()?  Either it's too slow or it's being

called too much, I'd say, unless we're tracking some incredibly vital
statistics there.  Can you use the call graph part of the profile to find
where most of the calls are coming from?

Also, can you look at the stats and verify that the number of calls to
CacheMemory::lookup() is not much larger than the total number of CPU
memory
references?

Another thing to note is that even though PerfectSwitch::wakeup() is not a
huge amount of time, the time per call is pretty large, so there may be
opportunities for a lot of improvement there.

Thanks,

Steve

On Wed, Jan 19, 2011 at 9:20 AM, Nilay Vaish ni...@cs.wisc.edu wrote:

 I profiled m5 again, using the following command.


./build/ALPHA_FS_MOESI_hammer/m5.prof ./configs/example/ruby_fs.py
--maxtick 2000 -n 8 --topology Mesh --mesh-rows 2 --num-l2cache 8
--num-dir 8

Results have been copied below. CacheMemory::lookup() still consumes some
time but is lot less than before. PrefectSwitch can be a candidate for
re-design. But given that, PerfectSwitch is taking only 3% of the time,
it
would not yield that much gain. May be we need to look at these different
functions in a more holistic fashion.

--
Nilay


 %   cumulative   self  self total
 time   seconds   secondscalls   s/call   s/call  name
 7.29 35.8935.89 606750284 0.00 0.00  Histogram::add(long
long)
 5.84 64.6228.73 256533483 0.00 0.00