On Wed, 19 Jan 2011, Steve Reinhardt wrote:

On Wed, Jan 19, 2011 at 3:56 PM, Nilay Vaish <ni...@cs.wisc.edu> wrote:

Some more data from the same simulation, this time from m5out/stats.txt.
The number of memory references is slightly more than 30,000,000 where as
the number of lookups in the cache is about 256,000,000. So that would be a
ratio of 1 : 8.5. I suspect that the reason for this might be that every
cache controller performs lookup for every memory reference.


Do you mean that I do a lookup in all 8 caches for each reference?  Is this
part of an assertion that's checking coherence invariants?  It seems like
that would be something that we'd want to only do in debug mode (or maybe
only when explicitly enabled).

I think the number of caches looked will depend on the coherence protocol (MOESI Hammer in this case). I have also profiled ALPHA FS MOESI Hammer with just one processor. In that case the ratio was 1:1.5, that is why I thought that may be all the caches are being looked up on each reference.

Since I am running m5.prof, I would expect that none of the assertion statements are being compiled in.



It seems to me that only one thread was running all the time which executed
mostly on processors 0 and 4. Is the rate at which sim tick is incremented
different from the processor clock rate? The simulation was allowed to run
for 200 billion sim ticks while the processor cycles is just 400 million.


Ticks are typically picoseconds, so assuming a 2 GHz CPU clock that looks
right.


Given the way the processors behaved during this simulation, I think we
should profile a simulation in which all the processors are active most of
the time.


I'm not sure that would affect the simulator performance profile that much.

I would expect more coherence traffic, network messages. More active CPUs might also mean that portions of code, other than those relating to memory accesses, are executed more often.


Steve



--
Nilay


system.cpu0.not_idle_fraction                0.170783 # Percentage of
non-idle cycles
system.cpu0.numCycles                       399995398 # number of cpu
cycles simulated
system.cpu0.num_insts                         9504409 # Number of
instructions executed
system.cpu0.num_refs                          2258648 # Number of memory
references

system.cpu1.not_idle_fraction                0.002503 # Percentage of
non-idle cycles
system.cpu1.numCycles                       398441993 # number of cpu
cycles simulated
system.cpu1.num_insts                          212019 # Number of
instructions executed
system.cpu1.num_refs                            71465 # Number of memory
references

CPUs 2 and 3 behavior is same as 1.

system.cpu4.not_idle_fraction                0.823388 # Percentage of
non-idle cycles
system.cpu4.numCycles                       400000000 # number of cpu
cycles simulated
system.cpu4.num_insts                        80686910 # Number of
instructions executed
system.cpu4.num_refs                         28047398 # Number of memory
references

system.cpu5.not_idle_fraction                0.000185 # Percentage of
non-idle cycles
system.cpu5.numCycles                       399900914 # number of cpu
cycles simulated
system.cpu5.num_insts                            8413 # Number of
instructions executed
system.cpu5.num_refs                             2862 # Number of memory
references

CPUs 6 and 7 also have stats similar to 5.




On Wed, 19 Jan 2011, Steve Reinhardt wrote:

 What's the deal with Histogram::add()?  Either it's too slow or it's being
called too much, I'd say, unless we're tracking some incredibly vital
statistics there.  Can you use the call graph part of the profile to find
where most of the calls are coming from?

Also, can you look at the stats and verify that the number of calls to
CacheMemory::lookup() is not much larger than the total number of CPU
memory
references?

Another thing to note is that even though PerfectSwitch::wakeup() is not a
huge amount of time, the time per call is pretty large, so there may be
opportunities for a lot of improvement there.

Thanks,

Steve

On Wed, Jan 19, 2011 at 9:20 AM, Nilay Vaish <ni...@cs.wisc.edu> wrote:

 I profiled m5 again, using the following command.

./build/ALPHA_FS_MOESI_hammer/m5.prof ./configs/example/ruby_fs.py
--maxtick 200000000000 -n 8 --topology Mesh --mesh-rows 2 --num-l2cache 8
--num-dir 8

Results have been copied below. CacheMemory::lookup() still consumes some
time but is lot less than before. PrefectSwitch can be a candidate for
re-design. But given that, PerfectSwitch is taking only 3% of the time,
it
would not yield that much gain. May be we need to look at these different
functions in a more holistic fashion.

--
Nilay


 %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 7.29     35.89    35.89 606750284     0.00     0.00  Histogram::add(long
long)
 5.84     64.62    28.73 256533483     0.00     0.00
CacheMemory::lookup(Address const&)
 4.56     87.06    22.44 124360139     0.00     0.00
L1Cache_Controller::wakeup()
 3.88    106.18    19.12 121283110     0.00     0.00
RubyPort::M5Port::recvTiming(Packet*)
 3.13    121.60    15.42  6875704     0.00     0.00
PerfectSwitch::wakeup()
 3.00    136.38    14.78 39527382     0.00     0.00
MemoryControl::executeCycle()
 2.95    150.91    14.53 90855686     0.00     0.00
BaseSimpleCPU::preExecute()
 2.68    164.10    13.19 147750111     0.00     0.00
MessageBuffer::enqueue(RefCountingPtr<Message>, long long)
 2.41    175.96    11.86 180281626     0.00     0.00
RubyEventQueueNode::process()
 2.07    186.17    10.21 302054741     0.00     0.00
EventQueue::serviceOne()
 2.03    196.15     9.98 121176116     0.00     0.00
Sequencer::getRequestStatus(RubyRequest const&)


 _______________________________________________
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


_______________________________________________
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev

Reply via email to