Stephane Eranian a écrit :
Stephane,
On Wed, May 09, 2007 at 11:34:59AM +0200, St?phane Zuckerman wrote:
We're trying to measure L2D MISSES on a Xeon Woodcrest (a dual cpu, dual
core machine).
We've tried different hardware counters, namely :
- LAST_LEVEL_CACHE_MISSES, which, the documentation says, is equivalent
to L2_RQSTS:I_STATE (invalid cachelines), but doesn't count the hardware
prefetches
- L2_RQSTS:MESI, and various combinations between M/E/S/I options, as
well as PREFETCH combined with the mask SELF.
We're trying to measure accurately the L2 data misses, but we're getting
inconsistent results. How must we proceed ?
If you are using pfmon, please provide exact command line.
We haven't inserted probes directly in the code yet. We do use pfmon,
indeed.
Here's what we've done :
pfmon -e LAST_LEVEL_CACHE_MISSES ./program
pfmon -e L2_RQSTS:I_STATE:ANY
pfmon -e L2_RQSTS:PREFETCH (which returns 0)
We also tried L2_RQSTS:MESI, and each flag separately.
Also you
need to define what you mean exactly by inconsistent. That are always
some fluctuations from one run to the next.
Of course, we didn't expect that the results would always be the same.
We're trying to make a correlation between cache accesses/misses and TLB
accesses/misses on this machine.
Here's the code we're trying to monitor :
double cache_loop(double * tmp, unsigned long size){
double acc;
unsigned long i;
for(i = 0; i < size; i++){
acc += tmp[i];
}
return acc;
}
Here, the array tmp is allocated either thanks to mmap or to malloc
(depending on the tests we wanted to perform). Before calling this
function, we initialize the array so as to force page allocation, then
we run a first loop around this function to warm up the cache.
The real test comes after, where we wrap a loop around the function call .
Basically, we have :
<code>
/* size is a variable parameter we set when calling our program
* its value varies from "in-cache" to "out-of-cache"
*/
for (i=0; i < size; i++)
tmp[i] = 1;
for (i=0; i < 10; i++)
res = cache_loop(tmp,size);
/* some more code which isn't relevant here */
for (i=0; i < ITERATIONS; i++)
res = cache_loop(tmp,size);
</code>
We would like to monitor the L2 cache misses (hence the call to
LAST_LEVEL_CACHE_MISSES with pfmon).
As for what I deem "inconsistent":
We have 4 MB of L2 cache on this computer for each processor, and when
monitoring the whole program with pfmon (no
--trigger-code-{start|stop}-address used) we get a few order of
magnitude less cache misses than we predicted (according to the size of
the data we allocated).
We know that we have to monitor the hardware prefetches with a different
run, but even then, the figures don't match our estimations.
We suspect that the out-of-order mechanism can temper with our
expectations, but it seems highly unlikely that it can change our
results that much.
We ran out of ideas to explain this, hence the question about how to
measure cache misses accurately (with pfmon for a start, but even
in-code probes would do, of course).
--
Stéphane Zuckerman
_______________________________________________
perfmon mailing list
[email protected]
http://www.hpl.hp.com/hosted/linux/mail-archives/perfmon/