Stephane Eranian a écrit :
Stephane,

On Wed, May 09, 2007 at 11:34:59AM +0200, St?phane Zuckerman wrote:
We're trying to measure L2D MISSES on a Xeon Woodcrest (a dual cpu, dual core machine).

We've tried different hardware counters, namely :

- LAST_LEVEL_CACHE_MISSES, which, the documentation says, is equivalent to L2_RQSTS:I_STATE (invalid cachelines), but doesn't count the hardware prefetches

- L2_RQSTS:MESI, and various combinations between M/E/S/I options, as well as PREFETCH combined with the mask SELF.

We're trying to measure accurately the L2 data misses, but we're getting inconsistent results. How must we proceed ?


If you are using pfmon, please provide exact command line.

We haven't inserted probes directly in the code yet. We do use pfmon, indeed.

Here's what we've done :
pfmon -e LAST_LEVEL_CACHE_MISSES ./program
pfmon -e L2_RQSTS:I_STATE:ANY
pfmon -e L2_RQSTS:PREFETCH (which returns 0)

We also tried L2_RQSTS:MESI, and each flag separately.


Also you
need to define what you mean exactly by inconsistent. That are always
some fluctuations from one run to the next.

Of course, we didn't expect that the results would always be the same. We're trying to make a correlation between cache accesses/misses and TLB accesses/misses on this machine.

Here's the code we're trying to monitor :

double cache_loop(double * tmp, unsigned long size){
  double acc;
  unsigned long i;
  for(i = 0; i < size; i++){
    acc += tmp[i];
  }
  return acc;
}


Here, the array tmp is allocated either thanks to mmap or to malloc (depending on the tests we wanted to perform). Before calling this function, we initialize the array so as to force page allocation, then we run a first loop around this function to warm up the cache.
The real test comes after, where we wrap a loop around the function call  .

Basically, we have :

<code>
/* size is a variable parameter we set when calling our program
 * its value varies from "in-cache" to "out-of-cache"
 */
for (i=0; i < size; i++)
        tmp[i] = 1;
for (i=0; i < 10; i++)
        res = cache_loop(tmp,size);

/* some more code which isn't relevant here */

for (i=0; i < ITERATIONS; i++)
        res = cache_loop(tmp,size);
</code>

We would like to monitor the L2 cache misses (hence the call to LAST_LEVEL_CACHE_MISSES with pfmon).

As for what I deem "inconsistent":
We have 4 MB of L2 cache on this computer for each processor, and when monitoring the whole program with pfmon (no --trigger-code-{start|stop}-address used) we get a few order of magnitude less cache misses than we predicted (according to the size of the data we allocated).

We know that we have to monitor the hardware prefetches with a different run, but even then, the figures don't match our estimations.

We suspect that the out-of-order mechanism can temper with our expectations, but it seems highly unlikely that it can change our results that much.

We ran out of ideas to explain this, hence the question about how to measure cache misses accurately (with pfmon for a start, but even in-code probes would do, of course).

--
Stéphane Zuckerman
_______________________________________________
perfmon mailing list
[email protected]
http://www.hpl.hp.com/hosted/linux/mail-archives/perfmon/

Reply via email to