Hi,

It's beneficial to other users that you include the mailing list when replying 
back so that other users can
follow the discussion.

Answers to your questions appear below.

A1. The number of streams depends on the MLP you want to achieve. Higher MLP 
means higher pressure
on the cache subsystem but will ensure that cache blocks of the array will be 
visited more frequent, thus
reducing the probability of having them evicted.

A2. To ensure that you do something with the array element so that the compiler 
does not optimize the code
by removing the array access (in case you don't do anything useful the compiler 
might drop the instruction)

A3. You should measure the LLC misses and LLC accesses using performance 
counters and oprofile or Vtune.

A4. To ensure that there are no other useless instructions being executed 
during the traversal of the array.

Regards,
-Stavros

On May 25, 2012, at 3:14 PM, suixiufeng wrote:

Thank you for your reply.
I have several question:
1) How many steams should I have?
2) What is the purpose of the variable  in each stream?
3) How to  ensure that  the microbenchmark has a hit ratio close to 100% in the 
LLC?
4) WHY? "Look at the assembly code to ensure that the body of the loop has as 
many assembly instructions as the number of streams you want to have".
Thank you very much!

2012/5/25 Volos Stavros <[email protected]<mailto:[email protected]>>
Hi,

Thanks for your interest.

Our microbenchmark traverses an array of a given size ( depending on the cache 
size you want to pollute).

The access pattern depends on the value of each accessed array element. For 
example, in the following
array with A[2] = 7 and  A[7] = 40, the access pattern starting from the 
element 2 is 2->7->40.

Now, the tricky part is to find how to initialize the array so we make sure 
that each cache block of the array is
re-accessed after accessing the rest of the cache blocks. At the same time, we 
want our access pattern to
be random (not captured by the existing on-chip prefetchers) so as to ensure 
that the accesses miss in the
L1 and L2 caches.

Initialization:
a) You initialize the array with 1,2,3,4, ....,0
b) For every element of the array, you choose a random element and swap their 
values.

Traversal: Initiate as many streams as the MLP you want to achieve (to ensure 
that cache blocks are not evicted
by the application). In each stream you can set a variable. Look at the 
assembly code to ensure that the body
of the loop has as many assembly instructions as the number of streams you want 
to have.

Before running any experiment, make sure that your access pattern is fair (all 
cache blocks are accessed before
accessing a cache block for the second time) and that your microbenchmark has a 
hit ratio close to 100% in the LLC
when running the application of your interest.

Regards,
-Stavros.

________________________________________
From: suixiufeng [[email protected]<mailto:[email protected]>]
Sent: Saturday, May 19, 2012 9:06 AM
To: [email protected]<mailto:[email protected]>
Subject: The cache-polluting threads

Hi,
   You perform a cache sensitivity analysis by dedicating two cores to 
cache-polluting threads. I want to know how to write the polluter threads. 
Would you please give me an example? Thank you!


Reply via email to