Hi, It's beneficial to other users that you include the mailing list when replying back so that other users can follow the discussion.
Answers to your questions appear below. A1. The number of streams depends on the MLP you want to achieve. Higher MLP means higher pressure on the cache subsystem but will ensure that cache blocks of the array will be visited more frequent, thus reducing the probability of having them evicted. A2. To ensure that you do something with the array element so that the compiler does not optimize the code by removing the array access (in case you don't do anything useful the compiler might drop the instruction) A3. You should measure the LLC misses and LLC accesses using performance counters and oprofile or Vtune. A4. To ensure that there are no other useless instructions being executed during the traversal of the array. Regards, -Stavros On May 25, 2012, at 3:14 PM, suixiufeng wrote: Thank you for your reply. I have several question: 1) How many steams should I have? 2) What is the purpose of the variable in each stream? 3) How to ensure that the microbenchmark has a hit ratio close to 100% in the LLC? 4) WHY? "Look at the assembly code to ensure that the body of the loop has as many assembly instructions as the number of streams you want to have". Thank you very much! 2012/5/25 Volos Stavros <[email protected]<mailto:[email protected]>> Hi, Thanks for your interest. Our microbenchmark traverses an array of a given size ( depending on the cache size you want to pollute). The access pattern depends on the value of each accessed array element. For example, in the following array with A[2] = 7 and A[7] = 40, the access pattern starting from the element 2 is 2->7->40. Now, the tricky part is to find how to initialize the array so we make sure that each cache block of the array is re-accessed after accessing the rest of the cache blocks. At the same time, we want our access pattern to be random (not captured by the existing on-chip prefetchers) so as to ensure that the accesses miss in the L1 and L2 caches. Initialization: a) You initialize the array with 1,2,3,4, ....,0 b) For every element of the array, you choose a random element and swap their values. Traversal: Initiate as many streams as the MLP you want to achieve (to ensure that cache blocks are not evicted by the application). In each stream you can set a variable. Look at the assembly code to ensure that the body of the loop has as many assembly instructions as the number of streams you want to have. Before running any experiment, make sure that your access pattern is fair (all cache blocks are accessed before accessing a cache block for the second time) and that your microbenchmark has a hit ratio close to 100% in the LLC when running the application of your interest. Regards, -Stavros. ________________________________________ From: suixiufeng [[email protected]<mailto:[email protected]>] Sent: Saturday, May 19, 2012 9:06 AM To: [email protected]<mailto:[email protected]> Subject: The cache-polluting threads Hi, You perform a cache sensitivity analysis by dedicating two cores to cache-polluting threads. I want to know how to write the polluter threads. Would you please give me an example? Thank you!
