Re: [Valgrind-users] Valgrind results for multicore machines

Josef Weidendorfer Fri, 20 Feb 2009 03:25:45 -0800

On Friday 20 February 2009, Chaitali Gupta wrote:
> I am trying to do cache profiling for a multi-threaded program written in C.
> I am using cachegrind to do so. Though the output is really descriptive, I
> could not find the results of L1 and L2 cache misses for different cores. I
> am running the multi-threaded program in a dual core machine. How do I get   
> to know the L1 and L2 data cache misses incurred in each core ?


Callgrind outputs cache simulation results by thread with 
"--separate-threads=yes" (use "--simulate-cache=yes" in addition).

Note that Valgrind serializes threads with, AFAIK, a "work slice" of
100.000 executed basic (super?) blocks. Thus, there is no way to say
anything about similarity of work sharing among threads compared to
how it could have happened in reality, be it by time-sharing on one core
or running threads simultaneously on multiple cores in parallel.

The conclusion: The separated data by thread do not say much in principle.
However, it *can* be useful if you know that you use static work partitioning
(eg. OpenMP without dynamic/guided scheduling), and the code executed in
each thread is more or less the same only with different data: then, one can 
expect that because of potential similar cache characteristic in each of the
threads, Valgrind approximates reality in some way.

And of course, cachegrind/callgrind just simulates one cache hierarchy even for 
multithreaded code. However, multicores nowaday often have a shared LLC
(last level cache). So you should get some idea of the real LLC behavior
when you look at cachegrinds L2 results.

For everything better, one definitely needs a simulated time for each of the
threads, and a way to make Valgrind influence thread scheduling dynamically.
(note the Valgrind currently leaves all scheduling decisions to the kernel).

> Any suggestion would be highly appreciated.

What are you after? I usually first would check that the sequential version 
runs well, and then go over to parallelization. Cachegrind/callgrind is
(currently) quite limited for the latter. Why not use OProfile (or a similar
tool) to check for load balancing? This is probably not satisfying
if you would like to analyse eg. data sharing behaviour among cores. For
the latter, simulation would be useful, but cachegrind/callgrind currently
are not.

Josef

> 
> Regards
> Chaitali
> 
> 
> 
>       



------------------------------------------------------------------------------
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H
_______________________________________________
Valgrind-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/valgrind-users

Re: [Valgrind-users] Valgrind results for multicore machines

Reply via email to