On 11/09/2011 11:53 AM, Francois Berenger wrote:
On 11/09/2011 07:21 PM, Pascal wrote:
I have more problems with L2 misse cache events and memory bandwidth. A
quad cores means 4 times the bandwidth necessary for a single process...
If your code is already a bit greedy, the scale up is not good.
I never went down to this level of optimization.
Are you using valgrind to detect cache miss events?
No, I am not sure valgrind can cope with multithread applications correctly.
In this particular case my code is running faster on a intel
Q9400@2.67GHz with
800MHz DDR2 than an intel Q9505@2.83GHz with 667MHz DDR2. Also I have
a nice scale up on a 4*12 opteron cpu (each cpu has 2 dual channel
memory bus)
but not on my standard quad core. If I get my hands on a i7-920 equipped
with
triple channel DDR3 the program should run much faster despite the same
cpu clock.
Then I used perf[1] and oprofile[2] on linux.
Have a look here for the whole story:
<http://blog.debroglie.net/2011/10/25/cpu-starvation/>
After gprof, usually I am done with optimization.
I would prefer to change my algorithm and would be afraid
of introducing optimizations that are architecture-dependent
into my software.
When I spot a bottle neck, it's my first reaction, changing the algorithm.
Caching calculations, more efficient algorithms...
But once I had to do some manual loop tilling. It's kind of a change of
algorithm
as the size of a temporary variable change as well but the number of
operations
remains the same. The code with the loop tilling is ~20% faster. Only
due to a
better use of the cpu cache.
<http://blog.debroglie.net/2011/10/28/loop-tiling/>
[1] http://kernel.org/ package name should be perf-util or similar
[2] http://oprofile.sourceforge.net/
Pascal