Re: Efficiency on modern processors

Doug Baskins Sun, 15 Feb 2009 20:25:51 -0800

Stavros:

I am retired and do not have funds similar to when I worked at HP, so my
testing is somewhat restricted to modern X86 (AMD & Intel ~ $1000 US)
machines and the (Linux) gcc compiler.  I have done a lot of performance
tuning and measurements on Judy for the the next version.  Right now I am
using an Intel E8600 (6Mb L2 cache) processor clocked at 4.2Ghz with
16GB of ram as a development machine.  In the past I have always been
frustrated by "flaws/quirks/bugs" in processor/compiler/OS design.  I
have been able to improve the performance of Judy a lot by focusing
primarily on cache-line fills.


> advantage of prefetching / speculative / lookahead fetching, deep
  pipelines (avoiding branches), etc.

To my great delight, the Core 2 family (and probably latest AMD),
together with the gcc compiler have made attention to these details
unnecessary or very small compared to attention to cache-line fills.  In
the past, I needed to pay a lot of attention to "avoiding branches".  I
don't know how, but the E8600 does not seem to suffer from "branches
spoiling the pipeline".  (Probably a call-return stack) Therefore, I do not
write code to avoid branching (and variable shifts) anymore.  Released
Judy (1.0.4) is paranoid about these things.

I believe the frontiers in performance (for problems that have large
working sets) are now in two areas (software):

1) Minimize cache-line fills (on reads only) -- now more than ever.

2) OS handling of TLB misses.  I (user code) has very limited control of
   this.

To give you a some perspective:  I recently wrote a routine to use in
Judy that takes the integer (floor) Log(base 2) of a 32/64 bit number.
It has a lot of branches and shifts.  The test routine will try all the
numbers in 32 bits (~4 billion).  The test program ran in less than 10
seconds.  I summed and printed the total results so the compiler could
not optimize out the loop.  2.4nS (10Sec / 2^32) to execute dozens of
instructions was very impressive to me.  Intel claims up to 3
instructions per clock and in my case that is 4 clocks per nano-second,
so up to 12 instructions per nano-second.

Now contrast this to another test case:  I have another test program
that creates a bitmap that is ~4 billion bits in size.  It then fills the
bitmap with pseudo random bits and reads them back in a different
(random) order.  I then plot the speed vs the number of bits set.  The sizes
and speeds of the different caches in the processor are very clearly
shown as well as the speed of the ram as the working set increases.
When the accessed bit (including calculating the offset and mask in the
word) is in the first level cache (I recall Core 2 has 2 levels), the
access time is about 4 nS.  However, after subtracting out about 3nS of
overhead for generating the random number, it is less than 1 nS.  In
contrast, when the access includes a cache-line fill, the access time is
~70nS.  This is an effective working set size of 512MB of ram.  When the
working set is increased to 8GB of ram (2^36 Bits), the access time
increases to ~125nS.  I assume this increase is caused by more TLB
misses being handled by the OS (Ubuntu Linux 2.6.24-23 kernel).

In the past, I have plotted the random access (read) time of ram vs the
working set size in many different machines and OSes.  It was very
revealing how the OS/hardware manages the TLB misses.

Well by now, I probably put you asleep with all this data.  However, if
I sparked you interest, you are welcome to all the test code (in C).

Thanks for you interest,

Doug

 
Doug Baskins <[email protected]>




________________________________
From: Stavros Macrakis <[email protected]>
To: Doug Baskins <[email protected]>
Sent: Monday, February 16, 2009 2:52:27 AM
Subject: Efficiency on modern processors

Doug,

Thanks for your work on Judy.  I found your article "A 10-minute 
description..." very interesting as a description of your efficiency 
philosophy.  That article focusses on cache-line fills as a critical bottleneck 
in modern architectures.  I was wondering if you'd written up thoughts on other 
aspects of optimizing for modern architectures, notably taking advantage of 
prefetching / speculative / lookahead fetching, deep pipelines (avoiding 
branches), etc.  Or are these too sensitive to compiler differences?

            -s

------------------------------------------------------------------------------
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H

_______________________________________________
Judy-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/judy-devel

Re: Efficiency on modern processors

Reply via email to