Re: Efficiency on modern processors

Gregory Maxwell Sun, 15 Feb 2009 21:32:18 -0800

On Sun, Feb 15, 2009 at 11:25 PM, Doug Baskins <[email protected]> wrote:


[snip]
> To my great delight, the Core 2 family (and probably latest AMD),
> together with the gcc compiler have made attention to these details
> unnecessary or very small compared to attention to cache-line fills.  In
> the past, I needed to pay a lot of attention to "avoiding branches".  I
> don't know how, but the E8600 does not seem to suffer from "branches
> spoiling the pipeline".  (Probably a call-return stack) Therefore, I do not
> write code to avoid branching (and variable shifts) anymore.  Released
> Judy (1.0.4) is paranoid about these things.

Prediction elimiates branch stalls very effectively. Newer processors
(like i7) have mutithreading so when a thread of execution is waiting
on an unspeculated memory read the latency can be hidden by allowing
another thread to run…

But these aren't the only interesting processors out there:  The new
low power Intel ATOM is an in-order core, so you do have to worry
about branch caused stalls. Also,
http://en.wikipedia.org/wiki/Larrabee_(GPU) is likewise, and there are
many modern and fast DSPs where branching is bad.


> I believe the frontiers in performance (for problems that have large
> working sets) are now in two areas (software):
>
> 1) Minimize cache-line fills (on reads only) -- now more than ever.
>
> 2) OS handling of TLB misses.  I (user code) has very limited control of
>    this.

One area you may like to give consideration to is worst case
complexity.  In realtime situations it's important that the code not
suddenly take much longer to run in some rare cases.  I've often found
that smart designs which behave well in the worst case are also
surprisingly good in the average, so maybe considering that problem
will inspire you.   Also— complexity is relevant because of the risk
of denial of service attacks. (i.e. someone figures out an app uses a
particular hash then sends it lots of colliding requests).

As far as TLB goes, Linux has a facility to offer large pages to
processes. (Google HugeTLB) But due to the complexity of dealing with
it, nothing outside of scientific computing seems to use it.  It would
be nice if something in a library could use large pages under the
hoodl

> To give you a some perspective:  I recently wrote a routine to use in
> Judy that takes the integer (floor) Log(base 2) of a 32/64 bit number.
> It has a lot of branches and shifts.  The test routine will try all the
> numbers in 32 bits (~4 billion).  The test program ran in less than 10
> seconds.  I summed and printed the total results so the compiler could
> not optimize out the loop.  2.4nS (10Sec / 2^32) to execute dozens of
> instructions was very impressive to me.  Intel claims up to 3
> instructions per clock and in my case that is 4 clocks per nano-second,
> so up to 12 instructions per nano-second.

An alternative perspective is that now that computers are say, 10x
faster than they used to be, I'm now working on problems at least 10x
larger and expecting them to still be fast.  This expectation
especially fails as I find that much software has complexity worse
than O(N).



In any case— just some thoughts. Thanks for all your work on libjudy.

------------------------------------------------------------------------------
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H
_______________________________________________
Judy-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/judy-devel

Re: Efficiency on modern processors

Reply via email to