On Sun, Feb 15, 2009 at 11:25 PM, Doug Baskins <[email protected]> wrote:
[snip] > To my great delight, the Core 2 family (and probably latest AMD), > together with the gcc compiler have made attention to these details > unnecessary or very small compared to attention to cache-line fills. In > the past, I needed to pay a lot of attention to "avoiding branches". I > don't know how, but the E8600 does not seem to suffer from "branches > spoiling the pipeline". (Probably a call-return stack) Therefore, I do not > write code to avoid branching (and variable shifts) anymore. Released > Judy (1.0.4) is paranoid about these things. Prediction elimiates branch stalls very effectively. Newer processors (like i7) have mutithreading so when a thread of execution is waiting on an unspeculated memory read the latency can be hidden by allowing another thread to run… But these aren't the only interesting processors out there: The new low power Intel ATOM is an in-order core, so you do have to worry about branch caused stalls. Also, http://en.wikipedia.org/wiki/Larrabee_(GPU) is likewise, and there are many modern and fast DSPs where branching is bad. > I believe the frontiers in performance (for problems that have large > working sets) are now in two areas (software): > > 1) Minimize cache-line fills (on reads only) -- now more than ever. > > 2) OS handling of TLB misses. I (user code) has very limited control of > this. One area you may like to give consideration to is worst case complexity. In realtime situations it's important that the code not suddenly take much longer to run in some rare cases. I've often found that smart designs which behave well in the worst case are also surprisingly good in the average, so maybe considering that problem will inspire you. Also— complexity is relevant because of the risk of denial of service attacks. (i.e. someone figures out an app uses a particular hash then sends it lots of colliding requests). As far as TLB goes, Linux has a facility to offer large pages to processes. (Google HugeTLB) But due to the complexity of dealing with it, nothing outside of scientific computing seems to use it. It would be nice if something in a library could use large pages under the hoodl > To give you a some perspective: I recently wrote a routine to use in > Judy that takes the integer (floor) Log(base 2) of a 32/64 bit number. > It has a lot of branches and shifts. The test routine will try all the > numbers in 32 bits (~4 billion). The test program ran in less than 10 > seconds. I summed and printed the total results so the compiler could > not optimize out the loop. 2.4nS (10Sec / 2^32) to execute dozens of > instructions was very impressive to me. Intel claims up to 3 > instructions per clock and in my case that is 4 clocks per nano-second, > so up to 12 instructions per nano-second. An alternative perspective is that now that computers are say, 10x faster than they used to be, I'm now working on problems at least 10x larger and expecting them to still be fast. This expectation especially fails as I find that much software has complexity worse than O(N). In any case— just some thoughts. Thanks for all your work on libjudy. ------------------------------------------------------------------------------ Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise -Strategies to boost innovation and cut costs with open source participation -Receive a $600 discount off the registration fee with the source code: SFAD http://p.sf.net/sfu/XcvMzF8H _______________________________________________ Judy-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/judy-devel
