Re: Try faster JudyGet.c for judy-1.0.5

john skaller Sat, 10 Nov 2012 08:21:58 -0800

On 10/11/2012, at 11:27 PM, Doug Baskins wrote:

> 
> > No but it can lead to a spill. Just consider the *previous* conditional 
> > branch.
> 
> Not sure what you mean by this.


Well, when you're loading a pipeline it comes from the cache.
A jump will require two cache lines: the code before the jump and the target.
If you have a conditional branch followed by two streams of code both 
with jumps in them, that's 4 cache lines and two pipelines to load.

The algorithm for loading cache lines and instruction pipelines has to be
very brain dead because it's implemented in hardware. Even though a 
common jump target should reduce the number of cache lines to be
loaded (on average), the algorithm would have to be fairly sophisticated
to detect this when looking ahead it sees a conditional branch then
two streams ending in jump instructions .. so perhaps it simply stops
filling the pipelines at the jump instructions until after the conditional
is actually evaluated and one of the predicted lookahead streams
can be discarded.

If it didn't do something like this it could clutter the cache with a lot
of prefetched cache lines most of which would certainly be discarded
(after all only one of the branches of a conditional will be taken).

No real idea, just guessing :)

> I have not tested it, but is a byte search (Intel) faster in 64bit mode than 
> 32bit mode?

No idea, but typically the exact same code used to run a lot slower in 32 bit
mode than 64 bit mode. I.e. actual 32 bit code in 64 bit mode, which uses
64 bit registers and addresses, is a lot faster, even when it doesn't
use any of the extra registers.

> If you have experience with a "fast" AMD chip, please let me know.

Nope. I run a Macbook Pro which is a core-2 duo.
I have no idea what the chips in my Rackspace slice are,
but it seems to run at the same speed as my Mac.

I'm using a laptop because I don't have enough electricity
to run a fast desktop (live on a yacht, rely on solar and wind power).

> I do not know.   I still feel there is some secret magic in branch-prediction
> that even icc (the Intel version of gcc) doesn't know about.  Gcc is 
> certainally clueless.  And so am I.  BTW, I have written versions of 
> JudyGet.c that uses goto's to "fool" the compiler  to "not take the branch"
> on the path that is 1% likely -- it did not help the performance.  I.E. the
> "drop thru case" is not the "magic" or it needs something more than that.

Right. Which may be why gcc doesn't use the branch prediction hints I have it.

> 
> > Today, ARM is king.
> 
> That happened fast, 

Yes indeed. Nokia and Microsoft got wiped out overnight by Apple,
and now Samsung has overtaken Apple. People go on about "mobile"
but that is not the real issue. The big issue is power consumption and
heat dissipation.

Of course .. it always has been .. I remember computer rooms the
size of a house .. and even bigger rooms full of air-conditioning plant :)
We got to watch out for exploding phones now (lithium batteries
catch fire if they're overheated, particularly if they get wet)

So whilst Intel is struggling to get 8 cores running and 16 GPU cores
in a chip .. NVidia's latest GPU has 3072 double precision floating
point units.

Caching is WRONG. Its only useful for sequential processing.
For parallel processing caching is the worst thing you can do.

NVidia does this right. You have slow memory, so you do NOT
cache it to improve performance. You just schedule threads
10 times faster than the memory. So the Kepler is designed
to obtain optimum performance with about 10,000 threads.
Oversaturation. You select a thread when its memory is ready.

This is the big change I think: whilst Intel chased compatibility
and made more and more complex chips, it has run into a roadblock.
It's an old-school company and it cannot adapt. New radically different
technology will wipe it out. 

Of course I can't say when or what but the whole concept
of desktop processors is now dead. Complexity, caching,
branch prediction, etc .. it all costs HEAT. And heat kills
performance. Its basic quantum mechanics ;)


> do you have suggestions on how to develop for it?  

Nope. You can develop for iOS on a Mac.
No idea how to develop code for an Android device.

> Clang:  I would simply like to make predictable changes to the code and
> get predictable results.  I.E. I want to know the  "mystery" and how to 
> program for it.   Perhaps the compilers will to the job in the future and
> make that a wasted effort.  However, I have been waiting since 2008
> (the E8600 processor is when I stumbled into the "mystery").

It's very likely not even the chip designers understand how the 
branch prediction and cache prefetching works, just as no one
these days understands how almost any part of any software works :)

As you may know Atlas (high performance floating stuff) solves this problem
with 10 hours or more of brute force trial and error configuration.
It may be the only way to go. Don't predict: measure.

BTW: From Wikipedia:

Future

The Maxwell architecture, the successor to Kepler, will have for the first time 
an integrated ARM CPU of its own (project Denver);[21] This will make Maxwell 
GPU more independent from the main CPU according to Nvidia's CEO - Jen Hsun 
Huang.[22] Maxwell was announced in September 2010[23] and is expected to be 
released in 2013. After Maxwell, the next architecture is code-named 
Einstein.[24] Regarding the Einstein architecture, Nvidia has only revealed its 
name without any further information.

--
john skaller
[email protected]
http://felix-lang.org




------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_nov
_______________________________________________
Judy-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/judy-devel

Re: Try faster JudyGet.c for judy-1.0.5

Reply via email to