Re: Try faster JudyGet.c for judy-1.0.5

john skaller Sat, 10 Nov 2012 02:07:15 -0800

On 10/11/2012, at 2:50 PM, Doug Baskins wrote:
> 
> The 32 bit Judy needs more extensive modifications to help performance.
> Judy1 - 32 bit needs a deletion of the Leaf1 (already done in 64 bit).
> This would get rid of the huge "mountain" in the performance curve with
> that object.  (I have no explanation of why a linear search of small
> number of byte objects takes so long).


Byte addressing on i32 is implemented by a very complex word
addressing with masking and special code to ensure atomic behaviour
is not lost. The data bus is 32 bits. Addressing bytes should be avoided
at all costs. Odd addresses should be avoided.

It works, but it is slow. It's faster to use aligned word addressing for
a byte array, except for the last byte.

Not sure if this helps.

The 64 bit chips have a completely different architecture.
Perhaps they handle this kind of thing better, since
they have to handle the 2 and 4 byte values the same way:
64 bit bus. So probably more pins and better technology,
with a lucky side effect of making the 8 bit values faster
to handle as well.

> I have found that -O2 to be the overall best with both gcc and icc.  -O3
> is NOT a good choice.  Also, I have been misled far to often by thinking
> that -Os is a good choice.  The code differences (*.s) seem to be
> branches to common exit code rather than leaving it in-line.  Does a
> "jmp" instruction sometimes incur a mis-predict?  I wouldn't think so.

No but it can lead to a spill. Just consider the *previous* conditional
branch.

Would be interesting to compare the i5 to an AMD chip,
since the Intel chips have that rubbish hyperthreading which
no doubt affects the cache in a bad way. AMD doesn't.

> 
> The development of a new and much faster Judy has be hampered greatly by
> the "mystery".  I have seen greater than 2X changes in performance in
> areas where the source code has NOT changed.

I have seen this too. I have a test function, namely Ackermann's
function, which I use to compare Felix code with C code.

Felix does not use C if() { ... } else { ... } or for loops or while loops,
it uses exclusively unconditional and conditional gotos.
[I.e. it generals C code as "close to assembler as possible"]

At one stage, Felix trashed C utterly; 50% faster than the C code.
Today, C is 30% faster than Felix. Same C code. Same C code
generated by Felix. Seem gcc has either lost the ability
to do proper control flow analysis, or it has gained the ability
to optimise some C constructions better.

Either way, it still make an utter mess of the optimisation,
failing to do really basic stuff like get rid of the need to
recursively call the function via the ABI compliant calling
protocol (which is a basic optimisation that should be used
everywhere in a translation unit: especially for recursive
functions).

In any case, do not forget that the i86 arch is a dying architecture.
It was always an extremely poorly designed heap of rubbish.
Pity Apple didn't kill the IBM PC much earlier and we'd have the vastly
superior Motorola technology. Pity IBM was again so stupid as
to charge way too much for the PowerPC, leading to Apple
dumping it.

Today, ARM is king.

The bottom line here is that working for a long time for a vastly superior
performance may miss the boat.

> 
> Then I discovered that VERY small changes in the code that did linear
> searches could have a dramatic change in performance.  This suggested
> that branch prediction was the culprit.  But, I have not been able to
> figure out how branch prediction works or not.  I would like to have
> some way to "tell" the compiler when a branch (or not) is going to
> happen with greater than 99% probabity.

You can with gcc. I do it in Felix. use

__builtin_expect

What this is "supposed" to do is help the compiler re-organise code blocks
so "more likely" code is the "drop through" case. I have not noticed this
doing anything useful though :)

> I have come to a very very general conclusion.  Making the generated
> code smaller (# size *.o) by design (not by compiler option) seems to
> generally to help performance.  However, there are exceptions that
> simply "fly in the face of reality".  See the best and worst of the 4
> different tests -- 32/64 bit Judy1 and JudyL.  There are 18 variations
> of each.  Two different compilers (icc, gcc), with and without "popcnt"
> and 3 compiler options -Os, -O2 and -O3.  Obviously, there is a
> "mystery" yet to be solved.

Then you should add clang to the mix. It's another, less mature,
compiler, but I am using it exclusively now on OSX (but with
GNU libraries).


--
john skaller
[email protected]
http://felix-lang.org




------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_nov
_______________________________________________
Judy-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/judy-devel

Re: Try faster JudyGet.c for judy-1.0.5

Reply via email to