Toni:

I am happy to hear that Word_t X n sized structures solved the problem.
I am working on a mostly re-written Judy with the promise of being
faster.  This is a bug reported previously and only occurred on machines
that can have int16_t aligned pointers (such as Motorola 68K
processors).  The X86 machines can also handle that kind of alignment,
but the code ran much slower, so I am surprised there is a X86 compiler
that would generate code to do that.  (It is strictly illegal on RISC
machines and may not be a performance problem on Core 2 duo machines --
I haven't tested that)

There has been discussion about "cache-aligned" data structures as
a performance enhancement used in Judy.  In 2002 I wrote a special
version of malloc(2) that did cache aligned allocations.  To my great
amazement, it helped only in older processors (older than circa 2001).
It did not help the performance of Judy on a Pentium III (750Mhz).
Given I did not want to be in the malloc(2) business, I elected to
support Judy with the standard malloc.  I made it very easy to use Judy
with any malloc you desire thru the "JudyMalloc.c" interface.  Doug
Lea's version of "dlmalloc" proved to be the best malloc I tested and performed
very well (but it had a bug he has since fixed -- every malloc I tested
had the same bug).  I feel his malloc is probably about the best that
can be done with malloc.  Profiles of Judy show malloc() to be a very
small part of Judy's performance.

Sean Barrett <http://www.nothings.org/computer/judy/> did a performance
evaluation of Judy with a hashing method several years ago.  The results
were flawed for several reasons.  1) Test were done on a Pentium III --
it has several flaws.  2) A 512MB machine is very small to see Judy
doing anything remarkable in performance.  3) And this is probably the
biggest one -- he did it on a Windows 98 machine with all the flaws in
malloc.  Judy was tested on machine that had as much as 256 Giga-bytes
of RAM (HP Superdome).  Judy was designed to scale to machines much
larger than that amount of RAM (Peta-bytes).  His performance graphs did
not have the usual shape (at 2-4Meg Indexes) that Judy has on a PIII
machine -- suggesting malloc was getting into the act with it's usual
"bug(s)".

I have to agree with you about the readability of Judy.  Alan did his
best to obscure the code -- perhaps unintentionally.  I feel only
someone who already knows the design (in his head) can even begin
reading the code.  Several people have told me they just cannot read the
code.  At the time HP was very paranoid that somebody would copy Judy --
so I did not have a big enough objection to it -- a decision I have
regretted ever since.  (We did not know it would eventually be made
open-source) I have re-written the next version to be as readable as I
can do it.  You are welcome to a copy, but it is barely runnable at this
point.  I also have older versions in better shape and closer to the
released version in design.  It still has a few confounding macros.

The performance of Judy1 and JudyL in a modern machine (Core 2 duo) is
about the same (in-cache) until you get to very large populations where
the bit mapped Index structures in Judy1 overwhelm the memory
in-efficiently of JudyL which contain a Word_t of overhead in every
Index and JudyL has an additional cache-line fill called the "Value".

You are curious if the bit instructions in almost all modern processors
would help the C coded versions in Judy.  I removed the processor
specific code in Judy when it did not measure a significant performance
improvement (I think in the released version).  So that avenue is
probably a waste of time.  Even the macros in Judy.h (I.E.  JLG(), J1T()
etc.)  do not help the performance of Judy in a Core 2 duo processor.
(I think they finally got the "call return stack" right.)  No processor
designer can leave that enhancement out of their design in the future
and compete with the performance of Intel's design.  So, in the future
Judy's, macros for the sake of saving a subroutine call will be avoided
in the code.  (I measure a subroutine call to be about 10 or so
nanoseconds in that processor).  It was in the 100's of nS range in
processors without a good (or no) implementation of a "call return
stack".

Toni, over the years I have studied thousands of performance graphs of
Judy.  My conclusions have remained very stable for several years.  The
one that always sticks in my mind is something Bob Gobeille said (a new
Judy Team member in 2000):  "It's all about cache-line fills stupid".
Today, the performance opportunities in Judy are in the "last branch"
design.  The "Bit-mapped Branch" is a two or three cache-line fill design
that is used just above the final Leaf level.  Changing the leaf design
to a 2 cache-line fill design and making it very large (>16K) would
eliminate the need for a Bit mapped Branch.  (there are 3 branch designs
used in Judy -- called:  Linear, Bit mapped and Uncompressed).  Linear
and Uncompressed are 1 cache-line fill designs -- the only improvement
possible would be to make the Uncompressed bigger, but that is for
another year.

Lastly, in machines since the PIII, a cache-line fill is a bit tricky to
calculate.  Apparently the chip designers have done a "look-ahead" read
of the adjacent cache-line when a fill (a memory read) is requested.
This means that the size of the cache-line appears bigger than it really
is to software.  Well this is just a guess based on measurements I have
done.  I first noticed this on a PIII back in 2001.  It's cache-line
appeared much larger than the advertised 32 bytes.

In a Core 2 duo processor, the cache-line "appears" to be about 256-512
bytes in size.  That is the size where a binary search starts to be
faster than a linear search of a sorted object of (64) numbers (Word_t
sized).  At about that size, the "foot race" (number of instructions
executed) in the linear search starts to dominate the lack of
performance.

If you are interested in improving performance of Judy, please consult
with me about your ideas -- in advance.  I can save you "barking up the
wrong tree" time.  I have done it many times.  My latest fiasco was
something I called LSCA.  It was a method to do a faster search of a
sorted data structure of Indexes.  The insert time became a problem
as the Leaf became large.  I thought I could solve that problem with
some sort of a circular buffer with a moving start.

Thanks for your interest

Doug Baskins



Doug Baskins <[EMAIL PROTECTED]>

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Judy-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/judy-devel

Reply via email to