>> Bear in mind that JudyNL is "just" an array-of-array... of JudyL, where
>> the top level keys are key lengths, and each successive subarray carries
>> four successive bytes of the keys.

> It had better be 8 bytes on a 32 bit machine!

Right, sorry...  I was just being casual.  And, did you mean 8 bytes on
a 64-bit machine?  :-)

>> JudyNL sorts indexes by length first rather than lexicographically
>> (left to right).

> Yep, that sucks.

In general the problem of how to EFFICIENTLY sort variable-length keys
lexicographically, where the length (end) of each key is not marked
within the key itself, is a very hard one.  Somehow you must associate a
length with each key, while not having it affect the sort order except
in the sense that a pure substring comes before any of its superstrings.

If you know how libJudy and other radix trees work, they don't replicate
common leading portions (bits) of similar keys, which saves a lot of
storage space, and often some time too.  (Compare with hashing, where
entire keys are stored in every node for synonym disambiguation.)  The
paper I wrote on this subject long ago talked about tricky ways to
modify the JudyL nodes to know when any given key ended within the node,
without looking for a bit/byte pattern within the key itself.

Ya know, I just thought of a new (maybe) idea.  What if, similar to COBS
or base-64 but without actually modifying/encoding the key, some pattern
within the key (appended to the end of every key) conditionally marked
the end of the key?  Choose some byte or multi-byte value likely to be
rare in the real world -- although in general, a priori, there's no such
value, you might as well just use a single null byte.  Anyway upon
seeing this pattern you need some way (native within the tree, can't
build it on top of vanilla JudyL for example) to check if it's truly a
terminator or just an accident.

Hmm...  But since EVERY key has SOME termination (length), it's like you
end up needing a whole other "length dictionary" tree to somehow
reference when checking for a possible termination.  Already
inefficient, and I can't begin to guess what the lookup itself might
consist of.

As long as I'm rambling, some of this is coming back to me now.  Recall
that a key NOT terminating in a given JudyL subarray level has
associated a value that's a pointer to a sub-subarray for all keys
sharing the same prefix through this point.  However for each 4-byte or
8-byte word (portion of keys) that "continues through" this JudyL level
(superstrings), there can be myriad other keys (substrings) that
terminate at this level (length) with, say, 1-3 or 1-7 null padding
bytes appended.  Each of those is unique at this level, but how do you
distinguish, say in a 4-byte word, the value 0x11121300 (key ends after
0x13 byte) from the same byte pattern that's the leading part (all four
bytes) of a longer key?  You only get one value = subarray pointer from
JudyL.

Now remember how since libJudy objects are always at least 4 bytes in
size (or was it actually 8, even on a 32-bit word machine), the 2 or
more least significant bits of any pointer can hold secret information
about where the pointer points.  (These bits are masked off before
dereferencing the pointer.)  I think my notion was that, first if there
was any unused bit pattern (not sure there was), you could tell "key(s)
end here" from "superstrings only" in the value word.  In the first
case, the pointer would go to a new object, not a JudyL array, that
would describe the key(s) terminating here, plus point to a subsidiary
JudyL array with the superstrings.

You know, this could be constructed using unmodified JudyL arrays, but
adding some new data structure too that holds "leaf" information for
leaves when they terminate.  Also similar to JudySL, it would help a lot
to shortcut long tail ends of unique keys/substrings (remainders).

Got all that?  :-)

Anyway in principle this is not an important problem, because different
users want to sort different kinds of keys in many different ways,
including handling multibyte characters and embedded escape codes.  In
practice, though, we all know how useful plain old lexicographical
sorting often is.

>> ...perhaps any given user or application might want only one or a set
>> of single-size index arrays (fixed-depth trees), like Judy2L.

> Yes, but the user can more or less easily implement that.  At least,
> given JudyL I can implement it easily :)

Right, that's what Judy2L did, it was just a "two word deep tree."  I
should have those sources too if anyone cares.

Cheers,
Alan

------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
Judy-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/judy-devel

Reply via email to