>> Bear in mind that JudyNL is "just" an array-of-array... of JudyL, where >> the top level keys are key lengths, and each successive subarray carries >> four successive bytes of the keys.
> It had better be 8 bytes on a 32 bit machine! Right, sorry... I was just being casual. And, did you mean 8 bytes on a 64-bit machine? :-) >> JudyNL sorts indexes by length first rather than lexicographically >> (left to right). > Yep, that sucks. In general the problem of how to EFFICIENTLY sort variable-length keys lexicographically, where the length (end) of each key is not marked within the key itself, is a very hard one. Somehow you must associate a length with each key, while not having it affect the sort order except in the sense that a pure substring comes before any of its superstrings. If you know how libJudy and other radix trees work, they don't replicate common leading portions (bits) of similar keys, which saves a lot of storage space, and often some time too. (Compare with hashing, where entire keys are stored in every node for synonym disambiguation.) The paper I wrote on this subject long ago talked about tricky ways to modify the JudyL nodes to know when any given key ended within the node, without looking for a bit/byte pattern within the key itself. Ya know, I just thought of a new (maybe) idea. What if, similar to COBS or base-64 but without actually modifying/encoding the key, some pattern within the key (appended to the end of every key) conditionally marked the end of the key? Choose some byte or multi-byte value likely to be rare in the real world -- although in general, a priori, there's no such value, you might as well just use a single null byte. Anyway upon seeing this pattern you need some way (native within the tree, can't build it on top of vanilla JudyL for example) to check if it's truly a terminator or just an accident. Hmm... But since EVERY key has SOME termination (length), it's like you end up needing a whole other "length dictionary" tree to somehow reference when checking for a possible termination. Already inefficient, and I can't begin to guess what the lookup itself might consist of. As long as I'm rambling, some of this is coming back to me now. Recall that a key NOT terminating in a given JudyL subarray level has associated a value that's a pointer to a sub-subarray for all keys sharing the same prefix through this point. However for each 4-byte or 8-byte word (portion of keys) that "continues through" this JudyL level (superstrings), there can be myriad other keys (substrings) that terminate at this level (length) with, say, 1-3 or 1-7 null padding bytes appended. Each of those is unique at this level, but how do you distinguish, say in a 4-byte word, the value 0x11121300 (key ends after 0x13 byte) from the same byte pattern that's the leading part (all four bytes) of a longer key? You only get one value = subarray pointer from JudyL. Now remember how since libJudy objects are always at least 4 bytes in size (or was it actually 8, even on a 32-bit word machine), the 2 or more least significant bits of any pointer can hold secret information about where the pointer points. (These bits are masked off before dereferencing the pointer.) I think my notion was that, first if there was any unused bit pattern (not sure there was), you could tell "key(s) end here" from "superstrings only" in the value word. In the first case, the pointer would go to a new object, not a JudyL array, that would describe the key(s) terminating here, plus point to a subsidiary JudyL array with the superstrings. You know, this could be constructed using unmodified JudyL arrays, but adding some new data structure too that holds "leaf" information for leaves when they terminate. Also similar to JudySL, it would help a lot to shortcut long tail ends of unique keys/substrings (remainders). Got all that? :-) Anyway in principle this is not an important problem, because different users want to sort different kinds of keys in many different ways, including handling multibyte characters and embedded escape codes. In practice, though, we all know how useful plain old lexicographical sorting often is. >> ...perhaps any given user or application might want only one or a set >> of single-size index arrays (fixed-depth trees), like Judy2L. > Yes, but the user can more or less easily implement that. At least, > given JudyL I can implement it easily :) Right, that's what Judy2L did, it was just a "two word deep tree." I should have those sources too if anyone cares. Cheers, Alan ------------------------------------------------------------------------------ Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis & security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk _______________________________________________ Judy-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/judy-devel
