> yes, lexicograhically sorting is needed. OK, it sounds like in your application it's valuable for some reason you didn't state.
> This is necessary if for multi-field keys in databases (the null > character can occurs within the key). OK, but don't confuse length-associated (so the variable-length key can contain any byte value) with lexigraphical sorting. The latter is only important if for some reason you want to use First/Next (or converse) to visit the keys in lexicographic order, rather than some other order. > The sorted by length like in JudyHS function cannot be used. OK. > The big advantage of the judy arrays is the lexicograhically ordering. For some people it's a complete don't-care, for others libJudy isn't good enough because it's not "real" sorting (multibyte, various keys, etc, see sort(1)), but for some applications a simple sorted order (other than by length) does have some value. > Otherwise you can simply use hashing, witch is faster. Not always! Hashing is hard to beat for simple/fast hash algorithms with good spectral properties (short synonym chains) for the particular sparse data being saved. The problem with hashing is it's "fragile" in real-world applications where the programmer doesn't understand it very well, tune it up well to match the data, or (most often) keep it maintained well as the dataset grows. I've personally sped up old (buried and forgotten) hashing code by 10x just by retuning a few parameters. >> Storing variable-length, length-associated keys sorted >> lexicographically is actually a fairly hard problem. > I don't know the internals of the judy arrays but ithink, you can > store the common prefixes as actually done and store the suffixes as > variable length array (length as varbyte + the byte array). Well I wish I could point you to the old paper where I put a lot of thought into this. Consider that any possible substring/superstring, of any length, is possible. While "decoding" the next 4/8 bytes of the key string (including any binary byte values in the general case), you must be able to efficiently distinguish (in the 4-byte case) all of these possibilities: - string has only 1 more byte (256 possible values) - string has 2 more bytes (256^2 possible values) - string has 3 more bytes (256^3 possible values) - string has 4 more bytes => each of leads to 1 or more substring values You can't do this with the present JudySL code... You must dig into the internals to have some new kind of tree node that tells you how many and which strings are present that terminate at this level, along with all superstrings that go deeper. Say you store: abcde abcdef abcdefxyz After decoding "abcd" through the first JudyL array, you can't just point to another JudyL array for the rest of the string(s). Something has to sort by remaining length (1,2,3,4+) first and lead you off to the children trees for each length. And you gotta do this while maintaining lexicographical order too, can't have "abcdf" (shorter key) appear before "abcdef" (longer key) while walking the list. Simply listing N (large) substrings in lexicographical order is very inefficient (in both space and CPU time) if there are more of them than fit in 1-2 cache lines! And common substrings are problematic too. I did dream up the most efficient (space and timewise) data structure I could think of to solve this problem, but I've forgotten it, and it's buried in the "variable size keys" paper I wrote over 10 years ago. Cheers, Alan ------------------------------------------------------------------------------ Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov _______________________________________________ Judy-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/judy-devel
