John M et al, > You don't need anything as complicated as utf8 for this. You can use > COBS (constant overhead byte stuffing) to remove NULLs... > > http://en.wikipedia.org/wiki/Consistent_Overhead_Byte_Stuffing
OK I got around to looking at this webpage. After some study I see what they are doing, although I don't think they explained (or at least summarized) it as simply as they could have. I'd say that blocks of bytes terminated by single nulls (or choose one other special byte value of your choice) are converted to blocks with an initial length byte (1-255) that includes the length byte itself; plus various special cases, etc etc. Special cases include a block longer than 255 with no terminating null, and short end blocks without trailing nulls. But curiously, sequences of successive nulls are not special because each extra 0x00 is simply converted to 0x01 = block of length 1 including the length byte, where the trailing null byte is assumed upon decoding. Is that about right? OK, I don't think this works all that well for JudyNL, which is similar to JudySL but with nulls (any bits) allowed within key values (sort keys by length first using JudyL). The reason is that if you must switch from length-terminated (the single null byte at the end of a C string) to length-associated (meaning sort by length first using JudyL), it doesn't buy you much. Now the previous paragraph assumes it's just a C string with a single null byte at the end. If it's actually a random byte string with any number of null bytes in it (the real purpose of JudyNL), then how do you terminate that string without an associated length? Nothing within the key itself can mark the end, UNLESS you encode it first using COBS or base-64 or whatever, then you can stick another null (now unambiguous) at the end. I suppose one difference is that an key encoded by any means to hide nulls could be stored as a JudySL "string" without having to sort keys by length first (using JudyL). However I think "natural" (lexicographical) sorting of the keys still gets lost as length bytes are inserted "at random" compared with original keys; in which case why not just use JudyNL rather than encoding/decoding? Am I missing something? Thanks, Alan Silverstein ------------------------------------------------------------------------------ Managing the Performance of Cloud-Based Applications Take advantage of what the Cloud has to offer - Avoid Common Pitfalls. Read the Whitepaper. http://pubads.g.doubleclick.net/gampad/clk?id=121054471&iu=/4140/ostg.clktrk _______________________________________________ Judy-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/judy-devel
