John M et al,

> You don't need anything as complicated as utf8 for this.  You can use
> COBS (constant overhead byte stuffing) to remove NULLs...
>
> http://en.wikipedia.org/wiki/Consistent_Overhead_Byte_Stuffing

OK I got around to looking at this webpage.  After some study I see what
they are doing, although I don't think they explained (or at least
summarized) it as simply as they could have.  I'd say that blocks of
bytes terminated by single nulls (or choose one other special byte value
of your choice) are converted to blocks with an initial length byte
(1-255) that includes the length byte itself; plus various special
cases, etc etc.

Special cases include a block longer than 255 with no terminating null,
and short end blocks without trailing nulls.  But curiously, sequences
of successive nulls are not special because each extra 0x00 is simply
converted to 0x01 = block of length 1 including the length byte, where
the trailing null byte is assumed upon decoding.

Is that about right?

OK, I don't think this works all that well for JudyNL, which is similar
to JudySL but with nulls (any bits) allowed within key values (sort keys
by length first using JudyL).  The reason is that if you must switch
from length-terminated (the single null byte at the end of a C string)
to length-associated (meaning sort by length first using JudyL), it
doesn't buy you much.

Now the previous paragraph assumes it's just a C string with a single
null byte at the end.  If it's actually a random byte string with any
number of null bytes in it (the real purpose of JudyNL), then how do you
terminate that string without an associated length?  Nothing within the
key itself can mark the end, UNLESS you encode it first using COBS or
base-64 or whatever, then you can stick another null (now unambiguous)
at the end.

I suppose one difference is that an key encoded by any means to hide
nulls could be stored as a JudySL "string" without having to sort keys
by length first (using JudyL).  However I think "natural"
(lexicographical) sorting of the keys still gets lost as length bytes
are inserted "at random" compared with original keys; in which case why
not just use JudyNL rather than encoding/decoding?  Am I missing
something?

Thanks,
Alan Silverstein

------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121054471&iu=/4140/ostg.clktrk
_______________________________________________
Judy-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/judy-devel

Reply via email to