If you want a really fast alternate encoding, you could encode all of
Unicode in at most 3 bytes.  Use the high bit as a "continuation" bit
and the lower 7 bits as the data.

ASCII gets passed through unchanged.

For code points between U+0080 and U+3FFF, split the value into the
high 7 bits and low 7 bits.  Set the highest bit on the first byte
and follow it by the second 7 bits with the high bit cleared (which
will look like ASCII).

For the rest of Unicode from U+2000 thru U+10FFFF, split the value
into three 7-bit value, set the high bit on the first two bytes and
leave the high bit cleared on the lowest byte.

So if you have a code point with binary value 00xx xxxx xyyy yyyy,
encode it as 1xxx xxxx followed by 0yyy yyyy.

And a code point with binary value 000x xxxx xxyy yyyy yzzz zzzz is
encoded as 1xxx xxxx 1yyy yyyy 0zzz zzzz.

This is essentially the encoding used for tags in Abstract Syntax
Notation One (ASN.1) which has been around more than 20 years, so
there should be no IP claims to it.

Mike



Kannan Goundan wrote:
Thanks to everyone for the detailed responses.  I definitely
appreciate the feedback on the broader issue (even though my question
was very narrow).

I should clarify my use case a little.  I'm creating a generic data
serialization format similar to Google Protocol Buffers and Apache
Thrift.  Other than Unicode strings, the format supports many other
data types -- all of which are serialized in a custom format.  Some
data types will contain a lot of string data while others will contain
very little.  As is the case with other tools in this area, standard
compression techniques can be applied to the entire payload as a
separate pass (e.g. gzip).

I can see how there are benefits to using one of the standard
encodings.  However, at this point, my goals are basically fast
serialization/deserialization and small size.  I might eventually see
the error in my ways (and feel like an idiot for ignoring your
advice), but in the interest of not wasting your time any more than I
already have, I should mention that suggestions to stick to a standard
encoding will fall on mostly deaf ears.

For my current use case, I don't need to perform random accesses in
serialized data so I don't see a need to make the space-usage
compromises that UTF-8 and UTF-16 make.  A more compact UTF-8-like
encoding will get you ASCII with one byte, the first 1/4 of the BMP
with two bytes, and everything else with three bytes.  A more compact
UTF-16-like format gets the BMP in 2 bytes (minus some PUA) and
everything else in 3.  Maybe not huge savings, but if you're of the
opinion that sticking to a standard doesn't buy you anything... :-)

I'll definitely take a closer look at SCSU.  Hopefully the encoding
speed is good enough.  Most of the other serialization tools just
blast out UTF-8, making them very fast on strings that contain mostly
ASCII.  I hope SCSU doesn't get me killed in ASCII-only encoding
benchmarks (http://wiki.github.com/eishay/jvm-serializers/).  I really
do like the idea of making my format less ASCII-biased, though.  And,
like I said before, I don't care much about sticking to a standard
encoding -- if stock SCSU ends up being too slow or complex, I might
still be able to use techniques from SCSU in a custom encoding.

(Philippe: when I said I needed 20 bits, I meant that I needed 20 bits
for the stuff after the BMP.  I fully intend for my encoding to handle
every Unicode codepoint, minus surrogates.)

Thanks again, everyone.
-- Kannan

On Wed, Jun 2, 2010 at 13:12, Asmus Freytag <asm...@ix.netcom.com> wrote:
On 6/2/2010 12:25 AM, Kannan Goundan wrote:
On Tue, Jun 1, 2010 at 23:30, Asmus Freytag <asm...@ix.netcom.com> wrote:

Why not use SCSU?

You get the small size and the encoder/decoder aren't that
complicated.

Hmm... I had skimmed the SCSU document a few days ago.  At the time it
seemed a bit more complicated than I wanted.  What's nice about UTF-8
and UTF-16-like encodings is that the space usage is predictable.

But maybe I'll take a closer look.  If a simple SCSU encoder can do
better than more "standard" encodings 99% of the time, then maybe it's
worth it...


It will, because it's designed to compress commonly used characters.

Start with the existing sample code and optimize it. Many features of SCSU
are optional, using them gives slightly better compression, but you don't
always have to use them and the result is still legal SCSU. Sometimes
leaving out a feature can make your encoder a tad simpler, although I found
that you can be pretty fast with decent performance.

A./





Reply via email to