I'm trying to come up with a compact encoding for Unicode strings for
data serialization purposes.  The goals are fast read/write and small
size.

The plan:
1. BMP code points are encoded as two bytes (0x0000-0xFFFF, minus surrogates).
2. Non-BMP code points are encoded as three bytes
- The first two bytes are code points from the BMP's UTF-16 surrogate
range (11 bits of data)
- The next byte provides an additional 8 bits of data.

Unfortunately, this doesn't quite work because it only gives me 19
bits to encode non-BMP code points, but I need 20 bits.  To solve this
problem, I'm planning on stealing a bit of code space from the BMP the
private-use area.  If I did, then:
- I'd get the bits needed to encoded the Non-BMP in 3 bytes.
- The stolen code points of the private-use area would now have to be
encoded using 3 bytes.

I chose the private-use area because I assumed it would be the least
commonly used, so having these code points require 3 bytes instead of
2 bytes wasn't that big a deal.  Does this sound reasonable?  Do
people suggest a different section of the BMP to steal from, or a
different encoding altogether?

Thanks for reading.
-- Kannan

P.S. I actually have two encodings.  One is similar to UTF-8 in that
it's ASCII-biased.  The encoding described above is intended for
non-ASCII-biased data.  The programmer selects which encoding to use
based on what he thinks the data looks like.


Reply via email to