I'm trying to come up with a compact encoding for Unicode strings for data serialization purposes. The goals are fast read/write and small size.
The plan: 1. BMP code points are encoded as two bytes (0x0000-0xFFFF, minus surrogates). 2. Non-BMP code points are encoded as three bytes - The first two bytes are code points from the BMP's UTF-16 surrogate range (11 bits of data) - The next byte provides an additional 8 bits of data. Unfortunately, this doesn't quite work because it only gives me 19 bits to encode non-BMP code points, but I need 20 bits. To solve this problem, I'm planning on stealing a bit of code space from the BMP the private-use area. If I did, then: - I'd get the bits needed to encoded the Non-BMP in 3 bytes. - The stolen code points of the private-use area would now have to be encoded using 3 bytes. I chose the private-use area because I assumed it would be the least commonly used, so having these code points require 3 bytes instead of 2 bytes wasn't that big a deal. Does this sound reasonable? Do people suggest a different section of the BMP to steal from, or a different encoding altogether? Thanks for reading. -- Kannan P.S. I actually have two encodings. One is similar to UTF-8 in that it's ASCII-biased. The encoding described above is intended for non-ASCII-biased data. The programmer selects which encoding to use based on what he thinks the data looks like.