On Tue Nov 6 15:25:44 2007, Tomasz Sterna wrote:
Dnia 06-11-2007, Wt o godzinie 14:56 +0000, Dave Cridland pisze:
> I'm not following something. So encode the octets #x00 #x01 #x02
> #x5D #x3E, and tell me what you get.
Like this:
Binary <-> Encoded
0x00 <-> 0xC4, 0x80
0x01 <-> 0xC4, 0x81
...
Ah, okay - so you're adding 0x100 to these. I thought this would
yield 3-octet characters, hence my confusion.
0x20 <-> 0x20
0x21 <-> 0x21
..
0x7F <-> 0x7F
0x80 <-> 0xC2, 0x80
..
0xFF <-> 0xC3, 0xBF
Right.
> I get three bytes that are not legal in a CDATA section, followed
by > a sequence of bytes which decode (via UTF-8) to "]]>", which
in turn > would end the CDATA section.
Good point.
We either transfer this chunk in &...; escaping, or just transcode
0x3E
or 0x5D bytes to 2byte UTF-8 character. (Maybe '>' to 'ยป' :)
Or add 0x100 again. (I checked this time, 0x5D encodes to 0xC5 0x9D).
However, using this technique, truly random data will expand by -
roughly - 60.5%. Base64 beats this, at only 33%. There's only 101
octets that are legal single-byte UTF-8 octets that we can allow
safely in CDATA sections, by my count, so that leaves 155 that are
double-byte.
Base64 operates by encoding 6 bits into an alphabet of 64 symbols;
encoding 7 bits needs an alphabet of 2^7, or 128 symbols, and would
give us growth of 14.2% - we don't have 128 symbols to play with,
though. We could choose an additional 17 double-octet symbols, in
which case we'd see growth of 20.5% overall. Slightly better than
base64.
So we'd encode each 7 bits using an alphabet of #x9 | #xA | #xD |
[#x20-#x3D] | [#x3F-#x5C] | [#x5E-#x111], which would then be UTF-8
encoded, and be roughly 90% of the size of base64.
However, I think you need to factor in the overhead that no
encoder/decoder library exists for this, and each individual
implementation would have to code one, (or wait for someone else to
do so).
Dave.
--
Dave Cridland - mailto:[EMAIL PROTECTED] - xmpp:[EMAIL PROTECTED]
- acap://acap.dave.cridland.net/byowner/user/dwd/bookmarks/
- http://dave.cridland.net/
Infotrope Polymer - ACAP, IMAP, ESMTP, and Lemonade