On Tue Nov  6 15:25:44 2007, Tomasz Sterna wrote:
Dnia 06-11-2007, Wt o godzinie 14:56 +0000, Dave Cridland pisze:
> I'm not following something. So encode the octets #x00 #x01 #x02
> #x5D #x3E, and tell me what you get.

Like this:

Binary <-> Encoded
0x00 <-> 0xC4, 0x80
0x01 <-> 0xC4, 0x81
...

Ah, okay - so you're adding 0x100 to these. I thought this would yield 3-octet characters, hence my confusion.


0x20 <-> 0x20
0x21 <-> 0x21
..
0x7F <-> 0x7F
0x80 <-> 0xC2, 0x80
..
0xFF <-> 0xC3, 0xBF


Right.



> I get three bytes that are not legal in a CDATA section, followed by > a sequence of bytes which decode (via UTF-8) to "]]>", which in turn > would end the CDATA section.

Good point.
We either transfer this chunk in &...; escaping, or just transcode 0x3E
or 0x5D bytes to 2byte UTF-8 character. (Maybe '>' to 'ยป' :)


Or add 0x100 again. (I checked this time, 0x5D encodes to 0xC5 0x9D).

However, using this technique, truly random data will expand by - roughly - 60.5%. Base64 beats this, at only 33%. There's only 101 octets that are legal single-byte UTF-8 octets that we can allow safely in CDATA sections, by my count, so that leaves 155 that are double-byte.

Base64 operates by encoding 6 bits into an alphabet of 64 symbols; encoding 7 bits needs an alphabet of 2^7, or 128 symbols, and would give us growth of 14.2% - we don't have 128 symbols to play with, though. We could choose an additional 17 double-octet symbols, in which case we'd see growth of 20.5% overall. Slightly better than base64.

So we'd encode each 7 bits using an alphabet of #x9 | #xA | #xD | [#x20-#x3D] | [#x3F-#x5C] | [#x5E-#x111], which would then be UTF-8 encoded, and be roughly 90% of the size of base64.

However, I think you need to factor in the overhead that no encoder/decoder library exists for this, and each individual implementation would have to code one, (or wait for someone else to do so).

Dave.
--
Dave Cridland - mailto:[EMAIL PROTECTED] - xmpp:[EMAIL PROTECTED]
 - acap://acap.dave.cridland.net/byowner/user/dwd/bookmarks/
 - http://dave.cridland.net/
Infotrope Polymer - ACAP, IMAP, ESMTP, and Lemonade

Reply via email to