Is it roundtripping or transfer-encoding (was: RE: Roundtripping Solved)

Lars Kristan Fri, 17 Dec 2004 08:42:44 -0800

Title: Is it roundtripping or transfer-encoding (was: RE: Roundtripping Solved)

Philippe Verdy wrote:

> What Lars wants has a name: it's a
> "transfer-encoding-syntax", to allow
> transporting any code unit sequences into a more restricted
> environment.
> This is not a new thing, but this is not specified by Unicode.

Good. It is a known thing. Which also means we can use previous experience with transfer-encoding-syntaxes. For example, what are the security implications and how they can be dealt with.

> But note that any occurence of U+EE80 to U+EEFF in the
> original NON-UTF-8
> "text" are escaped, despite they are valid Unicode. However,
> choosing U+EE80
> to U+EEFF is not a problem because these PUAs are very unlikely to be
> present in valid source texts, in absence of a prior PUA-agreement.

And would be no problem at all if new codepoints would be assigned for this purpose.

> Remember that this is only a Transform-Encoding-Syntax, not a
> new encoding.
> It does not require ANY new codepoint allocation by Unicode!

But does not mean there are no benefits in doing so. Escape characters are always a pain, like your example of """. OK, the next step is to assign a new codepoint for this purpose. SBCS had little room, the need was not recognised early enough and even if it would, people would use the escape character simply because they would like the way it would display. With (less than) 255 glyphs to choose from, people were bound to use them all. But Unicode has A LOT of codepoints, so it makes sense to do something like that.

At some point, someone thought of mapping bytes in invalid sequences to codepoints. Didn't know how to call them (or perhaps called them replacement characters), but UTC thought such codepoints shouldn't be assigned. But, if we call it "Transform-Encoding-Syntax" instead of "conversion", then they should be called "escape characters" instead of "replacement characters". And for the first time in history, you have an escaping method with more than one escape character. Very efficient. Very compact. Very straightforward. And Unicode is the one encoding that has both enough codepoints to afford it and at the same time more need for it than any other encoding.

One can compare it with MBCSs, and say the same thing could be done there but wasn't. But actually there was less need for it. Many SBCSs have no unassigned codepoints, and MBCSs were too busy with their own problems to worry about cross-compatibility at this level. But Unicode has learned a lot from mistakes made there, and can be better in every aspect. Shouldn't it be?

Anyway, if a very good Transform-Encoding-Syntax is devised, UTC could recognise the fact that everyone would benefit from it. If it means assigning 128 codepoints, then that is the price. And one can hardly say it has nothing to do with Unicode. It uses Unicode for transport. And Unicode can benefit from it itself.

Lars

Is it roundtripping or transfer-encoding (was: RE: Roundtripping Solved)

Reply via email to