Title: RE: Roundtripping in Unicode

Marcin 'Qrczak' Kowalczyk wrote:
> But it's not possible in the direction NOT-UTF-16 -> NOT-UTF-8 ->
> NOT-UTF-16, unless you define valid sequences of NOT-UTF-16 in an
> awkward way which would happen to exclude those subsequences of
> non-characters which would form a valid UTF-8 fragment.
NOT-UTF-16 -> NOT-UTF-8 -> NOT-UTF-16 was never a goal. Nor was UTF-16 -> NOT-UTF-8 -> UTF-16, or NOT-UTF-16 -> UTF-8 -> NOT-UTF-16.

UTF-16 -> UTF-8 -> UTF-16 is preserved and that keeps the goals of UTF intact.

The goal, BTW, is: NOT-UTF-8 -> UTF-16 -> NOT-UTF-8.

> Question: should a new programming language which uses Unicode for
> string representation allow non-characters in strings? Argument for
> allowing them: otherwise they are completely useless at all, except
> U+FFFE for BOM detection. Argument for disallowing them: they make
> UTF-n inappropriate for serialization of arbitrary strings, and thus
> non-standard extensions of UTF-n must be used for serialization.
My opinion:
It should allow them and process them usefully. Furthermore, this 'usefully' should not be up to developers to discover. It should be researched, described, well, in the end even standardized. IMHO, UTC should consider leading this process, even if it does not end with anything standardized in Unicode standard.

Validation should be completely separated from processing. IMHO.


Lars

Reply via email to