Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Hans Åberg via Unicode Wed, 17 May 2017 14:10:51 -0700

> On 17 May 2017, at 22:36, Doug Ewell via Unicode <[email protected]> wrote:
> 
> Hans Åberg wrote:
> 
>> It would be useful, for use with filesystems, to have Unicode
>> codepoint markers that indicate how UTF-8, including non-valid
>> sequences, is translated into UTF-32 in a way that the original
>> octet sequence can be restored. 
> 
> I have always argued strongly against this idea, and always will.
> 
> Far from solving the stated problem, it would introduce a new one:
> conversion from the "bad data" Unicode code points, currently
> well-defined, would become ambiguous.


Actually not: just translate the invalid UTF-8 sequences into invalid UTF-32. 
No Unicode extensions are needed, as it has no say about what to happen with 
what it considers invalid.

> File systems cannot have it both ways: they must define file names
> either as unrestricted sequences of bytes, or as strings of characters
> in some defined encoding. If they choose the latter, they need to define
> conversion mechanisms with suitable fallback and adhere to them. They
> can use the PUA if they like. 

The latter is complicated, so that is not what one does I am told, with some 
exception. Also, one may end up with a file in an unknown encoding, say 
imported remotely, and then the OS cannot deal with it.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Reply via email to