> On 17 May 2017, at 22:36, Doug Ewell via Unicode <[email protected]> wrote: > > Hans Åberg wrote: > >> It would be useful, for use with filesystems, to have Unicode >> codepoint markers that indicate how UTF-8, including non-valid >> sequences, is translated into UTF-32 in a way that the original >> octet sequence can be restored. > > I have always argued strongly against this idea, and always will. > > Far from solving the stated problem, it would introduce a new one: > conversion from the "bad data" Unicode code points, currently > well-defined, would become ambiguous.
Actually not: just translate the invalid UTF-8 sequences into invalid UTF-32. No Unicode extensions are needed, as it has no say about what to happen with what it considers invalid. > File systems cannot have it both ways: they must define file names > either as unrestricted sequences of bytes, or as strings of characters > in some defined encoding. If they choose the latter, they need to define > conversion mechanisms with suitable fallback and adhere to them. They > can use the PUA if they like. The latter is complicated, so that is not what one does I am told, with some exception. Also, one may end up with a file in an unknown encoding, say imported remotely, and then the OS cannot deal with it.

