> Date: Wed, 12 Sep 2018 01:41:03 +0200 > Cc: unicode Unicode Discussion <[email protected]>, > Richard Wordingham <[email protected]>, > Hans Aberg <[email protected]> > From: Philippe Verdy via Unicode <[email protected]> > > The only safe way to represent arbitrary bytes within strings when they are > not valid UTF-8 is to use invalid > UTF-8 sequences, i.e by using a "UTF-8-like" private extension of UTF-8 (that > extension is still not UTF-8!) > > This is what Java does for representing U+0000 by (0xC0,0x80) in the compiled > Bytecode or via the C/C++ > interface for JNI when converting the java string buffer into a C/C++ string > terminated by a NULL byte (not part > of the Java string content itself). That special sequence however is really > exposed in the Java API as a true > unsigned 16-bit code unit (char) with value 0x0000, and a valid single code > point.
That's more or less what Emacs does. > But both schemes (a) or (b) would be useful in editors allowing to edit > arbitrary binary files as if they were > plain-text, even if they contain null bytes, or invalid UTF-8 sequences (it's > up to these editors to find a way to > distinctively represent these bytes, and a way to enter/change them reliably. The experience in Emacs is that no serious text editor can decide that it doesn't support these use cases. Even if editing binary files is out of scope, there will always be text files whose encoding is unknowable and/or guessed/decided wrong, files with mixed encodings, etc.

