On Tue, Sep 11, 2018 at 2:13 PM Eli Zaretskii <e...@gnu.org> wrote: > > > Date: Tue, 11 Sep 2018 13:12:40 +0300 > > From: Henri Sivonen via Unicode <unicode@unicode.org> > > > > * I suggest splitting the "UTF-8 model" into three substantially > > different models: > > > > 1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No > > UTF-8-related operations are performed when ingesting byte-oriented > > data. Byte buffers and text buffers are type-wise ambiguous. Only > > iterating over byte data by code point gives the data the UTF-8 > > interpretation. Unless the data is cleaned up as a side effect of such > > iteration, malformed sequences in input survive into output. > > > > 2) UTF-8 without full trust in ability to retain validity (the model > > of the UTF-8-using C++ parts of Gecko; I believe this to be the most > > common UTF-8 model for C and C++, but I don't have evidence to back > > this up): When data is ingested with text semantics, it is converted > > to UTF-8. For data that's supposed to already be in UTF-8, this means > > replacing malformed sequences with the REPLACEMENT CHARACTER, so the > > data is valid UTF-8 right after input. However, iteration by code > > point doesn't trust ability of other code to retain UTF-8 validity > > perfectly and has "else" branches in order not to blow up if invalid > > UTF-8 creeps into the system. > > > > 3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers > > have a different type in the type system than byte buffers. To go from > > a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data > > has been tagged as valid UTF-8, the validity is trusted completely so > > that iteration by code point does not have "else" branches for > > malformed sequences. If data that the type system indicates to be > > valid UTF-8 wasn't actually valid, it would be nasal demon time. The > > language has a default "safe" side and an opt-in "unsafe" side. The > > unsafe side is for performing low-level operations in a way where the > > responsibility of upholding invariants is moved from the compiler to > > the programmer. It's impossible to violate the UTF-8 validity > > invariant using the safe part of the language. > > There's another model, the one used by Emacs. AFAIU, it is different > from all the 3 you describe above. In Emacs, each raw byte belonging > to a byte sequence which is invalid under UTF-8 is represented as a > special multibyte sequence. IOW, Emacs's internal representation > extends UTF-8 with multibyte sequences it uses to represent raw bytes. > This allows mixing stray bytes and valid text in the same buffer, > without risking lossy conversions (such as those one gets under model > 2 above).
I think extensions of UTF-8 that expand the value space beyond Unicode scalar values and the problems these extensions are designed to solve is a worthwhile topic to cover, but I think it's not the same topic as in the document but a slightly adjacent topic. On that topic, these two are relevant: https://simonsapin.github.io/wtf-8/ https://github.com/kennytm/omgwtf8 The former is used in the Rust standard library in order to provide a Unix-like view to Windows file paths in a way that can represent all Windows file paths. File paths on Unix-like systems are sequences of bytes whose presentable-to-humans interpretation (these days) is UTF-8, but there's no guarantee of UTF-8 validity. File paths on Windows are are sequences of unsigned 16-bit numbers whose presentable-to-humans interpretation is UTF-16, but there's no guarantee of UTF-16 validity. WTF-8 can represent all Windows file paths as sequences of bytes such that the paths that are valid UTF-16 as sequences of 16-bit units are valid UTF-8 in the 8-bit-unit representation. This allows application-visible file paths in the Rust standard library to be sequences of bytes both on Windows and non-Windows platforms and to be presentable to humans by decoding as UTF-8 in both cases. To my knowledge, the latter isn't in use yet. The implementation is tracked in https://github.com/rust-lang/rust/issues/49802 -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/