On Sat, Sep 8, 2018 at 7:36 PM Mark Davis ☕️ via Unicode <unicode@unicode.org> wrote: > > I recently did some extensive revisions of a paper on Unicode string models > (APIs). Comments are welcome. > > https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
* The Grapheme Cluster Model seems to have a couple of disadvantages that are not mentioned: 1) The subunit of string is also a string (a short string conforming to particular constraints). There's a need for *another* more atomic mechanism for examining the internals of the grapheme cluster string. 2) The way an arbitrary string is divided into units when iterating over it changes when the program is executed on a newer version of the language runtime that is aware of newly-assigned codepoints from a newer version of Unicode. * The Python 3.3 model mentions the disadvantages of memory usage cliffs but doesn't mention the associated perfomance cliffs. It would be good to also mention that when a string manipulation causes the storage to expand or contract, there's a performance impact that's not apparent from the nature of the operation if the programmer's intuition works on the assumption that the programmer is dealing with UTF-32. * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM text node storage in Gecko, (I believe but am not 100% sure) V8 and, optionally, HotSpot (https://docs.oracle.com/javase/9/vm/java-hotspot-virtual-machine-performance-enhancements.htm#JSJVM-GUID-3BB4C26F-6DE7-4299-9329-A3E02620D50A). That is, text has UTF-16 semantics, but if the high half of every code unit in a string is zero, only the lower half is stored. This has properties analogous to the Python 3.3 model, except non-BMP doesn't expand to UTF-32 but uses UTF-16 surrogate pairs. * I think the fact that systems that chose UTF-16 or UTF-32 have implemented models that try to save storage by omitting leading zeros and gaining complexity and performance cliffs as a result is a strong indication that UTF-8 should be recommended for newly-designed systems that don't suffer from a forceful legacy need to expose UTF-16 or UTF-32 semantics. * I suggest splitting the "UTF-8 model" into three substantially different models: 1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No UTF-8-related operations are performed when ingesting byte-oriented data. Byte buffers and text buffers are type-wise ambiguous. Only iterating over byte data by code point gives the data the UTF-8 interpretation. Unless the data is cleaned up as a side effect of such iteration, malformed sequences in input survive into output. 2) UTF-8 without full trust in ability to retain validity (the model of the UTF-8-using C++ parts of Gecko; I believe this to be the most common UTF-8 model for C and C++, but I don't have evidence to back this up): When data is ingested with text semantics, it is converted to UTF-8. For data that's supposed to already be in UTF-8, this means replacing malformed sequences with the REPLACEMENT CHARACTER, so the data is valid UTF-8 right after input. However, iteration by code point doesn't trust ability of other code to retain UTF-8 validity perfectly and has "else" branches in order not to blow up if invalid UTF-8 creeps into the system. 3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers have a different type in the type system than byte buffers. To go from a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data has been tagged as valid UTF-8, the validity is trusted completely so that iteration by code point does not have "else" branches for malformed sequences. If data that the type system indicates to be valid UTF-8 wasn't actually valid, it would be nasal demon time. The language has a default "safe" side and an opt-in "unsafe" side. The unsafe side is for performing low-level operations in a way where the responsibility of upholding invariants is moved from the compiler to the programmer. It's impossible to violate the UTF-8 validity invariant using the safe part of the language. * After working with different string models, I'd recommend the Rust model for newly-designed programming languages. (Not because I work for Mozilla but because I believe Rust's way of dealing with Unicode is the best I've seen.) Rust's standard library provides Unicode version-independent iterations over strings: by code unit and by code point. Iteration by extended grapheme cluster is provided by a library that's easy to include due to the nature of Rust package management (https://crates.io/crates/unicode_segmentation). Viewing a UTF-8 buffer as a read-only byte buffer has zero run-time cost and allows for maximally fast guaranteed-valid-UTF-8 output. -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/