Mark
On Tue, Oct 2, 2018 at 8:31 PM Daniel Bünzli <[email protected]> wrote: > On 2 October 2018 at 14:03:48, Mark Davis ☕️ via Unicode ( > [email protected]) wrote: > > > Because of performance and storage consideration, you need to consider > the > > possible internal data structures when you are looking at something as > > low-level as strings. But most of the 'model's in the document are only > > really distinguished by API, only the "Code Point model" discussions are > > segmented by internal storage, as with "Code Point Model: UTF-32" > > I guess my gripe with the presentation of that document is that it > perpetuates the problem of confusing "unicode characters" (or integers, or > scalar values) and their *encoding* (how to represent these integers as > byte sequences) which a source of endless confusion among programmers. > > This confusion is easy lifted once you explain that there exists certain > integers, the scalar values, which are your actual characters and then you > have different ways of encoding your characters; one can then explain that > a surrogate is not a character per se, it's a hack and there's no point in > indexing them except if you want trouble. > > This may also suggest another taxonomy of classification for the APIs, > those in which you work directly with the character data (the scalar > values) and those in which you work with an encoding of the actual > character data (e.g. a JavaScript string). > Thanks for the feedback. It is worth adding a discussion of the issues, perhaps something like: A code-point-based API takes and returns int32's, although only a small subset of the values are valid code points, namely 0x0..0x10FFFF. (In practice some APIs may support returning -1 to signal an error or termination, such as before or after the end of a string.) A surrogate code point is one in U+D800..U+DFFF; these reflect a range of special code units used in pairs in UTF-16 for representing code points above U+FFFF. A scalar value is a code point that is not a surrogate. A scalar-value API for immutable strings requires that no surrogate code points are ever returned. In practice, the main advantage of that API is that round-tripping to UTF-8/16 is guaranteed. Otherwise, a leaked surrogate code point is relatively harmless: Unicode properties are devised so that clients can essentially treat them as (permanently) unassigned characters. Warning: an iterator should *never* avoid returning surrogate code points by skipping them; that can cause security problems; see https://www.unicode.org/reports/tr36/tr36-7.html#Substituting_for_Ill_Formed_Subsequences and https://www.unicode.org/reports/tr36/tr36-7.html#Deletion_of_Noncharacters. There are two main choices for a scalar-value API: 1. Guarantee that the storage never contains surrogates. This is the simplest model. 2. Substitute U+FFFD for surrogates when the API returns code points. This can be done where #1 is not feasible, such as where the API is a shim on top of a (perhaps large) IO buffer of buffer of 16-bit code units that are not guaranteed to be UTF-16. The cost is extra tests on every code point access. > > In reality, most APIs are not even going to be in terms of code points: > > they will return int32's. > > That reality depends on your programming language. If the latter supports > type abstraction you can define an abstract type for scalar values (whose > implementation may simply be an integer). If you always go through the > constructor to create these "integers" you can maintain the invariant that > a value of this type is an integer in the ranges [0x0000;0xD7FF] and > [0xE000;0x10FFFF]. Knowing this invariant holds is quite useful when you > feed your "character" data to other processes like UTF-X encoders: it > guarantees the correctness of their outputs regardless of what the > programmer does. > If the programming language provides for such a primitive datatype, that is possible. That would mean at a minimum that casting/converting to that datatype from other numerical datatypes would require bounds-checking and throwing an exception for values outside of [0x0000..0xD7FF 0xE000..0x10FFFF]. Most common-use programming languages that I know of don't support that for primitives; the API would have to use a class, which would be so very painful for performance/storage. If you (or others) know of languages that do have such a cheap primitive datatype, that would be worth mentioning! > Best, > > Daniel > > >

