Mark
On Wed, Oct 3, 2018 at 3:01 PM Daniel Bünzli <[email protected]> wrote: > On 3 October 2018 at 09:17:10, Mark Davis ☕️ via Unicode ( > [email protected]) wrote: > > > There are two main choices for a scalar-value API: > > > > 1. Guarantee that the storage never contains surrogates. This is the > > simplest model. > > 2. Substitute U+FFFD for surrogates when the API returns code > > points. This can be done where #1 is not feasible, such as where the API > is > > a shim on top of a (perhaps large) IO buffer of buffer of 16-bit code > units > > that are not guaranteed to be UTF-16. The cost is extra tests on every > code > > point access. > > I'm not sure 2. really makes sense in pratice: it would mean you can't > access scalar values > which needs surrogates to be encoded. > Let me clear that up; I meant that "the underlying storage never contains something that would need to be represented as a surrogate code point." Of course, UTF-16 does need surrogate code units. What #1 would be excluding in the case of UTF-16 would be unpaired surrogates. That is, suppose the underlying storage is UTF-16 code units that don't satisfy #1. 0061 D83D DC7D 0061 D83D A code point API would return for those a sequence of 4 values, the last of which would be a surrogate code point. 00000061, 0001F47D, 00000061, 0000D83D A scalar value API would return for those also 4 values, but since we aren't in #1, it would need to remap. 00000061, 0001F47D, 00000061, 0000FFFD > > Also regarding 1. you can always defines an API that has this property > regardless of the actual storage, it's only that your indexing operations > might be costly as they do not directly map to the underlying storage array. > That being said I don't think direct indexing/iterating for Unicode text > is such an interesting operation due of course to the > normalization/segmentation issues. Basically if your API provides them I > only see these indexes as useful ways to define substrings. APIs that > identify/iterate boundaries (and thus substrings) are more interesting due > to the nature of Unicode text. > I agree that iteration is a very common case. But quite often implementations need to have at least opaque indexes (as discussed). > > > If the programming language provides for such a primitive datatype, that > is > > possible. That would mean at a minimum that casting/converting to that > > datatype from other numerical datatypes would require bounds-checking and > > throwing an exception for values outside of [0x0000..0xD7FF > > 0xE000..0x10FFFF]. > > Yes. But note that in practice if you are in 1. above you usually perform > this only at the point of decoding where you are already performing a lot > of other checks. Once done you no longer need to check anything as long as > the operations you perform on the values preserve the invariant. Also > converting back to an integer if you need one is a no-op: it's the identity > function. > If it is a real datatype, with strong guarantees that it *never* contains values outside of [0x0000..0xD7FF 0xE000..0x10FFFF], then every conversion from number will require checking. And in my experience, without a strong guarantee the datatype is in practice pretty useless. > > The OCaml Uchar module does this. This is the interface: > > https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.mli > > which defines the type t as abstract and here is the implementation: > > https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.ml > > which defines the implementation of type t = int which means values of > this type are an *unboxed* OCaml integer (and will be stored as such in say > an OCaml array). However since the module system enforces type abstraction > the only way of creating such values is to use the constants or the > constructors (e.g. of_int) which all maintain the scalar value invariant > (if you disregard the unsafe_* functions). > > Note that it would perfectly be possible to adopt a similar approach in C > via a typedef though given C's rather loose type system a little bit more > discipline would be required from the programmer (always go through the > constructor functions to create values of the type). That's the C motto: "requiring a 'bit more' discipline from programmers" > > Best, > > Daniel > > >

