Re: Unicode String Models

Daniel Bünzli via Unicode Wed, 03 Oct 2018 06:04:06 -0700

On 3 October 2018 at 09:17:10, Mark Davis ☕️ via Unicode ([email protected]) 
wrote:


> There are two main choices for a scalar-value API:
>  
> 1. Guarantee that the storage never contains surrogates. This is the
> simplest model.
> 2. Substitute U+FFFD for surrogates when the API returns code
> points. This can be done where #1 is not feasible, such as where the API is
> a shim on top of a (perhaps large) IO buffer of buffer of 16-bit code units
> that are not guaranteed to be UTF-16. The cost is extra tests on every code
> point access.

I'm not sure 2. really makes sense in pratice: it would mean you can't access 
scalar values 
which needs surrogates to be encoded. 

Also regarding 1. you can always defines an API that has this property 
regardless of the actual storage, it's only that your indexing operations might 
be costly as they do not directly map to the underlying storage array.

That being said I don't think direct indexing/iterating for Unicode text is 
such an interesting operation due of course to the normalization/segmentation 
issues. Basically if your API provides them I only see these indexes as useful 
ways to define substrings. APIs that identify/iterate boundaries (and thus 
substrings) are more interesting due to the nature of Unicode text.

> If the programming language provides for such a primitive datatype, that is
> possible. That would mean at a minimum that casting/converting to that
> datatype from other numerical datatypes would require bounds-checking and
> throwing an exception for values outside of [0x0000..0xD7FF
> 0xE000..0x10FFFF]. 

Yes. But note that in practice if you are in 1. above you usually perform this 
only at the point of decoding where you are already performing a lot of other 
checks. Once done you no longer need to check anything as long as the 
operations you perform on the values preserve the invariant. Also converting 
back to an integer if you need one is a no-op: it's the identity function. 

The OCaml Uchar module does this. This is the interface: 

  https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.mli

which defines the type t as abstract and here is the implementation: 

  https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.ml

which defines the implementation of type t = int which means values of this 
type are an *unboxed* OCaml integer (and will be stored as such in say an OCaml 
array). However since the module system enforces type abstraction the only way 
of creating such values is to use the constants or the constructors (e.g. 
of_int) which all maintain the scalar value invariant (if you disregard the 
unsafe_* functions). 

Note that it would perfectly be possible to adopt a similar approach in C via a 
typedef though given C's rather loose type system a little bit more discipline 
would be required from the programmer (always go through the constructor 
functions to create values of the type).

Best, 

Daniel

Re: Unicode String Models

Reply via email to