Re: Unicode String Models

Daniel Bünzli via Unicode Tue, 02 Oct 2018 11:35:20 -0700

On 2 October 2018 at 14:03:48, Mark Davis ☕️ via Unicode ([email protected]) 
wrote:


> Because of performance and storage consideration, you need to consider the
> possible internal data structures when you are looking at something as
> low-level as strings. But most of the 'model's in the document are only
> really distinguished by API, only the "Code Point model" discussions are
> segmented by internal storage, as with "Code Point Model: UTF-32"

I guess my gripe with the presentation of that document is that it perpetuates 
the problem of confusing "unicode characters" (or integers, or scalar values) 
and their *encoding* (how to represent these integers as byte sequences) which 
a source of endless confusion among programmers. 

This confusion is easy lifted once you explain that there exists certain 
integers, the scalar values, which are your actual characters and then you have 
different ways of encoding your characters; one can then explain that a 
surrogate is not a character per se, it's a hack and there's no point in 
indexing them except if you want trouble.

This may also suggest another taxonomy of classification for the APIs, those in 
which you work directly with the character data (the scalar values) and those 
in which you work with an encoding of the actual character data (e.g. a 
JavaScript string).

> In reality, most APIs are not even going to be in terms of code points:
> they will return int32's. 

That reality depends on your programming language. If the latter supports type 
abstraction you can define an abstract type for scalar values (whose 
implementation may simply be an integer). If you always go through the 
constructor to create these "integers" you can maintain the invariant that a 
value of this type is an integer in the ranges [0x0000;0xD7FF] and 
[0xE000;0x10FFFF]. Knowing this invariant holds is quite useful when you feed 
your "character" data to other processes like UTF-X encoders: it guarantees the 
correctness of their outputs regardless of what the programmer does.

Best, 

Daniel

Re: Unicode String Models

Reply via email to