Re: Unicode String Models

2018-09-11 Thread Henri Sivonen via Unicode
On Tue, Sep 11, 2018 at 2:13 PM Eli Zaretskii  wrote:
>
> > Date: Tue, 11 Sep 2018 13:12:40 +0300
> > From: Henri Sivonen via Unicode 
> >
> >  * I suggest splitting the "UTF-8 model" into three substantially
> > different models:
> >
> >  1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No
> > UTF-8-related operations are performed when ingesting byte-oriented
> > data. Byte buffers and text buffers are type-wise ambiguous. Only
> > iterating over byte data by code point gives the data the UTF-8
> > interpretation. Unless the data is cleaned up as a side effect of such
> > iteration, malformed sequences in input survive into output.
> >
> >  2) UTF-8 without full trust in ability to retain validity (the model
> > of the UTF-8-using C++ parts of Gecko; I believe this to be the most
> > common UTF-8 model for C and C++, but I don't have evidence to back
> > this up): When data is ingested with text semantics, it is converted
> > to UTF-8. For data that's supposed to already be in UTF-8, this means
> > replacing malformed sequences with the REPLACEMENT CHARACTER, so the
> > data is valid UTF-8 right after input. However, iteration by code
> > point doesn't trust ability of other code to retain UTF-8 validity
> > perfectly and has "else" branches in order not to blow up if invalid
> > UTF-8 creeps into the system.
> >
> >  3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
> > have a different type in the type system than byte buffers. To go from
> > a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
> > has been tagged as valid UTF-8, the validity is trusted completely so
> > that iteration by code point does not have "else" branches for
> > malformed sequences. If data that the type system indicates to be
> > valid UTF-8 wasn't actually valid, it would be nasal demon time. The
> > language has a default "safe" side and an opt-in "unsafe" side. The
> > unsafe side is for performing low-level operations in a way where the
> > responsibility of upholding invariants is moved from the compiler to
> > the programmer. It's impossible to violate the UTF-8 validity
> > invariant using the safe part of the language.
>
> There's another model, the one used by Emacs.  AFAIU, it is different
> from all the 3 you describe above.  In Emacs, each raw byte belonging
> to a byte sequence which is invalid under UTF-8 is represented as a
> special multibyte sequence.  IOW, Emacs's internal representation
> extends UTF-8 with multibyte sequences it uses to represent raw bytes.
> This allows mixing stray bytes and valid text in the same buffer,
> without risking lossy conversions (such as those one gets under model
> 2 above).

I think extensions of UTF-8 that expand the value space beyond Unicode
scalar values and the problems these extensions are designed to solve
is a worthwhile topic to cover, but I think it's not the same topic as
in the document but a slightly adjacent topic.

On that topic, these two are relevant:
https://simonsapin.github.io/wtf-8/
https://github.com/kennytm/omgwtf8

The former is used in the Rust standard library in order to provide a
Unix-like view to Windows file paths in a way that can represent all
Windows file paths. File paths on Unix-like systems are sequences of
bytes whose presentable-to-humans interpretation (these days) is
UTF-8, but there's no guarantee of UTF-8 validity. File paths on
Windows are are sequences of unsigned 16-bit numbers whose
presentable-to-humans interpretation is UTF-16, but there's no
guarantee of UTF-16 validity. WTF-8 can represent all Windows file
paths as sequences of bytes such that the paths that are valid UTF-16
as sequences of 16-bit units are valid UTF-8 in the 8-bit-unit
representation. This allows application-visible file paths in the Rust
standard library to be sequences of bytes both on Windows and
non-Windows platforms and to be presentable to humans by decoding as
UTF-8 in both cases.

To my knowledge, the latter isn't in use yet. The implementation is
tracked in https://github.com/rust-lang/rust/issues/49802

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/


Re: Tamil Brahmi Short Mid Vowels

2018-09-11 Thread Asmus Freytag via Unicode

  
  
On 9/11/2018 5:02 PM, Andrew Glass via
  Unicode wrote:


  
On Windows, Khmer is rendered with a dedicated shaping engine. I don't see a need to alter that engine or integrate Khmer with USE. How we fix Tai Tham, which does go to USE is a different matter. We need to work through the solution for Tai Tham. I'm opposed to a generic and broad relaxation of virama constraints in USE as that would have impact on many scripts that currently have no requirement for virama after vowels. I'm not opposed to a new Indic Syllabic Category that has virama-like features and is allowed to follow a vowel. If we establish such a property for Tai Tham, we can consider on a case-by-case basis if any virama characters would be better served by the new property—including Brahmi.



That approach would make sense.
There are other applications besides rendering that have a need
  to control where  a Virama can appear and for those there is also
  a benefit to having such alternate contexts captured by a
  dedicated property.
A./


  

Cheers,

Andrew


-Original Message-
From: Unicode  On Behalf Of Richard Wordingham via Unicode
Sent: Tuesday, September 11, 2018 4:27 PM
To: unicode@unicode.org
Subject: Re: Tamil Brahmi Short Mid Vowels

On Wed, 29 Aug 2018 21:42:57 +
Andrew Glass via Unicode  wrote:


  
Thank you Richard and Shriramana for bringing up this interesting 
problem.

I agree we need to fix this. I don’t want to fix this with a font hack 
or change to USE cluster rules or properties. I think the right place 
to fix this is in the encoding. This might be either a new character 
for Tamil Brahmi Puḷḷi — as Shriramana has proposed
(L2/12-226) — or separate characters for Tamil Brahmi Short E and Tamil 
Brahmi Short O in independent and dependent forms (4 characters 
total). I’m inclined to think that a visible virama, Tamil Brahmi 
Puḷḷi, is the right approach.

  
  
While this would work, please remember that refusing to allow a virama after a vowel also makes USE inappropriate for Khmer and Tai Tham, which use H+consonant rather than consonant+H for subscript final consonants.

Richard. 







  



Re: Unicode String Models

2018-09-11 Thread Eli Zaretskii via Unicode
> Date: Wed, 12 Sep 2018 00:13:52 +0200
> Cc: unicode@unicode.org
> From: Hans Åberg via Unicode 
> 
> It might be useful to represent non-UTF-8 bytes as Unicode code points. One 
> way might be to use a codepoint to indicate high bit set followed by the byte 
> value with its high bit set to 0, that is, truncated into the ASCII range. 
> For example, U+0080 looks like it is not in use, though I could not verify 
> this.

You must use a codepoint that is not defined by Unicode, and never
will.  That is what Emacs does: it extends the Unicode codepoint space
beyond 0x10.


RE: Tamil Brahmi Short Mid Vowels

2018-09-11 Thread Andrew Glass via Unicode


On Windows, Khmer is rendered with a dedicated shaping engine. I don't see a 
need to alter that engine or integrate Khmer with USE. How we fix Tai Tham, 
which does go to USE is a different matter. We need to work through the 
solution for Tai Tham. I'm opposed to a generic and broad relaxation of virama 
constraints in USE as that would have impact on many scripts that currently 
have no requirement for virama after vowels. I'm not opposed to a new Indic 
Syllabic Category that has virama-like features and is allowed to follow a 
vowel. If we establish such a property for Tai Tham, we can consider on a 
case-by-case basis if any virama characters would be better served by the new 
property—including Brahmi.

Cheers,

Andrew


-Original Message-
From: Unicode  On Behalf Of Richard Wordingham via 
Unicode
Sent: Tuesday, September 11, 2018 4:27 PM
To: unicode@unicode.org
Subject: Re: Tamil Brahmi Short Mid Vowels

On Wed, 29 Aug 2018 21:42:57 +
Andrew Glass via Unicode  wrote:

> Thank you Richard and Shriramana for bringing up this interesting 
> problem.
> 
> I agree we need to fix this. I don’t want to fix this with a font hack 
> or change to USE cluster rules or properties. I think the right place 
> to fix this is in the encoding. This might be either a new character 
> for Tamil Brahmi Puḷḷi — as Shriramana has proposed
> (L2/12-226 2F%2Fwww.unicode.org%2FL2%2FL2012%2F12226-brahmi-two-tamil-char.pdf
> p;data=02%7C01%7CAndrew.Glass%40microsoft.com%7Cc8b7042add6043b2d79608
> d6183f443b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63672305734730
> 4813sdata=raIc6m1AqKNg8WMpAployLZpkk9BthumjMx%2BPUlFVNE%3Dre
> served=0>) — or separate characters for Tamil Brahmi Short E and Tamil 
> Brahmi Short O in independent and dependent forms (4 characters 
> total). I’m inclined to think that a visible virama, Tamil Brahmi 
> Puḷḷi, is the right approach.

While this would work, please remember that refusing to allow a virama after a 
vowel also makes USE inappropriate for Khmer and Tai Tham, which use 
H+consonant rather than consonant+H for subscript final consonants.

Richard. 




Re: Unicode String Models

2018-09-11 Thread Philippe Verdy via Unicode
No 0xF8..0xFF are not used at all in UTF-8; but U+00F8..U+00FF really
**do** have UTF-8 encodings (using two bytes).

The only safe way to represent arbitrary bytes within strings when they are
not valid UTF-8 is to use invalid UTF-8 sequences, i.e by using a
"UTF-8-like" private extension of UTF-8 (that extension is still not UTF-8!)

This is what Java does for representing U+ by (0xC0,0x80) in the
compiled Bytecode or via the C/C++ interface for JNI when converting the
java string buffer into a C/C++ string terminated by a NULL byte (not part
of the Java string content itself). That special sequence however is really
exposed in the Java API as a true unsigned 16-bit code unit (char) with
value 0x, and a valid single code point.

The same can be done for reencoding each invalid byte in non-UTF-8
conforming texts using sequences with a "UTF-8-like" scheme (still
compatible with plain UTF-8 for every valid UTF-8 texts): you may either:
  * (a) encode each invalid byte separately (using two bytes for each), or
by encoding them by groups of 3-bits (represented using bytes 0xF8..0FF)
and then needing 3 bytes in the encoding.
  * (b) encode a private starter (e.g. 0xFF), followed by a byte for the
length of the raw bytes sequence that follows, and then the raw bytes
sequence of that length without any reencoding: this will never be confused
with other valid codepoints (however this scheme may no longer be directly
indexable from arbitrary random positions, unlike scheme a which may be
marginally longer longer)
But both schemes (a) or (b) would be useful in editors allowing to edit
arbitrary binary files as if they were plain-text, even if they contain
null bytes, or invalid UTF-8 sequences (it's up to these editors to find a
way to distinctively represent these bytes, and a way to enter/change them
reliably.

There's also a possibility of extension if the backing store uses UTF-16,
as all code units 0x.0x are used, but one scheme is possible by
using unpaired surrogates (notably a low surrogate NOT prefixed by a high
surrogate: the low surrogate already has 10 useful bits that can store any
raw byte value in its lowest bits): this scheme allows indexing from random
position and reliable sequencial traversal in both directions (backward or
forward)...

... But the presence of such extension of UTF-16 means that all the
implementation code handling standard text has to detect unpaired
surrogates, and can no longer assume that a low surrogate necessarily has a
high surrogate encoded just before it: it must be tested and that previous
position may be before the buffer start, causing a possibly buffer overrun
in backward direction (so the code will need to also know the start
position of the buffer and check it, or know the index which cannot be
negative), possibly exposing unrelated data and causing some security
risks, unless the backing store always adds a leading "guard" code unit set
arbitrarily to 0x.





Le mer. 12 sept. 2018 à 00:48, J Decker via Unicode  a
écrit :

>
>
> On Tue, Sep 11, 2018 at 3:15 PM Hans Åberg via Unicode <
> unicode@unicode.org> wrote:
>
>>
>> > On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode <
>> unicode@unicode.org> wrote:
>> >
>> > On Tue, 11 Sep 2018 21:10:03 +0200
>> > Hans Åberg via Unicode  wrote:
>> >
>> >> Indeed, before UTF-8, in the 1990s, I recall some Russians using
>> >> LaTeX files with sections in different Cyrillic and Latin encodings,
>> >> changing the editor encoding while typing.
>> >
>> > Rather like some of the old Unicode list archives, which are just
>> > concatenations of a month's emails, with all sorts of 8-bit encodings
>> > and stretches of base64.
>>
>> It might be useful to represent non-UTF-8 bytes as Unicode code points.
>> One way might be to use a codepoint to indicate high bit set followed by
>> the byte value with its high bit set to 0, that is, truncated into the
>> ASCII range. For example, U+0080 looks like it is not in use, though I
>> could not verify this.
>>
>>
> it's used for character 0x400.   0xD0 0x80   or 0x8000   0xE8 0x80 0x80
> (I'm probably off a bit in the leading byte)
> UTF-8 can represent from 0 to 0x20 every value; (which is all defined
> codepoints) early varients can support up to U+7FFF...
> and there's enough bits to carry the pattern forward to support 36 bits or
> 42 bits... (the last one breaking the standard a bit by allowing a byte
> wihout one bit off... 0xFF would be the leadin)
>
> 0xF8-FF are unused byte values; but those can all be encoded into utf-8.
>


Re: Tamil Brahmi Short Mid Vowels

2018-09-11 Thread Richard Wordingham via Unicode
On Wed, 29 Aug 2018 21:42:57 +
Andrew Glass via Unicode  wrote:

> Thank you Richard and Shriramana for bringing up this interesting
> problem.
> 
> I agree we need to fix this. I don’t want to fix this with a font
> hack or change to USE cluster rules or properties. I think the right
> place to fix this is in the encoding. This might be either a new
> character for Tamil Brahmi Puḷḷi — as Shriramana has proposed
> (L2/12-226)
> — or separate characters for Tamil Brahmi Short E and Tamil Brahmi
> Short O in independent and dependent forms (4 characters total). I’m
> inclined to think that a visible virama, Tamil Brahmi Puḷḷi, is the
> right approach.

While this would work, please remember that refusing to allow a virama
after a vowel also makes USE inappropriate for Khmer and Tai Tham,
which use H+consonant rather than consonant+H for subscript final
consonants.

Richard. 



Re: Unicode String Models

2018-09-11 Thread J Decker via Unicode
On Tue, Sep 11, 2018 at 3:15 PM Hans Åberg via Unicode 
wrote:

>
> > On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode <
> unicode@unicode.org> wrote:
> >
> > On Tue, 11 Sep 2018 21:10:03 +0200
> > Hans Åberg via Unicode  wrote:
> >
> >> Indeed, before UTF-8, in the 1990s, I recall some Russians using
> >> LaTeX files with sections in different Cyrillic and Latin encodings,
> >> changing the editor encoding while typing.
> >
> > Rather like some of the old Unicode list archives, which are just
> > concatenations of a month's emails, with all sorts of 8-bit encodings
> > and stretches of base64.
>
> It might be useful to represent non-UTF-8 bytes as Unicode code points.
> One way might be to use a codepoint to indicate high bit set followed by
> the byte value with its high bit set to 0, that is, truncated into the
> ASCII range. For example, U+0080 looks like it is not in use, though I
> could not verify this.
>
>
it's used for character 0x400.   0xD0 0x80   or 0x8000   0xE8 0x80 0x80
(I'm probably off a bit in the leading byte)
UTF-8 can represent from 0 to 0x20 every value; (which is all defined
codepoints) early varients can support up to U+7FFF...
and there's enough bits to carry the pattern forward to support 36 bits or
42 bits... (the last one breaking the standard a bit by allowing a byte
wihout one bit off... 0xFF would be the leadin)

0xF8-FF are unused byte values; but those can all be encoded into utf-8.


Re: Unicode String Models

2018-09-11 Thread Hans Åberg via Unicode


> On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode 
>  wrote:
> 
> On Tue, 11 Sep 2018 21:10:03 +0200
> Hans Åberg via Unicode  wrote:
> 
>> Indeed, before UTF-8, in the 1990s, I recall some Russians using
>> LaTeX files with sections in different Cyrillic and Latin encodings,
>> changing the editor encoding while typing.
> 
> Rather like some of the old Unicode list archives, which are just
> concatenations of a month's emails, with all sorts of 8-bit encodings
> and stretches of base64.

It might be useful to represent non-UTF-8 bytes as Unicode code points. One way 
might be to use a codepoint to indicate high bit set followed by the byte value 
with its high bit set to 0, that is, truncated into the ASCII range. For 
example, U+0080 looks like it is not in use, though I could not verify this.




Re: Unicode String Models

2018-09-11 Thread Richard Wordingham via Unicode
On Tue, 11 Sep 2018 21:10:03 +0200
Hans Åberg via Unicode  wrote:

> Indeed, before UTF-8, in the 1990s, I recall some Russians using
> LaTeX files with sections in different Cyrillic and Latin encodings,
> changing the editor encoding while typing.

Rather like some of the old Unicode list archives, which are just
concatenations of a month's emails, with all sorts of 8-bit encodings
and stretches of base64.

Richard.



Re: Unicode String Models

2018-09-11 Thread Hans Åberg via Unicode


> On 11 Sep 2018, at 20:40, Eli Zaretskii  wrote:
> 
>> From: Hans Åberg 
>> Date: Tue, 11 Sep 2018 20:14:30 +0200
>> Cc: hsivo...@hsivonen.fi,
>> unicode@unicode.org
>> 
>> If one encounters a file with mixed encodings, it is good to be able to view 
>> its contents and then convert it, as I see one can do in Emacs.
> 
> Yes.  And mixed encodings is not the only use case: it may well happen
> that the initial attempt to decode the file uses incorrect assumption
> about the encoding, for some reason.
> 
> In addition, it is important that changing some portion of the file,
> then saving the modified text will never change any part that the user
> didn't touch, as will happen if invalid sequences are rejected at
> input time and replaced with something else.

Indeed, before UTF-8, in the 1990s, I recall some Russians using LaTeX files 
with sections in different Cyrillic and Latin encodings, changing the editor 
encoding while typing.





Re: Unicode String Models

2018-09-11 Thread Eli Zaretskii via Unicode
> From: Hans Åberg 
> Date: Tue, 11 Sep 2018 20:14:30 +0200
> Cc: hsivo...@hsivonen.fi,
>  unicode@unicode.org
> 
> If one encounters a file with mixed encodings, it is good to be able to view 
> its contents and then convert it, as I see one can do in Emacs.

Yes.  And mixed encodings is not the only use case: it may well happen
that the initial attempt to decode the file uses incorrect assumption
about the encoding, for some reason.

In addition, it is important that changing some portion of the file,
then saving the modified text will never change any part that the user
didn't touch, as will happen if invalid sequences are rejected at
input time and replaced with something else.


Re: Unicode String Models

2018-09-11 Thread Hans Åberg via Unicode


> On 11 Sep 2018, at 19:21, Eli Zaretskii  wrote:
> 
>> From: Hans Åberg 
>> Date: Tue, 11 Sep 2018 19:13:28 +0200
>> Cc: Henri Sivonen ,
>> unicode@unicode.org
>> 
>>> In Emacs, each raw byte belonging
>>> to a byte sequence which is invalid under UTF-8 is represented as a
>>> special multibyte sequence.  IOW, Emacs's internal representation
>>> extends UTF-8 with multibyte sequences it uses to represent raw bytes.
>>> This allows mixing stray bytes and valid text in the same buffer,
>>> without risking lossy conversions (such as those one gets under model
>>> 2 above).
>> 
>> Can you give a reference detailing this format?
> 
> There's no formal description as English text, if that's what you
> meant.  The comments, macros and functions in the files
> src/character.[ch] in the Emacs source tree tell most of that story,
> albeit indirectly, and some additional info can be found in the
> section "Text Representation" of the Emacs Lisp Reference manual.

OK. If one encounters a file with mixed encodings, it is good to be able to 
view its contents and then convert it, as I see one can do in Emacs.





Re: Unicode String Models

2018-09-11 Thread Eli Zaretskii via Unicode
> From: Hans Åberg 
> Date: Tue, 11 Sep 2018 19:13:28 +0200
> Cc: Henri Sivonen ,
>  unicode@unicode.org
> 
> > In Emacs, each raw byte belonging
> > to a byte sequence which is invalid under UTF-8 is represented as a
> > special multibyte sequence.  IOW, Emacs's internal representation
> > extends UTF-8 with multibyte sequences it uses to represent raw bytes.
> > This allows mixing stray bytes and valid text in the same buffer,
> > without risking lossy conversions (such as those one gets under model
> > 2 above).
> 
> Can you give a reference detailing this format?

There's no formal description as English text, if that's what you
meant.  The comments, macros and functions in the files
src/character.[ch] in the Emacs source tree tell most of that story,
albeit indirectly, and some additional info can be found in the
section "Text Representation" of the Emacs Lisp Reference manual.


Re: Unicode String Models

2018-09-11 Thread Hans Åberg via Unicode


> On 11 Sep 2018, at 13:13, Eli Zaretskii via Unicode  
> wrote:
> 
> In Emacs, each raw byte belonging
> to a byte sequence which is invalid under UTF-8 is represented as a
> special multibyte sequence.  IOW, Emacs's internal representation
> extends UTF-8 with multibyte sequences it uses to represent raw bytes.
> This allows mixing stray bytes and valid text in the same buffer,
> without risking lossy conversions (such as those one gets under model
> 2 above).

Can you give a reference detailing this format?





Re: Unicode String Models

2018-09-11 Thread Mark Davis ☕️ via Unicode
These are all interesting and useful comments. I'll be responding once I
get a bit of free time, probably Friday or Saturday.

Mark


On Tue, Sep 11, 2018 at 4:16 AM Eli Zaretskii via Unicode <
unicode@unicode.org> wrote:

> > Date: Tue, 11 Sep 2018 13:12:40 +0300
> > From: Henri Sivonen via Unicode 
> >
> >  * I suggest splitting the "UTF-8 model" into three substantially
> > different models:
> >
> >  1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No
> > UTF-8-related operations are performed when ingesting byte-oriented
> > data. Byte buffers and text buffers are type-wise ambiguous. Only
> > iterating over byte data by code point gives the data the UTF-8
> > interpretation. Unless the data is cleaned up as a side effect of such
> > iteration, malformed sequences in input survive into output.
> >
> >  2) UTF-8 without full trust in ability to retain validity (the model
> > of the UTF-8-using C++ parts of Gecko; I believe this to be the most
> > common UTF-8 model for C and C++, but I don't have evidence to back
> > this up): When data is ingested with text semantics, it is converted
> > to UTF-8. For data that's supposed to already be in UTF-8, this means
> > replacing malformed sequences with the REPLACEMENT CHARACTER, so the
> > data is valid UTF-8 right after input. However, iteration by code
> > point doesn't trust ability of other code to retain UTF-8 validity
> > perfectly and has "else" branches in order not to blow up if invalid
> > UTF-8 creeps into the system.
> >
> >  3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
> > have a different type in the type system than byte buffers. To go from
> > a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
> > has been tagged as valid UTF-8, the validity is trusted completely so
> > that iteration by code point does not have "else" branches for
> > malformed sequences. If data that the type system indicates to be
> > valid UTF-8 wasn't actually valid, it would be nasal demon time. The
> > language has a default "safe" side and an opt-in "unsafe" side. The
> > unsafe side is for performing low-level operations in a way where the
> > responsibility of upholding invariants is moved from the compiler to
> > the programmer. It's impossible to violate the UTF-8 validity
> > invariant using the safe part of the language.
>
> There's another model, the one used by Emacs.  AFAIU, it is different
> from all the 3 you describe above.  In Emacs, each raw byte belonging
> to a byte sequence which is invalid under UTF-8 is represented as a
> special multibyte sequence.  IOW, Emacs's internal representation
> extends UTF-8 with multibyte sequences it uses to represent raw bytes.
> This allows mixing stray bytes and valid text in the same buffer,
> without risking lossy conversions (such as those one gets under model
> 2 above).
>


Re: Unicode String Models

2018-09-11 Thread Eli Zaretskii via Unicode
> Date: Tue, 11 Sep 2018 13:12:40 +0300
> From: Henri Sivonen via Unicode 
> 
>  * I suggest splitting the "UTF-8 model" into three substantially
> different models:
> 
>  1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No
> UTF-8-related operations are performed when ingesting byte-oriented
> data. Byte buffers and text buffers are type-wise ambiguous. Only
> iterating over byte data by code point gives the data the UTF-8
> interpretation. Unless the data is cleaned up as a side effect of such
> iteration, malformed sequences in input survive into output.
> 
>  2) UTF-8 without full trust in ability to retain validity (the model
> of the UTF-8-using C++ parts of Gecko; I believe this to be the most
> common UTF-8 model for C and C++, but I don't have evidence to back
> this up): When data is ingested with text semantics, it is converted
> to UTF-8. For data that's supposed to already be in UTF-8, this means
> replacing malformed sequences with the REPLACEMENT CHARACTER, so the
> data is valid UTF-8 right after input. However, iteration by code
> point doesn't trust ability of other code to retain UTF-8 validity
> perfectly and has "else" branches in order not to blow up if invalid
> UTF-8 creeps into the system.
> 
>  3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
> have a different type in the type system than byte buffers. To go from
> a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
> has been tagged as valid UTF-8, the validity is trusted completely so
> that iteration by code point does not have "else" branches for
> malformed sequences. If data that the type system indicates to be
> valid UTF-8 wasn't actually valid, it would be nasal demon time. The
> language has a default "safe" side and an opt-in "unsafe" side. The
> unsafe side is for performing low-level operations in a way where the
> responsibility of upholding invariants is moved from the compiler to
> the programmer. It's impossible to violate the UTF-8 validity
> invariant using the safe part of the language.

There's another model, the one used by Emacs.  AFAIU, it is different
from all the 3 you describe above.  In Emacs, each raw byte belonging
to a byte sequence which is invalid under UTF-8 is represented as a
special multibyte sequence.  IOW, Emacs's internal representation
extends UTF-8 with multibyte sequences it uses to represent raw bytes.
This allows mixing stray bytes and valid text in the same buffer,
without risking lossy conversions (such as those one gets under model
2 above).


Re: Unicode String Models

2018-09-11 Thread Henri Sivonen via Unicode
On Sat, Sep 8, 2018 at 7:36 PM Mark Davis ☕️ via Unicode
 wrote:
>
> I recently did some extensive revisions of a paper on Unicode string models 
> (APIs). Comments are welcome.
>
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#

* The Grapheme Cluster Model seems to have a couple of disadvantages
that are not mentioned:
  1) The subunit of string is also a string (a short string conforming
to particular constraints). There's a need for *another* more atomic
mechanism for examining the internals of the grapheme cluster string.
  2) The way an arbitrary string is divided into units when iterating
over it changes when the program is executed on a newer version of the
language runtime that is aware of newly-assigned codepoints from a
newer version of Unicode.

 * The Python 3.3 model mentions the disadvantages of memory usage
cliffs but doesn't mention the associated perfomance cliffs. It would
be good to also mention that when a string manipulation causes the
storage to expand or contract, there's a performance impact that's not
apparent from the nature of the operation if the programmer's
intuition works on the assumption that the programmer is dealing with
UTF-32.

 * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM
text node storage in Gecko, (I believe but am not 100% sure) V8 and,
optionally, HotSpot
(https://docs.oracle.com/javase/9/vm/java-hotspot-virtual-machine-performance-enhancements.htm#JSJVM-GUID-3BB4C26F-6DE7-4299-9329-A3E02620D50A).
That is, text has UTF-16 semantics, but if the high half of every code
unit in a string is zero, only the lower half is stored. This has
properties analogous to the Python 3.3 model, except non-BMP doesn't
expand to UTF-32 but uses UTF-16 surrogate pairs.

 * I think the fact that systems that chose UTF-16 or UTF-32 have
implemented models that try to save storage by omitting leading zeros
and gaining complexity and performance cliffs as a result is a strong
indication that UTF-8 should be recommended for newly-designed systems
that don't suffer from a forceful legacy need to expose UTF-16 or
UTF-32 semantics.

 * I suggest splitting the "UTF-8 model" into three substantially
different models:

 1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No
UTF-8-related operations are performed when ingesting byte-oriented
data. Byte buffers and text buffers are type-wise ambiguous. Only
iterating over byte data by code point gives the data the UTF-8
interpretation. Unless the data is cleaned up as a side effect of such
iteration, malformed sequences in input survive into output.

 2) UTF-8 without full trust in ability to retain validity (the model
of the UTF-8-using C++ parts of Gecko; I believe this to be the most
common UTF-8 model for C and C++, but I don't have evidence to back
this up): When data is ingested with text semantics, it is converted
to UTF-8. For data that's supposed to already be in UTF-8, this means
replacing malformed sequences with the REPLACEMENT CHARACTER, so the
data is valid UTF-8 right after input. However, iteration by code
point doesn't trust ability of other code to retain UTF-8 validity
perfectly and has "else" branches in order not to blow up if invalid
UTF-8 creeps into the system.

 3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
have a different type in the type system than byte buffers. To go from
a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
has been tagged as valid UTF-8, the validity is trusted completely so
that iteration by code point does not have "else" branches for
malformed sequences. If data that the type system indicates to be
valid UTF-8 wasn't actually valid, it would be nasal demon time. The
language has a default "safe" side and an opt-in "unsafe" side. The
unsafe side is for performing low-level operations in a way where the
responsibility of upholding invariants is moved from the compiler to
the programmer. It's impossible to violate the UTF-8 validity
invariant using the safe part of the language.

 * After working with different string models, I'd recommend the Rust
model for newly-designed programming languages. (Not because I work
for Mozilla but because I believe Rust's way of dealing with Unicode
is the best I've seen.) Rust's standard library provides Unicode
version-independent iterations over strings: by code unit and by code
point. Iteration by extended grapheme cluster is provided by a library
that's easy to include due to the nature of Rust package management
(https://crates.io/crates/unicode_segmentation). Viewing a UTF-8
buffer as a read-only byte buffer has zero run-time cost and allows
for maximally fast guaranteed-valid-UTF-8 output.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/