Re: Unicode education in Schools

2017-08-26 Thread Richard Wordingham via Unicode
On Fri, 25 Aug 2017 09:36:44 -0400
John W Kennedy  wrote:

> Just a reminder that in Apple’s Swift a “Character” is anything that
> looks like a character, including a letter with any theoretically
> unlimited stack of diacritics, a flag, or a skin-toned emoji, and all
> Swift functions working with characters, strings, and substrings
> count characters in this way. There is an underlying store that is,
> for historic reasons, UTF-16, and that can be accessed, but so can
> UTF-8 and UTF-32.

Can the individual Unicode characters be accessed one by one, e.g. for
searching for vowels or other such 'diacritics'?  Or would one only
have access to the code units?

Could one easily search for a subjoined consonant, e.g. COENG RO
 in Khmer, where the
two constituent characters would be in adjacent extended grapheme
clusters?

Richard.




Re: Character Sequences of Uncertain Rendering (was: Version linking?)

2017-08-26 Thread Richard Wordingham via Unicode
On Sat, 26 Aug 2017 21:52:19 +0200
Philippe Verdy via Unicode  wrote:

> 2017-08-26 21:28 GMT+02:00 Richard Wordingham via Unicode <
> unicode@unicode.org>:  

> Of course SHY in this use is not suitable, but who knows if one will
> not need this to split in tow parts what would be otherwise a single
> cluster (possibly reordered by canonical reordering if one needs to
> split between two Indic matras: this would suggest there's a need for
> a new "empty base consonnant" for that Indic script, but SHY (U+00AD)
> should probably not have the correct effect if it also inserts an
> undesired line break opportunity, independantly of how the glyph
> which would be rendered and the position (first or second line) where
> it would be rendered if the linebreak is honored).

I am confused as to what conceivable case you have in mind.  An example
would help.  I wonder if I'm misunderstanding what you mean by
'canonical reordering'.  Do you mean the order of codepoints, or the
arrangement of glyphs.  CGJ is available to preserve a specific
ordering of codepoints, though it is completely redundant in most Indic
scripts.

It is a fact that aksharas do get split between lines in manuscripts,
undesirable though it may be.  In a transcription intended to preserve
a division into lines, one would probably use NBSP at such a point,
and worry less about attempting to preserve the structure of the
line-broken akshara.  It seems that Unicode only supports word
boundaries and their absence where they provide or prohibit line
breaks.

Richard.


Re: Unicode education in Schools

2017-08-26 Thread Eli Zaretskii via Unicode
> Date: Sat, 26 Aug 2017 22:07:57 +0100
> From: Richard Wordingham via Unicode 
> 
> > We are miscommunicating.  My point was that programming for MS-Windows
> > needs a good understanding of what the UTF-16 surrogates are, and in
> > what MS-Windows APIs/library functions they can and cannot be used.
> > Without this understanding, one cannot figure out why the likes of
> > iwspace and iswupper only support the BMP, and what APIs to use to
> > lift this limitation.  Likewise with display-related APIs, used to
> > display Unicode text.
> 
> > If you don't teach UTF-16 including these details, the programmers
> > will feel lost when they meet with these complications.
> 
> So what's new compared to UTF-8?

Who said this is new?  I said this needs to be _taught_, or else
people will be ignorant about these subtleties.


Re: Unicode education in Schools

2017-08-26 Thread Richard Wordingham via Unicode
On Sat, 26 Aug 2017 21:20:45 +0300
Eli Zaretskii via Unicode  wrote:

> > Date: Sat, 26 Aug 2017 18:52:03 +0100
> > From: Richard Wordingham via Unicode 

> We are miscommunicating.  My point was that programming for MS-Windows
> needs a good understanding of what the UTF-16 surrogates are, and in
> what MS-Windows APIs/library functions they can and cannot be used.
> Without this understanding, one cannot figure out why the likes of
> iwspace and iswupper only support the BMP, and what APIs to use to
> lift this limitation.  Likewise with display-related APIs, used to
> display Unicode text.

> If you don't teach UTF-16 including these details, the programmers
> will feel lost when they meet with these complications.

So what's new compared to UTF-8?  The problem would be a misconception
that MSVC's wchar_t supported Unicode - or has that been fixed
recently?  The neutral message is to avoid wchar_t where possible.

C++11 and C11's char32_t ought to have fixed the problem.

Functions iswspace() and iswlower() are not stable, one really has to
replace them by the project's UCD routines.  For example, when the
locale is a Unicode locale with the obvious wchar_t representations, the
value of iswlower(0x13A0) recently changed from non-zero to zero, as
U+13A0 changed from gc=Lo to gc=Lu.  I don't think iswupper() is any
stabler.

Richard.


Re: Character Sequences of Uncertain Rendering (was: Version linking?)

2017-08-26 Thread Philippe Verdy via Unicode
2017-08-26 21:28 GMT+02:00 Richard Wordingham via Unicode <
unicode@unicode.org>:

>
> I'm wondering if there are any cases where a SHY _should_ go between a
> Latin letter and diacritic.  I can't think of any.
>

In standard Latin orthography you would not expect it, normally, but there
will be cases where this will still occur at random places between long
spans of letters.

However I did NOT suggest (like you are doing here) using SHY between a
Latin letter and any diacritic.

But may be you've been confused by the fact I took the example of free
insertion of SHY controls in alphabetic scripts in comparison to the free
insertion of joiner controls (not the same thing) between Indic letters
(including vowel matras or subjoined consonants that are encoded as
combining characters but are not really "diacritics").

Of course SHY in this use is not suitable, but who knows if one will not
need this to split in tow parts what would be otherwise a single cluster
(possibly reordered by canonical reordering if one needs to split between
two Indic matras: this would suggest there's a need for a new "empty base
consonnant" for that Indic script, but SHY (U+00AD) should probably not
have the correct effect if it also inserts an undesired line break
opportunity, independantly of how the glyph which would be rendered and the
position (first or second line) where it would be rendered if the linebreak
is honored).

If one wants an, empty base letter to combine with the diacritic after it,
I think it should be NBSP (U+00A0) to avoid the interpretation as a
"defective" cluster using a implied glyph such as the dotted circle (but
NBSP also has its own problems, notably for collation where it would
collate like a space instead of being ignorable at primary level: this can
be fixed however quite easily in collation tailorings, using collation
elements made with "NBSP+combining matra")


Character Sequences of Uncertain Rendering (was: Version linking?)

2017-08-26 Thread Richard Wordingham via Unicode
On Fri, 25 Aug 2017 01:24:36 +0200
Philippe Verdy via Unicode  wrote:

> 2017-08-17 22:37 GMT+02:00 Richard Wordingham via Unicode <
> unicode@unicode.org>:  
> 
> > Fortunately, there is no good evidence that the occurrence
> > of multiple distinct left matras is anything but a typing error,
> > though I can easily see how it might be used as a lexicographical
> > convention on the fuzzy edge of plain text.
> >
> > In a similar vein, in Malayalam, we get repeats of the 2-part vowel
> > U+0D4B MALAYALAM VOWEL SIGN OO (see Cibu Johny's report at
> > https://lists.freedesktop.org/archives/harfbuzz/2013-February/002945.html
> > ),
> > but I'm not sure what the legitimate encodings of the example word
> > കോോോ (typed here as ) are.
 
> Even if there were typing errors, the input method should either
> signal it visually to the user (using canonical reordering), or the
> user could still cancel this reordering (e.g. CTRL+Z for undoing it)
> and the input method could still fix it and mainting the order by
> then inserting combining joiners automatically even if the user did
> not enter them directly.

I don't see how any of ZWJ, ZWNJ and CGJ would help multiple
distinct left matras or repeated 2-part vowels. You might argue for
insertion of U+25CC as a base consonant, along with the ability to
delete just it.

> The joiners should better be removed transparently by the text editor
> without requiring the user to perform complex selections or pressing
> BACKSPACE multiple times, as I don't see any use of these joiners at
> end of graphemes, or multiple joiners in a sequence.

I believe  has a rôle in some Arabic script writing systems,
and possibly in other cursive Semitic scripts, such as Mongolian.
 is required at some syllable boundaries, and it is nice
to have ZWNJ honoured in the sequence , which is composed of two
extended grapheme clusters,  and .  This latter,
of course, is no more than one would require of good Latin typography
that works well with an English spell-checker - I would expect 'caecum'
to have a ligature but not 'sundae'.

> Even for Latin, one can freely enter SHY controls at any place within
> words, even if they are not at correct syllabic separations: this will
> impact the rendering if there are linebreaks, but this is done on
> purpose, and still easy to correct if this was made by error (a spell
> checker could also help locate these uncommons errors in existing
> texts but would not automatically correct them without instruction
> given by the user and a user can also choose to ignore/discard these
> signals and store the text as is).

Now that beings to mind some interesting cases -  and .  I'm not sure where the
handling should go, but Firefox handles the former reasonably.  My one
gripe is that I don't know how to tell the system that a rendered soft
hyphen is invisible.  Some typographers claim that the glyph for the
soft hyphen (i.e. the glyph for U+00AD) should be used when it becomes
manifest.  I haven't found any cases where a line break should go
between a left matra and a base consonant, but I wouldn't be surprised
to encounter an example in a manuscript in a phonetically ordered
script.  (They are far from unknown in Thai, but that's probably due
to software deficiencies.)  TUS treats the rendering of soft hyphens as
beyond its scope except for line-breaking - the rules are
language-dependent and beyond the scope of Unicode.  I don't know if
CLDR handles rendering around line-breaking soft hyphens.

I'm wondering if there are any cases where a SHY _should_ go between
a Latin letter and diacritic.  I can't think of any.

Richard.



Re: Unicode education in Schools

2017-08-26 Thread Eli Zaretskii via Unicode
> Date: Sat, 26 Aug 2017 18:52:03 +0100
> From: Richard Wordingham via Unicode 
> 
> > > It shouldn't.  UTF-16 works just like UTF-8, except that the code
> > > units are bigger.  
> 
> > Not really, since UTF-8 doesn't have surrogates.
> 
> It has 115 surrogates, thoroughly oppressed by the UTC - there are 64
> trailing surrogates 0x80 to 0xBF, 51 leading surrogates 0xC2 to 0xF4 ,
> and 0xC0, 0xC1 and 0xF5 to 0xFF suffer the indignity of being the 13
> uncodepoints - not even allowed in Unicode 8-bit strings. Emacs is one
> of the few systems that comes close to allowing them the dignity of
> integer values of their own - 3FFF80₁₆ to 3F₁₆ for the code units
> 0x80 to 0xFF.
> 
> I well remembered when Unicode regular expressions were required to
> allow one to search for lone surrogates, but there was no such concept
> of looking for isolated ill-associated bytes in Unicode 8-bit strings.
> 
> The point is that if one understands how UTF-8 works, UTF-16 is a
> system that works using a subset of the same principles, and one should
> therefore understand how UTF-16 works, until one comes to the weird and
> dubious concept of surrogate points having properties.  I believe the
> latter concept is of value only in code that lacks the concept of
> gibberish.  In UTF-8, the distinction between code unit value and
> Unicode scalar value is very clear; in UTF-16, it is muddied by the
> concept of 'codepoint'.

We are miscommunicating.  My point was that programming for MS-Windows
needs a good understanding of what the UTF-16 surrogates are, and in
what MS-Windows APIs/library functions they can and cannot be used.
Without this understanding, one cannot figure out why the likes of
iwspace and iswupper only support the BMP, and what APIs to use to
lift this limitation.  Likewise with display-related APIs, used to
display Unicode text.

If you don't teach UTF-16 including these details, the programmers
will feel lost when they meet with these complications.


Re: Unicode education in Schools

2017-08-26 Thread Richard Wordingham via Unicode
On Sat, 26 Aug 2017 18:55:25 +0300
Eli Zaretskii via Unicode  wrote:

> > Date: Sat, 26 Aug 2017 16:09:33 +0100
> > From: Richard Wordingham via Unicode 

> > It shouldn't.  UTF-16 works just like UTF-8, except that the code
> > units are bigger.  

> Not really, since UTF-8 doesn't have surrogates.

It has 115 surrogates, thoroughly oppressed by the UTC - there are 64
trailing surrogates 0x80 to 0xBF, 51 leading surrogates 0xC2 to 0xF4 ,
and 0xC0, 0xC1 and 0xF5 to 0xFF suffer the indignity of being the 13
uncodepoints - not even allowed in Unicode 8-bit strings. Emacs is one
of the few systems that comes close to allowing them the dignity of
integer values of their own - 3FFF80₁₆ to 3F₁₆ for the code units
0x80 to 0xFF.

I well remembered when Unicode regular expressions were required to
allow one to search for lone surrogates, but there was no such concept
of looking for isolated ill-associated bytes in Unicode 8-bit strings.

The point is that if one understands how UTF-8 works, UTF-16 is a
system that works using a subset of the same principles, and one should
therefore understand how UTF-16 works, until one comes to the weird and
dubious concept of surrogate points having properties.  I believe the
latter concept is of value only in code that lacks the concept of
gibberish.  In UTF-8, the distinction between code unit value and
Unicode scalar value is very clear; in UTF-16, it is muddied by the
concept of 'codepoint'.

Richard.



Re: Unicode education in Schools

2017-08-26 Thread Eli Zaretskii via Unicode
> Date: Sat, 26 Aug 2017 16:09:33 +0100
> From: Richard Wordingham via Unicode 
> 
> > > Just steer them away from UTF-16!  
> > 
> > Which will leave them entirely unprepared for the MS-Windows Unicode
> > programming, something they of course will never need in their
> > careers.
> 
> It shouldn't.  UTF-16 works just like UTF-8, except that the code units
> are bigger.

Not really, since UTF-8 doesn't have surrogates.


Re: Unicode education in Schools

2017-08-26 Thread Richard Wordingham via Unicode
On Fri, 25 Aug 2017 09:36:00 +0300
Eli Zaretskii via Unicode  wrote:

> > Date: Fri, 25 Aug 2017 00:23:40 +0100
> > From: Richard Wordingham via Unicode 
> > 
> > On Thu, 24 Aug 2017 17:17:10 +
> > Andre Schappo via Unicode  wrote:
> >   
> > > So, I consider it important to familiarise students with SMP
> > > characters as well as BMP characters. Then when they develop
> > > software they will, at the start, be thinking beyond ASCII and
> > > Unicode BMP characters.  
> > 
> > Just steer them away from UTF-16!  
> 
> Which will leave them entirely unprepared for the MS-Windows Unicode
> programming, something they of course will never need in their
> careers.

It shouldn't.  UTF-16 works just like UTF-8, except that the code units
are bigger.  The problem is that accidentally ignoring the difference
between UTF-16 and UCS-2 takes longer to be detected, and therefore
correcting the error may be very difficult.  Ignoring the difference
between ASCII (or an 8-bit coding) and UTF-8 shows up very quickly, and
therefore is less difficult to fix, for less is broken by the obvious
correction.

Richard.



Re: Unicode education in Schools

2017-08-26 Thread Richard Wordingham via Unicode
On Fri, 25 Aug 2017 12:57:37 +0100 (BST)
William_J_G Overington via Unicode  wrote:

> UTF-16 is very useful. I use it in my research project.

> If the byte content of a UTF-16 file is displayed in a hexadecimal
> display then for all plane 0 characters the byte content of the
> character codes are thereby displayed directly.

But only plane 0.

How tedious (and expensive) would it be to obtain a licence to convert,
and freely share, the UCD to UTF-8 or UTF-16?  The code charts might
have to be a separate issue because of the fonts.

> Also, all characters that can be encoded in Unicode can be stored in
> a UTF-16 file.

Or UTF-8.  UTF-32 support is a bit limited.

Richard.


Re: Unicode education in Schools

2017-08-26 Thread Norbert Lindenberg via Unicode
ECMAScript 6 fixed that, largely along the lines of my proposal:
http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html

Norbert


> On Aug 24, 2017, at 22:14 , Peter Constable via Unicode  
> wrote:
> 
> I thought Javascript had a UCS-2 understanding of Unicode strings. Has it 
> managed to progress beyond that?
> 
>  
> 
>  
> 
> Peter
> 
>  
> 
>  
> 
> From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of David Starner 
> via Unicode
> Sent: Thursday, August 24, 2017 5:18 PM
> To: Unicode Mailing List 
> Subject: Fwd: Unicode education in Schools
> 
>  
> 
>  
> 
> -- Forwarded message -
> From: David Starner 
> Date: Thu, Aug 24, 2017, 6:16 PM
> Subject: Re: Unicode education in Schools
> To: Richard Wordingham 
> 
>  
> 
>  
> 
> On Thu, Aug 24, 2017, 5:26 PM Richard Wordingham via Unicode 
>  wrote:
> 
> Just steer them away from UTF-16!  (And vigorously prohibit the very
> concept of UCS-2).
> 
> Richard.
> 
>  
> 
> Steer them away from reinventing the wheel. If they use Java, use Java 
> strings. If they're using GTK, use strings compatible with GTK. If they're 
> writing JavaScript, use JavaScript strings. There's basically no system 
> without Unicode strings or that they would be better off rewriting the wheel.
>