Re: Unicode education in Schools
On Fri, 25 Aug 2017 09:36:44 -0400 John W Kennedy wrote: > Just a reminder that in Apple’s Swift a “Character” is anything that > looks like a character, including a letter with any theoretically > unlimited stack of diacritics, a flag, or a skin-toned emoji, and all > Swift functions working with characters, strings, and substrings > count characters in this way. There is an underlying store that is, > for historic reasons, UTF-16, and that can be accessed, but so can > UTF-8 and UTF-32. Can the individual Unicode characters be accessed one by one, e.g. for searching for vowels or other such 'diacritics'? Or would one only have access to the code units? Could one easily search for a subjoined consonant, e.g. COENG RO in Khmer, where the two constituent characters would be in adjacent extended grapheme clusters? Richard.
Re: Character Sequences of Uncertain Rendering (was: Version linking?)
On Sat, 26 Aug 2017 21:52:19 +0200 Philippe Verdy via Unicode wrote: > 2017-08-26 21:28 GMT+02:00 Richard Wordingham via Unicode < > unicode@unicode.org>: > Of course SHY in this use is not suitable, but who knows if one will > not need this to split in tow parts what would be otherwise a single > cluster (possibly reordered by canonical reordering if one needs to > split between two Indic matras: this would suggest there's a need for > a new "empty base consonnant" for that Indic script, but SHY (U+00AD) > should probably not have the correct effect if it also inserts an > undesired line break opportunity, independantly of how the glyph > which would be rendered and the position (first or second line) where > it would be rendered if the linebreak is honored). I am confused as to what conceivable case you have in mind. An example would help. I wonder if I'm misunderstanding what you mean by 'canonical reordering'. Do you mean the order of codepoints, or the arrangement of glyphs. CGJ is available to preserve a specific ordering of codepoints, though it is completely redundant in most Indic scripts. It is a fact that aksharas do get split between lines in manuscripts, undesirable though it may be. In a transcription intended to preserve a division into lines, one would probably use NBSP at such a point, and worry less about attempting to preserve the structure of the line-broken akshara. It seems that Unicode only supports word boundaries and their absence where they provide or prohibit line breaks. Richard.
Re: Unicode education in Schools
> Date: Sat, 26 Aug 2017 22:07:57 +0100 > From: Richard Wordingham via Unicode > > > We are miscommunicating. My point was that programming for MS-Windows > > needs a good understanding of what the UTF-16 surrogates are, and in > > what MS-Windows APIs/library functions they can and cannot be used. > > Without this understanding, one cannot figure out why the likes of > > iwspace and iswupper only support the BMP, and what APIs to use to > > lift this limitation. Likewise with display-related APIs, used to > > display Unicode text. > > > If you don't teach UTF-16 including these details, the programmers > > will feel lost when they meet with these complications. > > So what's new compared to UTF-8? Who said this is new? I said this needs to be _taught_, or else people will be ignorant about these subtleties.
Re: Unicode education in Schools
On Sat, 26 Aug 2017 21:20:45 +0300 Eli Zaretskii via Unicode wrote: > > Date: Sat, 26 Aug 2017 18:52:03 +0100 > > From: Richard Wordingham via Unicode > We are miscommunicating. My point was that programming for MS-Windows > needs a good understanding of what the UTF-16 surrogates are, and in > what MS-Windows APIs/library functions they can and cannot be used. > Without this understanding, one cannot figure out why the likes of > iwspace and iswupper only support the BMP, and what APIs to use to > lift this limitation. Likewise with display-related APIs, used to > display Unicode text. > If you don't teach UTF-16 including these details, the programmers > will feel lost when they meet with these complications. So what's new compared to UTF-8? The problem would be a misconception that MSVC's wchar_t supported Unicode - or has that been fixed recently? The neutral message is to avoid wchar_t where possible. C++11 and C11's char32_t ought to have fixed the problem. Functions iswspace() and iswlower() are not stable, one really has to replace them by the project's UCD routines. For example, when the locale is a Unicode locale with the obvious wchar_t representations, the value of iswlower(0x13A0) recently changed from non-zero to zero, as U+13A0 changed from gc=Lo to gc=Lu. I don't think iswupper() is any stabler. Richard.
Re: Character Sequences of Uncertain Rendering (was: Version linking?)
2017-08-26 21:28 GMT+02:00 Richard Wordingham via Unicode < unicode@unicode.org>: > > I'm wondering if there are any cases where a SHY _should_ go between a > Latin letter and diacritic. I can't think of any. > In standard Latin orthography you would not expect it, normally, but there will be cases where this will still occur at random places between long spans of letters. However I did NOT suggest (like you are doing here) using SHY between a Latin letter and any diacritic. But may be you've been confused by the fact I took the example of free insertion of SHY controls in alphabetic scripts in comparison to the free insertion of joiner controls (not the same thing) between Indic letters (including vowel matras or subjoined consonants that are encoded as combining characters but are not really "diacritics"). Of course SHY in this use is not suitable, but who knows if one will not need this to split in tow parts what would be otherwise a single cluster (possibly reordered by canonical reordering if one needs to split between two Indic matras: this would suggest there's a need for a new "empty base consonnant" for that Indic script, but SHY (U+00AD) should probably not have the correct effect if it also inserts an undesired line break opportunity, independantly of how the glyph which would be rendered and the position (first or second line) where it would be rendered if the linebreak is honored). If one wants an, empty base letter to combine with the diacritic after it, I think it should be NBSP (U+00A0) to avoid the interpretation as a "defective" cluster using a implied glyph such as the dotted circle (but NBSP also has its own problems, notably for collation where it would collate like a space instead of being ignorable at primary level: this can be fixed however quite easily in collation tailorings, using collation elements made with "NBSP+combining matra")
Character Sequences of Uncertain Rendering (was: Version linking?)
On Fri, 25 Aug 2017 01:24:36 +0200 Philippe Verdy via Unicode wrote: > 2017-08-17 22:37 GMT+02:00 Richard Wordingham via Unicode < > unicode@unicode.org>: > > > Fortunately, there is no good evidence that the occurrence > > of multiple distinct left matras is anything but a typing error, > > though I can easily see how it might be used as a lexicographical > > convention on the fuzzy edge of plain text. > > > > In a similar vein, in Malayalam, we get repeats of the 2-part vowel > > U+0D4B MALAYALAM VOWEL SIGN OO (see Cibu Johny's report at > > https://lists.freedesktop.org/archives/harfbuzz/2013-February/002945.html > > ), > > but I'm not sure what the legitimate encodings of the example word > > കോോോ (typed here as ) are. > Even if there were typing errors, the input method should either > signal it visually to the user (using canonical reordering), or the > user could still cancel this reordering (e.g. CTRL+Z for undoing it) > and the input method could still fix it and mainting the order by > then inserting combining joiners automatically even if the user did > not enter them directly. I don't see how any of ZWJ, ZWNJ and CGJ would help multiple distinct left matras or repeated 2-part vowels. You might argue for insertion of U+25CC as a base consonant, along with the ability to delete just it. > The joiners should better be removed transparently by the text editor > without requiring the user to perform complex selections or pressing > BACKSPACE multiple times, as I don't see any use of these joiners at > end of graphemes, or multiple joiners in a sequence. I believe has a rôle in some Arabic script writing systems, and possibly in other cursive Semitic scripts, such as Mongolian. is required at some syllable boundaries, and it is nice to have ZWNJ honoured in the sequence , which is composed of two extended grapheme clusters, and . This latter, of course, is no more than one would require of good Latin typography that works well with an English spell-checker - I would expect 'caecum' to have a ligature but not 'sundae'. > Even for Latin, one can freely enter SHY controls at any place within > words, even if they are not at correct syllabic separations: this will > impact the rendering if there are linebreaks, but this is done on > purpose, and still easy to correct if this was made by error (a spell > checker could also help locate these uncommons errors in existing > texts but would not automatically correct them without instruction > given by the user and a user can also choose to ignore/discard these > signals and store the text as is). Now that beings to mind some interesting cases - and . I'm not sure where the handling should go, but Firefox handles the former reasonably. My one gripe is that I don't know how to tell the system that a rendered soft hyphen is invisible. Some typographers claim that the glyph for the soft hyphen (i.e. the glyph for U+00AD) should be used when it becomes manifest. I haven't found any cases where a line break should go between a left matra and a base consonant, but I wouldn't be surprised to encounter an example in a manuscript in a phonetically ordered script. (They are far from unknown in Thai, but that's probably due to software deficiencies.) TUS treats the rendering of soft hyphens as beyond its scope except for line-breaking - the rules are language-dependent and beyond the scope of Unicode. I don't know if CLDR handles rendering around line-breaking soft hyphens. I'm wondering if there are any cases where a SHY _should_ go between a Latin letter and diacritic. I can't think of any. Richard.
Re: Unicode education in Schools
> Date: Sat, 26 Aug 2017 18:52:03 +0100 > From: Richard Wordingham via Unicode > > > > It shouldn't. UTF-16 works just like UTF-8, except that the code > > > units are bigger. > > > Not really, since UTF-8 doesn't have surrogates. > > It has 115 surrogates, thoroughly oppressed by the UTC - there are 64 > trailing surrogates 0x80 to 0xBF, 51 leading surrogates 0xC2 to 0xF4 , > and 0xC0, 0xC1 and 0xF5 to 0xFF suffer the indignity of being the 13 > uncodepoints - not even allowed in Unicode 8-bit strings. Emacs is one > of the few systems that comes close to allowing them the dignity of > integer values of their own - 3FFF80₁₆ to 3F₁₆ for the code units > 0x80 to 0xFF. > > I well remembered when Unicode regular expressions were required to > allow one to search for lone surrogates, but there was no such concept > of looking for isolated ill-associated bytes in Unicode 8-bit strings. > > The point is that if one understands how UTF-8 works, UTF-16 is a > system that works using a subset of the same principles, and one should > therefore understand how UTF-16 works, until one comes to the weird and > dubious concept of surrogate points having properties. I believe the > latter concept is of value only in code that lacks the concept of > gibberish. In UTF-8, the distinction between code unit value and > Unicode scalar value is very clear; in UTF-16, it is muddied by the > concept of 'codepoint'. We are miscommunicating. My point was that programming for MS-Windows needs a good understanding of what the UTF-16 surrogates are, and in what MS-Windows APIs/library functions they can and cannot be used. Without this understanding, one cannot figure out why the likes of iwspace and iswupper only support the BMP, and what APIs to use to lift this limitation. Likewise with display-related APIs, used to display Unicode text. If you don't teach UTF-16 including these details, the programmers will feel lost when they meet with these complications.
Re: Unicode education in Schools
On Sat, 26 Aug 2017 18:55:25 +0300 Eli Zaretskii via Unicode wrote: > > Date: Sat, 26 Aug 2017 16:09:33 +0100 > > From: Richard Wordingham via Unicode > > It shouldn't. UTF-16 works just like UTF-8, except that the code > > units are bigger. > Not really, since UTF-8 doesn't have surrogates. It has 115 surrogates, thoroughly oppressed by the UTC - there are 64 trailing surrogates 0x80 to 0xBF, 51 leading surrogates 0xC2 to 0xF4 , and 0xC0, 0xC1 and 0xF5 to 0xFF suffer the indignity of being the 13 uncodepoints - not even allowed in Unicode 8-bit strings. Emacs is one of the few systems that comes close to allowing them the dignity of integer values of their own - 3FFF80₁₆ to 3F₁₆ for the code units 0x80 to 0xFF. I well remembered when Unicode regular expressions were required to allow one to search for lone surrogates, but there was no such concept of looking for isolated ill-associated bytes in Unicode 8-bit strings. The point is that if one understands how UTF-8 works, UTF-16 is a system that works using a subset of the same principles, and one should therefore understand how UTF-16 works, until one comes to the weird and dubious concept of surrogate points having properties. I believe the latter concept is of value only in code that lacks the concept of gibberish. In UTF-8, the distinction between code unit value and Unicode scalar value is very clear; in UTF-16, it is muddied by the concept of 'codepoint'. Richard.
Re: Unicode education in Schools
> Date: Sat, 26 Aug 2017 16:09:33 +0100 > From: Richard Wordingham via Unicode > > > > Just steer them away from UTF-16! > > > > Which will leave them entirely unprepared for the MS-Windows Unicode > > programming, something they of course will never need in their > > careers. > > It shouldn't. UTF-16 works just like UTF-8, except that the code units > are bigger. Not really, since UTF-8 doesn't have surrogates.
Re: Unicode education in Schools
On Fri, 25 Aug 2017 09:36:00 +0300 Eli Zaretskii via Unicode wrote: > > Date: Fri, 25 Aug 2017 00:23:40 +0100 > > From: Richard Wordingham via Unicode > > > > On Thu, 24 Aug 2017 17:17:10 + > > Andre Schappo via Unicode wrote: > > > > > So, I consider it important to familiarise students with SMP > > > characters as well as BMP characters. Then when they develop > > > software they will, at the start, be thinking beyond ASCII and > > > Unicode BMP characters. > > > > Just steer them away from UTF-16! > > Which will leave them entirely unprepared for the MS-Windows Unicode > programming, something they of course will never need in their > careers. It shouldn't. UTF-16 works just like UTF-8, except that the code units are bigger. The problem is that accidentally ignoring the difference between UTF-16 and UCS-2 takes longer to be detected, and therefore correcting the error may be very difficult. Ignoring the difference between ASCII (or an 8-bit coding) and UTF-8 shows up very quickly, and therefore is less difficult to fix, for less is broken by the obvious correction. Richard.
Re: Unicode education in Schools
On Fri, 25 Aug 2017 12:57:37 +0100 (BST) William_J_G Overington via Unicode wrote: > UTF-16 is very useful. I use it in my research project. > If the byte content of a UTF-16 file is displayed in a hexadecimal > display then for all plane 0 characters the byte content of the > character codes are thereby displayed directly. But only plane 0. How tedious (and expensive) would it be to obtain a licence to convert, and freely share, the UCD to UTF-8 or UTF-16? The code charts might have to be a separate issue because of the fonts. > Also, all characters that can be encoded in Unicode can be stored in > a UTF-16 file. Or UTF-8. UTF-32 support is a bit limited. Richard.
Re: Unicode education in Schools
ECMAScript 6 fixed that, largely along the lines of my proposal: http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html Norbert > On Aug 24, 2017, at 22:14 , Peter Constable via Unicode > wrote: > > I thought Javascript had a UCS-2 understanding of Unicode strings. Has it > managed to progress beyond that? > > > > > > Peter > > > > > > From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of David Starner > via Unicode > Sent: Thursday, August 24, 2017 5:18 PM > To: Unicode Mailing List > Subject: Fwd: Unicode education in Schools > > > > > > -- Forwarded message - > From: David Starner > Date: Thu, Aug 24, 2017, 6:16 PM > Subject: Re: Unicode education in Schools > To: Richard Wordingham > > > > > > On Thu, Aug 24, 2017, 5:26 PM Richard Wordingham via Unicode > wrote: > > Just steer them away from UTF-16! (And vigorously prohibit the very > concept of UCS-2). > > Richard. > > > > Steer them away from reinventing the wheel. If they use Java, use Java > strings. If they're using GTK, use strings compatible with GTK. If they're > writing JavaScript, use JavaScript strings. There's basically no system > without Unicode strings or that they would be better off rewriting the wheel. >