Re: Unicode education in Schools
So how do you think it matters if the characters are in the BMP or SMP?
Re: Unicode education in Schools
Because there are many systems that can now handle BMP characters but not cannot handle SMP characters. One example being systems that use mysql utf8 (3 byte encoding) and have not yet updated to utf8mb4 (4 byte encoding) So, I consider it important to familiarise students with SMP characters as well as BMP characters. Then when they develop software they will, at the start, be thinking beyond ASCII and Unicode BMP characters. André Schappo > On 24 Aug 2017, at 17:45, Shriramana Sharma wrote: > > So how do you think it matters if the characters are in the BMP or SMP?
Re: Unicode education in Schools
2017-08-24 19:17 GMT+02:00 Andre Schappo via Unicode : > > Because there are many systems that can now handle BMP characters but not > cannot handle SMP characters. > > One example being systems that use mysql utf8 (3 byte encoding) and have > not yet updated to utf8mb4 (4 byte encoding) > Mysql's utf8 is known to cause severe problems, notably on wikis installed by default with it: the presence of any non-BMP character (SMP or emojis are now very frequent and available on almost all modern smartphones) in the edited text will cause its **silent** truncation when uploading it to the server (when it will save the text to the database) even if any unsaved preview was correct. You will see the truncation when the page is loaded again. Mysql's "utf8" should have been dropped since long and replaced by utf8mb4 or setup so that data send to an "utf8"-encoded database would cause a SQL error that cannot be silently ignored with truncation (or it least it should only cause the non-BMP characters to be filtered out, without silently deleting everything that follows). This is an old severe bug of Mysql (on the server itself) or in the connection protocol, or internal filters used by Mysql client library, that has caused many severe security issues (such as discarding logs or todo lists, or loss of pending commercial transactions such as lists of payments to process to a bank or truncated billings sent to customers, or loss of contact address or name, or broken complete addresses for product delivery to a customer, or missing items in a delivered box and lost products in the middle of their routing). This is a demosntration that not signaling encoding errors to an application, or not clearly specifiying that an API may cause encoding exceptions that must be caught and must not ignored in applications, can hurt. Even if you use "utf8mb4" encoding errors are still possible and must not be ignored as the final result will be unpredictable.
Re: Unicode education in Schools
On 8/24/2017 10:17 AM, Andre Schappo via Unicode wrote: Because there are many systems that can now handle BMP characters but not cannot handle SMP characters. One example being systems that use mysql utf8 (3 byte encoding) and have not yet updated to utf8mb4 (4 byte encoding) So, I consider it important to familiarise students with SMP characters as well as BMP characters. Then when they develop software they will, at the start, be thinking beyond ASCII and Unicode BMP characters. The thinking "beyond BMP" part only comes in when you work in encoding forms where the BMP uses a different number of code units than the SMP (or any other non-BMP "page"). This is true for both utf8 and utf16 but not if you work in utf32 or in scalar values (as in the posted exercise). The trick with using emoji in this lesson is that the descriptions and images are meaningful to any English speaker, so it gets the student to learn about character names. The same exercise would be more of a challenge for students whose native tongue is not English. A./ André Schappo On 24 Aug 2017, at 17:45, Shriramana Sharma wrote: So how do you think it matters if the characters are in the BMP or SMP?
Re: Unicode education in Schools
On Thu, 24 Aug 2017 17:17:10 + Andre Schappo via Unicode wrote: > So, I consider it important to familiarise students with SMP > characters as well as BMP characters. Then when they develop software > they will, at the start, be thinking beyond ASCII and Unicode BMP > characters. Just steer them away from UTF-16! (And vigorously prohibit the very concept of UCS-2). Richard.
Re: Unicode education in Schools
Strings in Java and JavaScript are basically the same as they are arbitrary sequences of 16-bit code units, and not restricted to text with valid UTF-16 encoding. The differences are in the set of access methods, but they are both normally immutable, and both allow (but do enforce) substrings to share their backing store between distinct instances. The same applies to C/C++ "wide strings" when their code units are larger than 1 byte, but C/C++ do not make them immutable, except using dedicated classes, which will transiently allow setting their content through constructors, and C/C++ wide strings exist with several signed and unsigned code units (when Java only have unsigned 16-bit code units in their "char", and Javascript has no "char" type but only "Number" types with valid range restrictions applied when constructing String instances from code units or from codepoint values. Javascript should soon have a new numeric type (it is provisionnaly named "BigInt", a signed 64-bit integer and will have constants sufixed by "n", and there will be no implicit promotion from/to Number but only explicit conversions by checked constructors) and new code unit types for mutable buffers (but only for the rangechecks of their write accessors, using "Number" 64-bit floating points or the newer "BigInt" 64-bit integers) There are similar designs in Perl, PHP, and most languages: Unicode support and conformance for using these types for valid text is implemented only by libraries in their standard text API or in their I/O APIs taking immutable strings or mutable buffers in parameters, or returning sharable but immutable string instances or a mutable buffer referenced on input or allocated internally, but these API's are not restricted to just valid Unicode text handling and allow using their strings with any other encoding. With immutable strings implemented as classes, the backing store is normally not directly accessible even by reference, you can just reference the class referencing internally the backing store... implemented using mutable buffers and using an internal encoding which may be different from the one exposed by the string class (possibly using compression technics for their backing store, on demand, and implicit atomization of most frequently used string values, notably the empty string and string values representing a single character with an 8-bit only code point value, or strings containing any repetition of the same code point value: these values do not need any internally allocated buffer for their backing store, so these instances are allocated very fast, and do not stress the garbage collector when they are no longer used). When Unicode text handling methods are supported by their exposed methods, the Unicode validation rules are not necessarily checked everywhere, so it is still possible to have strings or buffers containing a single unpaired surrogate value. The backing store may also allow storing code units outside the ranges used by valid UTF-16 or valid UTF-32 (the backing stores are virtualized and could be on disk and swapped on demand with reusable buffers from a pool). 2017-08-25 2:17 GMT+02:00 David Starner via Unicode : > > > -- Forwarded message ----- > From: David Starner > Date: Thu, Aug 24, 2017, 6:16 PM > Subject: Re: Unicode education in Schools > To: Richard Wordingham > > > > > On Thu, Aug 24, 2017, 5:26 PM Richard Wordingham via Unicode < > unicode@unicode.org> wrote: > >> Just steer them away from UTF-16! (And vigorously prohibit the very >> concept of UCS-2). >> >> Richard. >> > > Steer them away from reinventing the wheel. If they use Java, use Java > strings. If they're using GTK, use strings compatible with GTK. If they're > writing JavaScript, use JavaScript strings. There's basically no system > without Unicode strings or that they would be better off rewriting the > wheel. > >>
RE: Unicode education in Schools
I thought Javascript had a UCS-2 understanding of Unicode strings. Has it managed to progress beyond that? Peter From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of David Starner via Unicode Sent: Thursday, August 24, 2017 5:18 PM To: Unicode Mailing List Subject: Fwd: Unicode education in Schools -- Forwarded message - From: David Starner mailto:prosfil...@gmail.com>> Date: Thu, Aug 24, 2017, 6:16 PM Subject: Re: Unicode education in Schools To: Richard Wordingham mailto:richard.wording...@ntlworld.com>> On Thu, Aug 24, 2017, 5:26 PM Richard Wordingham via Unicode mailto:unicode@unicode.org>> wrote: Just steer them away from UTF-16! (And vigorously prohibit the very concept of UCS-2). Richard. Steer them away from reinventing the wheel. If they use Java, use Java strings. If they're using GTK, use strings compatible with GTK. If they're writing JavaScript, use JavaScript strings. There's basically no system without Unicode strings or that they would be better off rewriting the wheel.
RE: Unicode education in Schools
IIUC the limitation seems to be only that functions such as "charAt" do not recognize that surrogates aren't valid characters: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/charAt via https://stackoverflow.com/a/8716157/1503120. This is a problem of many 32-bit char based toolkits too and doesn't (can't?) have an efficient solution for SMP without counting the surrogates (and checking them). Right?
RE: Unicode education in Schools
Use String.codePointAt() etc. El ago. 24, 2017 10:42 PM -0700, Shriramana Sharma via Unicode , escribió: > IIUC the limitation seems to be only that functions such as "charAt" do not > recognize that surrogates aren't valid characters: > > https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/charAt > via https://stackoverflow.com/a/8716157/1503120. > > This is a problem of many 32-bit char based toolkits too and doesn't (can't?) > have an efficient solution for SMP without counting the surrogates (and > checking them). Right?
Re: Unicode education in Schools
> Date: Fri, 25 Aug 2017 00:23:40 +0100 > From: Richard Wordingham via Unicode > > On Thu, 24 Aug 2017 17:17:10 + > Andre Schappo via Unicode wrote: > > > So, I consider it important to familiarise students with SMP > > characters as well as BMP characters. Then when they develop software > > they will, at the start, be thinking beyond ASCII and Unicode BMP > > characters. > > Just steer them away from UTF-16! Which will leave them entirely unprepared for the MS-Windows Unicode programming, something they of course will never need in their careers.
Re: Unicode education in Schools
Mark (https://twitter.com/mark_e_davis) On Thu, Aug 24, 2017 at 11:01 PM, Asmus Freytag via Unicode < unicode@unicode.org> wrote: > On 8/24/2017 10:17 AM, Andre Schappo via Unicode wrote: > >> Because there are many systems that can now handle BMP characters but not >> cannot handle SMP characters. >> >> One example being systems that use mysql utf8 (3 byte encoding) and have >> not yet updated to utf8mb4 (4 byte encoding) >> >> So, I consider it important to familiarise students with SMP characters >> as well as BMP characters. Then when they develop software they will, at >> the start, be thinking beyond ASCII and Unicode BMP characters. >> > > The thinking "beyond BMP" part only comes in when you work in encoding > forms where the BMP uses a different number of code units than the SMP (or > any other non-BMP "page"). This is true for both utf8 and utf16 but not if > you work in utf32 or in scalar values (as in the posted exercise). > > > The trick with using emoji in this lesson is that the descriptions and > images are meaningful to any English speaker, so it gets the student to > learn about character names. > > The same exercise would be more of a challenge for students whose native > tongue is not English. > The trick with using emoji... True. For emoji names it would be better to use the CLDR names with non-anglophone audiences, since those names are available in a number of languages. eg http://www.unicode.org/cldr/charts/31/annotations/romance.html#😕 (that was last release's version; next release will have improvements...) > > > A./ > > >> André Schappo >> >> On 24 Aug 2017, at 17:45, Shriramana Sharma wrote: >>> >>> So how do you think it matters if the characters are in the BMP or SMP? >>> >> >> >> >
Re: Unicode education in Schools
Richard Wordingham wrote: > Just steer them away from UTF-16! (And vigorously prohibit the very concept > of UCS-2). UTF-16 is very useful. I use it in my research project. If the byte content of a UTF-16 file is displayed in a hexadecimal display then for all plane 0 characters the byte content of the character codes are thereby displayed directly. Also, all characters that can be encoded in Unicode can be stored in a UTF-16 file. William Overington Friday 25 August 2017 Original message >From : unicode@unicode.org Date : 2017/08/25 - 00:23 (GMTST) To : unicode@unicode.org Subject : Re: Unicode education in Schools On Thu, 24 Aug 2017 17:17:10 + Andre Schappo via Unicode wrote: > So, I consider it important to familiarise students with SMP > characters as well as BMP characters. Then when they develop software > they will, at the start, be thinking beyond ASCII and Unicode BMP > characters. Just steer them away from UTF-16! (And vigorously prohibit the very concept of UCS-2). Richard.
Re: Unicode education in Schools
ECMAScript 6 fixed that, largely along the lines of my proposal: http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html Norbert > On Aug 24, 2017, at 22:14 , Peter Constable via Unicode > wrote: > > I thought Javascript had a UCS-2 understanding of Unicode strings. Has it > managed to progress beyond that? > > > > > > Peter > > > > > > From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of David Starner > via Unicode > Sent: Thursday, August 24, 2017 5:18 PM > To: Unicode Mailing List > Subject: Fwd: Unicode education in Schools > > > > > > -- Forwarded message ----- > From: David Starner > Date: Thu, Aug 24, 2017, 6:16 PM > Subject: Re: Unicode education in Schools > To: Richard Wordingham > > > > > > On Thu, Aug 24, 2017, 5:26 PM Richard Wordingham via Unicode > wrote: > > Just steer them away from UTF-16! (And vigorously prohibit the very > concept of UCS-2). > > Richard. > > > > Steer them away from reinventing the wheel. If they use Java, use Java > strings. If they're using GTK, use strings compatible with GTK. If they're > writing JavaScript, use JavaScript strings. There's basically no system > without Unicode strings or that they would be better off rewriting the wheel. >
Re: Unicode education in Schools
On Fri, 25 Aug 2017 12:57:37 +0100 (BST) William_J_G Overington via Unicode wrote: > UTF-16 is very useful. I use it in my research project. > If the byte content of a UTF-16 file is displayed in a hexadecimal > display then for all plane 0 characters the byte content of the > character codes are thereby displayed directly. But only plane 0. How tedious (and expensive) would it be to obtain a licence to convert, and freely share, the UCD to UTF-8 or UTF-16? The code charts might have to be a separate issue because of the fonts. > Also, all characters that can be encoded in Unicode can be stored in > a UTF-16 file. Or UTF-8. UTF-32 support is a bit limited. Richard.
Re: Unicode education in Schools
On Fri, 25 Aug 2017 09:36:00 +0300 Eli Zaretskii via Unicode wrote: > > Date: Fri, 25 Aug 2017 00:23:40 +0100 > > From: Richard Wordingham via Unicode > > > > On Thu, 24 Aug 2017 17:17:10 + > > Andre Schappo via Unicode wrote: > > > > > So, I consider it important to familiarise students with SMP > > > characters as well as BMP characters. Then when they develop > > > software they will, at the start, be thinking beyond ASCII and > > > Unicode BMP characters. > > > > Just steer them away from UTF-16! > > Which will leave them entirely unprepared for the MS-Windows Unicode > programming, something they of course will never need in their > careers. It shouldn't. UTF-16 works just like UTF-8, except that the code units are bigger. The problem is that accidentally ignoring the difference between UTF-16 and UCS-2 takes longer to be detected, and therefore correcting the error may be very difficult. Ignoring the difference between ASCII (or an 8-bit coding) and UTF-8 shows up very quickly, and therefore is less difficult to fix, for less is broken by the obvious correction. Richard.
Re: Unicode education in Schools
> Date: Sat, 26 Aug 2017 16:09:33 +0100 > From: Richard Wordingham via Unicode > > > > Just steer them away from UTF-16! > > > > Which will leave them entirely unprepared for the MS-Windows Unicode > > programming, something they of course will never need in their > > careers. > > It shouldn't. UTF-16 works just like UTF-8, except that the code units > are bigger. Not really, since UTF-8 doesn't have surrogates.
Re: Unicode education in Schools
On Sat, 26 Aug 2017 18:55:25 +0300 Eli Zaretskii via Unicode wrote: > > Date: Sat, 26 Aug 2017 16:09:33 +0100 > > From: Richard Wordingham via Unicode > > It shouldn't. UTF-16 works just like UTF-8, except that the code > > units are bigger. > Not really, since UTF-8 doesn't have surrogates. It has 115 surrogates, thoroughly oppressed by the UTC - there are 64 trailing surrogates 0x80 to 0xBF, 51 leading surrogates 0xC2 to 0xF4 , and 0xC0, 0xC1 and 0xF5 to 0xFF suffer the indignity of being the 13 uncodepoints - not even allowed in Unicode 8-bit strings. Emacs is one of the few systems that comes close to allowing them the dignity of integer values of their own - 3FFF80₁₆ to 3F₁₆ for the code units 0x80 to 0xFF. I well remembered when Unicode regular expressions were required to allow one to search for lone surrogates, but there was no such concept of looking for isolated ill-associated bytes in Unicode 8-bit strings. The point is that if one understands how UTF-8 works, UTF-16 is a system that works using a subset of the same principles, and one should therefore understand how UTF-16 works, until one comes to the weird and dubious concept of surrogate points having properties. I believe the latter concept is of value only in code that lacks the concept of gibberish. In UTF-8, the distinction between code unit value and Unicode scalar value is very clear; in UTF-16, it is muddied by the concept of 'codepoint'. Richard.
Re: Unicode education in Schools
> Date: Sat, 26 Aug 2017 18:52:03 +0100 > From: Richard Wordingham via Unicode > > > > It shouldn't. UTF-16 works just like UTF-8, except that the code > > > units are bigger. > > > Not really, since UTF-8 doesn't have surrogates. > > It has 115 surrogates, thoroughly oppressed by the UTC - there are 64 > trailing surrogates 0x80 to 0xBF, 51 leading surrogates 0xC2 to 0xF4 , > and 0xC0, 0xC1 and 0xF5 to 0xFF suffer the indignity of being the 13 > uncodepoints - not even allowed in Unicode 8-bit strings. Emacs is one > of the few systems that comes close to allowing them the dignity of > integer values of their own - 3FFF80₁₆ to 3F₁₆ for the code units > 0x80 to 0xFF. > > I well remembered when Unicode regular expressions were required to > allow one to search for lone surrogates, but there was no such concept > of looking for isolated ill-associated bytes in Unicode 8-bit strings. > > The point is that if one understands how UTF-8 works, UTF-16 is a > system that works using a subset of the same principles, and one should > therefore understand how UTF-16 works, until one comes to the weird and > dubious concept of surrogate points having properties. I believe the > latter concept is of value only in code that lacks the concept of > gibberish. In UTF-8, the distinction between code unit value and > Unicode scalar value is very clear; in UTF-16, it is muddied by the > concept of 'codepoint'. We are miscommunicating. My point was that programming for MS-Windows needs a good understanding of what the UTF-16 surrogates are, and in what MS-Windows APIs/library functions they can and cannot be used. Without this understanding, one cannot figure out why the likes of iwspace and iswupper only support the BMP, and what APIs to use to lift this limitation. Likewise with display-related APIs, used to display Unicode text. If you don't teach UTF-16 including these details, the programmers will feel lost when they meet with these complications.
Re: Unicode education in Schools
On Sat, 26 Aug 2017 21:20:45 +0300 Eli Zaretskii via Unicode wrote: > > Date: Sat, 26 Aug 2017 18:52:03 +0100 > > From: Richard Wordingham via Unicode > We are miscommunicating. My point was that programming for MS-Windows > needs a good understanding of what the UTF-16 surrogates are, and in > what MS-Windows APIs/library functions they can and cannot be used. > Without this understanding, one cannot figure out why the likes of > iwspace and iswupper only support the BMP, and what APIs to use to > lift this limitation. Likewise with display-related APIs, used to > display Unicode text. > If you don't teach UTF-16 including these details, the programmers > will feel lost when they meet with these complications. So what's new compared to UTF-8? The problem would be a misconception that MSVC's wchar_t supported Unicode - or has that been fixed recently? The neutral message is to avoid wchar_t where possible. C++11 and C11's char32_t ought to have fixed the problem. Functions iswspace() and iswlower() are not stable, one really has to replace them by the project's UCD routines. For example, when the locale is a Unicode locale with the obvious wchar_t representations, the value of iswlower(0x13A0) recently changed from non-zero to zero, as U+13A0 changed from gc=Lo to gc=Lu. I don't think iswupper() is any stabler. Richard.
Re: Unicode education in Schools
> Date: Sat, 26 Aug 2017 22:07:57 +0100 > From: Richard Wordingham via Unicode > > > We are miscommunicating. My point was that programming for MS-Windows > > needs a good understanding of what the UTF-16 surrogates are, and in > > what MS-Windows APIs/library functions they can and cannot be used. > > Without this understanding, one cannot figure out why the likes of > > iwspace and iswupper only support the BMP, and what APIs to use to > > lift this limitation. Likewise with display-related APIs, used to > > display Unicode text. > > > If you don't teach UTF-16 including these details, the programmers > > will feel lost when they meet with these complications. > > So what's new compared to UTF-8? Who said this is new? I said this needs to be _taught_, or else people will be ignorant about these subtleties.
Re: Unicode education in Schools
On Fri, 25 Aug 2017 09:36:44 -0400 John W Kennedy wrote: > Just a reminder that in Apple’s Swift a “Character” is anything that > looks like a character, including a letter with any theoretically > unlimited stack of diacritics, a flag, or a skin-toned emoji, and all > Swift functions working with characters, strings, and substrings > count characters in this way. There is an underlying store that is, > for historic reasons, UTF-16, and that can be accessed, but so can > UTF-8 and UTF-32. Can the individual Unicode characters be accessed one by one, e.g. for searching for vowels or other such 'diacritics'? Or would one only have access to the code units? Could one easily search for a subjoined consonant, e.g. COENG RO in Khmer, where the two constituent characters would be in adjacent extended grapheme clusters? Richard.