Softwares designed with only UCS-2 and not real UTF-16 support are still used today
For example MySQL with its broken "UTF-8" encoding which in fact encodes supplementary characters as two separate 16-bit code-units for surrogates, each one blindly encoded as 3-byte sequences which would be ill-formed in standard UTF-8, buit that also does not differentiate invalid pairs of surrogates, and offers no collation support for supplementary characters. In this case some other softwares will break silently on these sequences (for example Mediawiki when installed with a MySQL backend server whose datastore was created with its broken "UTF-8", will silently discard any text starting at the first supplementary character found in the wikitext. This is not a problem of Mediawiki but the fact the MediaWiki does NOT support such MySQL server isntalled with its "UTF-8" datastore, but only supports MySQL if the storage encoding declared for the database was "binary" (but in that case there's no support of collation in MySQL, texts are just containing any random sequences of bytes and internationalization is then made in the client software, here Mediawiki and its PHP, ICU, or Lua libraries, and other tools written in Perl and other languages) Note that this does not affect Wikimedia in its wikis because they were initially installed corectly with the binary encoding in MySQL, but now Wikimedia wikis use another database engine with native UTF-8 support and full coverage of the UCS. Other wikis using Mediawiki will need to upgrade their MySQL version if they want to keep it for adminsitrative reasons (and not convert again their datastore to the binary encoding). Softwares running with only UCS-2 are exposed to such risks similar to the one seen in MediaWiki on incorrect MySQL installations, where any user may edit a page to insert any supplementary character (supplementary sinograms, emojis, Gothic letters, supplementary symbols...) which will look correct when previewing, and correct when it is parsed, accepted silently by MySQL, but then silently truncated because of the encoding error: when reloading the data from MySQL, there will effectively be unexpectedly discarded data. How to react to the risks of data losses or truncation ? Throwing an exception or just returning an error is in fact more dangerous than just replacing the ill-formed sequences by one or more U+FFFD: we preserve as much as possible, but anyway softwares should be able to perform some tests in their datastore to see if they correctly handle the encoding: this could be done when starting the sofware and emitting log messages when the backend do not support the encoding: all that is needed is to send a single supplementary character to the remote datastore in a junk table or field and then retrieve it immediately in another transaction to make sure it is preserved. Similar tests can be done to see if the remote datastore also preserves the encoding form or "normalizes it, or alters it (this alteration could happen with a leading BOM and some other silent alterations could be made on NULL and trailing spaces if the datastore does not use text fields with varying length but fixed length instead). Similar tests could be done to check the maximum length accepted (a VARCHAR(256) on a binary-encoded database will not always store 256 Unciode characters, but in a database encoded with non borken UTF-8, it should store 256 codepoints independantly of theior values, even if their UTF-8 encoding would be up to 1024 bytes. 2017-05-16 0:43 GMT+02:00 Richard Wordingham via Unicode < [email protected]>: > On Mon, 15 May 2017 21:38:26 +0000 > David Starner via Unicode <[email protected]> wrote: > > > > and the fact is that handling surrogates (which is what proponents > > > of UTF-8 or UCS-4 usually focus on) is no more complicated than > > > handling combining characters, which you have to do anyway. > > > Not necessarily; you can legally process Unicode text without worrying > > about combining characters, whereas you cannot process UTF-16 without > > handling surrogates. > > The problem with surrogates is inadequate testing. They're sufficiently > rare for many users that it may be a long time before an error is > discovered. It's not always obvious that code is designed for UCS-2 > rather than UTF-16. > > Richard. >

