RE: Support for non-BMP characters

Marc Durdin Wed, 25 Apr 2012 02:53:28 -0700

Yes, but this means that regexes with SMP don’t work (e.g. [𝒜-𝒵]), character 
counts returns code units, etc.  So you have to reimplement string.length, 
string.charCodeAt, etc, if you don’t want to deal with surrogate pairs (I 
reckon you’ve got better things to be spending your time on).


http://dheeb.files.wordpress.com/2011/07/gbu.pdf “Unicode Support Shootout - 
The Good, the Bad & the (mostly) Ugly”  by Tom Christiansen has a great summary 
of some of the issues with relying on JavaScript’s internal string manipulation 
(unfortunately can’t find a better working link at present – the official 
training.perl.com site seems to be down).  Actually, that presentation is a 
fantastic place to start for understanding many of the limitations of various 
programming languages’ support for Unicode – if you haven’t read it, I’d urge 
you to go read it now.

Marc

From: Szelp, A. Sz. [mailto:[email protected]]
Sent: Wednesday, 25 April 2012 7:28 PM
To: Marc Durdin
Cc: David Starner; Unicode Mailing List
Subject: Re: Support for non-BMP characters

Shouldn't it be technically possible to store Supplementary Plane characters in 
UTF-16 / UCS-2 as well? Isn't that what Surrogate Pairs are for?

Sz
On Wed, Apr 25, 2012 at 11:09, Marc Durdin 
<[email protected]<mailto:[email protected]>> wrote:
Probably the most egregious example I know of is JavaScript.  As far as I know, 
JavaScript still only groks UCS-2.  I'd love to be wrong.

Marc

-----Original Message-----
From: [email protected]<mailto:[email protected]> 
[mailto:[email protected]<mailto:[email protected]>] On 
Behalf Of David Starner
Sent: Wednesday, 25 April 2012 6:32 PM
To: Unicode Mailing List
Subject: Support for non-BMP characters

It's been ten years since the first non-BMP characters were encoded.
How are they working in your neck of the woods? There's a lot of places where 
they're working just fine, but I was facing MySQL's support. It has had support 
for UCS-2 and UTF-8 limited to the BMP for a long time; now in MySQL 5.5 
there's utf16, utf32 and utf8mb4. (MySQL
5.1 and 5.5 are the current stable releases.) But there's enough warnings about 
incompatibilities with utf8mb4 to make me pause before switching my private 
database to it, and I think the net will see MySQL databases with utf8 instead 
of utf8mb4 as long as MySQL exists, unless they decide to push people over to 
it.

(Ada's an issue too, though not one most people will have to deal with. While 
Ada 2005 added a UTF-32 string type, it left the UCS-2 string type as is. 
Again, I suspect a lot of nominally Unicode Ada programs are going to BMP-only. 
Of course, UTF-8 as an ASCII superset is used, stuffed into strings labeled 
Latin-1; it's technically not conformant with the Ada standard but it works so 
long as you don't need much string processing.)

In any case, is the use of non-BMP characters still problematic in your corner 
of the computing world or is everything looking fine from where you are?

--
Kie ekzistas vivo, ekzistas espero.

RE: Support for non-BMP characters

Reply via email to