On Mon, 01 Apr 2013 08:15:53 -0400, Roy Smith wrote:

> In article <515941d8$0$29967$c3e8da3$54964...@news.astraweb.com>,
>  Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote:
>> [...]
>> >> OK, that leads to the next question.  Is there anyway I can (in
>> >> Python 2.7) detect when a string is not entirely in the BMP?  If I
>> >> could find all the non-BMP characters, I could replace them with
>> >> U+FFFD (REPLACEMENT CHARACTER) and life would be good (enough).
>> Of course you can do this, but you should not. If your input data
>> includes character C, you should deal with character C and not just
>> throw it away unnecessarily. That would be rude, and in Python 3.3 it
>> should be unnecessary.
> The import job isn't done yet, but so far we've processed 116 million
> records and had to clean up four of them.  I can live with that.
> Sometimes practicality trumps correctness.

Well, true. It has to be said that few programming languages (and 
databases) make it easy to do the right thing. On the other hand, you're 
a programmer. Your job is to write correct code, not easy code.

> It turns out, the problem is that the version of MySQL we're using

Well there you go. Why don't you use a real database? 



Postgresql has supported non-broken UTF-8 since at least version 8.1.

> doesn't support non-BMP characters.  Newer versions do (but you have to
> declare the column to use the utf8bm4 character set).  I could upgrade
> to a newer MySQL version, but it's just not worth it.

My brain just broke. So-called "UTF-8" in MySQL only includes up to a 
maximum of three-byte characters. There has *never* been a time where 
UTF-8 excluded four-byte characters. What were the developers thinking, 
arbitrarily cutting out support for 50% of UTF-8?

> Actually, I did try spinning up a 5.5 instance (one of the nice things
> of being in the cloud) and experimented with that, but couldn't get it
> to work there either.  I'll admit that I didn't invest a huge amount of
> effort to make that work before just writing this:
>     def bmp_filter(self, s):
>         """Filter a unicode string to remove all non-BMP (basic
>          multilingual plane) characters.  All such characters are
>          replaced with U+FFFD (Unicode REPLACEMENT CHARACTER).
>          """

I expect that in 5-10 years, applications that remove or mangle non-BMP 
characters will be considered as unacceptable as applications that mangle 
BMP characters. Or for that matter, applications that cannot handle names 
with apostrophes.

Hell, if your customer base is in Asia, chances are that mangling non-BMP 
characters is *already* considered unacceptable.


Reply via email to