On 01/04/2013 18:07, Steven D'Aprano wrote:
On Mon, 01 Apr 2013 08:15:53 -0400, Roy Smith wrote:

In article <515941d8$0$29967$c3e8da3$54964...@news.astraweb.com>,
 Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote:

[...]
>> OK, that leads to the next question.  Is there anyway I can (in
>> Python 2.7) detect when a string is not entirely in the BMP?  If I
>> could find all the non-BMP characters, I could replace them with
>> U+FFFD (REPLACEMENT CHARACTER) and life would be good (enough).

Of course you can do this, but you should not. If your input data
includes character C, you should deal with character C and not just
throw it away unnecessarily. That would be rude, and in Python 3.3 it
should be unnecessary.

The import job isn't done yet, but so far we've processed 116 million
records and had to clean up four of them.  I can live with that.
Sometimes practicality trumps correctness.

Well, true. It has to be said that few programming languages (and
databases) make it easy to do the right thing. On the other hand, you're
a programmer. Your job is to write correct code, not easy code.


It turns out, the problem is that the version of MySQL we're using

Well there you go. Why don't you use a real database?

http://www.postgresql.org/docs/9.2/static/multibyte.html

:-)

Postgresql has supported non-broken UTF-8 since at least version 8.1.


doesn't support non-BMP characters.  Newer versions do (but you have to
declare the column to use the utf8bm4 character set).  I could upgrade
to a newer MySQL version, but it's just not worth it.

My brain just broke. So-called "UTF-8" in MySQL only includes up to a
maximum of three-byte characters. There has *never* been a time where
UTF-8 excluded four-byte characters. What were the developers thinking,
arbitrarily cutting out support for 50% of UTF-8?

[snip]
50%? The BMP is one of 17 planes, so wouldn't that be 94%?

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to