On 08/17/2017 05:53 PM, Chris Angelico wrote:
On Fri, Aug 18, 2017 at 10:30 AM, John Nagle <na...@animats.com> wrote:
On 08/17/2017 05:14 PM, John Nagle wrote:
      I'm cleaning up some data which has text description fields from
multiple sources.
A few more cases:

bytearray(b'\xe5\x81ukasz zmywaczyk')

This one has to be Polish, and the first character should be the
letter Ł U+0141 or ł U+0142. In UTF-8, U+0141 becomes C5 81, which is
very similar to the E5 81 that you have.

So here's an insane theory: something attempted to lower-case the byte
stream as if it were ASCII. If you ignore the high bit, 0xC5 looks
like 0x45 or "E", which lower-cases by having 32 added to it, yielding
0xE5. Reversing this transformation yields sane data for several of
your strings - they then decode as UTF-8:

miguel Ángel santos
lidija kmetič
Łukasz zmywaczyk
jiří urbančík
Ľubomír mičko
petr urbančík

   You're exactly right.  The database has columns "name" and
"normalized name".  Normalizing the name was done by forcing it
to lower  case as if in ASCII, even for UTF-8. That resulted in
errors like

KACMAZLAR MEKANİK  -> kacmazlar mekanä°k

Anita Calçados -> anita calã§ados

Felfria Resor för att Koh Lanta -> felfria resor fã¶r att koh lanta

   The "name" field is OK; it's just the "normalized name" field
that is sometimes garbaged. Now that I know this, and have properly
captured the "name" field in UTF-8 where appropriate, I can
regenerate the "normalized name" field.  MySQL/MariaDB know how
to lower-case UTF-8 properly.

   Clean data at last.  Thanks.

   The database, by the way, is a historical snapshot of startup
funding, from Crunchbase.

                                John Nagle
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to