On 08/17/2017 05:53 PM, Chris Angelico wrote:> On Fri, Aug 18, 2017 at
10:30 AM, John Nagle <na...@animats.com> wrote:
>> On 08/17/2017 05:14 PM, John Nagle wrote:
>>> I'm cleaning up some data which has text description fields from
>>> multiple sources.
>> A few more cases:
>>
>> bytearray(b'\xe5\x81ukasz zmywaczyk')
>
> This one has to be Polish, and the first character should be the
> letter Ł U+0141 or ł U+0142. In UTF-8, U+0141 becomes C5 81, which is
> very similar to the E5 81 that you have.
>
> So here's an insane theory: something attempted to lower-case the byte
> stream as if it were ASCII. If you ignore the high bit, 0xC5 looks
> like 0x45 or "E", which lower-cases by having 32 added to it, yielding
> 0xE5. Reversing this transformation yields sane data for several of
> your strings - they then decode as UTF-8:
>
> miguel Ángel santos
> lidija kmetič
> Łukasz zmywaczyk
> jiří urbančík
> Ľubomír mičko
> petr urbančík
I think you're right for those. I'm working from a MySQL dump of
supposedly LATIN-1 data, but LATIN-1 will accept anything. I've
found UTF-8 and Windows-2152 in there. It's quite possble that someone
lower-cased UTF-8 stored in a LATIN-1 field. There are lots of
questions on the web which complain about getting a Python decode error
on 0x9d, and the usual answer is "Use Latin-1". But that doesn't really
decode properly, it just doesn't generate an exception.
> That doesn't work for everything, though. The 0x81 0x81 and 0x9d ones
> are still a puzzle.
The 0x9d thing seems unrelated to the Polish names thing. 0x9d
shows up in the middle of English text that's otherwise ASCII.
Is this something that can appear as a result of cutting and
pasting from Microsoft Word?
I'd like to get 0x9d right, because it comes up a lot. The
Polish name thing is rare. There's only about a dozen of those
in 400MB of database dump. There are hundreds of 0x9d hits.
Here's some more 0x9d usage, each from a different data item:
Guitar Pro, JamPlay, RedBana\\\'s Audition,\x9d Doppleganger\x99s The
Lounge\x9d or Heatwave Interactive\x99s Platinum Life Country,\\"
for example \\"I\\\'ve seen the bull run in Pamplona, Spain\x9d.\\"
Everything
Netwise Depot is a \\"One Stop Web Shop\\"\x9d that provides
sustainable \\"green\\"\x9d living
are looking for a \\"Do It for Me\\"\x9d solution
This has me puzzled. It's often, but not always after a close quote.
"TM" or "(R)" might make sense, but what non-Unicode character set
has those. And "green"(tm) makes no sense.
John Nagle
--
https://mail.python.org/mailman/listinfo/python-list