On 08/17/2017 10:12 PM, Ian Kelly wrote:

Here's some more 0x9d usage, each from a different data item:


Guitar Pro, JamPlay, RedBana\\\'s Audition,\x9d Doppleganger\x99s The
Lounge\x9d or Heatwave Interactive\x99s Platinum Life Country,\\"

This one seems like a good hint since \x99 here looks like it should
be an apostrophe. But what character set has an apostrophe there? The
best I can come up with is that 0xE2 0x80 0x99 is "right single
quotation mark" in UTF-8. Also known as the "smart apostrophe", so it
could have been entered by a word processor.

The problem is that if that's what it is, then two out of the three
bytes are outright missing. If the same thing happened to \x9d then
who knows what's missing from it?

One possibility is that it's the same two bytes. That would make it
0xE2 0x80 0x9D which is "right double quotation mark". Since it keeps
appearing after ending double quotes that seems plausible, although
one has to wonder why it appears *in addition to* the ASCII double
quotes.

    I was wondering if it was a signal to some word processor to
apply smart quote handling.

This has me puzzled.  It's often, but not always after a close quote.
"TM" or "(R)" might make sense, but what non-Unicode character set
has those.  And  "green"(tm) makes no sense.

CP-1252 has ™ at \x99, perhaps coincidentally. CP-1252 and Latin-1
both have ® at \xae.

   That's helpful.  All those text snippets failed Windows-1252
decoding, though, because 0x9d isn't in Windows-1252.

   I'm coming around to the idea that some of these snippets
have been previously mis-converted, which is why they make no sense.
Since, as someone pointed out, there was UTF-8 which had been
run through an ASCII-type lower casing algorithm, that's a reasonable
assumption.  Thanks for looking at this, everyone.  If a string won't
parse as either UTF-8 or Windows-1252, I'm just going to convert the
bogus stuff to the Unicode replacement character. I might remove
0x9d chars, since that never seems to affect readability.

                                John Nagle

--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to