On 2017-08-18 01:14, John Nagle wrote:
      I'm cleaning up some data which has text description fields from
multiple sources. Some are are in UTF-8. Some are in WINDOWS-1252.
And some are in some other character set. So I have to examine and
sanity check each field in a database dump, deciding which character
set best represents what's there.

     Here's a hard case:

   g1 = bytearray(b'\\"Perfect Gift Idea\\"\x9d Each time')

   g1.decode("utf8")
     UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9d in position
21: invalid start byte

    g1.decode("windows-1252")
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position
21: character maps to <undefined>

0x9d is unmapped in "windows-1252", according to

https://en.wikipedia.org/wiki/Windows-1252

So the Python codec isn't wrong here.

Trying "latin-1"

    g1.decode("latin-1")
   '\\"Perfect Gift Idea\\"\x9d Each time'

That just converts 0x9d in the input to 0x9d in Unicode.
That's "Operating System Command" (the "Windows" key?)
That's clearly wrong; some kind of quote was intended.
Any ideas?

It's preceded by something in quotes, so it might be ™ (trademark symbol, '\u2122') or something similar. No idea which encoding that would be, though.
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to