On Fri, Aug 18, 2017 at 10:14 AM, John Nagle <na...@animats.com> wrote:
>     I'm cleaning up some data which has text description fields from
> multiple sources. Some are are in UTF-8. Some are in WINDOWS-1252.
> And some are in some other character set. So I have to examine and
> sanity check each field in a database dump, deciding which character
> set best represents what's there.
>
>    Here's a hard case:
>
>  g1 = bytearray(b'\\"Perfect Gift Idea\\"\x9d Each time')
>
>  g1.decode("utf8")
>    UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9d in position 21:
> invalid start byte
>
>   g1.decode("windows-1252")
> UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 21:
> character maps to <undefined>
>
> 0x9d is unmapped in "windows-1252", according to
>
> https://en.wikipedia.org/wiki/Windows-1252
>
> So the Python codec isn't wrong here.
>
> Trying "latin-1"
>
>   g1.decode("latin-1")
>  '\\"Perfect Gift Idea\\"\x9d Each time'
>
> That just converts 0x9d in the input to 0x9d in Unicode.
> That's "Operating System Command" (the "Windows" key?)
> That's clearly wrong; some kind of quote was intended.
> Any ideas?

Another possibility is that it's some kind of dash or ellipsis or
something, but I can't find anything that does. (You already have
quote characters in there.) The nearest I can actually find is:

>>> b'\\"Perfect Gift Idea\\"\x9d Each time'.decode("1256")
'\\"Perfect Gift Idea\\"\u200c Each time'
>>> unicodedata.name("\u200c")
'ZERO WIDTH NON-JOINER'

which, honestly, doesn't make a lot of sense either. :(

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to