On Fri, Aug 18, 2017 at 10:14 AM, John Nagle <na...@animats.com> wrote: > I'm cleaning up some data which has text description fields from > multiple sources. Some are are in UTF-8. Some are in WINDOWS-1252. > And some are in some other character set. So I have to examine and > sanity check each field in a database dump, deciding which character > set best represents what's there. > > Here's a hard case: > > g1 = bytearray(b'\\"Perfect Gift Idea\\"\x9d Each time') > > g1.decode("utf8") > UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9d in position 21: > invalid start byte > > g1.decode("windows-1252") > UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 21: > character maps to <undefined> > > 0x9d is unmapped in "windows-1252", according to > > https://en.wikipedia.org/wiki/Windows-1252 > > So the Python codec isn't wrong here. > > Trying "latin-1" > > g1.decode("latin-1") > '\\"Perfect Gift Idea\\"\x9d Each time' > > That just converts 0x9d in the input to 0x9d in Unicode. > That's "Operating System Command" (the "Windows" key?) > That's clearly wrong; some kind of quote was intended. > Any ideas?
Another possibility is that it's some kind of dash or ellipsis or something, but I can't find anything that does. (You already have quote characters in there.) The nearest I can actually find is: >>> b'\\"Perfect Gift Idea\\"\x9d Each time'.decode("1256") '\\"Perfect Gift Idea\\"\u200c Each time' >>> unicodedata.name("\u200c") 'ZERO WIDTH NON-JOINER' which, honestly, doesn't make a lot of sense either. :( ChrisA -- https://mail.python.org/mailman/listinfo/python-list