On Fri, 18 Aug 2017 10:14 am, John Nagle wrote: > I'm cleaning up some data which has text description fields from > multiple sources. Some are are in UTF-8. Some are in WINDOWS-1252. > And some are in some other character set. So I have to examine and > sanity check each field in a database dump, deciding which character > set best represents what's there. > > Here's a hard case: > > g1 = bytearray(b'\\"Perfect Gift Idea\\"\x9d Each time')
py> unicodedata.name('\x9d'.decode('macroman')) 'LATIN SMALL LETTER U WITH GRAVE' Doesn't seem too likely. This may help: http://i18nqa.com/debug/bug-double-conversion.html There's always the possibility that it's just junk, or moji-bake from some other source, so it might not be anything sensible in any extended ASCII character set. -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list