On Fri, Aug 18, 2017 at 4:24 PM, John Nagle <na...@animats.com> wrote: > I'm coming around to the idea that some of these snippets > have been previously mis-converted, which is why they make no sense. > Since, as someone pointed out, there was UTF-8 which had been > run through an ASCII-type lower casing algorithm, that's a reasonable > assumption. Thanks for looking at this, everyone. If a string won't > parse as either UTF-8 or Windows-1252, I'm just going to convert the > bogus stuff to the Unicode replacement character. I might remove > 0x9d chars, since that never seems to affect readability.
That sounds like a good plan. Unless you can pin down a single coherent encoding (even a broken one, like "UTF-8, then add 32 to everything between 0xC1 and 0xDA"), all you have is decoding individual strings. There just isn't enough context to do anything smarter than flipping unparseable bytes to U+FFFD. ChrisA -- https://mail.python.org/mailman/listinfo/python-list