On Thu, Mar 10, 2016 at 2:33 AM, Steven D'Aprano <st...@pearwood.info> wrote: > On Thu, 10 Mar 2016 01:54 am, Chris Angelico wrote: > >> I have a source of occasional text files that basically just dumps >> stuff on me without any metadata, and I have to figure out (a) what >> the encoding is, and (b) what language the text is in. > > https://pypi.python.org/pypi/chardet > >> then I have two levels of heuristics to try to guess a >> most-likely encoding > > I'm curious, what do you do?
Collect subtitles files from random internet contributors and determine whether they add to the existing corpus of material. The first heuristic level is chardet, as mentioned; but with the specific files that I'm processing, it has some semi-consistent errors, so I scripted around that - eg "if chardet says ISO-8859-2, and these byte patterns exist, it's probably actually codepage 1250". IIRC the second level is entirely translating from an ISO-8859 to the nearest-equivalent Windows codepage. > (I stress that trying to guess the character set or encoding from the text > itself is a second-last ditch tactic, for when you really don't know and > can't find out what the encoding is. The final, last-ditch tactic is to > just say "bugger it, I'll pretend it's Latin-1" and get a mess of > moji-bake, but at least an ASCII characters will decode alright, and as an > English speaker, that's all that's important to me :-) What I do is attempt to guess, *and then hand it to the user*. I have a little "cdless" script that does a chardet on a file, decodes accordingly, and pipes the result into 'less' [1]. The most powerful character encoding detection tool in my arsenal is 'less'. Pretending that text is Latin-1 is actually a pretty good start. If I didn't have chardet, I'd be mainly using this: https://github.com/Rosuav/shed/blob/master/charconv.py With no args, this will take the beginning of the file (it tries to get one paragraph of up to 1KB) and decode it using all the ISO-8859-* encodings, displaying the results for human analysis. That's surprisingly effective for a manual job. A large number of European languages use a lot of ASCII letters and then each have their own distinct non-ASCII characters in between; the only truly confusable encodings are the ones that are entirely non-ASCII (Cyrillic, Arabic, Greek, Hebrew - ISO-8859-5 through 8), and mis-decoding one as another usually results in complete nonsense (words with impossible vowel/consonant combinations, for instance). It does take *linguistic* analysis (as opposed to purely mathematical/charcode), but it isn't too hard. ChrisA [1] ... and since Unix pipes carry bytes, not text, this involves encoding it as UTF-8. But that's an implementation detail between cdless and less. -- https://mail.python.org/mailman/listinfo/python-list