On Tue, 23 Oct 2001, D. Dale Gulledge wrote: > Is there a reliable tool for determining the encoding of a file?
No, but here are a few idea for whoever wants to make a good one (probably to be contributed for the GNU "file" utility): Depending on the amount of effort, you can distinguish different encodings quite well as long as the text is long enough for the usual cryptoanalytic techniques for breaking substitution ciphers to work (which means usually >500 characters): - UTF-8 follows strict rules and every other encoding (except for the UTF-8 subset ASCII, which usually hasn't to be distinguished) will contain either malformed UTF-8 sequences (when it's an 8-bit encoding) or ISO 2022 sequences (when it's a CJK encoding), both of which make it pretty unlikely that a non-UTF-8 encoding is mistaken for a UTF-8 encoding. - EUC files similarly have characteristic byte sequences that are not allowed in these encodings, such as unpaired GR bytes. - ISO 8859 files should be free of C1 and most C0 codes (except for the usual LF/TAB). - Any file should be free of unused code positions. - You can do a bit more with character and tuple frequency analysis. You need for various languages (English, German, French, C, Lisp) and their transliterations a library of frequency tables for the various UCS characters/pairs, and then you try all Something->UCS conversions until you find the best match of the resulting histogram with one in the library (read up on "index of coincidence" [Friedman, ~1920] in introductory cryptanalysis textbooks such as Stinson). - Add to that rules, which languages are likely to be encoded in which way. - Add to that a library of clue patterns from standardized marker formats such as MIME headers, .htaccess files, Emacs headers, the locale, etc. - Set up a rule-based resoltion algorithm that merges the results of these tests based on rule priorities. For instance, the presence of malformed UTF-8 characters should likely have more weight than fragments of a MIME header that claim UTF-8. Make all that configureable for the end users, as they are likely to have further a-priori knowledge of what encodings are to be expected. These beasts used to be called "expert systems" when I went to school and "A.I." was a research field, not a movie title ... Markus -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/> - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/