On Sun, 26 Oct 2003, Marco Baroni wrote: > So, I would like to ask you or anybody else: is there some kind of tool > (e.g., a text editor) that I could use to discover which encoding is > being used? (I tried with emacs but failed).
The only way I have successfully coped with massive amounts of data in unknown encodings and/or languages was n-gram analysis. I used it to determine encoding and languages for Vietnamese/English web pages a few years ago. To do it though, you need, first, good data sets in known encodings in known languages. Then you do a 'closest match' statistically on texts to determine the encoding and language. Here are some URLs that you might find useful. http://lists.w3.org/Archives/Public/www-international/2001JulSep/0188.html http://www.basistech.com/products/rli.html http://odur.let.rug.nl/~vannoord/TextCat/ http://www.dougb.com/ident.html -- Benjamin Franz Gauss's law is always true, but it is not always useful. -- David J. Griffiths, "Introduction to Electrodynamics"
