Re: PS (Malformed UTF-8 character)

Benjamin Franz Wed, 29 Oct 2003 17:02:27 -0800

On Sun, 26 Oct 2003, Marco Baroni wrote:

> So, I would like to ask you or anybody else: is there some kind of tool 
> (e.g., a text editor) that I could use to discover which encoding is 
> being used? (I tried with emacs but failed).


The only way I have successfully coped with massive amounts of data in
unknown encodings and/or languages was n-gram analysis. I used it to
determine encoding and languages for Vietnamese/English web pages a few
years ago.  To do it though, you need, first, good data sets in known
encodings in known languages. Then you do a 'closest match' statistically
on texts to determine the encoding and language.

Here are some URLs that you might find useful.

http://lists.w3.org/Archives/Public/www-international/2001JulSep/0188.html

http://www.basistech.com/products/rli.html

http://odur.let.rug.nl/~vannoord/TextCat/

http://www.dougb.com/ident.html

-- 
Benjamin Franz

Gauss's law is always true, but it is not always useful.
    -- David J. Griffiths, "Introduction to Electrodynamics"

Re: PS (Malformed UTF-8 character)

Reply via email to