Re: Detecting encoding in Plain text

Frank Yung-Fong Tang Wed, 14 Jan 2004 17:14:41 -0800


John Burger wrote on 1/14/2004, 7:16 AM:

 > Mark E. Shoulson wrote:
 >
 > > If it's a heuristic we're after, then why split hairs and try to make
 > > all the rules ourselves?  Get a big ol' mess of training data in as
 > > many languages as you can and hand it over to a class full of CS
 > > graduate students studying Machine Learning.
 >
 > Absolutely my reaction.  All of these suggested heuristics are great,
 > but would almost certainly simply fall out of a more rigorous approach
 > using a generative probabilistic model, or some other classification
 > technique.  Useful features would include n-graphs frequencies, as Mark
 > suggests, as well as lots of other things.  For particular
 > applications, you could use a cache model, e.g., using statistics from
 > other documents from the same web site, or other messages from the same
 > email address, or even generalizing across country-of-origin.
 > Additionally, I'm pretty sure that you could get some mileage out of
 > unsupervised data, that is, all of the documents in the training set
 > needn't be labeled with language/encoding.  And one thing we have a lot
 > of on the web is unsupervised data.
 >
 > I would be extremely surprised if such an approach couldn't achieve 99%
 > accuracy - and I really do mean 99%, or better.
 >
 > By the way, I still don't quite understand what's special about Thai.
 > Could someone elaborate?

For language other than Thai, Chinese and Japanese, you usually will see 
space between words. Therefore, you should see a high count of SPACE in 
your document. The SPACE for text in language other than Thai, Chinese 
and Japanese should occupy probably 10%-15% of the code point (just a 
guess, if the average lenght of word is 9 characters, you will get 10% 
SPACE, if it shorter, if the average is shoter, than the percentage of 
SPACE increase). But for Thai, Chinese and Japanese, space is not put in 
between words, and therefore the percentage of SPACE code point will be 
quite different. For Korean, it is hard to say, depend they are using 
IDEOGRAPH SPACE or SINGLE BYTE SPACE. Also, for Korea, it will depend on 
which normalization form they are using. The % of space will be 
different too because in one normalization form you will count one 
Korean characters as one unicode code point, but in the decomposed form, 
it may be count as 3.

Shanjian Lee and Kat Momoi implement a charset detector based on my 
early work and direction. They summarise it into a paper and present in 
Sept 11, 2001. see 
http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html for 
details. It talk about a different issue and problem.

 >
 > - John Burger
 >    MITRE
 >
 >
 >

Re: Detecting encoding in Plain text

Reply via email to