Re: Detecting encoding in Plain text

Peter Kirk Wed, 14 Jan 2004 11:38:50 -0800

On 14/01/2004 09:25, Mark Davis wrote:

I'm not sure which "one suggested heuristic method" you are referring to, ...

Basically the one that in UTF-16 there are likely to be many zero bytes in either odd or even positions.

... but
you are bounding to conclusions. For example, one of the heuristics is to judge
what are more common characters when bytes are interpreted as if they were in
different encoding schemes. When picking between UTF16-BE and LE, U+0020 is
*still* much more common than U+2000, even in Thai.

Not necessarily. In certain texts neither might occur at all, so the heuristic fails.

I agree with Mark S and others that more sophisticated methods are likely to be safer.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Detecting encoding in Plain text

Reply via email to