Consider CR and LF too.

Mark Davis wrote on 1/14/2004, 9:25 AM:

 > I'm not sure which "one suggested heuristic method" you are referring
 > to, but
 > you are bounding to conclusions. For example, one of the heuristics is
 > to judge
 > what are more common characters when bytes are interpreted as if they
 > were in
 > different encoding schemes. When picking between UTF16-BE and LE,
 > U+0020 is
 > *still* much more common than U+2000, even in Thai.
 > Mark
 > __________________________________
 > â ààààààààààààààààààààà â
 > ----- Original Message -----
 > From: "Peter Kirk" <[EMAIL PROTECTED]>
 > To: "John Burger" <[EMAIL PROTECTED]>
 > Sent: Wed, 2004 Jan 14 08:12
 > Subject: Re: Detecting encoding in Plain text
 > > On 14/01/2004 07:16, John Burger wrote:
 > >
 > > > ...
 > > > By the way, I still don't quite understand what's special about Thai.
 > > > Could someone elaborate?
 > > >
 > > I mentioned Thai because it is the only language I know of which does
 > > not used SPACE, U+0020. It also has at least some of its own
 > > punctuation. So a Thai text need not include any characters U+00xx -
 > > which rules out one suggested heuristic method.
 > >
 > > --
 > > Peter Kirk
 > > [EMAIL PROTECTED] (personal)
 > > [EMAIL PROTECTED] (work)
 > >
 > >
 > >
 > >
 > >

Reply via email to