Consider CR and LF too. Mark Davis wrote on 1/14/2004, 9:25 AM:
> I'm not sure which "one suggested heuristic method" you are referring > to, but > you are bounding to conclusions. For example, one of the heuristics is > to judge > what are more common characters when bytes are interpreted as if they > were in > different encoding schemes. When picking between UTF16-BE and LE, > U+0020 is > *still* much more common than U+2000, even in Thai. > > Mark > __________________________________ > http://www.macchiato.com > â ààààààààààààààààààààà â > > ----- Original Message ----- > From: "Peter Kirk" <[EMAIL PROTECTED]> > To: "John Burger" <[EMAIL PROTECTED]> > Cc: <[EMAIL PROTECTED]> > Sent: Wed, 2004 Jan 14 08:12 > Subject: Re: Detecting encoding in Plain text > > > > On 14/01/2004 07:16, John Burger wrote: > > > > > ... > > > By the way, I still don't quite understand what's special about Thai. > > > Could someone elaborate? > > > > > I mentioned Thai because it is the only language I know of which does > > not used SPACE, U+0020. It also has at least some of its own > > punctuation. So a Thai text need not include any characters U+00xx - > > which rules out one suggested heuristic method. > > > > -- > > Peter Kirk > > [EMAIL PROTECTED] (personal) > > [EMAIL PROTECTED] (work) > > http://www.qaya.org/ > > > > > > > > > >