Re: Detecting encoding in Plain text

D. Starner Wed, 14 Jan 2004 00:13:33 -0800


----- Original Message -----
From: Peter Kirk <[EMAIL PROTECTED]>
Date: Tue, 13 Jan 2004 09:03:48 -0800
To: Doug Ewell <[EMAIL PROTECTED]>
Subject: Re: Detecting encoding in Plain text

On 13/01/2004 08:34, Doug Ewell wrote:

>Peter Kirk <peterkirk at qaya dot org> wrote: > > > >>>If a certain Unicode plain text file uses ASCII punctuation OR spaces >>>OR end-of-line characters, AND the file is not too short or has a >>>very odd formatting, then the algorithm should work. >>> >>> >>True. But there may be certain languages (perhaps Thai?) for which all >>of these circumstances regularly occur together. It would be very >>inconvenient for users of these languages if programs regularly >>attribute the wrong encoding to their text. >> >> > >Whether this is specifically true for Thai or not -- and I doubt that >the "short file or odd formatting" condition could ever be considered >language-dependent -- I would say an otherwise-good heuristic that >performs badly for Thai ought to have special cases built in for Thai, >rather than being discarded. > > > > I may have confused you with what I wrote, but my "all of these circumstances" referred not to "the "short file or odd formatting" condition", but to Marco's "*all* these circumstances", which you snipped, which were originally:
>Some scripts include their own digits and punctuation; not all scripts use spaces; 
and controls are not necessarily used, if U+2028 LINE SEPARATOR is used for new lines.
>
I agree that heuristics should be adjusted for Thai. But problems may arise if they have to be adjusted individually, and without regression errors, for all 6000+ world languages.
--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/


--
___________________________________________________________
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Detecting encoding in Plain text

Reply via email to