On 13/01/2004 08:34, Doug Ewell wrote:

Peter Kirk <peterkirk at qaya dot org> wrote:



If a certain Unicode plain text file uses ASCII punctuation OR spaces
OR end-of-line characters, AND the file is not too short or has a
very odd formatting, then the algorithm should work.


True. But there may be certain languages (perhaps Thai?) for which all
of these circumstances regularly occur together. It would be very
inconvenient for users of these languages if programs regularly
attribute the wrong encoding to their text.



Whether this is specifically true for Thai or not -- and I doubt that the "short file or odd formatting" condition could ever be considered language-dependent -- I would say an otherwise-good heuristic that performs badly for Thai ought to have special cases built in for Thai, rather than being discarded.




I may have confused you with what I wrote, but my "all of these circumstances" referred not to "the "short file or odd formatting" condition", but to Marco's "*all* these circumstances", which you snipped, which were originally:

Some scripts include their own digits and punctuation; not all scripts use spaces; and controls are not necessarily used, if U+2028 LINE SEPARATOR is used for new lines.


I agree that heuristics should be adjusted for Thai. But problems may arise if they have to be adjusted individually, and without regression errors, for all 6000+ world languages.


--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/





Reply via email to