At 20:28 07/11/30, Tex Texin wrote:

>One improvement you can make is that if you have non-ASCII characters, you 
>can assume UTF-8, but check that it is valid UTF-8.
>Most text in CP437 won't satisfy UTF-8 encoding rules.
>If you have non-ASCII characters, and it doesn't satisfy UTF-8 encoding 
>principles, then you can assume it is CP437.
>
>Martin Duerst published a nice Perl expression for checking UTF-8
>
>http://www.w3.org/International/questions/qa-forms-utf-8.en.php

That regular expression was motivated by some earlier research described in
http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf.

I didn't analyze CP437, but given that the combination of a box
character followed by an accented Latin character is quite rare,
my conclusion would be that CP437 is as easy to distinguish from
UTF-8 in practice as most other encodings.

Otherwise, I agree with Bjoern's conclusions except his very last
one, "authors are best off if they avoid non-ASCII names". In this
day and age, authors more and more assume that file names in various
languages just work. The zip spec does a good job making this possible
using UTF-8. It's a pitty that some implementations are not up to the job.

Regards,    Martin.


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:[EMAIL PROTECTED]     


Reply via email to