At 20:28 07/11/30, Tex Texin wrote: >One improvement you can make is that if you have non-ASCII characters, you >can assume UTF-8, but check that it is valid UTF-8. >Most text in CP437 won't satisfy UTF-8 encoding rules. >If you have non-ASCII characters, and it doesn't satisfy UTF-8 encoding >principles, then you can assume it is CP437. > >Martin Duerst published a nice Perl expression for checking UTF-8 > >http://www.w3.org/International/questions/qa-forms-utf-8.en.php
That regular expression was motivated by some earlier research described in http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf. I didn't analyze CP437, but given that the combination of a box character followed by an accented Latin character is quite rare, my conclusion would be that CP437 is as easy to distinguish from UTF-8 in practice as most other encodings. Otherwise, I agree with Bjoern's conclusions except his very last one, "authors are best off if they avoid non-ASCII names". In this day and age, authors more and more assume that file names in various languages just work. The zip spec does a good job making this possible using UTF-8. It's a pitty that some implementations are not up to the job. Regards, Martin. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:[EMAIL PROTECTED]
