Jungshik Shin wrote:

Markus Kuhn wrote:

The attached Perl script print cuts from all lines in a plaintext file
that contain non-ASCII bytes. With option -m, it looks for malformed and
overlong UTF-8 sequences instead. Usefull for reviewing files with
unknown encoding manually.



It may be a good idea to filter out 'UTF-8' representation of surrogate codepoints

(0x0d800 - 0xdfff) as well. That is, the following can be added to $utf8malformed

\xed[\xa0-\bf][\x80-\xbf]

In addition, non-characters (0xffff and 0xfffe in all planes) may as well be filtered out.


 \xef\xbf[\xbe-\xbf]|
 [\xf0-\xf7][\x8f,\x9f,\xaf,\xbf]\xbf[\xbe-\xbf]

( and 5 and 6byte ones if you want)






-- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/



Reply via email to