Markus Kuhn wrote:
The attached Perl script print cuts from all lines in a plaintext file
that contain non-ASCII bytes. With option -m, it looks for malformed and
overlong UTF-8 sequences instead. Usefull for reviewing files with
unknown encoding manually.
It may be a good idea to filter out 'UTF-8' representation of surrogate codepoints
(0x0d800 - 0xdfff) as well. That is, the following can be added to $utf8malformed
\xed[\xa0-\bf][\x80-\xbf]
In addition, non-characters (0xffff and 0xfffe in all planes) may as well be filtered out.
\xef\xbf[\xbe-\xbf]| [\xf0-\xf7][\x8f,\x9f,\xaf,\xbf]\xbf[\xbe-\xbf]
( and 5 and 6byte ones if you want)
-- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/