Re: Perl script to hunt for malformed/overlong UTF-8 sequences

Jungshik Shin Tue, 18 Mar 2003 20:10:05 -0800

Jungshik Shin wrote:

Markus Kuhn wrote:

The attached Perl script print cuts from all lines in a plaintext file that contain non-ASCII bytes. With option -m, it looks for malformed and overlong UTF-8 sequences instead. Usefull for reviewing files with unknown encoding manually.

It may be a good idea to filter out 'UTF-8' representation of surrogate codepoints

(0x0d800 - 0xdfff) as well. That is, the following can be added to $utf8malformed

\xed[\xa0-\bf][\x80-\xbf]

In addition, non-characters (0xffff and 0xfffe in all planes) may as well be filtered out.

 \xef\xbf[\xbe-\xbf]|
 [\xf0-\xf7][\x8f,\x9f,\xaf,\xbf]\xbf[\xbe-\xbf]

( and 5 and 6byte ones if you want)


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Perl script to hunt for malformed/overlong UTF-8 sequences

Reply via email to