Re: Invalid Uicode characters

David Graff Tue, 16 Sep 2003 07:28:44 -0700

[EMAIL PROTECTED] said:
> I am running Perl 5.8. and trying to filter out some invalid Unicode
> characters from Unicoded texts of some South Asian languages. There
> are 28 such characters in my data (all control characters):


> 0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1B,
> 0x1C, 0x1D, 0x1F, 0x1e, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0xB, 0xC,
> 0xF, 0xFFFF, 0xe 

> The data is coded as utf-16 and I want to keep it this way when the
> invalid characters are removed. Is there an easy way to do this with
> Perl while keeping the textual quality intact? Any advice is welcome.
> Thanks. 

If your data are utf-16, are you actually saying that you have 28 
distinct 16-bit values scattered in your data and you want to remove 
them?  i.e.:

\x{0001} \x{0002} ... \x{0010} \x{0011} \x{0012} \x{0013} ... 

Or do you mean something like: these 28 byte values are showing up
"stranded" (unpaired with a second byte that would produce a valid
utf-16 code point)?  (Or do you mean something else besides these two 
possibilities?)

(Are you really seeing \x{ffff} in your data?  or did you mean to 
indicate \x{00ff}?)

If your data contains ASCII control characters that have been rendered
in utf-16 form, these would not be considered "invalid"; in fact, some
of these control characters are quite common and useful in text
(including "tab", "line-feed", "carriage-return").  Still, if you want
to eliminate all of them, then the "tr" function is probably your best
bet.  First, you need to "decode" your utf-16 data into perl's internal
utf8 form (see the man pages for Encode, PerlIO, Encode::PerlIO, and
PerlIO::encoding) -- here's an example using PerlIO, dumping the "fixed"
text to STDOUT (e.g. for redirection to some other file):

  use Encode;
  ...

  open( IN, "<:UTF-16", "utf16.file" );
  binmode STDOUT, ":UTF-16";

  while (<>) {
     tr/\x{0001}-\x{001f}//d;
     print;
  }

(In many cases, specifying a character range in unicode like this is 
ill-conceived, but in this case, it's no problem, and it does work.)

I'm not sure whether the above will really produce a result that you
want, since it does remove carriage-returns and line-feeds, which may
cause some word-breaks to disappear from the data (and you would see
words likethis -- that used to be two words but are now oneword).

If your data contain "stranded" single bytes, you have a real problem 
on your hands.  You can use the "decode" function provided by Encode.pm 
to trap strings that contain such errors, as follows:

 use Encode;
 ...

 eval "\$_ = decode( 'UTF-16', \$utf16string, Encode::FB_CROAK )";

 if ( $@ ) {
    # $utf16string contains stuff that can't be interpreted as UTF-16
    ...
 }

But what you do inside that "if" block to handle the errors is not
likely to be obvious or easy -- stray bytes in a utf-16 stream means 
there has been some form of corruption, and the question is: how do you 
figure out exactly where (and how pervasive) the corruption really is?
(Let alone how to fix it...)

        Dave Graff

Re: Invalid Uicode characters

Reply via email to