[EMAIL PROTECTED] said: > I am running Perl 5.8. and trying to filter out some invalid Unicode > characters from Unicoded texts of some South Asian languages. There > are 28 such characters in my data (all control characters):
> 0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1B, > 0x1C, 0x1D, 0x1F, 0x1e, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0xB, 0xC, > 0xF, 0xFFFF, 0xe > The data is coded as utf-16 and I want to keep it this way when the > invalid characters are removed. Is there an easy way to do this with > Perl while keeping the textual quality intact? Any advice is welcome. > Thanks. If your data are utf-16, are you actually saying that you have 28 distinct 16-bit values scattered in your data and you want to remove them? i.e.: \x{0001} \x{0002} ... \x{0010} \x{0011} \x{0012} \x{0013} ... Or do you mean something like: these 28 byte values are showing up "stranded" (unpaired with a second byte that would produce a valid utf-16 code point)? (Or do you mean something else besides these two possibilities?) (Are you really seeing \x{ffff} in your data? or did you mean to indicate \x{00ff}?) If your data contains ASCII control characters that have been rendered in utf-16 form, these would not be considered "invalid"; in fact, some of these control characters are quite common and useful in text (including "tab", "line-feed", "carriage-return"). Still, if you want to eliminate all of them, then the "tr" function is probably your best bet. First, you need to "decode" your utf-16 data into perl's internal utf8 form (see the man pages for Encode, PerlIO, Encode::PerlIO, and PerlIO::encoding) -- here's an example using PerlIO, dumping the "fixed" text to STDOUT (e.g. for redirection to some other file): use Encode; ... open( IN, "<:UTF-16", "utf16.file" ); binmode STDOUT, ":UTF-16"; while (<>) { tr/\x{0001}-\x{001f}//d; print; } (In many cases, specifying a character range in unicode like this is ill-conceived, but in this case, it's no problem, and it does work.) I'm not sure whether the above will really produce a result that you want, since it does remove carriage-returns and line-feeds, which may cause some word-breaks to disappear from the data (and you would see words likethis -- that used to be two words but are now oneword). If your data contain "stranded" single bytes, you have a real problem on your hands. You can use the "decode" function provided by Encode.pm to trap strings that contain such errors, as follows: use Encode; ... eval "\$_ = decode( 'UTF-16', \$utf16string, Encode::FB_CROAK )"; if ( $@ ) { # $utf16string contains stuff that can't be interpreted as UTF-16 ... } But what you do inside that "if" block to handle the errors is not likely to be obvious or easy -- stray bytes in a utf-16 stream means there has been some form of corruption, and the question is: how do you figure out exactly where (and how pervasive) the corruption really is? (Let alone how to fix it...) Dave Graff