John Delacour <[EMAIL PROTECTED]> writes: >At 11:31 am +0100 16/9/03, [EMAIL PROTECTED] wrote: >>Dear PERLists, >> >>I am running Perl 5.8. and trying to filter out some invalid Unicode >>characters from Unicoded texts of some South Asian languages. There >>are 28 such characters in my data (all control characters): >> >>0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, >>0x1B, 0x1C, 0x1D, 0x1F, 0x1e, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, >>0xB, 0xC, 0xF, 0xFFFF, 0xe
i.e. 0x00 ok 0x01..0x08 bad 0x09 (TAB) ok 0x0A (LF) ok 0x0B..0x0C bad 0x0D (CR) ok 0x0e..0x19 bad 0x1A ok (why!) 0x1b..0x1f bad 0x7f DEL ok (why?) 0x80..0x9F ok (why?) 0x100.0xFFFE ok The "bad" ones in my re-ordered table are valid Unicode characters. (0xFFFF isn't) I think earlier advice to convert to perl form and tr/// them out is best way to proceed. >> >>The data is coded as utf-16 and I want to keep it this way when the >>invalid characters are removed. Is there an easy way to do this with >>Perl while keeping the textual quality intact? Loosing 0x08 (BS) may loose you some over-strike. In general removing things _may_ make textural quality non-intact if that quality included fixed-length fields or the like. > >Your question is not clear to me. You complaint isn't clear to me ;-) >You say these are invalid Unicode >characters and then list 8-bit characters. Are you saying that >redundant "\x01" etc have got into the text somehow or that >"\x{0001}" etc. are there? "\x01" and "\x{0001}" are the same thing. >Can you give us a sample of the offending >text. Are you saying it is like the UTF-16 equivalent of the output >of this? : > >perl -e 'print qq~\x17\x{6017}\x18\x{6001}~' > >JD