John Delacour <[EMAIL PROTECTED]> writes:
>At 11:31 am +0100 16/9/03, [EMAIL PROTECTED] wrote:
>>Dear PERLists,
>>
>>I am running Perl 5.8. and trying to filter out some invalid Unicode 
>>characters from Unicoded texts of some South Asian languages. There 
>>are 28 such characters in my data (all control characters):
>>
>>0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 
>>0x1B, 0x1C, 0x1D, 0x1F, 0x1e, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 
>>0xB, 0xC, 0xF, 0xFFFF, 0xe

i.e. 
   0x00          ok
   0x01..0x08    bad
   0x09 (TAB)    ok
   0x0A (LF)     ok
   0x0B..0x0C    bad    
   0x0D  (CR)    ok
   0x0e..0x19    bad
   0x1A          ok   (why!)
   0x1b..0x1f    bad 
   0x7f  DEL     ok   (why?)
   0x80..0x9F    ok   (why?)
   0x100.0xFFFE  ok   

The "bad" ones in my re-ordered table are valid Unicode characters. 
(0xFFFF isn't)      

I think earlier advice to convert to perl form and tr/// them out 
is best way to proceed.
 

>>
>>The data is coded as utf-16 and I want to keep it this way when the 
>>invalid characters are removed. Is there an easy way to do this with 
>>Perl while keeping the textual quality intact?

Loosing 0x08 (BS) may loose you some over-strike.
In general removing things _may_ make textural quality non-intact
if that quality included fixed-length fields or the like.

>
>Your question is not clear to me.  

You complaint isn't clear to me ;-)

>You say these are invalid Unicode 
>characters and then list 8-bit characters. Are you saying that 
>redundant "\x01" etc have got into the text somehow or that 
>"\x{0001}" etc. are there?  

"\x01" and "\x{0001}" are the same thing. 

>Can you give us a sample of the offending 
>text.  Are you saying it is like the UTF-16 equivalent of the output 
>of this? :
>
>perl -e 'print qq~\x17\x{6017}\x18\x{6001}~'
>
>JD

Reply via email to