Yesss, od and hexdump are present on the system.

I did hexdump file2.txt, where file2.txt has the following contents:

                               ÿÝ.Anodizado
     Ultima actualización: 06-Mar-2004     
http://www.kr2-egb.com.ar/anodizado.htm 
¿Que es el anodizado?
Pagina creada en noviembre del 2003    « Volver al inicio  
<index.htm>        :ð,    .            <84>     .


and the dump was:

0000000 2020 2020 2020 2020 2020 2020 2020 2020
0000010 2020 2020 2020 2020 2020 2020 2020 ddff
0000020 412e 6f6e 6964 617a 6f64 200a 2020 5520
0000030 746c 6d69 2061 6361 7574 6c61 7a69 6361
0000040 f369 3a6e 3020 2d36 614d 2d72 3032 3430
0000050 2020 2020 6820 7474 3a70 2f2f 7777 2e77
0000060 726b 2d32 6765 2e62 6f63 2e6d 7261 612f
0000070 6f6e 6964 617a 6f64 682e 6d74 51bf 6575
0000080 6520 2073 6c65 6120 6f6e 6964 617a 6f64
0000090 0a3f 6150 6967 616e 6320 6572 6461 2061
00000a0 6e65 6e20 766f 6569 626d 6572 6420 6c65
00000b0 3220 3030 2033 2020 ab20 5620 6c6f 6576
00000c0 2072 6c61 6920 696e 6963 206f 693c 646e
00000d0 7865 682e 6d74 203e 2020 2020 2020 3a20
00000e0 2cf0 2020 2020 202e 2020 2020 2020 2020
00000f0 2020 8420 2020 2020 2e20 2020 2020 2020
0000100 2020 2020 2020 2020 2020 000a
000010b

Is this helpful?

I reduced the file a bit more, just to contain the chars I want to  
erase so now it is:

                              ÿÝ.Anodizado
         :ð,    .            <84>     .


and the corresponding hex dump is:

0000000 2020 2020 2020 2020 2020 2020 2020 2020
0000010 2020 2020 2020 2020 2020 2020 2020 ddff
0000020 412e 6f6e 6964 617a 6f64 200a 2020 2020
0000030 2020 3a20 2cf0 2020 2020 202e 2020 2020
0000040 2020 2020 2020 8420 2020 2020 2e20 2020
0000050 2020 2020 2020 2020 2020 2020 2020 000a
000005f

And this is what remain if I erase the word "Anodizado":

0000000 2020 2020 2020 2020 2020 2020 2020 2020
0000010 2020 2020 2020 2020 2020 2020 ff20 2edd
0000020 200a 2020 2020 2020 3a20 2cf0 2020 2020
0000030 202e 2020 2020 2020 2020 2020 3c20 3438
0000040 203e 2020 2020 0a2e
0000048

I've tried also this:
hexdump -c file2.txt

And the result was:
0000000
0000010                         377 335   .  \n
0000020           : 360   ,                   .
0000030                           <   8   4   >                       .
0000040  \n
0000041

So I did:
$text=~s/\377//g;
$text=~s/\335//g;
$text=~s/\360//g;
$text=~s/\204//g;

And this cleaned a bit more. Any suggestions?






On Oct 3, 2008, at 7:33 AM, Brian Raven wrote:

> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of
> Alejandro Santillan Iturres
> Sent: 02 October 2008 19:45
> To: [email protected]
> Cc: [EMAIL PROTECTED]
> Subject: Re: regexp to "clean" a text file
>
>> Thank you William, Bill and Tim. Finally s/[\x00-\x1f]//g did the
> trick, almost perfect.
>> The original file is the palm database of memo pads. The text is
> there, plain. Several mixed control characters > were present.
>> The system I working on is a Fedora linux box. I have no hex utility
> installed to make de dump, so I don't know > if the ^E is really a ^E.
>
> I find that a little hard to believe. Try 'hexdump', or if that isn't
> present you should at least have 'od'. If neither of them are  
> installed,
> you Linux installation sounds a bit broken. Unless you can identify
> which characters are to be kept or discarded, you will find it  
> difficult
> to 'clean' your data effectively.
>
> HTH
>
> -- 
> Brian Raven
>
> -----------------------------------------------------------------------------------------------------------
> This e-mail may contain confidential and/or privileged information.  
> If you are not the intended recipient or have received this e-mail  
> in error, please advise the sender immediately by reply e-mail and  
> delete this message and any attachments without retaining a copy.  
> Any unauthorised copying, disclosure or distribution of the material  
> in this e-mail is strictly forbidden.
>
>
> _______________________________________________
> ActivePerl mailing list
> [email protected]
> To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

_______________________________________________
ActivePerl mailing list
[email protected]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Reply via email to