Re: regexp to "clean" a text file

Bill Luebkert Sat, 04 Oct 2008 20:17:06 -0700

Alejandro Santillan Iturres wrote:
> Yesss, od and hexdump are present on the system.
> 
> I did hexdump file2.txt, where file2.txt has the following contents:
> 
>                                ÿÝ.Anodizado
>      Ultima actualización: 06-Mar-2004     
> http://www.kr2-egb.com.ar/anodizado.htm 
> ¿Que es el anodizado?
> Pagina creada en noviembre del 2003    « Volver al inicio  
> <index.htm>        :ð,    .            <84>     .
> 
> 
> and the dump was:
> 
> 0000000 2020 2020 2020 2020 2020 2020 2020 2020
> 0000010 2020 2020 2020 2020 2020 2020 2020 ddff
> 0000020 412e 6f6e 6964 617a 6f64 200a 2020 5520
> 0000030 746c 6d69 2061 6361 7574 6c61 7a69 6361
> 0000040 f369 3a6e 3020 2d36 614d 2d72 3032 3430
> 0000050 2020 2020 6820 7474 3a70 2f2f 7777 2e77
...


> Is this helpful?

I still think what you should do is make an array of 256 characters and
index the array for each incoming character and replace the ones that
need replacing with a space or a binary 0 and then delete the binary
0's after with a tr/\x00//d - basically a translate table and a delete.

Otherwise you have a whole bunch of s/// to do on each line which could
get expensive.  You probably could construct a $line =~ tr/$from/$to/
to do the job also - where $from has the characters you want to replace
and $to has the replacement characters.

> So I did:
> $text=~s/\377//g;
> $text=~s/\335//g;
> $text=~s/\360//g;
> $text=~s/\204//g;
> 
> And this cleaned a bit more. Any suggestions?

Here's some sample ideas - you may be able to speed it up by
benchmarking a few other ways:

use strict;
use warnings;

my @TT = ();                    # translate table
for (my $ii = 0; $ii < 256; ++$ii) {
        $TT[$ii] = chr $ii;     # populate with default
}

# sample line
my $line1 = "                       ÿÝ.Anodizado<br>\x00    Ultima\n";

# characters to change
my %h = (                       # characters to substitute/delete
   chr (5) => ' ',              # space
   chr (1) => ' ',              # space
   chr (0xdd) => chr 0,                 # delete
   chr (0xff) => chr 0,                 # delete
);

# modify translate table for %H hash subtitutes
foreach my $key (keys %h) {
        $TT[ord $key] = $h{$key};
#       printf "\$key=%c(0x%02x) => 0x%02x\n", ord ($key), ord ($key),
#         ord $h{$key};
}

my @lines = ($line1);

# do for each line

foreach my $line (@lines) {     # do a line at a time

        print "line before='$line'\n";

        my $len = length $line;
        for (my $ii = 0; $ii < $len; ++$ii) {
                # translate each character
                my $ch = substr $line, $ii, 1;
                my $ci = ord $ch;
                substr $line, $ii, 1, $TT[$ci];
        }

        $line =~ tr/\x00//d;    # drop \x00's

        print "line after='$line'\n";
}

__END__
_______________________________________________
ActivePerl mailing list
[email protected]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Re: regexp to "clean" a text file

Reply via email to