Hamann, T.D. (Thomas) wrote:
Hi,

Hello,

I see that you've found the prolem but I'd like to make some comments.


I am having a rather unusual problem with a script that I wrote last
year to clean unwanted contents out of UTF-8 encoded text files. It
worked fine in the past, but when I try to run it now I get an error
message and somehow all newlines are removed from the resulting file.
Nothing was changed between 2011 and 2012 in the script, which I give
below:

#!/usr/bin/perl
# filecleaner.plx

use warnings;
use strict;
use utf8;
use open ':encoding(utf8)';

my $source = shift @ARGV;
my $destination = shift @ARGV;

It might be better to have some error checking here:

@ARGV == 2 or die "usage: filecleaner.plx <source file name> <destination file name>\n";

my ( $source, $destination ) = @ARGV;


open IN, $source or die "Can't read source file $source: $!\n";
open OUT, ">$destination" or die "can't write on file $destination: $!\n";

while (<IN>) {
     # Replaces all tab-characters with spaces:
     s/\t/ /g;

Replacing single characters would be better using the tr/// operator:

      tr/\t/ /;


     # Replaces all hyphens that are both preceded and trailed by a space by 
long dashes preceded and trailed by a space:
     s/ - / — /g;
     # Removes the leading space(s) from a variety of unwanted combinations:
     s/( +)( |\.|,|:|;|\!|\]|\)|\n)/$2/g;

It is better to use a character class instead of alternation for single character alternatives:

       s/( +)([ .,:;!\])\n])/$2/g;

And you don't need to capture $1 if you are not going to use it:

       s/ +([ .,:;!\])\n])/$1/g;

Nor do you need to capture anything at all:

       s/ +(?=[ .,:;!\])\n])//g;


     # Removes multiple dots:
     s/\.+/./g;
     # Removes multiple commas:
     s/,+/,/g;
     # Removes multiple colons:
     s/:+/:/g;
     # Removes multiple semi-colons:
     s/;+/;/g;

Those four substitution operators can be replaced with one transliteration:

      tr/.,:;//s;


     # Removes commas before dots:
     s/(,+)(\.)/$2/g;

Again, no need to capture anything:

       s/,+(?=\.)//g;


     # Removes the trailing spaces and dots behind two types of brackets:

This removes trailing spaces OR dots, not trailing spaces AND dots

     s/(\(|\[)( +|\.+)/$1/g;

       s/(?<=[([])( +|\.+)//g;


     # Removes empty sets of brackets:
     s/(\(|\[)(\)|\])//g;

       s/\(\)|\[\]//g;


     # Removes whitespace at beginning of line:
     s/^\s+//;
     # Removes whitespace at end of line:
     s/\s+$//;
     # Prints all non-empty lines to file:
     if (!/^\s*$/) {

       if ( /\S/ ) {


         print OUT $_;
     }
}

close IN;
close OUT;



John
--
Any intelligent fool can make things bigger and
more complex... It takes a touch of genius -
and a lot of courage to move in the opposite
direction.                   -- Albert Einstein

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/


Reply via email to