Hamann, T.D. (Thomas) wrote:
Hi,
Hello,
I see that you've found the prolem but I'd like to make some comments.
I am having a rather unusual problem with a script that I wrote last
year to clean unwanted contents out of UTF-8 encoded text files. It
worked fine in the past, but when I try to run it now I get an error
message and somehow all newlines are removed from the resulting file.
Nothing was changed between 2011 and 2012 in the script, which I give
below:
#!/usr/bin/perl
# filecleaner.plx
use warnings;
use strict;
use utf8;
use open ':encoding(utf8)';
my $source = shift @ARGV;
my $destination = shift @ARGV;
It might be better to have some error checking here:
@ARGV == 2 or die "usage: filecleaner.plx <source file name>
<destination file name>\n";
my ( $source, $destination ) = @ARGV;
open IN, $source or die "Can't read source file $source: $!\n";
open OUT, ">$destination" or die "can't write on file $destination: $!\n";
while (<IN>) {
# Replaces all tab-characters with spaces:
s/\t/ /g;
Replacing single characters would be better using the tr/// operator:
tr/\t/ /;
# Replaces all hyphens that are both preceded and trailed by a space by
long dashes preceded and trailed by a space:
s/ - / — /g;
# Removes the leading space(s) from a variety of unwanted combinations:
s/( +)( |\.|,|:|;|\!|\]|\)|\n)/$2/g;
It is better to use a character class instead of alternation for single
character alternatives:
s/( +)([ .,:;!\])\n])/$2/g;
And you don't need to capture $1 if you are not going to use it:
s/ +([ .,:;!\])\n])/$1/g;
Nor do you need to capture anything at all:
s/ +(?=[ .,:;!\])\n])//g;
# Removes multiple dots:
s/\.+/./g;
# Removes multiple commas:
s/,+/,/g;
# Removes multiple colons:
s/:+/:/g;
# Removes multiple semi-colons:
s/;+/;/g;
Those four substitution operators can be replaced with one transliteration:
tr/.,:;//s;
# Removes commas before dots:
s/(,+)(\.)/$2/g;
Again, no need to capture anything:
s/,+(?=\.)//g;
# Removes the trailing spaces and dots behind two types of brackets:
This removes trailing spaces OR dots, not trailing spaces AND dots
s/(\(|\[)( +|\.+)/$1/g;
s/(?<=[([])( +|\.+)//g;
# Removes empty sets of brackets:
s/(\(|\[)(\)|\])//g;
s/\(\)|\[\]//g;
# Removes whitespace at beginning of line:
s/^\s+//;
# Removes whitespace at end of line:
s/\s+$//;
# Prints all non-empty lines to file:
if (!/^\s*$/) {
if ( /\S/ ) {
print OUT $_;
}
}
close IN;
close OUT;
John
--
Any intelligent fool can make things bigger and
more complex... It takes a touch of genius -
and a lot of courage to move in the opposite
direction. -- Albert Einstein
--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/