At 11:27 AM +0000 1/4/12, Hamann, T.D. (Thomas) wrote:
Hi, I am having a rather unusual problem with a script that I wrote
last year to clean unwanted contents out of UTF-8 encoded text
files. It worked fine in the past, but when I try to run it now I
get an error message and somehow all newlines are removed from the
resulting file. Nothing was changed between 2011 and 2012 in the
script, which I give below: #!/usr/bin/perl # filecleaner.plx use
warnings; use strict; use utf8; use open ':encoding(utf8)'; my
$source = shift @ARGV; my $destination = shift @ARGV; open IN,
$source or die "Can't read source file $source: $!\n"; open OUT,
">$destination" or die "can't write on file $destination: $!\n";
while (<IN>) { # Replaces all tab-characters with spaces:
s/\t/ /g; # Replaces all hyphens that are both preceded and
trailed by a space by long dashes preceded and trailed by a space:
s/ - / - /g; # Removes the leading space(s) from a variety of
unwanted combinations: s/( +)( |\.|,|:|;|\!|\]|\)|\n)/$2/g;
Character classes can save you some typing and improve readability,
and it is not necessary to capture what you don't want:
s/ +([ .,!])\n])/$1/g;
# Removes multiple dots: s/\.+/./g; # Removes multiple
commas: s/,+/,/g; # Removes multiple colons: s/:+/:/g;
# Removes multiple semi-colons: s/;+/;/g; # Removes commas
before dots: s/(,+)(\.)/$2/g;
You have already replaced successive commas with a single comma, so +
isn't needed here.
# Removes the trailing spaces and dots behind two types of
brackets: s/(\(|\[)( +|\.+)/$1/g; # Removes empty sets of
brackets: s/(\(|\[)(\)|\])//g; # Removes whitespace at
beginning of line: s/^\s+//; # Removes whitespace at end of
line: s/\s+$//;
Whitespace includes the new line character!
# Prints all non-empty lines to file: if (!/^\s*$/) {
print OUT $_; } } close IN; close OUT; The error message
("Malformed UTF-8 character (unexpected continuation byte 0x97, with
no preceding start byte) at filecleaner.plx line 23") seems to refer
to the long dash in line 23. This was copied out of a UTF-8 encoded
file in 2011. If I change that to another UTF-8 long dash copied
from another UTF-8 file downloaded off the internet, the error
message goes away. However, if I copy the dash out of a supposedly
UTF-8 encoded file made in Word I get the error message.
Sounds like the Word long space isn't valid UTF8.
With the dash fixed, however, the newlines still get stripped out of
the file, which leaves me at a complete loss, since nothing in the
code ought to chomp off newline characters.
I suggest you chomp the input and add a newline when you print.
What could cause such behaviour? Corrupt script file? Corrupted perl
installation? Some stupid recent Windows update that screwed up
UTF-8 and/or file handling in Windows XP? Before I went on Christmas
holidays things were fine...
Can't help you there.
--
Jim Gibson
[email protected]
--
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
http://learn.perl.org/