Re: Bizarre problem: Known good script (in 2011) fails to work in 2012

Jim Gibson Wed, 04 Jan 2012 08:10:57 -0800

At 11:27 AM +0000 1/4/12, Hamann, T.D. (Thomas) wrote:

Hi, I am having a rather unusual problem with a script that I wrotelast year to clean unwanted contents out of UTF-8 encoded textfiles. It worked fine in the past, but when I try to run it now Iget an error message and somehow all newlines are removed from theresulting file. Nothing was changed between 2011 and 2012 in thescript, which I give below: #!/usr/bin/perl # filecleaner.plx usewarnings; use strict; use utf8; use open ':encoding(utf8)'; my$source = shift @ARGV; my $destination = shift @ARGV; open IN,$source or die "Can't read source file $source: $!\n"; open OUT,">$destination" or die "can't write on file $destination: $!\n";while (<IN>) { # Replaces all tab-characters with spaces:s/\t/ /g; # Replaces all hyphens that are both preceded andtrailed by a space by long dashes preceded and trailed by a space:s/ - / - /g; # Removes the leading space(s) from a variety ofunwanted combinations: s/( +)( |\.|,|:|;|\!|\]|\)|\n)/$2/g;

Character classes can save you some typing and improve readability,and it is not necessary to capture what you don't want:


    s/ +([ .,!])\n])/$1/g;

# Removes multiple dots: s/\.+/./g; # Removes multiplecommas: s/,+/,/g; # Removes multiple colons: s/:+/:/g;# Removes multiple semi-colons: s/;+/;/g; # Removes commasbefore dots: s/(,+)(\.)/$2/g;

You have already replaced successive commas with a single comma, so +isn't needed here.

# Removes the trailing spaces and dots behind two types ofbrackets: s/($|\[)( +|\.+)/$1/g; # Removes empty sets ofbrackets: s/(\(|\[)($|\])//g; # Removes whitespace atbeginning of line: s/^\s+//; # Removes whitespace at end ofline: s/\s+$//;


Whitespace includes the new line character!

# Prints all non-empty lines to file: if (!/^\s*$/) {print OUT $_; } } close IN; close OUT; The error message("Malformed UTF-8 character (unexpected continuation byte 0x97, withno preceding start byte) at filecleaner.plx line 23") seems to referto the long dash in line 23. This was copied out of a UTF-8 encodedfile in 2011. If I change that to another UTF-8 long dash copiedfrom another UTF-8 file downloaded off the internet, the errormessage goes away. However, if I copy the dash out of a supposedlyUTF-8 encoded file made in Word I get the error message.



Sounds like the Word long space isn't valid UTF8.

With the dash fixed, however, the newlines still get stripped out ofthe file, which leaves me at a complete loss, since nothing in thecode ought to chomp off newline characters.



I suggest you chomp the input and add a newline when you print.

What could cause such behaviour? Corrupt script file? Corrupted perlinstallation? Some stupid recent Windows update that screwed upUTF-8 and/or file handling in Windows XP? Before I went on Christmasholidays things were fine...


Can't help you there.

--
Jim Gibson
[email protected]

--
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
http://learn.perl.org/

Re: Bizarre problem: Known good script (in 2011) fails to work in 2012

Reply via email to