At 11:27 AM +0000 1/4/12, Hamann, T.D. (Thomas) wrote:
Hi, I am having a rather unusual problem with a script that I wrote last year to clean unwanted contents out of UTF-8 encoded text files. It worked fine in the past, but when I try to run it now I get an error message and somehow all newlines are removed from the resulting file. Nothing was changed between 2011 and 2012 in the script, which I give below: #!/usr/bin/perl # filecleaner.plx use warnings; use strict; use utf8; use open ':encoding(utf8)'; my $source = shift @ARGV; my $destination = shift @ARGV; open IN, $source or die "Can't read source file $source: $!\n"; open OUT, ">$destination" or die "can't write on file $destination: $!\n"; while (<IN>) { # Replaces all tab-characters with spaces: s/\t/ /g; # Replaces all hyphens that are both preceded and trailed by a space by long dashes preceded and trailed by a space: s/ - / - /g; # Removes the leading space(s) from a variety of unwanted combinations: s/( +)( |\.|,|:|;|\!|\]|\)|\n)/$2/g;

Character classes can save you some typing and improve readability, and it is not necessary to capture what you don't want:

    s/ +([ .,!])\n])/$1/g;

# Removes multiple dots: s/\.+/./g; # Removes multiple commas: s/,+/,/g; # Removes multiple colons: s/:+/:/g; # Removes multiple semi-colons: s/;+/;/g; # Removes commas before dots: s/(,+)(\.)/$2/g;

You have already replaced successive commas with a single comma, so + isn't needed here.

# Removes the trailing spaces and dots behind two types of brackets: s/(\(|\[)( +|\.+)/$1/g; # Removes empty sets of brackets: s/(\(|\[)(\)|\])//g; # Removes whitespace at beginning of line: s/^\s+//; # Removes whitespace at end of line: s/\s+$//;

Whitespace includes the new line character!

# Prints all non-empty lines to file: if (!/^\s*$/) { print OUT $_; } } close IN; close OUT; The error message ("Malformed UTF-8 character (unexpected continuation byte 0x97, with no preceding start byte) at filecleaner.plx line 23") seems to refer to the long dash in line 23. This was copied out of a UTF-8 encoded file in 2011. If I change that to another UTF-8 long dash copied from another UTF-8 file downloaded off the internet, the error message goes away. However, if I copy the dash out of a supposedly UTF-8 encoded file made in Word I get the error message.


Sounds like the Word long space isn't valid UTF8.

With the dash fixed, however, the newlines still get stripped out of the file, which leaves me at a complete loss, since nothing in the code ought to chomp off newline characters.


I suggest you chomp the input and add a newline when you print.

What could cause such behaviour? Corrupt script file? Corrupted perl installation? Some stupid recent Windows update that screwed up UTF-8 and/or file handling in Windows XP? Before I went on Christmas holidays things were fine...

Can't help you there.

--
Jim Gibson
j...@gibson.org

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/


Reply via email to