Okay, some further testing using a family member's Windows XP PC and a fresh 
install of ActivePerl seems to have revealed the culprit:

Changing 
    s/\s+$//;
to:
s/(\s+$)(\n)/$2/;

fixed the issue. 

Since the script worked fine until about 3 weeks ago and I copied the original 
code from http://www.perlmonks.org/?node_id=2258, I can only surmise that 
Microsoft must have changed the way Windows XP deals with newlines in a very 
recent update. Which they could have communicated with the outer world. :( 

(oh well, another reason to dislike Microsoft, I guess).

Now for another question: How much code will this change break? 

Thomas


________________________________________
Van: Hamann, T.D. (Thomas) [ham...@nhn.leidenuniv.nl]
Verzonden: woensdag 4 januari 2012 12:27
Aan: beginners@perl.org
Onderwerp: Bizarre problem: Known good script (in 2011) fails to work in 2012

Hi,

I am having a rather unusual problem with a script that I wrote last year to 
clean unwanted contents out of UTF-8 encoded text files. It worked fine in the 
past, but when I try to run it now I get an error message and somehow all 
newlines are removed from the resulting file. Nothing was changed between 2011 
and 2012 in the script, which I give below:

#!/usr/bin/perl
# filecleaner.plx

use warnings;
use strict;
use utf8;
use open ':encoding(utf8)';

my $source = shift @ARGV;
my $destination = shift @ARGV;

open IN, $source or die "Can't read source file $source: $!\n";
open OUT, ">$destination" or die "can't write on file $destination: $!\n";

while (<IN>) {
    # Replaces all tab-characters with spaces:
    s/\t/ /g;
    # Replaces all hyphens that are both preceded and trailed by a space by 
long dashes preceded and trailed by a space:
    s/ - / — /g;
    # Removes the leading space(s) from a variety of unwanted combinations:
    s/( +)( |\.|,|:|;|\!|\]|\)|\n)/$2/g;
    # Removes multiple dots:
    s/\.+/./g;
    # Removes multiple commas:
    s/,+/,/g;
    # Removes multiple colons:
    s/:+/:/g;
    # Removes multiple semi-colons:
    s/;+/;/g;
    # Removes commas before dots:
    s/(,+)(\.)/$2/g;
    # Removes the trailing spaces and dots behind two types of brackets:
    s/(\(|\[)( +|\.+)/$1/g;
    # Removes empty sets of brackets:
    s/(\(|\[)(\)|\])//g;
    # Removes whitespace at beginning of line:
    s/^\s+//;
    # Removes whitespace at end of line:
    s/\s+$//;
    # Prints all non-empty lines to file:
    if (!/^\s*$/) {
        print OUT $_;
    }
}

close IN;
close OUT;

The error message ("Malformed UTF-8 character (unexpected continuation byte 
0x97, with no preceding start byte) at filecleaner.plx line 23") seems to refer 
to the long dash in line 23. This was copied out of a UTF-8 encoded file in 
2011. If I change that to another UTF-8 long dash copied from another UTF-8 
file downloaded off
the internet, the error message goes away. However, if I copy the dash out of a 
supposedly UTF-8 encoded file made in Word I get the error message.

With the dash fixed, however, the newlines still get stripped out of the file, 
which leaves me at a complete loss, since nothing in the code ought to chomp 
off newline characters.

What could cause such behaviour? Corrupt script file? Corrupted perl 
installation? Some stupid recent Windows update that screwed up UTF-8 and/or 
file handling in Windows XP? Before I went on Christmas holidays
things were fine...

Any ideas?

Thanks,

Thomas
--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/



--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/


Reply via email to