Okay, some further testing using a family member's Windows XP PC and a fresh install of ActivePerl seems to have revealed the culprit:
Changing s/\s+$//; to: s/(\s+$)(\n)/$2/; fixed the issue. Since the script worked fine until about 3 weeks ago and I copied the original code from http://www.perlmonks.org/?node_id=2258, I can only surmise that Microsoft must have changed the way Windows XP deals with newlines in a very recent update. Which they could have communicated with the outer world. :( (oh well, another reason to dislike Microsoft, I guess). Now for another question: How much code will this change break? Thomas ________________________________________ Van: Hamann, T.D. (Thomas) [ham...@nhn.leidenuniv.nl] Verzonden: woensdag 4 januari 2012 12:27 Aan: beginners@perl.org Onderwerp: Bizarre problem: Known good script (in 2011) fails to work in 2012 Hi, I am having a rather unusual problem with a script that I wrote last year to clean unwanted contents out of UTF-8 encoded text files. It worked fine in the past, but when I try to run it now I get an error message and somehow all newlines are removed from the resulting file. Nothing was changed between 2011 and 2012 in the script, which I give below: #!/usr/bin/perl # filecleaner.plx use warnings; use strict; use utf8; use open ':encoding(utf8)'; my $source = shift @ARGV; my $destination = shift @ARGV; open IN, $source or die "Can't read source file $source: $!\n"; open OUT, ">$destination" or die "can't write on file $destination: $!\n"; while (<IN>) { # Replaces all tab-characters with spaces: s/\t/ /g; # Replaces all hyphens that are both preceded and trailed by a space by long dashes preceded and trailed by a space: s/ - / — /g; # Removes the leading space(s) from a variety of unwanted combinations: s/( +)( |\.|,|:|;|\!|\]|\)|\n)/$2/g; # Removes multiple dots: s/\.+/./g; # Removes multiple commas: s/,+/,/g; # Removes multiple colons: s/:+/:/g; # Removes multiple semi-colons: s/;+/;/g; # Removes commas before dots: s/(,+)(\.)/$2/g; # Removes the trailing spaces and dots behind two types of brackets: s/(\(|\[)( +|\.+)/$1/g; # Removes empty sets of brackets: s/(\(|\[)(\)|\])//g; # Removes whitespace at beginning of line: s/^\s+//; # Removes whitespace at end of line: s/\s+$//; # Prints all non-empty lines to file: if (!/^\s*$/) { print OUT $_; } } close IN; close OUT; The error message ("Malformed UTF-8 character (unexpected continuation byte 0x97, with no preceding start byte) at filecleaner.plx line 23") seems to refer to the long dash in line 23. This was copied out of a UTF-8 encoded file in 2011. If I change that to another UTF-8 long dash copied from another UTF-8 file downloaded off the internet, the error message goes away. However, if I copy the dash out of a supposedly UTF-8 encoded file made in Word I get the error message. With the dash fixed, however, the newlines still get stripped out of the file, which leaves me at a complete loss, since nothing in the code ought to chomp off newline characters. What could cause such behaviour? Corrupt script file? Corrupted perl installation? Some stupid recent Windows update that screwed up UTF-8 and/or file handling in Windows XP? Before I went on Christmas holidays things were fine... Any ideas? Thanks, Thomas -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/ -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/