Bizarre problem: Known good script (in 2011) fails to work in 2012

2012-01-04 Thread Hamann, T.D. (Thomas)
Hi,

I am having a rather unusual problem with a script that I wrote last year to 
clean unwanted contents out of UTF-8 encoded text files. It worked fine in the 
past, but when I try to run it now I get an error message and somehow all 
newlines are removed from the resulting file. Nothing was changed between 2011 
and 2012 in the script, which I give below:

#!/usr/bin/perl
# filecleaner.plx

use warnings;
use strict;
use utf8;
use open ':encoding(utf8)';

my $source = shift @ARGV;
my $destination = shift @ARGV;

open IN, $source or die Can't read source file $source: $!\n;
open OUT, $destination or die can't write on file $destination: $!\n;

while (IN) {
# Replaces all tab-characters with spaces:
s/\t/ /g;
# Replaces all hyphens that are both preceded and trailed by a space by 
long dashes preceded and trailed by a space:
s/ - / — /g; 
# Removes the leading space(s) from a variety of unwanted combinations:
s/( +)( |\.|,|:|;|\!|\]|\)|\n)/$2/g;
# Removes multiple dots:
s/\.+/./g;
# Removes multiple commas:
s/,+/,/g;
# Removes multiple colons:
s/:+/:/g;
# Removes multiple semi-colons:
s/;+/;/g;
# Removes commas before dots:
s/(,+)(\.)/$2/g;
# Removes the trailing spaces and dots behind two types of brackets:
s/(\(|\[)( +|\.+)/$1/g;
# Removes empty sets of brackets:
s/(\(|\[)(\)|\])//g;
# Removes whitespace at beginning of line:
s/^\s+//;
# Removes whitespace at end of line:
s/\s+$//;
# Prints all non-empty lines to file:
if (!/^\s*$/) {
print OUT $_;
}
}

close IN;
close OUT;

The error message (Malformed UTF-8 character (unexpected continuation byte 
0x97, with no preceding start byte) at filecleaner.plx line 23) seems to refer 
to the long dash in line 23. This was copied out of a UTF-8 encoded file in 
2011. If I change that to another UTF-8 long dash copied from another UTF-8 
file downloaded off 
the internet, the error message goes away. However, if I copy the dash out of a 
supposedly UTF-8 encoded file made in Word I get the error message.

With the dash fixed, however, the newlines still get stripped out of the file, 
which leaves me at a complete loss, since nothing in the code ought to chomp 
off newline characters.

What could cause such behaviour? Corrupt script file? Corrupted perl 
installation? Some stupid recent Windows update that screwed up UTF-8 and/or 
file handling in Windows XP? Before I went on Christmas holidays 
things were fine...

Any ideas?

Thanks,

Thomas
--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




RE: Bizarre problem: Known good script (in 2011) fails to work in 2012

2012-01-04 Thread Hamann, T.D. (Thomas)
Okay, some further testing using a family member's Windows XP PC and a fresh 
install of ActivePerl seems to have revealed the culprit:

Changing 
s/\s+$//;
to:
s/(\s+$)(\n)/$2/;

fixed the issue. 

Since the script worked fine until about 3 weeks ago and I copied the original 
code from http://www.perlmonks.org/?node_id=2258, I can only surmise that 
Microsoft must have changed the way Windows XP deals with newlines in a very 
recent update. Which they could have communicated with the outer world. :( 

(oh well, another reason to dislike Microsoft, I guess).

Now for another question: How much code will this change break? 

Thomas



Van: Hamann, T.D. (Thomas) [ham...@nhn.leidenuniv.nl]
Verzonden: woensdag 4 januari 2012 12:27
Aan: beginners@perl.org
Onderwerp: Bizarre problem: Known good script (in 2011) fails to work in 2012

Hi,

I am having a rather unusual problem with a script that I wrote last year to 
clean unwanted contents out of UTF-8 encoded text files. It worked fine in the 
past, but when I try to run it now I get an error message and somehow all 
newlines are removed from the resulting file. Nothing was changed between 2011 
and 2012 in the script, which I give below:

#!/usr/bin/perl
# filecleaner.plx

use warnings;
use strict;
use utf8;
use open ':encoding(utf8)';

my $source = shift @ARGV;
my $destination = shift @ARGV;

open IN, $source or die Can't read source file $source: $!\n;
open OUT, $destination or die can't write on file $destination: $!\n;

while (IN) {
# Replaces all tab-characters with spaces:
s/\t/ /g;
# Replaces all hyphens that are both preceded and trailed by a space by 
long dashes preceded and trailed by a space:
s/ - / — /g;
# Removes the leading space(s) from a variety of unwanted combinations:
s/( +)( |\.|,|:|;|\!|\]|\)|\n)/$2/g;
# Removes multiple dots:
s/\.+/./g;
# Removes multiple commas:
s/,+/,/g;
# Removes multiple colons:
s/:+/:/g;
# Removes multiple semi-colons:
s/;+/;/g;
# Removes commas before dots:
s/(,+)(\.)/$2/g;
# Removes the trailing spaces and dots behind two types of brackets:
s/(\(|\[)( +|\.+)/$1/g;
# Removes empty sets of brackets:
s/(\(|\[)(\)|\])//g;
# Removes whitespace at beginning of line:
s/^\s+//;
# Removes whitespace at end of line:
s/\s+$//;
# Prints all non-empty lines to file:
if (!/^\s*$/) {
print OUT $_;
}
}

close IN;
close OUT;

The error message (Malformed UTF-8 character (unexpected continuation byte 
0x97, with no preceding start byte) at filecleaner.plx line 23) seems to refer 
to the long dash in line 23. This was copied out of a UTF-8 encoded file in 
2011. If I change that to another UTF-8 long dash copied from another UTF-8 
file downloaded off
the internet, the error message goes away. However, if I copy the dash out of a 
supposedly UTF-8 encoded file made in Word I get the error message.

With the dash fixed, however, the newlines still get stripped out of the file, 
which leaves me at a complete loss, since nothing in the code ought to chomp 
off newline characters.

What could cause such behaviour? Corrupt script file? Corrupted perl 
installation? Some stupid recent Windows update that screwed up UTF-8 and/or 
file handling in Windows XP? Before I went on Christmas holidays
things were fine...

Any ideas?

Thanks,

Thomas
--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/



--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: Bizarre problem: Known good script (in 2011) fails to work in 2012

2012-01-04 Thread Jim Gibson

At 11:27 AM + 1/4/12, Hamann, T.D. (Thomas) wrote:
Hi, I am having a rather unusual problem with a script that I wrote 
last year to clean unwanted contents out of UTF-8 encoded text 
files. It worked fine in the past, but when I try to run it now I 
get an error message and somehow all newlines are removed from the 
resulting file. Nothing was changed between 2011 and 2012 in the 
script, which I give below: #!/usr/bin/perl # filecleaner.plx use 
warnings; use strict; use utf8; use open ':encoding(utf8)'; my 
$source = shift @ARGV; my $destination = shift @ARGV; open IN, 
$source or die Can't read source file $source: $!\n; open OUT, 
$destination or die can't write on file $destination: $!\n; 
while (IN) { # Replaces all tab-characters with spaces: 
s/\t/ /g; # Replaces all hyphens that are both preceded and 
trailed by a space by long dashes preceded and trailed by a space: 
s/ - / - /g; # Removes the leading space(s) from a variety of 
unwanted combinations: s/( +)( |\.|,|:|;|\!|\]|\)|\n)/$2/g;


Character classes can save you some typing and improve readability, 
and it is not necessary to capture what you don't want:


s/ +([ .,!])\n])/$1/g;

# Removes multiple dots: s/\.+/./g; # Removes multiple 
commas: s/,+/,/g; # Removes multiple colons: s/:+/:/g; 
# Removes multiple semi-colons: s/;+/;/g; # Removes commas 
before dots: s/(,+)(\.)/$2/g;


You have already replaced successive commas with a single comma, so + 
isn't needed here.


# Removes the trailing spaces and dots behind two types of 
brackets: s/(\(|\[)( +|\.+)/$1/g; # Removes empty sets of 
brackets: s/(\(|\[)(\)|\])//g; # Removes whitespace at 
beginning of line: s/^\s+//; # Removes whitespace at end of 
line: s/\s+$//;


Whitespace includes the new line character!

# Prints all non-empty lines to file: if (!/^\s*$/) { 
print OUT $_; } } close IN; close OUT; The error message 
(Malformed UTF-8 character (unexpected continuation byte 0x97, with 
no preceding start byte) at filecleaner.plx line 23) seems to refer 
to the long dash in line 23. This was copied out of a UTF-8 encoded 
file in 2011. If I change that to another UTF-8 long dash copied 
from another UTF-8 file downloaded off the internet, the error 
message goes away. However, if I copy the dash out of a supposedly 
UTF-8 encoded file made in Word I get the error message.



Sounds like the Word long space isn't valid UTF8.

With the dash fixed, however, the newlines still get stripped out of 
the file, which leaves me at a complete loss, since nothing in the 
code ought to chomp off newline characters.



I suggest you chomp the input and add a newline when you print.

What could cause such behaviour? Corrupt script file? Corrupted perl 
installation? Some stupid recent Windows update that screwed up 
UTF-8 and/or file handling in Windows XP? Before I went on Christmas 
holidays things were fine...


Can't help you there.

--
Jim Gibson
j...@gibson.org

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: Bizarre problem: Known good script (in 2011) fails to work in 2012

2012-01-04 Thread Rob Dixon

On 04/01/2012 14:02, Hamann, T.D. (Thomas) wrote:


Okay, some further testing using a family member's Windows XP PC and
a  fresh install of ActivePerl seems to have revealed the culprit:

Changing
 s/\s+$//;
to:
s/(\s+$)(\n)/$2/;

fixed the issue.

Since the script worked fine until about 3 weeks ago and I copied
the  original code from http://www.perlmonks.org/?node_id=2258, I can only
surmise that Microsoft must have changed the way Windows XP deals with
newlines in a very recent update. Which they could have communicated
with the outer world. :(

(oh well, another reason to dislike Microsoft, I guess).

Now for another question: How much code will this change break?


I'm afraid something else must have changed, as /\s+/ has always matched
HT, LF, CR, FF, and space, so that line would always remove a trailing
newline.

In your eagerness to find fuel for your hatred for Microsoft you are
forgetting that Perl normalizes all native file records so that they end
with \n when they are read from the file. Such arbitrary nonsense
impedes proper bug-fixing and has no place on this list - Microsoft is
not a football team.

The usual solution to your problem is to 'chomp' the line terminator
from the end of the line before applying the edits, and then adding it
back again on output.

Precisely why you program has changed behaviour I cannot tell, but be
assured that the code you show has always removed trailing newlines and
the problem must lie elsewhere

Rob

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: Bizarre problem: Known good script (in 2011) fails to work in 2012

2012-01-04 Thread John W. Krahn

Hamann, T.D. (Thomas) wrote:

Hi,


Hello,

I see that you've found the prolem but I'd like to make some comments.



I am having a rather unusual problem with a script that I wrote last
year to clean unwanted contents out of UTF-8 encoded text files. It
worked fine in the past, but when I try to run it now I get an error
message and somehow all newlines are removed from the resulting file.
Nothing was changed between 2011 and 2012 in the script, which I give
below:

#!/usr/bin/perl
# filecleaner.plx

use warnings;
use strict;
use utf8;
use open ':encoding(utf8)';

my $source = shift @ARGV;
my $destination = shift @ARGV;


It might be better to have some error checking here:

@ARGV == 2 or die usage: filecleaner.plx source file name 
destination file name\n;


my ( $source, $destination ) = @ARGV;



open IN, $source or die Can't read source file $source: $!\n;
open OUT, $destination or die can't write on file $destination: $!\n;

while (IN) {
 # Replaces all tab-characters with spaces:
 s/\t/ /g;


Replacing single characters would be better using the tr/// operator:

  tr/\t/ /;



 # Replaces all hyphens that are both preceded and trailed by a space by 
long dashes preceded and trailed by a space:
 s/ - / — /g;
 # Removes the leading space(s) from a variety of unwanted combinations:
 s/( +)( |\.|,|:|;|\!|\]|\)|\n)/$2/g;


It is better to use a character class instead of alternation for single 
character alternatives:


   s/( +)([ .,:;!\])\n])/$2/g;

And you don't need to capture $1 if you are not going to use it:

   s/ +([ .,:;!\])\n])/$1/g;

Nor do you need to capture anything at all:

   s/ +(?=[ .,:;!\])\n])//g;



 # Removes multiple dots:
 s/\.+/./g;
 # Removes multiple commas:
 s/,+/,/g;
 # Removes multiple colons:
 s/:+/:/g;
 # Removes multiple semi-colons:
 s/;+/;/g;


Those four substitution operators can be replaced with one transliteration:

  tr/.,:;//s;



 # Removes commas before dots:
 s/(,+)(\.)/$2/g;


Again, no need to capture anything:

   s/,+(?=\.)//g;



 # Removes the trailing spaces and dots behind two types of brackets:


This removes trailing spaces OR dots, not trailing spaces AND dots


 s/(\(|\[)( +|\.+)/$1/g;


   s/(?=[([])( +|\.+)//g;



 # Removes empty sets of brackets:
 s/(\(|\[)(\)|\])//g;


   s/\(\)|\[\]//g;



 # Removes whitespace at beginning of line:
 s/^\s+//;
 # Removes whitespace at end of line:
 s/\s+$//;
 # Prints all non-empty lines to file:
 if (!/^\s*$/) {


   if ( /\S/ ) {



 print OUT $_;
 }
}

close IN;
close OUT;




John
--
Any intelligent fool can make things bigger and
more complex... It takes a touch of genius -
and a lot of courage to move in the opposite
direction.   -- Albert Einstein

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/