Re: Bizarre problem: Known good script (in 2011) fails to work in 2012
Hamann, T.D. (Thomas) wrote: Hi, Hello, I see that you've found the prolem but I'd like to make some comments. I am having a rather unusual problem with a script that I wrote last year to clean unwanted contents out of UTF-8 encoded text files. It worked fine in the past, but when I try to run it now I get an error message and somehow all newlines are removed from the resulting file. Nothing was changed between 2011 and 2012 in the script, which I give below: #!/usr/bin/perl # filecleaner.plx use warnings; use strict; use utf8; use open ':encoding(utf8)'; my $source = shift @ARGV; my $destination = shift @ARGV; It might be better to have some error checking here: @ARGV == 2 or die "usage: filecleaner.plx \n"; my ( $source, $destination ) = @ARGV; open IN, $source or die "Can't read source file $source: $!\n"; open OUT, ">$destination" or die "can't write on file $destination: $!\n"; while () { # Replaces all tab-characters with spaces: s/\t/ /g; Replacing single characters would be better using the tr/// operator: tr/\t/ /; # Replaces all hyphens that are both preceded and trailed by a space by long dashes preceded and trailed by a space: s/ - / — /g; # Removes the leading space(s) from a variety of unwanted combinations: s/( +)( |\.|,|:|;|\!|\]|\)|\n)/$2/g; It is better to use a character class instead of alternation for single character alternatives: s/( +)([ .,:;!\])\n])/$2/g; And you don't need to capture $1 if you are not going to use it: s/ +([ .,:;!\])\n])/$1/g; Nor do you need to capture anything at all: s/ +(?=[ .,:;!\])\n])//g; # Removes multiple dots: s/\.+/./g; # Removes multiple commas: s/,+/,/g; # Removes multiple colons: s/:+/:/g; # Removes multiple semi-colons: s/;+/;/g; Those four substitution operators can be replaced with one transliteration: tr/.,:;//s; # Removes commas before dots: s/(,+)(\.)/$2/g; Again, no need to capture anything: s/,+(?=\.)//g; # Removes the trailing spaces and dots behind two types of brackets: This removes trailing spaces OR dots, not trailing spaces AND dots s/(\(|\[)( +|\.+)/$1/g; s/(?<=[([])( +|\.+)//g; # Removes empty sets of brackets: s/(\(|\[)(\)|\])//g; s/\(\)|\[\]//g; # Removes whitespace at beginning of line: s/^\s+//; # Removes whitespace at end of line: s/\s+$//; # Prints all non-empty lines to file: if (!/^\s*$/) { if ( /\S/ ) { print OUT $_; } } close IN; close OUT; John -- Any intelligent fool can make things bigger and more complex... It takes a touch of genius - and a lot of courage to move in the opposite direction. -- Albert Einstein -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/
Re: Bizarre problem: Known good script (in 2011) fails to work in 2012
On 04/01/2012 14:02, Hamann, T.D. (Thomas) wrote: Okay, some further testing using a family member's Windows XP PC and a fresh install of ActivePerl seems to have revealed the culprit: Changing s/\s+$//; to: s/(\s+$)(\n)/$2/; fixed the issue. Since the script worked fine until about 3 weeks ago and I copied the original code from http://www.perlmonks.org/?node_id=2258, I can only surmise that Microsoft must have changed the way Windows XP deals with newlines in a very recent update. Which they could have communicated with the outer world. :( (oh well, another reason to dislike Microsoft, I guess). Now for another question: How much code will this change break? I'm afraid something else must have changed, as /\s+/ has always matched HT, LF, CR, FF, and space, so that line would always remove a trailing newline. In your eagerness to find fuel for your hatred for Microsoft you are forgetting that Perl normalizes all native file records so that they end with "\n" when they are read from the file. Such arbitrary nonsense impedes proper bug-fixing and has no place on this list - Microsoft is not a football team. The usual solution to your problem is to 'chomp' the line terminator from the end of the line before applying the edits, and then adding it back again on output. Precisely why you program has changed behaviour I cannot tell, but be assured that the code you show has always removed trailing newlines and the problem must lie elsewhere Rob -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/
Re: Bizarre problem: Known good script (in 2011) fails to work in 2012
At 11:27 AM + 1/4/12, Hamann, T.D. (Thomas) wrote: Hi, I am having a rather unusual problem with a script that I wrote last year to clean unwanted contents out of UTF-8 encoded text files. It worked fine in the past, but when I try to run it now I get an error message and somehow all newlines are removed from the resulting file. Nothing was changed between 2011 and 2012 in the script, which I give below: #!/usr/bin/perl # filecleaner.plx use warnings; use strict; use utf8; use open ':encoding(utf8)'; my $source = shift @ARGV; my $destination = shift @ARGV; open IN, $source or die "Can't read source file $source: $!\n"; open OUT, ">$destination" or die "can't write on file $destination: $!\n"; while () { # Replaces all tab-characters with spaces: s/\t/ /g; # Replaces all hyphens that are both preceded and trailed by a space by long dashes preceded and trailed by a space: s/ - / - /g; # Removes the leading space(s) from a variety of unwanted combinations: s/( +)( |\.|,|:|;|\!|\]|\)|\n)/$2/g; Character classes can save you some typing and improve readability, and it is not necessary to capture what you don't want: s/ +([ .,!])\n])/$1/g; # Removes multiple dots: s/\.+/./g; # Removes multiple commas: s/,+/,/g; # Removes multiple colons: s/:+/:/g; # Removes multiple semi-colons: s/;+/;/g; # Removes commas before dots: s/(,+)(\.)/$2/g; You have already replaced successive commas with a single comma, so + isn't needed here. # Removes the trailing spaces and dots behind two types of brackets: s/(\(|\[)( +|\.+)/$1/g; # Removes empty sets of brackets: s/(\(|\[)(\)|\])//g; # Removes whitespace at beginning of line: s/^\s+//; # Removes whitespace at end of line: s/\s+$//; Whitespace includes the new line character! # Prints all non-empty lines to file: if (!/^\s*$/) { print OUT $_; } } close IN; close OUT; The error message ("Malformed UTF-8 character (unexpected continuation byte 0x97, with no preceding start byte) at filecleaner.plx line 23") seems to refer to the long dash in line 23. This was copied out of a UTF-8 encoded file in 2011. If I change that to another UTF-8 long dash copied from another UTF-8 file downloaded off the internet, the error message goes away. However, if I copy the dash out of a supposedly UTF-8 encoded file made in Word I get the error message. Sounds like the Word long space isn't valid UTF8. With the dash fixed, however, the newlines still get stripped out of the file, which leaves me at a complete loss, since nothing in the code ought to chomp off newline characters. I suggest you chomp the input and add a newline when you print. What could cause such behaviour? Corrupt script file? Corrupted perl installation? Some stupid recent Windows update that screwed up UTF-8 and/or file handling in Windows XP? Before I went on Christmas holidays things were fine... Can't help you there. -- Jim Gibson j...@gibson.org -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/
RE: Bizarre problem: Known good script (in 2011) fails to work in 2012
Okay, some further testing using a family member's Windows XP PC and a fresh install of ActivePerl seems to have revealed the culprit: Changing s/\s+$//; to: s/(\s+$)(\n)/$2/; fixed the issue. Since the script worked fine until about 3 weeks ago and I copied the original code from http://www.perlmonks.org/?node_id=2258, I can only surmise that Microsoft must have changed the way Windows XP deals with newlines in a very recent update. Which they could have communicated with the outer world. :( (oh well, another reason to dislike Microsoft, I guess). Now for another question: How much code will this change break? Thomas Van: Hamann, T.D. (Thomas) [ham...@nhn.leidenuniv.nl] Verzonden: woensdag 4 januari 2012 12:27 Aan: beginners@perl.org Onderwerp: Bizarre problem: Known good script (in 2011) fails to work in 2012 Hi, I am having a rather unusual problem with a script that I wrote last year to clean unwanted contents out of UTF-8 encoded text files. It worked fine in the past, but when I try to run it now I get an error message and somehow all newlines are removed from the resulting file. Nothing was changed between 2011 and 2012 in the script, which I give below: #!/usr/bin/perl # filecleaner.plx use warnings; use strict; use utf8; use open ':encoding(utf8)'; my $source = shift @ARGV; my $destination = shift @ARGV; open IN, $source or die "Can't read source file $source: $!\n"; open OUT, ">$destination" or die "can't write on file $destination: $!\n"; while () { # Replaces all tab-characters with spaces: s/\t/ /g; # Replaces all hyphens that are both preceded and trailed by a space by long dashes preceded and trailed by a space: s/ - / — /g; # Removes the leading space(s) from a variety of unwanted combinations: s/( +)( |\.|,|:|;|\!|\]|\)|\n)/$2/g; # Removes multiple dots: s/\.+/./g; # Removes multiple commas: s/,+/,/g; # Removes multiple colons: s/:+/:/g; # Removes multiple semi-colons: s/;+/;/g; # Removes commas before dots: s/(,+)(\.)/$2/g; # Removes the trailing spaces and dots behind two types of brackets: s/(\(|\[)( +|\.+)/$1/g; # Removes empty sets of brackets: s/(\(|\[)(\)|\])//g; # Removes whitespace at beginning of line: s/^\s+//; # Removes whitespace at end of line: s/\s+$//; # Prints all non-empty lines to file: if (!/^\s*$/) { print OUT $_; } } close IN; close OUT; The error message ("Malformed UTF-8 character (unexpected continuation byte 0x97, with no preceding start byte) at filecleaner.plx line 23") seems to refer to the long dash in line 23. This was copied out of a UTF-8 encoded file in 2011. If I change that to another UTF-8 long dash copied from another UTF-8 file downloaded off the internet, the error message goes away. However, if I copy the dash out of a supposedly UTF-8 encoded file made in Word I get the error message. With the dash fixed, however, the newlines still get stripped out of the file, which leaves me at a complete loss, since nothing in the code ought to chomp off newline characters. What could cause such behaviour? Corrupt script file? Corrupted perl installation? Some stupid recent Windows update that screwed up UTF-8 and/or file handling in Windows XP? Before I went on Christmas holidays things were fine... Any ideas? Thanks, Thomas -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/ -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/