Re: newlines on win32, old mac, and unix
# order matters $raw_text =~ s/\015\012/\n/g; $raw_text =~ s/\012/\n/g unless \n eq \012; $raw_text =~ s/\015/\n/g unless \n eq \015; Does it make any difference if I use s/\cM\cJ/cJ/ vs. s/\015\012/\n/g ? Since the newline convention is not necessarily the one in the runtime platform you cannot write a line-oriented script. If files are too big to slurp then you'd work on chunks, but need to check by hand whether a CRLF has been cut in the middle. I'm reading each line in a while loop, so it should work fine on a large file? -- Anthony Ettinger Signature: http://chovy.dyndns.org/hcard.html -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: newlines on win32, old mac, and unix
Anthony Ettinger wrote: # order matters $raw_text =~ s/\015\012/\n/g; $raw_text =~ s/\012/\n/g unless \n eq \012; $raw_text =~ s/\015/\n/g unless \n eq \015; Does it make any difference if I use s/\cM\cJ/cJ/ vs. s/\015\012/\n/g ? The string cJ in your example is completely different than the string \n and even if you had used \cJ it would still not be the same some of the time and you don't have the /g option on your example. John -- use Perl; program fulfillment -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: newlines on win32, old mac, and unix
On 6/19/06, John W. Krahn [EMAIL PROTECTED] wrote: Anthony Ettinger wrote: # order matters $raw_text =~ s/\015\012/\n/g; $raw_text =~ s/\012/\n/g unless \n eq \012; $raw_text =~ s/\015/\n/g unless \n eq \015; Does it make any difference if I use s/\cM\cJ/cJ/ vs. s/\015\012/\n/g ? The string cJ in your example is completely different than the string \n and even if you had used \cJ it would still not be the same some of the time and you don't have the /g option on your example. Not according to the perlport page, it reads as though they are synonymous with each other. Also, why would a newline not be at the end of a line? I don't see that /g *has* to be there except for the mac files, which is what I have. John -- use Perl; program fulfillment -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response -- Anthony Ettinger Signature: http://chovy.dyndns.org/hcard.html -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: newlines on win32, old mac, and unix
Anthony Ettinger wrote: On 6/19/06, John W. Krahn [EMAIL PROTECTED] wrote: Anthony Ettinger wrote: # order matters $raw_text =~ s/\015\012/\n/g; $raw_text =~ s/\012/\n/g unless \n eq \012; $raw_text =~ s/\015/\n/g unless \n eq \015; Does it make any difference if I use s/\cM\cJ/cJ/ vs. s/\015\012/\n/g ? The string cJ in your example is completely different than the string \n and even if you had used \cJ it would still not be the same some of the time and you don't have the /g option on your example. Not according to the perlport page, it reads as though they are synonymous with each other. Also, why would a newline not be at the end of a line? I don't see that /g *has* to be there except for the mac files, which is what I have. I don't have the original post for context but a lot depends on what the input record separator contains and what layer PerIO is using and what operating system the program is running on. John -- use Perl; program fulfillment -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: newlines on win32, old mac, and unix
On Jun 19, 2006, at 22:45, Anthony Ettinger wrote: # order matters $raw_text =~ s/\015\012/\n/g; $raw_text =~ s/\012/\n/g unless \n eq \012; $raw_text =~ s/\015/\n/g unless \n eq \015; Does it make any difference if I use s/\cM\cJ/cJ/ vs. s/\015\012/\n/ g ? The regexp is OK, the replacement string is not, because \cJ is not necessarily eq \n. The latter is portable, the former is not. Since the newline convention is not necessarily the one in the runtime platform you cannot write a line-oriented script. If files are too big to slurp then you'd work on chunks, but need to check by hand whether a CRLF has been cut in the middle. I'm reading each line in a while loop, so it should work fine on a large file? The while loops over lines ***as long as they are encoded using the conventions of the runtime platform***. The diamond operator uses $/ as separator, which in turn is \n by default. Since the purpose of your script is to deal with *any* newline convention, in general a while loop like while (my $line = $fh) { ... } looks suspicious. The variable should be called $chunk_of_text, instead of $line. You don't know whether you'll get a line. Suspicious, may signal the programmer does not fully understand what's going on. For instance, TextWrangler is known to use old-Mac conventions by default (last time I checked). If you read a file like that with that while in either Unix or Windows you'll slurp the entire file in a single iteration. That is, $line will contain the whole file. In general, to be robust to newline conventions you need to to some munging by hand before using regular, portable line-oriented idioms. -- fxn -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
newlines on win32, old mac, and unix
I have to write a simple function which strips out the various newlines on text files, and replaces them with the standard unix newline \nafter reading the perlport doc, I'm even more confused now. LF eq \012 eq \x0A eq \cJ eq chr(10) eq ASCII 10 CR eq \015 eq \x0D eq \cM eq chr(13) eq ASCII 13 | Unix | DOS | Mac | --- \n | LF | LF | CR | \r | CR | CR | LF | \n * | LF | CRLF | CR | \r * | CR | CR | LF | --- * text-mode STDIO In text-mode, I open the file, and do the following: while (defined(my $line = INFILE)) { my $outline; if ($line =~ m/\cM\cJ/) { print dos\n; ($outline = $line) =~ s/\cM\cJ/\cJ/; #win32 } elsif ($line =~ m/\cM(?!\cJ)/) { print mac\n; ($outline = $line) =~ s/\cM/\cJ/g; #mac } else { print other\n; $outline = $line; #default } print OUTFILE $outline; } It works fine on unix when I run the unit tests on old mac files, win, and unix files and do a hexdump -C on themhowever, when I run it on win32 perl 5.6.1, it is not doing any replacement. Teh lines remain unchanged. My understanding is that \n is a reference (depending on which OS your perl is running on) to CR (mac), CRLF (dos), and LF (unix) in text-mode STDIO. So replacing CR (not followed by LF) with LF should work on mac, and CRLF with LF on dos, and leaving LF untouched on *nix (other)then it shouldn't be a problem...however it appears that \cJ is actually different on win32 than it is on unix. so is \cJ is actually \cM\cJ on win32? -- Anthony Ettinger Signature: http://chovy.dyndns.org/hcard.html -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: newlines on win32, old mac, and unix
On Jun 13, 2006, at 20:26, Anthony Ettinger wrote: I have to write a simple function which strips out the various newlines on text files, and replaces them with the standard unix newline \n In Perl \n depends on the system, it is eq \012 everywhere except in MacOS pre-X, where it is \015. The standard Unix newline is \012. In text-mode, I open the file, and do the following: while (defined(my $line = INFILE)) { my $outline; if ($line =~ m/\cM\cJ/) { print dos\n; ($outline = $line) =~ s/\cM\cJ/\cJ/; #win32 } elsif ($line =~ m/\cM(?!\cJ)/) { print mac\n; ($outline = $line) =~ s/\cM/\cJ/g; #mac } else { print other\n; $outline = $line; #default } print OUTFILE $outline; } It works fine on unix when I run the unit tests on old mac files, win, and unix files and do a hexdump -C on themhowever, when I run it on win32 perl 5.6.1, it is not doing any replacement. Teh lines remain unchanged. My understanding is that \n is a reference (depending on which OS your perl is running on) to CR (mac), CRLF (dos), and LF (unix) in text-mode STDIO. That is a common misconception. The string \n has length 1 always, everywhere. It is not CRLF on Windows. To explain this properly I'd need to reproduce here an article I've written for Perl.com, not yet published. But to address the problem is enough to say that in text mode there is an I/O layer (PerlIO) that does some magic back and forth between \n and the native newline convention. That's the way portability is accomplished, inherited from C. To be able to deal with any newline convention the way you want in a portable way you disable that magic enabling binmode on the filehandle. The easiest solution is to slurp the text and s/// it like this (written inline): binmode $in_fh; my $raw_text = do { local $/; $in_fh }; # order matters $raw_text =~ s/\015\012/\n/g; $raw_text =~ s/\012/\n/g unless \n eq \012; $raw_text =~ s/\015/\n/g unless \n eq \015; # $raw_text is normalized here Since the newline convention is not necessarily the one in the runtime platform you cannot write a line-oriented script. If files are too big to slurp then you'd work on chunks, but need to check by hand whether a CRLF has been cut in the middle. -- fxn -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response