Re: newlines on win32, old mac, and unix

2006-06-19 Thread Anthony Ettinger

   # order matters
   $raw_text =~ s/\015\012/\n/g;
   $raw_text =~ s/\012/\n/g unless \n eq \012;
   $raw_text =~ s/\015/\n/g unless \n eq \015;



Does it make any difference if I use s/\cM\cJ/cJ/ vs. s/\015\012/\n/g ?





Since the newline convention is not necessarily the one in the
runtime platform you cannot write a line-oriented script. If files
are too big to slurp then you'd work on chunks, but need to check by
hand whether a CRLF has been cut in the middle.



I'm reading each line in a while loop, so it should work fine on a large file?



--
Anthony Ettinger
Signature: http://chovy.dyndns.org/hcard.html

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: newlines on win32, old mac, and unix

2006-06-19 Thread John W. Krahn
Anthony Ettinger wrote:
# order matters
$raw_text =~ s/\015\012/\n/g;
$raw_text =~ s/\012/\n/g unless \n eq \012;
$raw_text =~ s/\015/\n/g unless \n eq \015;
 
 Does it make any difference if I use s/\cM\cJ/cJ/ vs. s/\015\012/\n/g ?

The string cJ in your example is completely different than the string \n
and even if you had used \cJ it would still not be the same some of the time
and you don't have the /g option on your example.


John
-- 
use Perl;
program
fulfillment

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: newlines on win32, old mac, and unix

2006-06-19 Thread Anthony Ettinger

On 6/19/06, John W. Krahn [EMAIL PROTECTED] wrote:

Anthony Ettinger wrote:
# order matters
$raw_text =~ s/\015\012/\n/g;
$raw_text =~ s/\012/\n/g unless \n eq \012;
$raw_text =~ s/\015/\n/g unless \n eq \015;

 Does it make any difference if I use s/\cM\cJ/cJ/ vs. s/\015\012/\n/g ?

The string cJ in your example is completely different than the string \n
and even if you had used \cJ it would still not be the same some of the time
and you don't have the /g option on your example.



Not according to the perlport page, it reads as though they are
synonymous with each other. Also, why would a newline not be at the
end of a line? I don't see that /g *has* to be there except for the
mac files, which is what I have.




John
--
use Perl;
program
fulfillment

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response






--
Anthony Ettinger
Signature: http://chovy.dyndns.org/hcard.html

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: newlines on win32, old mac, and unix

2006-06-19 Thread John W. Krahn
Anthony Ettinger wrote:
 On 6/19/06, John W. Krahn [EMAIL PROTECTED] wrote:
 Anthony Ettinger wrote:
 # order matters
 $raw_text =~ s/\015\012/\n/g;
 $raw_text =~ s/\012/\n/g unless \n eq \012;
 $raw_text =~ s/\015/\n/g unless \n eq \015;
 
  Does it make any difference if I use s/\cM\cJ/cJ/ vs. s/\015\012/\n/g ?

 The string cJ in your example is completely different than the
 string \n
 and even if you had used \cJ it would still not be the same some of
 the time
 and you don't have the /g option on your example.
 
 Not according to the perlport page, it reads as though they are
 synonymous with each other. Also, why would a newline not be at the
 end of a line? I don't see that /g *has* to be there except for the
 mac files, which is what I have.

I don't have the original post for context but a lot depends on what the input
record separator contains and what layer PerIO is using and what operating
system the program is running on.


John
-- 
use Perl;
program
fulfillment

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: newlines on win32, old mac, and unix

2006-06-19 Thread Xavier Noria

On Jun 19, 2006, at 22:45, Anthony Ettinger wrote:


   # order matters
   $raw_text =~ s/\015\012/\n/g;
   $raw_text =~ s/\012/\n/g unless \n eq \012;
   $raw_text =~ s/\015/\n/g unless \n eq \015;



Does it make any difference if I use s/\cM\cJ/cJ/ vs. s/\015\012/\n/ 
g ?


The regexp is OK, the replacement string is not, because \cJ is not  
necessarily eq \n. The latter is portable, the former is not.



Since the newline convention is not necessarily the one in the
runtime platform you cannot write a line-oriented script. If files
are too big to slurp then you'd work on chunks, but need to check by
hand whether a CRLF has been cut in the middle.



I'm reading each line in a while loop, so it should work fine on a  
large file?


The while loops over lines ***as long as they are encoded using the  
conventions of the runtime platform***. The diamond operator uses $/  
as separator, which in turn is \n by default. Since the purpose of  
your script is to deal with *any* newline convention, in general a  
while loop like


  while (my $line = $fh) { ... }

looks suspicious. The variable should be called $chunk_of_text,  
instead of $line. You don't know whether you'll get a line.  
Suspicious, may signal the programmer does not fully understand  
what's going on.


For instance, TextWrangler is known to use old-Mac conventions by  
default (last time I checked). If you read a file like that with that  
while in either Unix or Windows you'll slurp the entire file in a  
single iteration. That is, $line will contain the whole file.


In general, to be robust to newline conventions you need to to some  
munging by hand before using regular, portable line-oriented idioms.


-- fxn


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




newlines on win32, old mac, and unix

2006-06-13 Thread Anthony Ettinger

I have to write a simple function which strips out the various
newlines on text files, and replaces them with the standard unix
newline \nafter reading the perlport doc, I'm even more confused
now.

  LF  eq  \012  eq  \x0A  eq  \cJ  eq  chr(10)  eq  ASCII 10
  CR  eq  \015  eq  \x0D  eq  \cM  eq  chr(13)  eq  ASCII 13



   | Unix | DOS  | Mac  |
  ---
  \n   |  LF  |  LF  |  CR  |
  \r   |  CR  |  CR  |  LF  |
  \n * |  LF  | CRLF |  CR  |
  \r * |  CR  |  CR  |  LF  |
  ---
  * text-mode STDIO


In text-mode, I open the file, and do the following:

   while (defined(my $line = INFILE))
   {
   my $outline;
   if ($line =~ m/\cM\cJ/)
   {
   print dos\n;
   ($outline = $line) =~ s/\cM\cJ/\cJ/; #win32

   } elsif ($line =~ m/\cM(?!\cJ)/) {
   print mac\n;
   ($outline = $line) =~ s/\cM/\cJ/g; #mac
   } else {
   print other\n;
   $outline = $line; #default
   }

   print OUTFILE $outline;
   }

It works fine on unix when I run the unit tests on old mac files, win,
and unix files and do a hexdump -C on themhowever, when I run it
on win32 perl 5.6.1, it is not doing any replacement. Teh lines remain
unchanged.

My understanding is that \n is a reference (depending on which OS your
perl is running on) to CR (mac), CRLF (dos), and LF (unix) in
text-mode STDIO. So replacing CR (not followed by LF) with LF should
work on mac, and CRLF with LF on dos, and leaving LF untouched on *nix
(other)then it shouldn't be a problem...however it appears that
\cJ is actually different on win32 than it is on unix.

so is \cJ is actually \cM\cJ on win32?



--
Anthony Ettinger
Signature: http://chovy.dyndns.org/hcard.html

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: newlines on win32, old mac, and unix

2006-06-13 Thread Xavier Noria

On Jun 13, 2006, at 20:26, Anthony Ettinger wrote:


I have to write a simple function which strips out the various
newlines on text files, and replaces them with the standard unix
newline \n


In Perl \n depends on the system, it is eq \012 everywhere except  
in MacOS pre-X, where it is \015. The standard Unix newline is \012.



In text-mode, I open the file, and do the following:

   while (defined(my $line = INFILE))
   {
   my $outline;
   if ($line =~ m/\cM\cJ/)
   {
   print dos\n;
   ($outline = $line) =~ s/\cM\cJ/\cJ/;  
#win32


   } elsif ($line =~ m/\cM(?!\cJ)/) {
   print mac\n;
   ($outline = $line) =~ s/\cM/\cJ/g; #mac
   } else {
   print other\n;
   $outline = $line; #default
   }

   print OUTFILE $outline;
   }

It works fine on unix when I run the unit tests on old mac files, win,
and unix files and do a hexdump -C on themhowever, when I run it
on win32 perl 5.6.1, it is not doing any replacement. Teh lines remain
unchanged.

My understanding is that \n is a reference (depending on which OS your
perl is running on) to CR (mac), CRLF (dos), and LF (unix) in
text-mode STDIO.


That is a common misconception. The string \n has length 1 always,  
everywhere. It is not CRLF on Windows.


To explain this properly I'd need to reproduce here an article I've  
written for Perl.com, not yet published. But to address the problem  
is enough to say that in text mode there is an I/O layer (PerlIO)  
that does some magic back and forth between \n and the native newline  
convention. That's the way portability is accomplished, inherited  
from C.


To be able to deal with any newline convention the way you want in a  
portable way you disable that magic enabling binmode on the  
filehandle. The easiest solution is to slurp the text and s/// it  
like this (written inline):


  binmode $in_fh;
  my $raw_text = do { local $/; $in_fh };

  # order matters
  $raw_text =~ s/\015\012/\n/g;
  $raw_text =~ s/\012/\n/g unless \n eq \012;
  $raw_text =~ s/\015/\n/g unless \n eq \015;

  # $raw_text is normalized here

Since the newline convention is not necessarily the one in the  
runtime platform you cannot write a line-oriented script. If files  
are too big to slurp then you'd work on chunks, but need to check by  
hand whether a CRLF has been cut in the middle.


-- fxn


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response