Re: Getting the dir structure

2004-12-05 Thread Dan Jones
On Sat, 2004-12-04 at 17:25 -0800, John W. Krahn wrote:
 Dan Jones wrote:
  On Thu, 2004-12-02 at 00:13 -0800, Mr M senthil kumar wrote:
  
 SNIP
 
 I have a file with thousands of line  like :
 /abc/def/ijk/test.txt
 /pqr/lmn/test1.t
 I want to get the directory where the files test.txt and test1.txt are
 lying.
 
 /SNIP
 
 Hi,
 You can try the following:
 
 #!/usr/bin/perl
 open (IN,input_file) || die Cannot open file: $!;
 open (OUT,output_file) || die Cannot send the output: $!;
  
  
  I see this a lot.  One thing that immediately occurs to me is that if
  opening IN succeeds but opening OUT fails, the program dies without
  closing IN.  Is this acceptable code in the Perl world or should the
  code close all open files before dieing?
 
 The operating system handles all resources like files and memory so when the 
 program exits the operating system frees up all file handle resources and the 
 memory allocated for the strings input_file and Cannot open file: $!.  
 If 
 the operating system didn't do this then programming would be a *LOT* harder 
 and a *LOT* less robust!
 
 As an analogy:  If you rent a hotel/motel room you *could* clean it up 
 yourself before you check out, but most people don't.  :-)

I understand that the system will usually clean up the messes you leave
behind.  However, in application programming with higher level
languages, it's considered extremely poor programming practice to rely
on this behavior.  My question isn't whether the system will close the
file, it's whether this is considered acceptable program behavior.


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: Getting the dir structure

2004-12-04 Thread Dan Jones
On Thu, 2004-12-02 at 00:13 -0800, Mr M senthil kumar wrote:
 
 SNIP
  I have a file with thousands of line  like :
  /abc/def/ijk/test.txt
  /pqr/lmn/test1.t
  I want to get the directory where the files test.txt and test1.txt are
  lying.
 /SNIP
 
 Hi,
 You can try the following:
 
 #!/usr/bin/perl
 open (IN,input_file) || die Cannot open file: $!;
 open (OUT,output_file) || die Cannot send the output: $!;

I see this a lot.  One thing that immediately occurs to me is that if
opening IN succeeds but opening OUT fails, the program dies without
closing IN.  Is this acceptable code in the Perl world or should the
code close all open files before dieing?




-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: Sufficient effort

2004-12-02 Thread Dan Jones
On Thu, 2004-12-02 at 09:38 +0100, Gunnar Hjalmarsson wrote:
 Casey West wrote:
  And I should point out that I agree. However, that doesn't excuse bad
  behavior in response. Furthermore, I expect more from a responder
  than a questioner.
 
 Okay, I've made up my mind. I don't have the required patience with the
 lazy dogs, so I don't fit here. Of course, you can always follow Jenda's
 advice and ignore them. And I will. I'll ignore this list from now on.
 
 Good luck! And beware of offending the flashes.

Have you stopped to consider that the reaction you have to your posting
style being questioned is the same reaction some people have to your
posts?  If you're willing to abandon the list over someone questioning
your style, how many people have also been willing to abandon it when
they're slammed for violating some unknown etiquette?

The intent here wasn't to attach you personally; it was to say that
perhaps there's a better way to do things.  (Does that sound familiar?)
If you do leave, your knowledge will be sorely missed.




-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Reading large mbox files

2004-11-27 Thread Dan Jones
I'm looking to write a utility to do some processing on email messages
stored in mbox format.  Some mbox files can be quite large, hundreds of
megs or perhaps gigs in size.  Obviously, reading in the whole file at
once isn't feasible.  The most obvious method is to set $/ to the
regex /\n\nFrom / (messages in mbox format are seperated by a blank line
and begin with a From line) and to read in email messages one at a time.
It seems to me that this would be quite slow.  Another possibility that
springs to mind is to read in chunks 64k or so chunks of data and then
split those chunks into individual messages.  This will complicate the
program logic, however, as the chunks will inevitably split the last
message in two.  I'd then either have to back up the offset into the
file to point to the begging of the message or to store the beginning of
the message, read in a new chunk, get the last half of the message off
the new chunk, combine it with the stored beginning of the message, then
process it.

I'm aware that there are a number of modules which deal with mail and
mbox handling, but so far none of them seem to make doing what I'm
trying to do easy.  Reinventing the wheel isn't always a waste of time -
it's sometimes a very good way to learn how wheels are constructed and
how to use your tools to construct wheels.  This gives you insight and
practice when you have to use those same tools to construct
non-wheels. :)

Any thoughts or pointers to discussions on how to handle large files in
Perl would be welcome.




-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: Reading large mbox files

2004-11-27 Thread Dan Jones
On Sat, 2004-11-27 at 23:07 +0100, Gunnar Hjalmarsson wrote:
 Dan Jones wrote:
  The most obvious method is to set $/ to the regex /\n\nFrom /
  (messages in mbox format are seperated by a blank line and begin with
  a From line) and to read in email messages one at a time.
 
  From perldoc perlvar:
 Remember: the value of $/ is a string, not a regex.

Yes, I've since realized that.

 So that method isn't obvious at all; it's not available. (But you can
 set $/ to the *string* \n\nFrom .)

Well, the idea of resetting $/ is obvious, even if I misstated the
details.

  It seems to me that this would be quite slow.
 
 What made you draw that conclusion?

Because I cut my teeth on C (and later C++).  That means I have lots of
habits and mind sets that are great for a C programmer but not so great
for a wannabe JAPH.  Disk I/O is slow.  It's generally _much_ faster to
read in decent size chunks from the hard drive than to do repeated reads
of small size.  What I didn't consider is that Larry's a pretty bright
guy, and he's probably forgotten more about low level I/O than I know.
After thinking about it a bit more, I realized that Perl is almost
certainly doing the buffering for you.  I'd be very surprised if reading
via the angle operators correlated to doing actual disk reads.  Instead,
I'd wager that Perl slurps in a big chunk and just feeds it to you as
you request it.  One of the advantages of using higher level languages
is that they'll do a lot of the schlock work for you, if you remember to
let them do it!

 If you want to preserve the message separators with the right messages,
 line by line processing may be the easiest approach:
 
  my $msgsep = qr(^From );
  my $msg;
 
  while (MBOX) {
  if ( /$msgsep/ ) {
  processmsg( \$msg ) if $msg;
  $msg = $_;
  } else {
  $msg .= $_;
  }
  }
  processmsg( \$msg );
 
  sub processmsg {
  ...
  }
 
 (The $msgsep regex should probably be more specified.)

Since the message separator doesn't change, I think it's easier just to
remove it from the end of the message and add it back in to the next
message.  I had to extract the first message before going in to the
loop, since the message separator isn't missing from it and adding it
back in leads to it being present twice at the beginning of the message.
Here's what I've come up with so far:

#!/usr/bin/perl

use strict;
use warnings;

use Term::ReadKey;

sub ProcMessage($);
sub Pause();

die usage: rdmbox mailbox unless ($ARGV[0]);

open MAILBOX, $ARGV[0] or die Can't open $ARGV[0];

local $/ = \n\nFrom ;

$_ = MAILBOX;
$_ =~ s/\n\nFrom $//;
ProcMessage($_);

while(MAILBOX) {
$_ =~ s/\n\nFrom $//;
ProcMessage(From $_);

if(Pause() == 0) {
last;
}
}

close MAILBOX;

sub ProcMessage($){
my $message = shift;
print Message:\n $message \n\n;
}

sub Pause() {
open(TTY, /dev/tty);
ReadMode raw;
my $key = ReadKey 0, *TTY;
ReadMode normal;
if($key eq q) {
return 0;
}
else {
return 1;
}
}


Comments on the above code are welcome, even if it touches on some other
issue.  If it isn't obvious by now, I'm trying to learn here! Basic
error checking is included but error messages are a bit terse right
now.  


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: Reading large mbox files

2004-11-27 Thread Dan Jones
On Sun, 2004-11-28 at 04:09 +0100, Gunnar Hjalmarsson wrote:
 Dan Jones wrote:
  
  local $/ = \n\nFrom ;
  
  $_ = MAILBOX;
  $_ =~ s/\n\nFrom $//;
  ProcMessage($_);
  
  while(MAILBOX) {
  $_ =~ s/\n\nFrom $//;
  ProcMessage(From $_);
  
  if(Pause() == 0) {
  last;
  }
  }
  
  close MAILBOX;
  
  sub ProcMessage($){
  my $message = shift;
  print Message:\n $message \n\n;
  }
 
 Seems fine to me. The only concern is paragraphs that start with
 From , without e.g.  being prepended to those lines. I suppose you
 know whether that is an issue to count with.

That isn't supposed to happen.  The program writing to the mailbox is
responsible for checking for that and appending a  to those lines.
See here if you're interested:

http://en.wikipedia.org/wiki/Mbox

On to the next issue.  One of the things I want to do is to check for
duplicate messages.  The common way to do that is to simply check the
Message ID.  The widely used procmail recipe uses that method, as does
the formail utility.  However, Message IDs are not guaranteed to be
unique.  If a collision does occur, you lose a message.  My thought is
to hash the message body, and store that hash value.  If a Message ID
collision occurs as you're processing the mailbox, you check the hash
values to be sure they're the same before deleting the message.

The problem is that Perl has a variable type called hash.  (Yes, I
know, you probably heard that somewhere before.)  Searching for
information on using hashing functions in Perl leads to pages and pages
of information dealing with the hash variable type.  Perl obviously uses
an internal hashing function to generate its hash variables.  Is it
possible to access that function from a script?  If not, does anyone
know of a module or pointer to information on hashing functions for
Perl?




-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




RE: Recursively counting a matching pattern on a single line.

2004-10-28 Thread Dan Jones
On Wed, 2004-10-27 at 20:07, S.A. Birl wrote:
 On Oct 27, [EMAIL PROTECTED] ([EMAIL PROTECTED]:
 
 Brian:
 Brian:  If you want to make sure they are alternating like  etc... I would do
 Brian:  this:
 Brian:
 Brian:  $_ = $line;
 Brian:
 Brian:  @syms = m/[]/g;
 Brian:  $string = join(, @syms);
 Brian:  if ($strings !~ m/^()*$/)
 Brian:  {
 Brian:## Scream here!
 Brian:  }
 Brian:
 Brian:  The regular expression:
 Brian:
 Brian:  m/^()*$/
 Brian:
 Brian:  will ensure that it starts with  and ends with  and anything in between
 Brian:  will be  which I think should do the trick. That logic is pretty hairy
 Brian:  though and I could be missing something.
 
 
 
 Wouldnt m/[]/g literally match  and not characters?
 
 Why wouldnt it be m/[.+]/g ?

Brackets in a regular expression match a single character inside the
brackets.  It's the RE equivalent to an OR.  So m/[]/g will match each
instance of '' or ''.  The match will be a single character long.


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




RE: Determine the Binary format of a file

2004-10-28 Thread Dan Jones
On Wed, 2004-10-27 at 23:19, Jim wrote:
  
  Have any backups? Paper reports?
  
  If all else fails, you could always hire some interns and 
  turn it into a massive data [re-]entry project, provided that 
  a paper trail exists...
  
 
 LOL! If I don't figure it out tonight, gonna tell my boss to renew the
 software :)

If you have access to the software, you might be able to create a new
file and put unique data into it, such as strings of repeating numbers
or letters, then try to reverse engineer that format.




-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: what is something like this - $seen{$1}

2004-10-26 Thread Dan Jones
On Tue, 2004-10-26 at 22:00, Chasecreek Systemhouse wrote:
 Interesting.
 
 Why doesn't this skip already seen letters, I used the
 case-insensitive modifier...
 
 %seen = ( );
 $string = AaBbCcDdEeFf;
 while ($string =~ /(.)/gi) {
 $seen{$1}++;
 }
 print \n\nunique chars are: , sort(keys %seen), \n;
 
 'A' and 'a' are the same, or is the logic only char() oriented?
 
 I guess I'm forced to use lc();

The 'i' modifier only affects the matching of the RE.  If you had a
letter in your RE, it would match either case of that letter.  It
doesn't change the case of the letter, just affects which ones match. 
Since the RE has no letters, the 'i' modifier doesn't do anything here. 


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: Broken Subroutine

2004-10-20 Thread Dan Jones
On Wed, 2004-10-20 at 23:19, Ron Smith wrote:
 The following is the code:

 #!/usr/bin/perl -w
 use strict;

 my @paths = `dir /b/s`;   # print
 @paths;
 my @basenames = basenames(@paths);

 sub basenames {
 foreach (@_) {
 if ($_ =~ /(\w+)\.\d+\.\w+$/) {
 @basenames = $1;  # print
 @basenames\n;
 }
 }
 }

First, while it's allowable, it seems to me that you're asking for
trouble by using the same name multiple times.  Perl may not have
difficulty keeping sub var, $var, @var and @var declared again at a
different scope separated, but programmers sure do.  For instance, do
you mean for the array @basenames inside the subroutine to be the same
as the array @basenames that you declared outside the subroutine?  If
so, why are you trying to assign a value to it when it's already
(theoretically) being populated inside the subroutine?

I say theoretically because when you assign a scalar to an array via
'=', you're essentially creating a new array with one element.  Any
values that were in the array are lost.  If you're trying to add an
additional element to the array, you need to use push (to add to the end
of the array) or unshift (to add to the beginning of the array.)

If you intended to use the same variable inside and outside the
subroutine (just one verion of @basenames), then don't bother assigning
to the variable.  Just call the subroutine.  I don't recommend it, but
it will work.  On the other hand, if you intended to have two different
variables, change the name of one of them (and you'll need to declare it
with my inside the subroutine.).  Then explicitly return the array
from the subroutine.

Something like this (untested code!):

bad way

#!/usr/bin/perl -w
use strict;

my @paths = `dir /b/s`;
my @basenames;

procbasenames(@paths);

sub procbasenames {
foreach (@_) {
if ($_ =~ /(\w+)\.\d+\.\w+$/) {
push @basenames, $1;
}
}
}

better way

#!/usr/bin/perl -w
use strict;

my @paths = `dir /b/s`;
my @basenames = procbasenames(@paths);

sub procbasenames {
my @basenamematches;
foreach (@_) {
if ($_ =~ /(\w+)\.\d+\.\w+$/) {
push @basenamematches, $1;
}
}
return @basenamematches;
}



-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: Regex to match valid host or dns names

2004-10-13 Thread Dan Jones
On Wed, 2004-10-13 at 15:06, K.Prabakar wrote:
  example below, it fails to match host-no.top-level as a valid host
  name. I modify the regex several times - but still don't get the right
  outlook.
  
  my @hosts = qw(192.168.22.1 192.168.22.18 localhost another.host.domain
  host-no.top-level my.host.domain.com);
  foreach (@hosts){
  # Works ok
  push (@ips, $_ ) if $_ =~ /^\d{1,3}\.\d{1,3}\.\d{1|3}/; 
   
  # Can't match host-no.top-level. 
  push (@dns, $_) if $_ =~ /^\w+-?[\w+]?\.?[\w+.{1}]*\w+$/;
  }
  
 
 
   
  /^\w+-?[\w+]?\.?[\w+.{1}]*\w+$/--Here you look for only one - and 
 also not allowing any other non-word charaters(like hyphen).
 
 The . can match any character even other than - .
 
 You can think like this:(For IP's)
  search for a number with maximum 3 digits and 
 then followed by the same kind of 3 numbers but prefixed with a dot.
 Try this --- $_ =~ /^\d{1,3}[\.\d{1,3}]{3}/
 
 You can think like this:(For DNS's)
 search for a WORD which may(-?) contain hyphen 
 within it and then followed by the same kind of zero-or-more-WORDs 
 but prefixed with a dot which is a normal dns name pattern.
 
 Try this  $_ =~ /^\w\w*-?\w+?[\.\w\w*-?\w+?]*$/
 
 But this will allow IP's also in your @dns because \w can match digits 
 also.

Isn't this easily solved?

foreach (@hosts){
 if($_ =~ /^\d{1,3}[\.\d{1,3}]{3}/) {
  push (@ips, $_ );
 }
 elsif($_ =~ /^\w\w*-?\w+?[\.\w\w*-?\w+?]*$/) {
  push (@dns, $_) 
 }
}




-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response