Re: Reading large mbox files

Dan Jones Sat, 27 Nov 2004 18:02:58 -0800

On Sat, 2004-11-27 at 23:07 +0100, Gunnar Hjalmarsson wrote:
> Dan Jones wrote:
> > The most obvious method is to set $/ to the regex /\n\nFrom /
> > (messages in mbox format are seperated by a blank line and begin with
> > a From line) and to read in email messages one at a time.
> 
>  From "perldoc perlvar":
> "Remember: the value of $/ is a string, not a regex."


Yes, I've since realized that.

> So that method isn't obvious at all; it's not available. (But you can
> set $/ to the *string* "\n\nFrom ".)

Well, the idea of resetting $/ is obvious, even if I misstated the
details.

> > It seems to me that this would be quite slow.
> 
> What made you draw that conclusion?

Because I cut my teeth on C (and later C++).  That means I have lots of
habits and mind sets that are great for a C programmer but not so great
for a wannabe JAPH.  Disk I/O is slow.  It's generally _much_ faster to
read in decent size chunks from the hard drive than to do repeated reads
of small size.  What I didn't consider is that Larry's a pretty bright
guy, and he's probably forgotten more about low level I/O than I know.
After thinking about it a bit more, I realized that Perl is almost
certainly doing the buffering for you.  I'd be very surprised if reading
via the angle operators correlated to doing actual disk reads.  Instead,
I'd wager that Perl slurps in a big chunk and just feeds it to you as
you request it.  One of the advantages of using higher level languages
is that they'll do a lot of the schlock work for you, if you remember to
let them do it!

> If you want to preserve the message separators with the right messages,
> line by line processing may be the easiest approach:
> 
>      my $msgsep = qr(^From );
>      my $msg;
> 
>      while (<MBOX>) {
>          if ( /$msgsep/ ) {
>              processmsg( \$msg ) if $msg;
>              $msg = $_;
>          } else {
>              $msg .= $_;
>          }
>      }
>      processmsg( \$msg );
> 
>      sub processmsg {
>          ...
>      }
> 
> (The $msgsep regex should probably be more specified.)

Since the message separator doesn't change, I think it's easier just to
remove it from the end of the message and add it back in to the next
message.  I had to extract the first message before going in to the
loop, since the message separator isn't missing from it and adding it
back in leads to it being present twice at the beginning of the message.
Here's what I've come up with so far:

#!/usr/bin/perl

use strict;
use warnings;

use Term::ReadKey;

sub ProcMessage($);
sub Pause();

die "usage: rdmbox mailbox" unless ($ARGV[0]);

open MAILBOX, $ARGV[0] or die "Can't open $ARGV[0]";

local $/ = "\n\nFrom ";

$_ = <MAILBOX>;
$_ =~ s/\n\nFrom $//;
ProcMessage($_);

while(<MAILBOX>) {
        $_ =~ s/\n\nFrom $//;
        ProcMessage("From $_");

    if(Pause() == 0) {
        last;
    }
}

close MAILBOX;

sub ProcMessage($){
        my $message = shift;
        print "Message:\n $message \n\n";
}

sub Pause() {
        open(TTY, "</dev/tty");
    ReadMode "raw";
    my $key = ReadKey 0, *TTY;
    ReadMode "normal";
    if($key eq "q") {
                return 0;
    }
    else {
        return 1;
    }
}


Comments on the above code are welcome, even if it touches on some other
issue.  If it isn't obvious by now, I'm trying to learn here! Basic
error checking is included but error messages are a bit terse right
now.  


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: Reading large mbox files

Reply via email to