Re: Home made mail news search tool, and folded header lines

R. Joseph Newton Sat, 27 Mar 2004 12:46:23 -0800

Harry Putnam wrote:


> >> Something like:
> >> [...] snipped getopts and other unrelated stuff
> >>       while(<FILE>){
> >>           chomp;
> >>           my $line = $_;
> >
> > Why here.  Since you are doing this with each line, you could write in the loop
> > control:
> > while (my $line = <FILE>) {
>
> Not sure I understand the advantage.  In my formulation, `$line' is
> minus the trailing newline... which I've found to be nearly always a plus.

Different question.  Sorry, that was sloppy of me, but this still can be handled with
more clarity by declaring the loop control variable in the loop control.

while (my $line = <FILE>) {
   chomp $line;
The chomp should indeed have been there, but it is generally better when the statement
tells you what is being chomped.  Defaulted function calls are pretty much the [Daddy]
Bushisms of programming.  They are not really complete statements, but the listener
[Perl] is expected to fill in the gaps in logic.


> >>           ## @hdregs is an array of several regex for the headers
> >>           for($ii=0;$ii<=$#hdregs;$ii++){
> >
> > Why no space between clauses?  Why no space around assignment
> > operators?
>
> Just how I've become accustomed to writing code.  Probably not a good
> plan for when others need to read and revise it.

For yourself too, I would hope.

> > Why a C-style for loop?  Are you using the index somewhere?
>
> Well yes, sort of.  I wanted a way to ensure that each reg has hit at
> least once.  Otherwise we don't print.  So I used a formulation like
> this (Not posted previously for clarity):
>
>          if ($data{$hdregs[$ii]}++ == 0) {
>            ## it will only be 0 once
>            $hdelem_hit_cnt++;
>          }

You probably need a hash if what you are looking for is uniqueness.  They are indeally
designed for elements that must remain unique

>
> Then before printing we compare $hdelem_hit_cnt to ($#hdregs + 1):

I hate to say it, but I have to break here.  Hint--when you create an identifier,
pronounce the identifier aloud, once.  Then say it real fast, ten times.  If you start
to gag, choose an identifier that you can pronounce aloud.  Okay, I took a deep
breath, now.  I think I can trudge onward.  It doesn't need to be a trudge,
though--that's what bugs me.

>  sub test_hdr_good {
>     if ($hdelem_hit_cnt == ($#hdregs + 1)) {
>       $test_hdr_good = "TRUE";
>       $hdelem_hit_cnt = 0;
>     }
>  }
>
> They should be the same if all regs have hit at least once.  If not
> the same... we don't print.

Don't count on any predetermined set of headers being hit.  There are very few headers
that appear in each and every message.  At a time when I had 13, 038 message in a
given mailbox, these were the only header items to appear in exactly that many
messages:

Date: 13139
X-Mozilla-Status: 13139
X-UIDL: 13139
From: 13139
X-Mozilla-Status2: 13139
Subject: 13139
though Received: 83781, Delivered-To: 20954, X-SMTPD: 16868 may also have appeared in
all.

I would suggest that rather than counting hits, you simply be prepared to handle
undefs or empty strings if a header line of interest is not present.  There simply are
not eneough signifant header tags duplicated to have this aspect of the problem
dominate your overall strategy.  If you have a particular interest in the transmission
path, then maybe it is worth the extra effort.  Sorting headers for a searchable
archive does not require that great a focus.


>
>
> >>              if($line =~ /$hdregs[$ii]/){
> >
> > Right now, you have just gotten quite a bit of information about this line,
> > including [with the same amount of effort, the type of header line involved.
> >
> >>
> >>                 ## Capture the line
> >>                 push @hits,$line;
> >
> > You now have thrown away the type information for the line, by throwing it back
> > in an usorted bag.  As Joe Ben Stamper said "When you fall, fall in the
> > direction of your work".  These lines should probably be going into a hash,
> > keyed to the portion of the line before the colon.  You may wish to throw out
> > about 3/4 of them, since there are hundreds of different attributes carried in
> > header lines, and only a small subset is going to be useful for data
> > management.  Under any circumstances, you should probably try to capture *all*
> > the information available at this point.
>
> I'm not following you here.  The code does capture the entire line.

Is that what you want?  It seems to me that a line of text is much less informative
and programmatically useful than a key-value pair in a hash.  The value for Received
lines should be an array, though, probably.

>
> And using Randy's concatenation technique, including folded lines
> (concat'ed)
>
> Prior to printing the array is sorted like this:
>       for(sort @hits){
>           print...
>       }
> So that the ouput has some sort of uniformity.
>
> Further, if I key a hash with stuff before colon, repeated hits like
> on `Received' lines  will disappear into the ether.

Yes, you will have to watch for the Received lines for special handling.  The other
main header lines will be unique, though.

>
>
> I plan to use this code for tracking Received lines at times.
>
> [...]
>
> > Then buffer the input.  Declare a variable outside of the loop to hold the
> > preivous line.  If the line currently being read begins with whitespace, join it
> > to the $current_line with a newline.  It might take a little restructuring of
> > the sequence within the loop.  This is one case where a priming read could be of
> > assistance, since your loop could then have something in the buffer to spit out
> > unless the line being read has space at the start.
>
> Looks like you and Randy hit on the same thing for that situation.
> Randy posted a nifty way to do just that.  I just didn't quite follow
> the code at first.

I think part of the problem here is that you never really defined the problem well.  I
went back to the root post, and there you pretty much jumped right into what things
you wanted to do with what coding structures.  That is not a very effective way to
program.  If your goal is clear, in terms of what information you want out of the
process, you can usually find very straightforward ways to get there.

You really are not all that new to Perl.  I've been seeing your posts about as long as
I've been on the list.  I think the habit of thinking in code may affect your
progress, though.  Code and coding structures should come after you have worked out
the logic of the process.  Going back to your original post, I am wondering:  How
important is the path trace to a search tool?  Other common fields are unique per
message.

Come to think of it, I used a somewhat similar strategy to yours in a mailbox parser I
wrote earlier this year:

  sub parse_header {
    my ($line, $source_mailbox, $count, $message_db_out) = @_;
    my $message_info = {};
    my $count_string = sprintf "%05d", $count;
    my $header_out;

    open $header_out, ">hdr/hdr$count_string.txt" or
     die "Opening header $count to write: $!";
    while ($line ne "\n") {
      extract_header_items($line, $message_info);
      print $header_out $line;
      $line = <$source_mailbox>;
    }
    close $header_out or die "Closing header file $count on write: $!";
    index_message($count_string, $message_info, $message_db_out);
    return $line, $message_info;
  }

  sub extract_header_items {
    my ($line, $message_info) = @_;

    chomp $line;
    if ($line =~ /^From - /) {
      ($message_info->{'Received-Date'} = $line) =~ s/From - (.*.)/$1/;
    } elsif ( $line =~
    /^(To: |From: |Date: |References: |In-Reply-To: )/) {
      my ($field_name, $field_value) = split /:\s+/, $line, 2;
      $field_value =~ s/,/;/g;
      $message_info->{$field_name} = $field_value;
    } elsif ($line =~ /^Subject: /) {
      ($message_info->{'Subject'} = $line) =~ s/Subject: //;
    } elsif ($line =~ s/Message-ID: //i) {
      $message_info->{'Message-ID'} = $line;
    } elsif ($line =~ /^In-Reply-To: /i) {
      merge_as_Reference($line, $message_info);  # does this duplicate above?
    }
    $message_info->{'Subject'} = '[no subject]'
     if not defined $message_info->{'Subject'};
  }

  sub merge_as_Reference {
    my ($line, $message_info) = @_;

    $line =~ s/^In-Reply-To: //i;
    if (my $refs = $message_info->{'References'}) {
      $refs .= " $line" if ($refs !~ /$line/);
      $message_info->{'References'} = $refs;
    } else {
      $message_info->{'References'} = $line;
    }
  }

Hmmm, looks like I neve touched the Received headers, execpt to print them out to
file.  Those are the only header tags I have found that have multiline values. [oops,
guess there are a couple others, but not many]  All [most] other header fields I have
encountered amonfg the 20,000+ posts I have tested with resolve on a single line.  So
the complication is limited here.

Joseph


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: Home made mail news search tool, and folded header lines

Reply via email to