Inspired by "Filters that fight back", by Paul Graham
 http://www.paulgraham.com/ffb.html
I found a reference to a short script that scans e-mail for URL's,
and then turns around and automatically references the offending page.
Well, I'm not interested in doing that at the moment, but I have
enhanced the script (and fixed a bug) to make it do a decent job of
extracting e-mails from an mbox. Here's the script, also attached.

#! /usr/bin/perl -w
 
 
use lib '..', '.';
 
use Mail::Box::Manager 2.00;
use LWP::Simple;
use URI;
use HTML::Entities;
 
# Basic idea of using Mail::Box::Manager from:
# http://radio.weblogs.com/0111823/2003/11/16.html#a373
# Open the folder
my $mgr=Mail::Box::Manager->new;
my $folder = $mgr->open
       ( $ARGV[0],
         extract => 'ALWAYS' # Take the body
       );
die "Cannot open '$ARGV[0]': $!\n"
       unless defined $folder;
 
# Process all messages in this folder.
my @messages = $folder->messages;
foreach my $message (@messages)
{
   my @match=($message->decoded =~ /\bhref="(http[^>"]*)">.*/gi );
   foreach my $match(@match)
   {
      my $u   = URLDecode(HTML::Entities::decode($match));
      my $url = URI->new($u)->as_string;
      print $url,"\n";
      my $content="";
      # Try and fetch the page, no-op for now.
      # my $rc=getstore($match,$content);
   }
}
 
# Finish
$folder->close;
exit 0;
 
 
# from: http://glennf.com/writing/hexadecimal.url.encoding.html
sub URLDecode {
    my $theURL = $_[0];
    $theURL =~ tr/+/ /;
    $theURL =~ s/%([a-fA-F0-9]{2,2})/chr(hex($1))/eg;
    $theURL =~ tr [\200-\377] [\000-\177];   # delete 8th bit
    $theURL =~ s/[^[:print:]]//g;            # remove non-printables
    $theURL =~ s/\s//g;              # remove white space
    $theURL =~ s/<!--(.|\n)*-->//g;
    return $theURL;
}

I ran it as follows, with good results:
  % extract_urls.pl spam-mail | sort -u
where 'spam-mail' is an mbox.

Attachment: extract_urls.pl
Description: Binary data

Reply via email to