Michael W. Cocke wrote:
div class=moz-text-flowed style=font-family: -moz-fixedI was
told a while back that the best way to extract urls from emails was to
use code from SpamAssassin. Ok - Now, I need to do just that. Any
pointers? I've looked thru the code in SpamCopURI, but unless there
are some docs hidden somewhere I can't even figure out the entry
point. Are there some docs hidden somewhere (I hope!)?
Thanks!
Mike-
/div
here is a little something i use to extract urls from messages. it
takes a mesage on STDIN, runs its through a empty instance of SA (no
rules, no configs loaded), and prints to STDOUT.
#!/usr/bin/perl
use Mail::SpamAssassin;
use Mail::SpamAssassin::PerMsgStatus;
main;
#
sub main {
my $msg;
while () { $msg .= $_; }
my $data = geturi(\$msg);
print $data;
exit;
}
#
sub geturi {
my ($message) = shift;
my $sa = create_saobj();
$sa-init(0);
my $mail = $sa-parse($$message);
my $msg = Mail::SpamAssassin::PerMsgStatus-new($sa, $mail);
my @uris = $msg-get_uri_list();
my %uri_list;
foreach my $uri (@uris) {
next if ($uri =~ m/^(cid|mailto|javascript):/i);
$uri_list{$uri} = 1;
}
my $uris = join(\n, keys %uri_list, );
return $uris;
}
#
sub create_saobj {
my %setup_args = ( rules_filename = undef, site_rules_filename = undef,
userprefs_filename = undef, userstate_dir = undef,
local_tests_only = 1, dont_copy_prefs = 1
);
my $sa = Mail::SpamAssassin-new(\%setup_args);
return $sa;
}
#
# EOF
# cat corpus/spam/canselon.com.html | perl parse_uri.pl
http://images.loveouroffers.com/general/8675_usub/USUB_101_b_02.gif
./unsubscribeOffers.html
http://images.loveouroffers.com/general/8675_usub/USUB_101_b_01.gif
http://images.loveouroffers.com/general/8675_usub/spacer.gif
list.html?clientid=12em=offerid=1mailerid=1emailid=0
http://list.html/?clientid=12em=offerid=1mailerid=1emailid=0
http://images.loveouroffers.com/general/8675_usub/USUB_101_b_03.jpg
http:///unsubscribeOffers.html
http://./unsubscribeOffers.html
Enjoy. Also, I only get digest copies from this list and dont check
them all, so please cc me if you want me to see it. :)
--
Dallas Engelken
[EMAIL PROTECTED]
http://uribl.com