Re: Using SA code to extract URLs ?

2007-01-13 Thread Jeff Chan
On Friday, January 12, 2007, 6:10:32 PM, Michael Cocke wrote:
 I was told a while back that the best way to extract urls from emails 
 was to use code from SpamAssassin.  Ok - Now, I need to do just that. 
 Any pointers?  I've looked thru the code in SpamCopURI, but unless there 
 are some docs hidden somewhere I can't even figure out the entry point. 
   Are there some docs hidden somewhere (I hope!)?

Yes, SpamAssassin is a very good way to extract URLs from mails.

Listen to Theo.  SpamCopURI is a patch for an older version of
SpamAssassin so that it could use SURBLs.  The code built into
the latest SpamAssassin for getting URIs is likely more complete
and effective.

Jeff C.
-- 
Jeff Chan
mailto:[EMAIL PROTECTED]
http://www.surbl.org/



Re: Using SA code to extract URLs ?

2007-01-13 Thread Dallas Engelken

Michael W. Cocke wrote:
div class=moz-text-flowed style=font-family: -moz-fixedI was 
told a while back that the best way to extract urls from emails was to 
use code from SpamAssassin.  Ok - Now, I need to do just that. Any 
pointers?  I've looked thru the code in SpamCopURI, but unless there 
are some docs hidden somewhere I can't even figure out the entry 
point.  Are there some docs hidden somewhere (I hope!)?


Thanks!

Mike-
/div


here is a little something i use to extract urls from messages.   it 
takes a mesage on STDIN, runs its through a empty instance of SA (no 
rules, no configs loaded), and prints to STDOUT.


#!/usr/bin/perl

use Mail::SpamAssassin;
use Mail::SpamAssassin::PerMsgStatus;

main;

# 

sub main {
 my $msg;
 while () { $msg .= $_; }
 my $data = geturi(\$msg);
 print $data;
 exit;
}

# 

sub geturi {
 my ($message) = shift;
 my $sa = create_saobj();
 $sa-init(0);
 my $mail = $sa-parse($$message);
 my $msg = Mail::SpamAssassin::PerMsgStatus-new($sa, $mail);
 my @uris = $msg-get_uri_list();
 my %uri_list;
 foreach my $uri (@uris) {
   next if ($uri =~ m/^(cid|mailto|javascript):/i);
   $uri_list{$uri} = 1;
 }
 my $uris = join(\n, keys %uri_list, );
 return $uris;
}

# 

sub create_saobj {
 my %setup_args = ( rules_filename = undef, site_rules_filename = undef,
userprefs_filename = undef, userstate_dir = undef,
local_tests_only = 1, dont_copy_prefs = 1
  );
 my $sa = Mail::SpamAssassin-new(\%setup_args);
 return $sa;
}

# 
# EOF



# cat corpus/spam/canselon.com.html | perl parse_uri.pl
http://images.loveouroffers.com/general/8675_usub/USUB_101_b_02.gif
./unsubscribeOffers.html
http://images.loveouroffers.com/general/8675_usub/USUB_101_b_01.gif
http://images.loveouroffers.com/general/8675_usub/spacer.gif
list.html?clientid=12em=offerid=1mailerid=1emailid=0
http://list.html/?clientid=12em=offerid=1mailerid=1emailid=0
http://images.loveouroffers.com/general/8675_usub/USUB_101_b_03.jpg
http:///unsubscribeOffers.html
http://./unsubscribeOffers.html


Enjoy.  Also, I only get digest copies from this list and dont check 
them all, so please cc me if you want me to see it. :)


--
Dallas Engelken
[EMAIL PROTECTED]
http://uribl.com



Using SA code to extract URLs ?

2007-01-12 Thread Michael W. Cocke
I was told a while back that the best way to extract urls from emails 
was to use code from SpamAssassin.  Ok - Now, I need to do just that. 
Any pointers?  I've looked thru the code in SpamCopURI, but unless there 
are some docs hidden somewhere I can't even figure out the entry point. 
 Are there some docs hidden somewhere (I hope!)?


Thanks!

Mike-


Re: Using SA code to extract URLs ?

2007-01-12 Thread Theo Van Dinter
On Fri, Jan 12, 2007 at 09:10:32PM -0500, Michael W. Cocke wrote:
 I was told a while back that the best way to extract urls from emails 
 was to use code from SpamAssassin.  Ok - Now, I need to do just that. 
 Any pointers?  I've looked thru the code in SpamCopURI, but unless there 
 are some docs hidden somewhere I can't even figure out the entry point. 
  Are there some docs hidden somewhere (I hope!)?

PerMsgStatus-get_uri_detail() or related functions?  perldoc the PMS module
for more information.

-- 
Randomly Selected Tagline:
When it comes to defense, redundancy is the minimum. - Michael Warfield


pgpWznwIniOc3.pgp
Description: PGP signature