improving SURBL without the foot-shooting

Daniel Quinlan 4 Oct 2004 23:48:04 -0000

By foot-shooting, I mean sacrificing our corpus by extracting whitelist
entries from it.  (I don't want us to hit a sacrifice fly when we have
two outs and a man on first.)


There are three basic methods for improving SURBL:

  1. adding good entries to the whitelist
  2. removing good (and perhaps neutral) entries from the blacklist
  3. adding bad entries to the blacklist

And there are multiple ways to identify goodness and badness. I suspect
you aren't employing all of the good ways.  And you can use any list of
URLs as an input to those identifiers.  Since #3 does not seem to be
something that SURBL is lacking, I'll focus on how to find good URLs.

Lists of mostly good URLs:

  1. http://dmoz.org/ - Open Directory Project

     huge human-edited list of URLs that are mostly good

  2. http://www.google.com/ - Google

     highly ranked URLs are mostly good, there is an API for
     programmatic access

  3. Top N lists of sites:

     http://www.pcmag.com/category2/0,1738,7488,00.asp

     again, mostly good

  4. SURBL query traffic

     mostly good if you subtract the blacklisted ones

Identifiers and filters:

  1. SBL queries on NS->A records for domain of site, if it's listed,
     then it's probably a spammer.

  2. Hits 1, 2, or more of the above good lists.

  3. Registration cost: cheaper registrar = more likely to be spam,
     more expensive registrar = less likely to be spam.

  4. Listed in Catherine Hampton's SpamBouncer domain list = almost
     certainly spam.

  5. Domain is listed in ROKSO = definitely spam

  6. Domain is listed in Yahoo (similar to DMOZ) = probably not spam

  7. Domain appears in Wikipedia = probably not spam

I hope these ideas help.  I've tried to focus on databases that can be
downloaded or accessed.

I think your continual nagging of SA corpus submitters for ham hits is
really weak and only serves to damage the SpamAssassin scoring process
and your own evaluation of efficacy.  I hope nobody (else) gives you
their SA ham hits, otherwise we'll have no way to know if the above ways
to reduce FPs are actually working for the average user.

I think we need to care about more than just appearing to be accurate.

Daniel

-- 
Daniel Quinlan                     ApacheCon! 13-17 November (3 SpamAssassin
http://www.pathname.com/~quinlan/  http://www.apachecon.com/  sessions & more)

improving SURBL without the foot-shooting

Reply via email to