By foot-shooting, I mean sacrificing our corpus by extracting whitelist entries from it. (I don't want us to hit a sacrifice fly when we have two outs and a man on first.)
There are three basic methods for improving SURBL: 1. adding good entries to the whitelist 2. removing good (and perhaps neutral) entries from the blacklist 3. adding bad entries to the blacklist And there are multiple ways to identify goodness and badness. I suspect you aren't employing all of the good ways. And you can use any list of URLs as an input to those identifiers. Since #3 does not seem to be something that SURBL is lacking, I'll focus on how to find good URLs. Lists of mostly good URLs: 1. http://dmoz.org/ - Open Directory Project huge human-edited list of URLs that are mostly good 2. http://www.google.com/ - Google highly ranked URLs are mostly good, there is an API for programmatic access 3. Top N lists of sites: http://www.pcmag.com/category2/0,1738,7488,00.asp again, mostly good 4. SURBL query traffic mostly good if you subtract the blacklisted ones Identifiers and filters: 1. SBL queries on NS->A records for domain of site, if it's listed, then it's probably a spammer. 2. Hits 1, 2, or more of the above good lists. 3. Registration cost: cheaper registrar = more likely to be spam, more expensive registrar = less likely to be spam. 4. Listed in Catherine Hampton's SpamBouncer domain list = almost certainly spam. 5. Domain is listed in ROKSO = definitely spam 6. Domain is listed in Yahoo (similar to DMOZ) = probably not spam 7. Domain appears in Wikipedia = probably not spam I hope these ideas help. I've tried to focus on databases that can be downloaded or accessed. I think your continual nagging of SA corpus submitters for ham hits is really weak and only serves to damage the SpamAssassin scoring process and your own evaluation of efficacy. I hope nobody (else) gives you their SA ham hits, otherwise we'll have no way to know if the above ways to reduce FPs are actually working for the average user. I think we need to care about more than just appearing to be accurate. Daniel -- Daniel Quinlan ApacheCon! 13-17 November (3 SpamAssassin http://www.pathname.com/~quinlan/ http://www.apachecon.com/ sessions & more)