On Friday, April 2, 2004, 9:02:59 PM, Loren Wilton wrote: > Jeff, I had a look at your list at some random time a few days ago. I > noticed that the top 90% or so of the reports looked pretty solid. At the > instant I looked, the bottom 10% of the reports were most all highly > suspect. This is where the yahoo and geocities and other whitelist stuff > was showing up. Some other reports (and I can't remember what any of them > were) also seemed somewhat suspect, even though they probably weren't on a > whitelist.
> I concluded that only the top 90% of your reports should be used in the > blocking test, and ignore the reports with less than 10% of the > highest-scoring report. Now, perhaps this percentage fluxuates with time, I > certainly haven't made multiple checks to see. And maybe after whitelist > removal the rest of the bottom 10% really is spam. > But I think it would be an interesting experiment to compare the relibility > of the top 90% to the relibility of the entire collection. Thanks for checking this over for us! It looks like you visited: http://spamcheck.freeapp.net/top-sites.html which does not have the whitelist entries removed from it and which does not go all the way down to the threshold of 10 spams. The full list which is about 11000 entries can be seen at: http://spamcheck.freeapp.net/top-sites.txt This is a basis for the thresholded 400 or so domains at: http://spamcheck.freeapp.net/top-sites-domains which doesn't show the counts used to threshold, but they all got over 10 counts. It does however have some duplicates like www.domain.com for domain.com eliminated and perhaps most importantly *has had the whitelisted domains and two level ccTLDs removed*. It is the basis for the RBL: http://spamcheck.freeapp.net/surbl.bind Due to the whitelisting and thresholding, the domains that make it into SURBL are quite spammy, hopefully and probably more than the 90% you estimated on the unfiltered list. Cheers, Jeff C. -- Jeff Chan mailto:[EMAIL PROTECTED] http://www.surbl.org/
