Jeff Chan wrote: > On Thursday, February 16, 2006, 9:13:36 PM, Matt Kettler wrote: > >> I'm only presenting evidence of accuracy problems in relation to why the >> URIBLs collectively wield a great deal of power in SpamAssassin scoring. >> I'm not really complaining about uribl.com, I'm complaining about URIBLs >> as a whole. That's both uribl.com and surbl. Whenever I use the term >> URIBL in all caps, I mean all URI dns-based blacklists. If you prefer, >> I'll retract my uribl.com example, and point out that less than an hour >> later, I got a ws.surbl.org FP. >> > > There may be some value in not lumping together URIBL.com and > SURBL.org lists. As you can see the performance of the lists are > different, and the way they're created is different too. That > makes it harder for us to respond to comments that seem to not > take those differences into account. > Did you see Theo's test data from yesterday?
35.418 41.1930 0.0000 1.000 0.90 0.00 URIBL_JP_SURBL 34.665 40.3177 0.0000 1.000 0.88 0.00 URIBL_SC_SURBL 26.069 30.3204 0.0000 1.000 0.80 0.00 URIBL_AB_SURBL 28.024 32.5464 0.2915 0.991 0.61 0.00 URIBL_OB_SURBL 48.113 55.7492 1.2873 0.977 0.55 0.00 URIBL_BLACK 0.293 0.3406 0.0000 1.000 0.47 0.00 URIBL_PH_SURBL 0.000 0.0000 0.0000 0.500 0.42 0.00 URIBL_RED 0.000 0.0000 0.0000 0.500 0.42 0.01 T_URIBL_XS_SURBL 37.539 42.4763 7.2626 0.854 0.38 0.00 URIBL_WS_SURBL 0.548 0.3446 1.7974 0.161 0.03 0.00 URIBL_GREY I consider that "highly similar" for JP, SC, AB, OB and WS. Also, even if there are some differences, even 10% overlap would have the effect I'm talking about. I personally would like to see some statistics, but at this point, we don't have any test data on this so we're arguing your theory vs mine. I'd love to see some results for some meta tests: meta SURBL_MULTI2 ((URIBL_JP_SURBL + URIBL_SC_SURBL + URIBL_AB_SURBL + URIBL_OB_SURBL+ URIBL_WS_SURBL) >2) meta SURBL_MULTI3 ((URIBL_JP_SURBL + URIBL_SC_SURBL + URIBL_AB_SURBL + URIBL_OB_SURBL+ URIBL_WS_SURBL) >3) meta SURBL_MULTI4 ((URIBL_JP_SURBL + URIBL_SC_SURBL + URIBL_AB_SURBL + URIBL_OB_SURBL+ URIBL_WS_SURBL) >4) In particular, I'm concerned about the ham hits of even multi 2. Theo? >> 3) I'm even more concerned about the monoculure of the URIBLs. >> > > I suppose it depends on your point of view. From my point of > view the various lists are different in terms of sources and > listing logic. As you can see from the results posted, they have > fairly different performance in terms of spam and ham hits, but > those measurements don't take into account the underlying > tools and sources that go into making them, which varies between > lists. > I don't see the difference from the recent results posted by Theo. > >> uribl.com's black, surbl.org's ws, sc, jp, ab and ob are all >> more-or-less the same list. Paul argued against that statement, but in >> my mind his arguments are weak at best. There IS considerable overlap >> between these lists. Contrary Paul's statements, you only need to be >> reported once by a spamcop spamtrap or trusted feed to be on SC. >> > > That's only partially correct. Paul's statement is correct for > most SpamCop reports. It takes many reports to get on SC for > most domains except the ones that resolve into known spammer > networks. > > There are no "trusted feeds" for SC. Not on your end, but keep in mind that spamcop trusts their spamtraps with a 5x bias. > The data in SC comes from > SpamCop reports. I don't know the number of SpamCop users, but > they're probably many. The way I deal with the issue of trust is > to aggregate the reports in various ways and ignore some of the > noise that would lead to FPs. And all SURBL lists are subject to > whitelisting as a final arbiter. So even if a SpamCop user > wanted us to blacklist say google.com or yahoo.com, we won't. > > >> JP >> monitors 18,000 domains, not just two people. AB accepts feeds directly >> from spamcop and does different analysis on them. Ultimately it is >> possible for a single copy of an email to cause a listing in >> uribl_black, SC, WS, JP, and OB all at the same time. >> > > Not really. It take a fairly large and widespread spam run to > get onto multiple (SURBL) lists. So why do so some small-spread legitamate mailings with special-purpose domains end up multi-listed? I've seen this happen a number of times in the past 3 weeks. This *IS* real. It's not terribly common in terms of % of email, but maybe 1 in 1000 ham mails I get has a double-listed link in it.