Re: Over-scoring of SURBL lists...

Matt Kettler Fri, 17 Feb 2006 07:20:09 -0800

Jeff Chan wrote:
> On Thursday, February 16, 2006, 9:13:36 PM, Matt Kettler wrote:
>   
>> I'm only presenting evidence of accuracy problems in relation to why the
>> URIBLs collectively wield a great deal of power in SpamAssassin scoring.
>> I'm not really complaining about uribl.com, I'm complaining about URIBLs
>> as a whole. That's both uribl.com and surbl. Whenever I use the term
>> URIBL in all caps, I mean all URI dns-based blacklists. If you prefer,
>> I'll retract my uribl.com example, and point out that less than an hour
>> later, I got a ws.surbl.org FP.
>>     
>
> There may be some value in not lumping together URIBL.com and
> SURBL.org lists.  As you can see the performance of the lists are
> different, and the way they're created is different too.  That
> makes it harder for us to respond to comments that seem to not
> take those differences into account.
>   
Did you see Theo's test data from yesterday?


 35.418  41.1930   0.0000    1.000   0.90    0.00  URIBL_JP_SURBL
 34.665  40.3177   0.0000    1.000   0.88    0.00  URIBL_SC_SURBL
 26.069  30.3204   0.0000    1.000   0.80    0.00  URIBL_AB_SURBL
 28.024  32.5464   0.2915    0.991   0.61    0.00  URIBL_OB_SURBL
 48.113  55.7492   1.2873    0.977   0.55    0.00  URIBL_BLACK
  0.293   0.3406   0.0000    1.000   0.47    0.00  URIBL_PH_SURBL
  0.000   0.0000   0.0000    0.500   0.42    0.00  URIBL_RED
  0.000   0.0000   0.0000    0.500   0.42    0.01  T_URIBL_XS_SURBL
 37.539  42.4763   7.2626    0.854   0.38    0.00  URIBL_WS_SURBL
  0.548   0.3446   1.7974    0.161   0.03    0.00  URIBL_GREY

I consider that "highly similar" for JP, SC, AB, OB and WS.

Also, even if there are some differences, even 10% overlap would have
the effect I'm talking about.

I personally would like to see some statistics, but  at this point, we
don't have any test data on this so we're arguing your theory vs mine.

I'd love to see some results for some meta tests:

meta SURBL_MULTI2   ((URIBL_JP_SURBL + URIBL_SC_SURBL + URIBL_AB_SURBL +
URIBL_OB_SURBL+  URIBL_WS_SURBL) >2)
meta SURBL_MULTI3   ((URIBL_JP_SURBL + URIBL_SC_SURBL + URIBL_AB_SURBL +
URIBL_OB_SURBL+  URIBL_WS_SURBL) >3)
meta SURBL_MULTI4   ((URIBL_JP_SURBL + URIBL_SC_SURBL + URIBL_AB_SURBL +
URIBL_OB_SURBL+  URIBL_WS_SURBL) >4)

In particular, I'm concerned about the ham hits of even multi 2.

Theo?
>> 3) I'm even more concerned about the monoculure of the URIBLs.
>>     
>
> I suppose it depends on your point of view.  From my point of
> view the various lists are different in terms of sources and
> listing logic.  As you can see from the results posted, they have
> fairly different performance in terms of spam and ham hits, but
> those measurements don't take into account the underlying
> tools and sources that go into making them, which varies between
> lists.
>   

I don't see the difference from the recent results posted by Theo.
>   
>> uribl.com's black, surbl.org's ws, sc, jp, ab and ob are all
>> more-or-less the same list. Paul argued against that statement, but in
>> my mind his arguments are weak at best. There IS considerable overlap
>> between these lists. Contrary Paul's statements, you only need to be
>> reported once by a spamcop spamtrap or trusted feed to be on SC.
>>     
>
> That's only partially correct.  Paul's statement is correct for
> most SpamCop reports.  It takes many reports to get on SC for
> most domains except the ones that resolve into known spammer
> networks.
>   
> There are no "trusted feeds" for SC.  
Not on your end, but keep in mind that spamcop trusts their spamtraps
with a 5x bias.


> The data in SC comes from
> SpamCop reports.  I don't know the number of SpamCop users, but
> they're probably many.  The way I deal with the issue of trust is
> to aggregate the reports in various ways and ignore some of the
> noise that would lead to FPs.  And all SURBL lists are subject to
> whitelisting as a final arbiter.  So even if a SpamCop user
> wanted us to blacklist say google.com or yahoo.com, we won't.
>
>   
>> JP
>> monitors 18,000 domains, not just two people. AB accepts feeds directly
>> from spamcop and does different analysis on them. Ultimately it is
>> possible for a single copy of an email to cause a listing in
>> uribl_black, SC, WS, JP, and OB all at the same time.
>>     
>
> Not really.  It take a fairly large and widespread spam run to
> get onto multiple (SURBL) lists.  
So why do so some small-spread legitamate mailings with special-purpose
domains end up multi-listed? I've seen this happen a number of times in
the past 3 weeks. This *IS* real.

It's not terribly common in terms of % of email, but maybe 1 in 1000 ham
mails I get has a double-listed link in it.

Re: Over-scoring of SURBL lists...

Reply via email to