Re: Over-scoring of SURBL lists...

Jeff Chan Fri, 17 Feb 2006 07:41:17 -0800

On Friday, February 17, 2006, 7:19:50 AM, Matt Kettler wrote:
> Jeff Chan wrote:
>> On Thursday, February 16, 2006, 9:13:36 PM, Matt Kettler wrote:
>>   
>>> I'm only presenting evidence of accuracy problems in relation to why the
>>> URIBLs collectively wield a great deal of power in SpamAssassin scoring.
>>> I'm not really complaining about uribl.com, I'm complaining about URIBLs
>>> as a whole. That's both uribl.com and surbl. Whenever I use the term
>>> URIBL in all caps, I mean all URI dns-based blacklists. If you prefer,
>>> I'll retract my uribl.com example, and point out that less than an hour
>>> later, I got a ws.surbl.org FP.
>>>     
>>
>> There may be some value in not lumping together URIBL.com and
>> SURBL.org lists.  As you can see the performance of the lists are
>> different, and the way they're created is different too.  That
>> makes it harder for us to respond to comments that seem to not
>> take those differences into account.
>>   
> Did you see Theo's test data from yesterday?


Yes.  I was referring lumping URIBL.com with SURBL.org mostly.

>  35.418  41.1930   0.0000    1.000   0.90    0.00  URIBL_JP_SURBL
>  34.665  40.3177   0.0000    1.000   0.88    0.00  URIBL_SC_SURBL
>  26.069  30.3204   0.0000    1.000   0.80    0.00  URIBL_AB_SURBL
>  28.024  32.5464   0.2915    0.991   0.61    0.00  URIBL_OB_SURBL
>  48.113  55.7492   1.2873    0.977   0.55    0.00  URIBL_BLACK
>   0.293   0.3406   0.0000    1.000   0.47    0.00  URIBL_PH_SURBL
>   0.000   0.0000   0.0000    0.500   0.42    0.00  URIBL_RED
>   0.000   0.0000   0.0000    0.500   0.42    0.01  T_URIBL_XS_SURBL
>  37.539  42.4763   7.2626    0.854   0.38    0.00  URIBL_WS_SURBL
>   0.548   0.3446   1.7974    0.161   0.03    0.00  URIBL_GREY

> I consider that "highly similar" for JP, SC, AB, OB and WS.

As similar as 30 and 40, and 0, .3 and 7 are, I suppose.

> Also, even if there are some differences, even 10% overlap would have
> the effect I'm talking about.

> I personally would like to see some statistics, but  at this point, we
> don't have any test data on this so we're arguing your theory vs mine.

> I'd love to see some results for some meta tests:

> meta SURBL_MULTI2   ((URIBL_JP_SURBL + URIBL_SC_SURBL + URIBL_AB_SURBL +
> URIBL_OB_SURBL+  URIBL_WS_SURBL) >2)
> meta SURBL_MULTI3   ((URIBL_JP_SURBL + URIBL_SC_SURBL + URIBL_AB_SURBL +
> URIBL_OB_SURBL+  URIBL_WS_SURBL) >3)
> meta SURBL_MULTI4   ((URIBL_JP_SURBL + URIBL_SC_SURBL + URIBL_AB_SURBL +
> URIBL_OB_SURBL+  URIBL_WS_SURBL) >4)

> In particular, I'm concerned about the ham hits of even multi 2.

I'd be concerned about it to, but it seldom seems to happen.

> Theo?
>>> 3) I'm even more concerned about the monoculure of the URIBLs.
>>>     
>>
>> I suppose it depends on your point of view.  From my point of
>> view the various lists are different in terms of sources and
>> listing logic.  As you can see from the results posted, they have
>> fairly different performance in terms of spam and ham hits, but
>> those measurements don't take into account the underlying
>> tools and sources that go into making them, which varies between
>> lists.
>>   

> I don't see the difference from the recent results posted by Theo.

That's like saying two different RBLs that hit a similar
percentage of spams must therefore have the same policies, even
when they may have no data in common.  It's not a conclusion that
can be drawn from that kind of measurement.

>>> uribl.com's black, surbl.org's ws, sc, jp, ab and ob are all
>>> more-or-less the same list. Paul argued against that statement, but in
>>> my mind his arguments are weak at best. There IS considerable overlap
>>> between these lists. Contrary Paul's statements, you only need to be
>>> reported once by a spamcop spamtrap or trusted feed to be on SC.
>>>     
>>
>> That's only partially correct.  Paul's statement is correct for
>> most SpamCop reports.  It takes many reports to get on SC for
>> most domains except the ones that resolve into known spammer
>> networks.
>>   
>> There are no "trusted feeds" for SC.

> Not on your end, but keep in mind that spamcop trusts their spamtraps
> with a 5x bias.

Our feeds are SpamCop user and mole reports, not SpamCop trap data.

>> The data in SC comes from
>> SpamCop reports.  I don't know the number of SpamCop users, but
>> they're probably many.  The way I deal with the issue of trust is
>> to aggregate the reports in various ways and ignore some of the
>> noise that would lead to FPs.  And all SURBL lists are subject to
>> whitelisting as a final arbiter.  So even if a SpamCop user
>> wanted us to blacklist say google.com or yahoo.com, we won't.
>>
>>   
>>> JP
>>> monitors 18,000 domains, not just two people. AB accepts feeds directly
>>> from spamcop and does different analysis on them. Ultimately it is
>>> possible for a single copy of an email to cause a listing in
>>> uribl_black, SC, WS, JP, and OB all at the same time.
>>>     
>>
>> Not really.  It take a fairly large and widespread spam run to
>> get onto multiple (SURBL) lists.

> So why do so some small-spread legitamate mailings with special-purpose
> domains end up multi-listed? I've seen this happen a number of times in
> the past 3 weeks. This *IS* real.

> It's not terribly common in terms of % of email, but maybe 1 in 1000 ham
> mails I get has a double-listed link in it.

I don't know.  It's hard to consider in the abstract.

Perhaps you'd care to name an example.

Cheers,

Jeff C.
-- 
Jeff Chan
mailto:[EMAIL PROTECTED]
http://www.surbl.org/

Re: Over-scoring of SURBL lists...

Reply via email to