Re: Over-scoring of SURBL lists...

Jeff Chan Thu, 16 Feb 2006 23:08:31 -0800

On Thursday, February 16, 2006, 9:13:36 PM, Matt Kettler wrote:
> I'm only presenting evidence of accuracy problems in relation to why the
> URIBLs collectively wield a great deal of power in SpamAssassin scoring.
> I'm not really complaining about uribl.com, I'm complaining about URIBLs
> as a whole. That's both uribl.com and surbl. Whenever I use the term
> URIBL in all caps, I mean all URI dns-based blacklists. If you prefer,
> I'll retract my uribl.com example, and point out that less than an hour
> later, I got a ws.surbl.org FP.


There may be some value in not lumping together URIBL.com and
SURBL.org lists.  As you can see the performance of the lists are
different, and the way they're created is different too.  That
makes it harder for us to respond to comments that seem to not
take those differences into account.

> 3) I'm even more concerned about the monoculure of the URIBLs.

I suppose it depends on your point of view.  From my point of
view the various lists are different in terms of sources and
listing logic.  As you can see from the results posted, they have
fairly different performance in terms of spam and ham hits, but
those measurements don't take into account the underlying
tools and sources that go into making them, which varies between
lists.

> uribl.com's black, surbl.org's ws, sc, jp, ab and ob are all
> more-or-less the same list. Paul argued against that statement, but in
> my mind his arguments are weak at best. There IS considerable overlap
> between these lists. Contrary Paul's statements, you only need to be
> reported once by a spamcop spamtrap or trusted feed to be on SC.

That's only partially correct.  Paul's statement is correct for
most SpamCop reports.  It takes many reports to get on SC for
most domains except the ones that resolve into known spammer
networks.

There are no "trusted feeds" for SC.  The data in SC comes from
SpamCop reports.  I don't know the number of SpamCop users, but
they're probably many.  The way I deal with the issue of trust is
to aggregate the reports in various ways and ignore some of the
noise that would lead to FPs.  And all SURBL lists are subject to
whitelisting as a final arbiter.  So even if a SpamCop user
wanted us to blacklist say google.com or yahoo.com, we won't.

> JP
> monitors 18,000 domains, not just two people. AB accepts feeds directly
> from spamcop and does different analysis on them. Ultimately it is
> possible for a single copy of an email to cause a listing in
> uribl_black, SC, WS, JP, and OB all at the same time.

Not really.  It take a fairly large and widespread spam run to
get onto multiple (SURBL) lists.  If we made it even less
sensitive, more people would probably complain about FN rates.
Some already do. 

> It might be
> possible for that one email to list in AB via spamcop, but I'm not sure
> if they have a multi-report requirement or not.

In most cases, it's not possible for one mail to result in AB
listing.

> Sure it's unlikely, but
> there is enough overlap to have it be possible. If that one email is
> mis-classified you have a whopper of a FP problem to deal with.

It appears that you may misestimate the sensitivity of the lists.

> Combinining 1-3 you have a serious problem. Due to 2 FPs are relatively
> commonplace, and due to 3 any FPs tend to cascade quickly into multiple
> URIBLs. Due to 1, these rules wield considerable power (> +12) that even
> BAYES_00 can't put a dent in (-2.599)

I and several other people spend hours looking at SURBL FPs and
potential FP every day.  Very few seem to appear on more than one
SURBL. 

I can't speak for URIBL.com.  I don't know about their FPs.

> Ultimately my major problem isn't with the URIBLs themselves. My problem
> is with the structure of the rules in SA 3.1.0 and the outrageously high
> scores they have in SA 3.1.0.

> Really, I think Chris S had a good idea earlier when he suggested just
> rolling all of surbl into one rule. Ditto for uribl.com, but it's only
> got one list worth rolling up. (grey is interesting, but I don't think
> you'd want to aggregate grey and black into a single rule. The FP rate
> of grey would hurt black's score potential). Collectively, these two
> rules should have less than 5.0 as a total score.

I can't speak for URIBL.com, but the reason for having separate
scores for different SURBLs was so that their relative
performance could be scored differently, both in terms of spam
and ham hit rates.

Sometimes a pure blackhat spam gang spam will hit only one list
initially, so it's important to have a score high enough for a
single constituent list to identify it as spam.

On the other side, some lists do have higher FP rates than
others, so they should be able to get a lower score.  Separate
scores also facilitates that.

If the SpamAssassin community can come up with some clever new
way to score things so that both FNs and FPs decrease, then I'm
all for that.  It's not clear to me that the current scoring
system is suboptimal, though maybe some of the assumptions are
incorrect or inappropriate for application to URIBLs.

> This is a stark contrast to a default SA 3.1.0, where the URIBL's from
> surbl.org collectively total 19.715 points by themselves, and 21.354
> when you factor in sbl too.

Only extremely spammy domains will tend to get onto multiple
SURBLs.

Maybe you can post some counterexamples to back up your
discourse.  That might be helpful for understanding exactly what
you're getting at.

Jeff C.
-- 
Jeff Chan
mailto:[EMAIL PROTECTED]
http://www.surbl.org/

Re: Over-scoring of SURBL lists...

Reply via email to