RE: Over-scoring of SURBL lists...

Dallas L. Engelken Fri, 17 Feb 2006 13:05:44 -0800

> -----Original Message-----
> From: Matt Kettler [mailto:[EMAIL PROTECTED] 
> Sent: Friday, February 17, 2006 05:14
> To: Dallas L. Engelken
> Cc: users@spamassassin.apache.org
> Subject: Re: Over-scoring of SURBL lists...
> 
> Dallas L. Engelken wrote:
> >> -----Original Message-----
> >> From: Matt Kettler [mailto:[EMAIL PROTECTED]
> >> Sent: Thursday, February 16, 2006 22:50
> >> To: Chris Santerre
> >> Cc: users@spamassassin.apache.org
> >> Subject: Re: Over-scoring of SURBL lists...
> >>
> >> Chris Santerre wrote:
> >>     
> >>> Matt Kettler wrote:
> >>>       
> >>>> My FPs fall into two categories:
> >>>>
> >>>> 1) URIs that would likely never appear outside of a specialty 
> >>>> newsletter. I've had lots of hits on things like:
> >>>> -Authors of programmer's tools
> >>>> -producers of electronic parts
> >>>> -producers of embedded computer systems (Note: embedded,
> >>>>         
> >> not normal
> >>     
> >>>> computers..
> >>>> companies like versalogic.com that make parts that only a kiosk 
> >>>> manufacturer or extreme geek would use)
> >>>>         
> >>> Agreed. And we have seen these be more JoeJobs. But some
> >>>       
> >> are not. Some
> >>     
> >>> simply hire mass emailers thinking they are legit, only 
> to find out 
> >>> they are not. Just because they are legit for you, doesn't
> >>>       
> >> mean they
> >>     
> >>> haven't spammed someone else. You ask, we remove.
> >>>       
> >> Yes, the only problem is that I'm getting tired of having to track 
> >> down sample emails for FPs so I can find which URI a URIBL FPed on.
> >>
> >> But really, how often or not a URIBL FP's isn't really the 
> point. The 
> >> point is they DO FP, and it's really quite common for FP's to be 
> >> multi-listed. That multi-listing wields some hefty score 
> biases, way 
> >> beyond the power of any other rule in spamassassin other than 
> >> BLACKLIST_* and GTUBE.
> >>
> >> I merely find it to be a big problem that URIBLs on the 
> general whole 
> >> are rather FP prone, and prone to "cascades" of FPs which 
> unleashes 
> >> havoc from the strong scores the perceptron gave them.
> >>
> >> I think the reason the perceptron gave them such high 
> scores is that 
> >> a lot of URIBL FP problems get fixed fairly quickly, 
> within a matter 
> >> of hours. Ditto for a lot of FN problems.
> >>
> >> By the time the mass-checks are run, the URI's in the 
> corpus emails 
> >> are likely well sorted by the reports given to the URIBLs.
> >>
> >>     
> >
> > Sounds like someone's having a bad day ;)
> >
> >
> >   
> 
> First, a pre-statement:
> 
> I'm only presenting evidence of accuracy problems in relation 
> to why the URIBLs collectively wield a great deal of power in 
> SpamAssassin scoring.
> I'm not really complaining about uribl.com, I'm complaining 
> about URIBLs as a whole. That's both uribl.com and surbl. 
> Whenever I use the term URIBL in all caps, I mean all URI 
> dns-based blacklists. If you prefer, I'll retract my 
> uribl.com example, and point out that less than an hour 
> later, I got a ws.surbl.org FP.
> 
> And let me remind you.
> 
> Let me remind you, 
> 
> 1) you control which uribl's you run
> 2) you control how they score
> 
> 
> 1)  I'm talking about the default setup of SA 3.1.0 and the 
> perceptron assigned default scores for the URIBLs it uses.. 
> Not customization.
> Default, Stock ,SA 3.1.0 setup. Note that doesn't really 
> involve uribl.com, but does involve surbl and sbl.
> 
> 2) I do have serious concerns about the accuracy problems of 
> both surbl.org and uribl.com. Particularly in light of #2. 
> uribl.com presents a larger portion of this problem at my 
> site, but surbl has the same basic problems.
> 
> 3) I'm even more concerned about the monoculure of the URIBLs.
> uribl.com's black, surbl.org's ws, sc, jp, ab and ob are all 
> more-or-less the same list. Paul argued against that 
> statement, but in my mind his arguments are weak at best. 
> There IS considerable overlap between these lists. Contrary 
> Paul's statements, you only need to be reported once by a 
> spamcop spamtrap or trusted feed to be on SC. JP monitors 
> 18,000 domains, not just two people. AB accepts feeds 
> directly from spamcop and does different analysis on them. 
> Ultimately it is possible for a single copy of an email to 
> cause a listing in uribl_black, SC, WS, JP, and OB all at the 
> same time. It might be possible for that one email to list in 
> AB via spamcop, but I'm not sure if they have a multi-report 
> requirement or not. Sure it's unlikely, but there is enough 
> overlap to have it be possible. If that one email is 
> mis-classified you have a whopper of a FP problem to deal with.
>


I think that is a benefit of the single list classification in
URIBL.com.  We don't crosslist (ok, we had a small bug that's been
fixed) domains.

> 
> Combinining 1-3 you have a serious problem. Due to 2 FPs are 
> relatively commonplace, and due to 3 any FPs tend to cascade 
> quickly into multiple URIBLs. Due to 1, these rules wield 
> considerable power (> +12) that even BAYES_00 can't put a 
> dent in (-2.599)
> 
> Ultimately my major problem isn't with the URIBLs themselves. 
> My problem is with the structure of the rules in SA 3.1.0 and 
> the outrageously high scores they have in SA 3.1.0.
> 
> Really, I think Chris S had a good idea earlier when he 
> suggested just rolling all of surbl into one rule. Ditto for 
> uribl.com, but it's only got one list worth rolling up. (grey 
> is interesting, but I don't think you'd want to aggregate 
> grey and black into a single rule. The FP rate of grey would 
> hurt black's score potential). Collectively, these two rules 
> should have less than 5.0 as a total score.
> 
> This is a stark contrast to a default SA 3.1.0, where the 
> URIBL's from surbl.org collectively total 19.715 points by 
> themselves, and 21.354 when you factor in sbl too.
> 
> 

I'd agree that wrapping up SURBL into a single test could have benefit.
You could still have SURBL_JP, SURBL_OB and other rules standalone and
scoring 0.01, so end users have a choice if they want to adjust them.

score SURBL 1.5
score URIBL 1.5

score URIBL_SURBL 1.0

score SURBL_JP 0.01
score SURBL_OB 0.01
score SURBL_WS 0.01
etc..

The result will be no URIBL only FPs.  OTOH, you may end up with a
shit-ton of people bitching about spam accuracy dropping in stock 3.2
installs if you make these changes.  

Dallas

RE: Over-scoring of SURBL lists...

Reply via email to