I'm also catching up on this thread and wasn't sure where to reply so
I'll make my observations here.


Matt, I think you have a legitimate concern.  I think I can sum up the
points of view as follows:

1. For grey URIs (perhaps scott's, for example) and/or FPs due to
non-spam URIs being listed, a user's bayes_00 score should be capable of
dropping the spam below the 5.0 threshold.  As it sits right now, the
high scores of the uribls mean this cannot happen.

2. No, the uribls work great at *overriding* an erronious bayes_00 score
caused by short URI-only spam messages.  It works well because the URIs
are collected in a unique manner by each list and hand-verified.

3. WTF, the uribls can't possibly FP!!!  Er, what I really mean is, why
haven't you been reporting these URIs as FPs?



Now for my opinion: I agree with Matt that the potential for FPs due to
multiple listings of grey URIs or even non-spam URIs exists and I think
he's shown that it is more than theoretical.  However, even in a
strictly theoretical argument I would still argue that the uribls
together should not be so powerful that they simply cannot be countered. 

I think it's fair to say that SpamAssassin has been designed such that
no one spam sign by itself should be utterly overpowering.  I would tend
to group the uribls together as a single type of spam-sign, even though
the vectors for getting listed happen to be different.  On the other
hand, it certainly helps accuracy that each URI is hand-checked.  On the
other other hand, as we know there will always be a grey area.


With that said, I think the idea of a base uribl score, plus additional
points per uribl has some merit.  Something like

meta (URIBL_WS_SURBL | URIBL_JP_SURBL .... etc) 3
score URIBL_WS_SURBL 1.5
... etc

as an example (I think this may have been suggested already but I have
read so many posts now I can't remember what came from my head and what
came from yours)



Now for a scoring question: isn't the perceptron supposed to factor out
decisively overlapping rules?  If so, why the enormously high scores for
all the different uribls?  From my stats, I get 50% of spam that hits
SURBL hitting 4 or 5 of the SURBL lists.  Shouldn't the perceptron have
noticed that and lowered the scores?  Or is the bug in mass-check which
Theo mentioned causing the scores to not be deflated?



OK I think I'm done rambling now!

Chris Thielen







Reply via email to