I'm also catching up on this thread and wasn't sure where to reply so I'll make my observations here.
Matt, I think you have a legitimate concern. I think I can sum up the points of view as follows: 1. For grey URIs (perhaps scott's, for example) and/or FPs due to non-spam URIs being listed, a user's bayes_00 score should be capable of dropping the spam below the 5.0 threshold. As it sits right now, the high scores of the uribls mean this cannot happen. 2. No, the uribls work great at *overriding* an erronious bayes_00 score caused by short URI-only spam messages. It works well because the URIs are collected in a unique manner by each list and hand-verified. 3. WTF, the uribls can't possibly FP!!! Er, what I really mean is, why haven't you been reporting these URIs as FPs? Now for my opinion: I agree with Matt that the potential for FPs due to multiple listings of grey URIs or even non-spam URIs exists and I think he's shown that it is more than theoretical. However, even in a strictly theoretical argument I would still argue that the uribls together should not be so powerful that they simply cannot be countered. I think it's fair to say that SpamAssassin has been designed such that no one spam sign by itself should be utterly overpowering. I would tend to group the uribls together as a single type of spam-sign, even though the vectors for getting listed happen to be different. On the other hand, it certainly helps accuracy that each URI is hand-checked. On the other other hand, as we know there will always be a grey area. With that said, I think the idea of a base uribl score, plus additional points per uribl has some merit. Something like meta (URIBL_WS_SURBL | URIBL_JP_SURBL .... etc) 3 score URIBL_WS_SURBL 1.5 ... etc as an example (I think this may have been suggested already but I have read so many posts now I can't remember what came from my head and what came from yours) Now for a scoring question: isn't the perceptron supposed to factor out decisively overlapping rules? If so, why the enormously high scores for all the different uribls? From my stats, I get 50% of spam that hits SURBL hitting 4 or 5 of the SURBL lists. Shouldn't the perceptron have noticed that and lowered the scores? Or is the bug in mass-check which Theo mentioned causing the scores to not be deflated? OK I think I'm done rambling now! Chris Thielen