* jdow wrote (23/12/05 12:06):
> From: "Chris Lear" <[EMAIL PROTECTED]>
>>* jdow wrote (23/12/05 11:26):
>>> From: "Chris Lear" <[EMAIL PROTECTED]>
>>> 
>>>> I'm getting false positives for SARE_URI_EQUALS, which scores 5 and is
>>>> therefore skewing the scoring of some mail quite badly.
>>>> The weird thing is that the uris that spamassassin is complaining about
>>>> aren't uris at all. The mail in question is auto-created reports of cvs
>>>> diffs, so it's slightly unusual.
>> 
>> [...]
>>>> 
>>>> I've had a bit of a look at the regexps that spamassassin uses to work
>>>> out what is a uri, and it seems that "updated.by=Updated" is treated as
>>>> a uri because .by is a valid tld and spamassassin looks for "schemeless"
>>>> uris, then prepends http:// for the tests.
>>>> 
>>>> I'm running spamassassin 3.1.0 on perl 5.8.2.
>>>> 
>>>> Does anyone have any suggestions, apart from simply reducing the score
>>>> for SARE_URI_EQUALS? Is this a spamassassin bug, or is there no way to
>>>> guarantee that only real uris are parsed as such?
>>> 
>>> Before you drop the score precipitously check if there is some other
>>> characteristic of the emails that trigger falsely which can be used to
>>> apply a negative score. If there is such a characteristic then generate
>>> the appropriate negative score. If not weigh how effective the rule is
>>> for you. The version of "sa-stats.pl" that is on the SARE site helps
>>> figure this out nicely.
>>> 
>>> That said it's close to a "50/50" rule that hits on very few messages
>>> here so should have a low score. (It hit on 6 messages out of 75000.)
>>> Cutting it out completely here seems like it would be effective TODAY.
>>> That could change. At one time it was quite necessary. Spammer fads
>>> change.)
>> 
>> I've reduced the score, and a quick check shows that that rule hits
>> almost nothing anyway, so it's not a big problem. The bayes rules were
>> keeping the false positives from doing much damage, anyway.
>> But spamassassin uses uris for lots of things, and if it's commonly
>> parsing (reasonably) normal text as uris, I would expect that to be a
>> problem in more rules than just SARE_URI_EQUALS.
> 
> That is a standalone rule.
> 
> And I do note that many of the SARE rules have severe problems in very
> specific cases. There are some mailing lists that are not well filtered
> for spam which have postings which trigger some of the "too effective
> to toss" SARE rules. I've developed some massive meta rules to at least
> partially get a handle on the problem. (A number of times XXX hit option
> would be nice to have for this.)

Sorry to go on, but I wonder whether you've missed by point. The
SARE_URI_EQUALS rule is working fine. It just looks in the uris that
spamassassin gives it, and complains when they contain "=".
The problem is that spamassassin is treating things that aren't uris as
uris. So SARE_URI_EQUALS is working on dud data.

In this specific case, the e-mail contains the text
"updated.by=Updated". This is not a uri, and nor should it be treated as
one. But spamassassin thinks it is (becasue .by is a valid tld), so, as
far as I can tell, *all* uri rules will check it. It so happens that
SARE_URI_EQUALS hits in this case, but other uri rules are vulnerable to
false positives if the uri parsing is wrong, aren't they?

Chris

Reply via email to