* jdow wrote (23/12/05 12:06): > From: "Chris Lear" <[EMAIL PROTECTED]> >>* jdow wrote (23/12/05 11:26): >>> From: "Chris Lear" <[EMAIL PROTECTED]> >>> >>>> I'm getting false positives for SARE_URI_EQUALS, which scores 5 and is >>>> therefore skewing the scoring of some mail quite badly. >>>> The weird thing is that the uris that spamassassin is complaining about >>>> aren't uris at all. The mail in question is auto-created reports of cvs >>>> diffs, so it's slightly unusual. >> >> [...] >>>> >>>> I've had a bit of a look at the regexps that spamassassin uses to work >>>> out what is a uri, and it seems that "updated.by=Updated" is treated as >>>> a uri because .by is a valid tld and spamassassin looks for "schemeless" >>>> uris, then prepends http:// for the tests. >>>> >>>> I'm running spamassassin 3.1.0 on perl 5.8.2. >>>> >>>> Does anyone have any suggestions, apart from simply reducing the score >>>> for SARE_URI_EQUALS? Is this a spamassassin bug, or is there no way to >>>> guarantee that only real uris are parsed as such? >>> >>> Before you drop the score precipitously check if there is some other >>> characteristic of the emails that trigger falsely which can be used to >>> apply a negative score. If there is such a characteristic then generate >>> the appropriate negative score. If not weigh how effective the rule is >>> for you. The version of "sa-stats.pl" that is on the SARE site helps >>> figure this out nicely. >>> >>> That said it's close to a "50/50" rule that hits on very few messages >>> here so should have a low score. (It hit on 6 messages out of 75000.) >>> Cutting it out completely here seems like it would be effective TODAY. >>> That could change. At one time it was quite necessary. Spammer fads >>> change.) >> >> I've reduced the score, and a quick check shows that that rule hits >> almost nothing anyway, so it's not a big problem. The bayes rules were >> keeping the false positives from doing much damage, anyway. >> But spamassassin uses uris for lots of things, and if it's commonly >> parsing (reasonably) normal text as uris, I would expect that to be a >> problem in more rules than just SARE_URI_EQUALS. > > That is a standalone rule. > > And I do note that many of the SARE rules have severe problems in very > specific cases. There are some mailing lists that are not well filtered > for spam which have postings which trigger some of the "too effective > to toss" SARE rules. I've developed some massive meta rules to at least > partially get a handle on the problem. (A number of times XXX hit option > would be nice to have for this.)
Sorry to go on, but I wonder whether you've missed by point. The SARE_URI_EQUALS rule is working fine. It just looks in the uris that spamassassin gives it, and complains when they contain "=". The problem is that spamassassin is treating things that aren't uris as uris. So SARE_URI_EQUALS is working on dud data. In this specific case, the e-mail contains the text "updated.by=Updated". This is not a uri, and nor should it be treated as one. But spamassassin thinks it is (becasue .by is a valid tld), so, as far as I can tell, *all* uri rules will check it. It so happens that SARE_URI_EQUALS hits in this case, but other uri rules are vulnerable to false positives if the uri parsing is wrong, aren't they? Chris