Re: SARE_URI_EQUALS false positives
* Loren Wilton wrote (24/12/2005 00:23): Does anyone have any suggestions, apart from simply reducing the score for SARE_URI_EQUALS? Is this a spamassassin bug, or is there no way to guarantee that only real uris are parsed as such? Several. Hi. Thanks for the response. I'm replying rather late due to pressures of Christmas. 1.Change your report generator to remove the extraneous dot between updated and by. Or change it to the more common underscore, if you insist on these words being connected for some reason. 2.Put spaces around the equal sign. These are fine suggestions, but sadly not practical. The e-mails are auto-generated diffs from cvs commits. The files being committed are java properties files. In particular, the updated.by property contains internationalised versions of the phrase Updated by. The more common underscore would be unusual in the java properties file, and expecting the developers to change the way they work to avoid SARE misfires is a slightly overzealous reaction to the spam problem, I think. However, it is possible if there's no sensible alternative. The second suggestion is only a workaround, not a fix, anyway, because spamassassin will still check http://updated.by as a uri. 3.If you are reluctant for the correct fix, drop the score on the uri_equals rule to 4 or maybe 3, depending on what else your report manages to hit. I am reluctant to use the correct fix. Actually I'm inclined to think that the word correct is being misapplied here. I've changed the scores appropriately, though. 4.You could submit a Bugzilla on the parsing of that phrase. But frankly I consider the bug in the report generation, not SA's parsing of strange syntax. The reason I didn't submit a bug was that I was not sure there was one - hence the original query. And I'm still not going to submit a bug, because I'm persuaded that there is not one. What bothered me (and still does a bit) was that the string updated.by=anything matches a rule that looks for uris of the form http(s)://*=*. Ie the http(s) is conjured out of nowhere for schemeless uris. I can see the point, but I thought it would be worth bringing a possible problem to light. It's a possible problem, not a bug per se, and the subsequent discussion shows that people take different views on the seriousness of this kind of parsing issue. One thing that hasn't been mentioned in respect of this is that if spamassassin is looking aggressively for schemeless uris, it could in some cases create quite a lot of unwanted uri checking traffic. I'm happy to stick with what I've got now. I've sent some examples off as indicated so that the SARE corpus will contain my mail in future. Chris
Re: SARE_URI_EQUALS false positives
Mouss wrote on Tue, 27 Dec 2005 00:04:34 +0100: Is foo.tld=bar a valid hostname part in a URI? foo.tld=bar is a valid URL with foo.tld being the hostname and =bar being the query part. Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com
Re: SARE_URI_EQUALS false positives
List Mail User wrote on Mon, 26 Dec 2005 16:46:00 -0800 (PST): How about the case of http=3A=2F=2Fwww=2Ecnn=2Ecom=2F2003=2F inside of HTML? i.e. http://www.cnn.com/2003/ - from a phishing spam, the full line was: You mean it displayed like this in the mail agent *after* Q decoding and worked? Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com
Re: SARE_URI_EQUALS false positives
Kai Schaetzl a écrit : Mouss wrote on Tue, 27 Dec 2005 00:04:34 +0100: Is foo.tld=bar a valid hostname part in a URI? foo.tld=bar is a valid URL with foo.tld being the hostname and =bar being the query part. are you sure? my understanding is that query part must be in the url-path, so must come after at least one slash. something like scheme://[user[:[EMAIL PROTECTED]:port]/[path[?queryargs]] plus the fact that: scheme://[user[:[EMAIL PROTECTED]:port] is the same as the one with a traling slash, and absence of the scheme part assumes http. running http://www.google.com=test on my firefox results in a dns lookup error.
Re: SARE_URI_EQUALS false positives
On Tue, Dec 27, 2005 at 09:17:09PM +0100, mouss wrote: are you sure? my understanding is that query part must be in the url-path, so must come after at least one slash. something like I don't know about =bar, but if it were ?bar, many browsers will assume there's supposed to be a / before the ?. One could argue that =bar looks like quoted-printable encoding, but that's another discussion. ;) -- Randomly Generated Tagline: I'm Bond ... Covalent Bond.- Farside Cartoon pgpGBQ35kOlA0.pgp Description: PGP signature
Re: SARE_URI_EQUALS false positives
List Mail User a écrit : How about the case of http=3A=2F=2Fwww=2Ecnn=2Ecom=2F2003=2F inside of HTML? i.e. http://www.cnn.com/2003/ - from a phishing spam, the full line was: =3Chttp=3A=2F=2Fwww=2Ecnn=2Ecom=2F2003=2FWORLD=2Fafrica=2F07=2F20=2Fkenya=2Ecrash=2Findex=2Ehtml=3E I thought these were only interpreted in quoted printable, which SA can handle anyway. So I'm talking about decoded body (not rawbody). a MUA that interpretes '=' this way even in html mail would be seriously broken. I guess some MUAs will still guess that a message is in QP even if the header says it's plain text, but this should anyway be handled by the decoder which itself was a continuation of a previous line. If you allow for more than just ASCII or UTF-8, there are quite a few words that can be built from the first six letters of the alphabet - and a much greater amount if we include elite-speak. The above example need not have been a phish using cnn.com, but just as easily could have been a spamvertised domain or have been valid non-spam HTML. Unfortunately the case of MUAs accepting non-standard (re. illegal) HTML constructs is the most common case (e.g. Outlook and OE as well as many more MUAs which *need* to be able to read the same emails under MS Win*). And still more cases of URIs exist, which are not parsed by SA, but can have constructs like these with embedded domain names (e.g. Message-ID: lines). Life would be much easier if all URIs were contained within '' and '' (as at least one standard requires). The problem is that sometimes '=' is a word break, and sometimes it is used a either a continuation or meta-character. Find a rule with a very good rate at disambiguating these cases (for example, an '=' as the final character on a line can probably almost never be ignored). and file a Bugzilla; I'm sure the developers would at least look at whatever you come up with. Remember to also handle '%', '#' and '$' while you're at it:-) well, one can find rules for the case of http://..., but it's hard to get ones when there is no scheme part. because as you said before, it is hard to guess how foo REMOVESPACE REMOVESPACE something.example would be interpreted by the MUA (in http://foo REMOVESPACE ..., there is no ambiguity as foo is clearly part of the uri). BTW. I have some rules to tag mail when the host part of the uri contains '' (I don't see why the hostname part should contain this). I wonder if I can just tag if the host part contains any but \d\w\.-_:@? This would obviously reject encoded URIs. Is there enough ham where the hostname part is encoded?
Re: SARE_URI_EQUALS false positives
... List Mail User wrote on Mon, 26 Dec 2005 16:46:00 -0800 (PST): How about the case of http=3A=2F=2Fwww=2Ecnn=2Ecom=2F2003=2F inside of HTML? i.e. http://www.cnn.com/2003/ - from a phishing spam, the full line was: You mean it displayed like this in the mail agent *after* Q decoding and worked? Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com I'm old fashioned and use a command line MUA, but a quick check by copying the old email to a XP box and loading it into Outlook, shows that, yes, it does work there. Paul Shupak [EMAIL PROTECTED]
Re: SARE_URI_EQUALS false positives
List Mail User a écrit : updated.by - check http://www.tld.by/cgi-bin/registry.cgi You'll see that update.by is a registered domain! Therefore updated.by is indeed a URI. QED the question is: if foo.example-DEMUNGED is listed in uribl/surbl, does that make it a bad string in mail? If it appears as http://somethin.foo.example-DEMUNGED, or even as a textual www.foo.example-DEMUNGED, we may consider it risky But if it appears as: telnet smtp.foo.example-DEMUNGED or Dec 26 23:41:53 bobo postfix/smtpd[7560]: connect from foo.example-DEMUNGED[192.0.2.56] ... then checking *BLs is questionable. There are more chances to block someone reporting a spammy session or asking for help than seeing a spammer advertize his site via a log line... I believe this is the most important issue that uribl encounters: is the URI used to advertize or is it an example/report/...? if we solve this, we'll feel very happy.
Re: SARE_URI_EQUALS false positives
Loren Wilton a écrit : Does anyone have any suggestions, apart from simply reducing the score for SARE_URI_EQUALS? Is this a spamassassin bug, or is there no way to guarantee that only real uris are parsed as such? Several. 1.Change your report generator to remove the extraneous dot between updated and by. Or change it to the more common underscore, if you insist on these words being connected for some reason. 2.Put spaces around the equal sign. 3.If you are reluctant for the correct fix, drop the score on the uri_equals rule to 4 or maybe 3, depending on what else your report manages to hit. 4.You could submit a Bugzilla on the parsing of that phrase. But frankly I consider the bug in the report generation, not SA's parsing of strange syntax. Is foo.tld=bar a valid hostname part in a URI? I doubt that. now, would a MUA show that as a URI followed by bar? I think that SA should provide an option to enable/disable: uri_broken_mua, so that people not caring for broken MUAs can avoid such false positives.
Re: SARE_URI_EQUALS false positives
... Mouss, List Mail User a écrit : updated.by - check http://www.tld.by/cgi-bin/registry.cgi You'll see that update.by is a registered domain! Therefore updated.by is indeed a URI. QED the question is: if foo.example-DEMUNGED is listed in uribl/surbl, does that make it a bad string in mail? If it appears as http://somethin.foo.example-DEMUNGED, or even as a textual www.foo.example-DEMUNGED, we may consider it risky But if it appears as: telnet smtp.foo.example-DEMUNGED or Dec 26 23:41:53 bobo postfix/smtpd[7560]: connect from foo.example-DEMUNGED[192.0.2.56] ... then checking *BLs is questionable. There are more chances to block someone reporting a spammy session or asking for help than seeing a spammer advertize his site via a log line... I believe this is the most important issue that uribl encounters: is the URI used to advertize or is it an example/report/...? if we solve this, we'll feel very happy. There are several parts to the answer, but the first and most important part can be phrased as barring a special case, yes a spam domain in mail is bad (period). Now, there are more than a few special cases. One immediate case is that no abuse@ email account should be doing content filtering. Another obvious case is that any person or mailing list which discusses spam need to be whitelisted, setup to avoid filtering or some other action take to configure it not to trip spam filters. The case you listed of an example/report would/should always come under these situations, but there are still others; If you file a complaint with any party about an abuse situation, you should be prepared to have your own message quoted back to you (this one has to include organizations like ICANN, the internic, ARIN, RIPE, etc.). If you discuss spam or abuse with another person or on a list, again you should be prepared to be answered similarly (this case I have been guilty of forgetting more than once). There are still more possible cases that can be hard to expect - e.g. I recently got an email from a hosting service that I have locally BL'd which was sent addressed to customers (I am *not* one), but which I was copied on (I have spoken by telephone and email to the business' managers and staff on a few occasions) - fortunately they sent it to an account which is only used for certain types of complaints and communication, and which bypasses the BLs at the MTA level (still hits SA). Also, there are some companies/newsletters which may be on quite a few BLs, but are solicited mail at my site, so they *must* be whitelisted (at the MTA, in SA, in DCC, etc.). If you accept requests for help (with abuse issues, or even allowing such things), you should either be using a dedicated account or be prepared to FP on the emails. (Yes, I know not everyone controls one or more domains and can not create special purpose accounts trivially.) Even the simplest case of a bare domain name is clearly bad. How can you distinguish (without building/writing a natural language parser) the difference between saying I got spam from example.com for ... and Copy example.com into your browser to see our specials...? The second format is fairly common in spam. You could try to somehow score a bare name differently, but them what if it is embedded in a scripting language, HTML or obfuscated with character translations (e.g. %45xample%2Eco%4D or similar); This kind of style can still be dangerous. There are many examples of non-distributed rules (i.e. not part of SA distributions) which conflict with common styles of email writing and quoting (e.g. the SARE chicken-pox rules and large chunks of source code is a common example). Most of the standard SA rules are safe under normal conditions, but if some automated tool creates text containing a string which happens by be formatted the same as a spam domain, there will be a conflict (e.g. if updated.by were spammy - or even if a local rule penalized non-RFC compliant TLDs, since .by doesn't have a whois server, so any string with .by, .my, .de, .mx, etc. at its end could cause problems). I don't think you can find any way to tell if something is actually advertising even if you did have a natural language parser. Consider the case where the mail contains an image of a watch, pills or scantily clad young woman, random text (not random words, but literary chaff) and a bare domain name. To a human it may be obvious what is happening, but you'd need a very complex recognizer to get a computer to know it was advertising; It could be a picture of your cousin with the poem she won a prize for and the domain it is published at, sent by a relative (example from mail I've actually recieved) or it could be an advertisement for child pornography; How can you tell (especially when it comes from a DUL host via a cable ISP)? Not an easy case, and not one I expect to be solved in my lifetime. Paul
Re: SARE_URI_EQUALS false positives
... Is foo.tld=bar a valid hostname part in a URI? I doubt that. now, would a MUA show that as a URI followed by bar? I think that SA should provide an option to enable/disable: uri_broken_mua, so that people not caring for broken MUAs can avoid such false positives. How about the case of http=3A=2F=2Fwww=2Ecnn=2Ecom=2F2003=2F inside of HTML? i.e. http://www.cnn.com/2003/ - from a phishing spam, the full line was: =3Chttp=3A=2F=2Fwww=2Ecnn=2Ecom=2F2003=2FWORLD=2Fafrica=2F07=2F20=2Fkenya=2Ecrash=2Findex=2Ehtml=3E which itself was a continuation of a previous line. If you allow for more than just ASCII or UTF-8, there are quite a few words that can be built from the first six letters of the alphabet - and a much greater amount if we include elite-speak. The above example need not have been a phish using cnn.com, but just as easily could have been a spamvertised domain or have been valid non-spam HTML. Unfortunately the case of MUAs accepting non-standard (re. illegal) HTML constructs is the most common case (e.g. Outlook and OE as well as many more MUAs which *need* to be able to read the same emails under MS Win*). And still more cases of URIs exist, which are not parsed by SA, but can have constructs like these with embedded domain names (e.g. Message-ID: lines). Life would be much easier if all URIs were contained within '' and '' (as at least one standard requires). The problem is that sometimes '=' is a word break, and sometimes it is used a either a continuation or meta-character. Find a rule with a very good rate at disambiguating these cases (for example, an '=' as the final character on a line can probably almost never be ignored). and file a Bugzilla; I'm sure the developers would at least look at whatever you come up with. Remember to also handle '%', '#' and '$' while you're at it:-) Paul Shupak [EMAIL PROTECTED]
Re: SARE_URI_EQUALS false positives
From: Chris Lear [EMAIL PROTECTED] I'm getting false positives for SARE_URI_EQUALS, which scores 5 and is therefore skewing the scoring of some mail quite badly. The weird thing is that the uris that spamassassin is complaining about aren't uris at all. The mail in question is auto-created reports of cvs diffs, so it's slightly unusual. I've tried to condense the debug information. Here it is: This is some of the output from spamassassin -D false_positive [16733] dbg: uri: parsed uri found, updated.by=Mis [16733] dbg: uri: cleaned parsed uri, http://updated.by=Mis [16733] dbg: uri: cleaned parsed uri, updated.by=Mis [16733] dbg: uri: parsed uri found, http://updated.by=Mis [16733] dbg: uri: cleaned parsed uri, http://updated.by=Mis [16733] dbg: uri: parsed uri found, updated.by=Updated [16733] dbg: uri: cleaned parsed uri, updated.by=Updated [16733] dbg: uri: cleaned parsed uri, http://updated.by=Updated [16733] dbg: uri: parsed uri found, http://updated.by=Updated [16733] dbg: uri: cleaned parsed uri, http://updated.by=Updated These parsed uris are not links in the e-mail. They are just text. I've had a bit of a look at the regexps that spamassassin uses to work out what is a uri, and it seems that updated.by=Updated is treated as a uri because .by is a valid tld and spamassassin looks for schemeless uris, then prepends http:// for the tests. I'm running spamassassin 3.1.0 on perl 5.8.2. Does anyone have any suggestions, apart from simply reducing the score for SARE_URI_EQUALS? Is this a spamassassin bug, or is there no way to guarantee that only real uris are parsed as such? Before you drop the score precipitously check if there is some other characteristic of the emails that trigger falsely which can be used to apply a negative score. If there is such a characteristic then generate the appropriate negative score. If not weigh how effective the rule is for you. The version of sa-stats.pl that is on the SARE site helps figure this out nicely. That said it's close to a 50/50 rule that hits on very few messages here so should have a low score. (It hit on 6 messages out of 75000.) Cutting it out completely here seems like it would be effective TODAY. That could change. At one time it was quite necessary. Spammer fads change.) {^_^}
Re: SARE_URI_EQUALS false positives
* jdow wrote (23/12/05 11:26): From: Chris Lear [EMAIL PROTECTED] I'm getting false positives for SARE_URI_EQUALS, which scores 5 and is therefore skewing the scoring of some mail quite badly. The weird thing is that the uris that spamassassin is complaining about aren't uris at all. The mail in question is auto-created reports of cvs diffs, so it's slightly unusual. [...] I've had a bit of a look at the regexps that spamassassin uses to work out what is a uri, and it seems that updated.by=Updated is treated as a uri because .by is a valid tld and spamassassin looks for schemeless uris, then prepends http:// for the tests. I'm running spamassassin 3.1.0 on perl 5.8.2. Does anyone have any suggestions, apart from simply reducing the score for SARE_URI_EQUALS? Is this a spamassassin bug, or is there no way to guarantee that only real uris are parsed as such? Before you drop the score precipitously check if there is some other characteristic of the emails that trigger falsely which can be used to apply a negative score. If there is such a characteristic then generate the appropriate negative score. If not weigh how effective the rule is for you. The version of sa-stats.pl that is on the SARE site helps figure this out nicely. That said it's close to a 50/50 rule that hits on very few messages here so should have a low score. (It hit on 6 messages out of 75000.) Cutting it out completely here seems like it would be effective TODAY. That could change. At one time it was quite necessary. Spammer fads change.) I've reduced the score, and a quick check shows that that rule hits almost nothing anyway, so it's not a big problem. The bayes rules were keeping the false positives from doing much damage, anyway. But spamassassin uses uris for lots of things, and if it's commonly parsing (reasonably) normal text as uris, I would expect that to be a problem in more rules than just SARE_URI_EQUALS. Chris
Re: SARE_URI_EQUALS false positives
From: Chris Lear [EMAIL PROTECTED] * jdow wrote (23/12/05 11:26): From: Chris Lear [EMAIL PROTECTED] I'm getting false positives for SARE_URI_EQUALS, which scores 5 and is therefore skewing the scoring of some mail quite badly. The weird thing is that the uris that spamassassin is complaining about aren't uris at all. The mail in question is auto-created reports of cvs diffs, so it's slightly unusual. [...] I've had a bit of a look at the regexps that spamassassin uses to work out what is a uri, and it seems that updated.by=Updated is treated as a uri because .by is a valid tld and spamassassin looks for schemeless uris, then prepends http:// for the tests. I'm running spamassassin 3.1.0 on perl 5.8.2. Does anyone have any suggestions, apart from simply reducing the score for SARE_URI_EQUALS? Is this a spamassassin bug, or is there no way to guarantee that only real uris are parsed as such? Before you drop the score precipitously check if there is some other characteristic of the emails that trigger falsely which can be used to apply a negative score. If there is such a characteristic then generate the appropriate negative score. If not weigh how effective the rule is for you. The version of sa-stats.pl that is on the SARE site helps figure this out nicely. That said it's close to a 50/50 rule that hits on very few messages here so should have a low score. (It hit on 6 messages out of 75000.) Cutting it out completely here seems like it would be effective TODAY. That could change. At one time it was quite necessary. Spammer fads change.) I've reduced the score, and a quick check shows that that rule hits almost nothing anyway, so it's not a big problem. The bayes rules were keeping the false positives from doing much damage, anyway. But spamassassin uses uris for lots of things, and if it's commonly parsing (reasonably) normal text as uris, I would expect that to be a problem in more rules than just SARE_URI_EQUALS. That is a standalone rule. And I do note that many of the SARE rules have severe problems in very specific cases. There are some mailing lists that are not well filtered for spam which have postings which trigger some of the too effective to toss SARE rules. I've developed some massive meta rules to at least partially get a handle on the problem. (A number of times XXX hit option would be nice to have for this.) {^_^}
Re: SARE_URI_EQUALS false positives
* jdow wrote (23/12/05 12:06): From: Chris Lear [EMAIL PROTECTED] * jdow wrote (23/12/05 11:26): From: Chris Lear [EMAIL PROTECTED] I'm getting false positives for SARE_URI_EQUALS, which scores 5 and is therefore skewing the scoring of some mail quite badly. The weird thing is that the uris that spamassassin is complaining about aren't uris at all. The mail in question is auto-created reports of cvs diffs, so it's slightly unusual. [...] I've had a bit of a look at the regexps that spamassassin uses to work out what is a uri, and it seems that updated.by=Updated is treated as a uri because .by is a valid tld and spamassassin looks for schemeless uris, then prepends http:// for the tests. I'm running spamassassin 3.1.0 on perl 5.8.2. Does anyone have any suggestions, apart from simply reducing the score for SARE_URI_EQUALS? Is this a spamassassin bug, or is there no way to guarantee that only real uris are parsed as such? Before you drop the score precipitously check if there is some other characteristic of the emails that trigger falsely which can be used to apply a negative score. If there is such a characteristic then generate the appropriate negative score. If not weigh how effective the rule is for you. The version of sa-stats.pl that is on the SARE site helps figure this out nicely. That said it's close to a 50/50 rule that hits on very few messages here so should have a low score. (It hit on 6 messages out of 75000.) Cutting it out completely here seems like it would be effective TODAY. That could change. At one time it was quite necessary. Spammer fads change.) I've reduced the score, and a quick check shows that that rule hits almost nothing anyway, so it's not a big problem. The bayes rules were keeping the false positives from doing much damage, anyway. But spamassassin uses uris for lots of things, and if it's commonly parsing (reasonably) normal text as uris, I would expect that to be a problem in more rules than just SARE_URI_EQUALS. That is a standalone rule. And I do note that many of the SARE rules have severe problems in very specific cases. There are some mailing lists that are not well filtered for spam which have postings which trigger some of the too effective to toss SARE rules. I've developed some massive meta rules to at least partially get a handle on the problem. (A number of times XXX hit option would be nice to have for this.) Sorry to go on, but I wonder whether you've missed by point. The SARE_URI_EQUALS rule is working fine. It just looks in the uris that spamassassin gives it, and complains when they contain =. The problem is that spamassassin is treating things that aren't uris as uris. So SARE_URI_EQUALS is working on dud data. In this specific case, the e-mail contains the text updated.by=Updated. This is not a uri, and nor should it be treated as one. But spamassassin thinks it is (becasue .by is a valid tld), so, as far as I can tell, *all* uri rules will check it. It so happens that SARE_URI_EQUALS hits in this case, but other uri rules are vulnerable to false positives if the uri parsing is wrong, aren't they? Chris
Re: SARE_URI_EQUALS false positives
updated.by - check http://www.tld.by/cgi-bin/registry.cgi You'll see that update.by is a registered domain! Therefore updated.by is indeed a URI. QED Paul Shupak [EMAIL PROTECTED]
Re: SARE_URI_EQUALS false positives
Hello Chris, Friday, December 23, 2005, 3:04:29 AM, you wrote: CL I'm getting false positives for SARE_URI_EQUALS, which scores 5 and is CL therefore skewing the scoring of some mail quite badly. ... CL Does anyone have any suggestions, apart from simply reducing the CL score for SARE_URI_EQUALS? Is this a spamassassin bug, or is there CL no way to guarantee that only real uris are parsed as such? Send me a couple of sample emails with this problem so I can add them to my ham corpus, and SARE_URI_EQUALS will automagically drop its score to 1.666 (or lower). No SARE rule with ham scores more than 1.666. Best is to put them into an mbox file, zip, and email. Thanks. Bob Menschel
Re: SARE_URI_EQUALS false positives
Does anyone have any suggestions, apart from simply reducing the score for SARE_URI_EQUALS? Is this a spamassassin bug, or is there no way to guarantee that only real uris are parsed as such? Several. 1.Change your report generator to remove the extraneous dot between updated and by. Or change it to the more common underscore, if you insist on these words being connected for some reason. 2.Put spaces around the equal sign. 3.If you are reluctant for the correct fix, drop the score on the uri_equals rule to 4 or maybe 3, depending on what else your report manages to hit. 4.You could submit a Bugzilla on the parsing of that phrase. But frankly I consider the bug in the report generation, not SA's parsing of strange syntax. Loren