Re: SARE_URI_EQUALS false positives

2006-01-03 Thread Chris Lear
* Loren Wilton wrote (24/12/2005 00:23):
 Does anyone have any suggestions, apart from simply reducing the score
 for SARE_URI_EQUALS? Is this a spamassassin bug, or is there no way to
 guarantee that only real uris are parsed as such?
 
 Several.

Hi. Thanks for the response. I'm replying rather late due to pressures
of Christmas.

 
 1.Change your report generator to remove the extraneous dot between
 updated and by.  Or change it to the more common underscore, if you insist
 on these words being connected for some reason.
 
 2.Put spaces around the equal sign.

These are fine suggestions, but sadly not practical. The e-mails are
auto-generated diffs from cvs commits. The files being committed are
java properties files. In particular, the updated.by property contains
internationalised versions of the phrase Updated by. The more common
underscore would be unusual in the java properties file, and expecting
the developers to change the way they work to avoid SARE misfires is a
slightly overzealous reaction to the spam problem, I think. However, it
is possible if there's no sensible alternative.
The second suggestion is only a workaround, not a fix, anyway, because
spamassassin will still check http://updated.by as a uri.

 
 3.If you are reluctant for the correct fix, drop the score on the
 uri_equals rule to 4 or maybe 3, depending on what else your report manages
 to hit.

I am reluctant to use the correct fix. Actually I'm inclined to think
that the word correct is being misapplied here. I've changed the
scores appropriately, though.

 
 4.You could submit a Bugzilla on the parsing of that phrase.  But
 frankly I consider the bug in the report generation, not SA's parsing of
 strange syntax.

The reason I didn't submit a bug was that I was not sure there was one -
hence the original query. And I'm still not going to submit a bug,
because I'm persuaded that there is not one. What bothered me (and still
does a bit) was that the string updated.by=anything matches a rule
that looks for uris of the form http(s)://*=*. Ie the http(s) is
conjured out of nowhere for schemeless uris. I can see the point, but I
thought it would be worth bringing a possible problem to light. It's a
possible problem, not a bug per se, and the subsequent discussion shows
that people take different views on the seriousness of this kind of
parsing issue. One thing that hasn't been mentioned in respect of this
is that if spamassassin is looking aggressively for schemeless uris, it
could in some cases create quite a lot of unwanted uri checking traffic.

I'm happy to stick with what I've got now. I've sent some examples off
as indicated so that the SARE corpus will contain my mail in future.

Chris


Re: SARE_URI_EQUALS false positives

2005-12-27 Thread Kai Schaetzl
Mouss wrote on Tue, 27 Dec 2005 00:04:34 +0100:

 Is foo.tld=bar a valid hostname part in a URI?

foo.tld=bar is a valid URL with foo.tld being the hostname and =bar 
being the query part.

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com





Re: SARE_URI_EQUALS false positives

2005-12-27 Thread Kai Schaetzl
List Mail User wrote on Mon, 26 Dec 2005 16:46:00 -0800 (PST):

 How about the case of http=3A=2F=2Fwww=2Ecnn=2Ecom=2F2003=2F 
 inside of HTML?   i.e. http://www.cnn.com/2003/ - from a phishing spam, 
 the full line was:

You mean it displayed like this in the mail agent *after* Q decoding and 
worked?

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com





Re: SARE_URI_EQUALS false positives

2005-12-27 Thread mouss
Kai Schaetzl a écrit :
 Mouss wrote on Tue, 27 Dec 2005 00:04:34 +0100:
 
 
Is foo.tld=bar a valid hostname part in a URI?
 
 
 foo.tld=bar is a valid URL with foo.tld being the hostname and =bar 
 being the query part.
 

are you sure? my understanding is that query part must be in the
url-path, so must come after at least one slash. something like
scheme://[user[:[EMAIL PROTECTED]:port]/[path[?queryargs]]
plus the fact that:
scheme://[user[:[EMAIL PROTECTED]:port]  is the same as the one with a
traling slash, and
absence of the scheme part assumes http.

running http://www.google.com=test on my firefox results in a dns lookup
error.



Re: SARE_URI_EQUALS false positives

2005-12-27 Thread Theo Van Dinter
On Tue, Dec 27, 2005 at 09:17:09PM +0100, mouss wrote:
 are you sure? my understanding is that query part must be in the
 url-path, so must come after at least one slash. something like

I don't know about =bar, but if it were ?bar, many browsers will assume
there's supposed to be a / before the ?.

One could argue that =bar looks like quoted-printable encoding, but that's
another discussion. ;)

-- 
Randomly Generated Tagline:
I'm Bond ... Covalent Bond.- Farside Cartoon


pgpGBQ35kOlA0.pgp
Description: PGP signature


Re: SARE_URI_EQUALS false positives

2005-12-27 Thread mouss
List Mail User a écrit :
 
   How about the case of http=3A=2F=2Fwww=2Ecnn=2Ecom=2F2003=2F
 inside of HTML?   i.e. http://www.cnn.com/2003/ - from a phishing spam,
 the full line was:
 
 =3Chttp=3A=2F=2Fwww=2Ecnn=2Ecom=2F2003=2FWORLD=2Fafrica=2F07=2F20=2Fkenya=2Ecrash=2Findex=2Ehtml=3E
 

I thought these were only interpreted in quoted printable, which SA can
handle anyway. So I'm talking about decoded body (not rawbody). a MUA
that interpretes '=' this way even in html mail would be seriously
broken. I guess some MUAs will still guess that a message is in QP even
if the header says it's plain text, but this should anyway be handled by
the decoder

 which itself was a continuation of a previous line.  If you allow for more
 than just ASCII or UTF-8, there are quite a few words that can be built
 from the first six letters of the alphabet - and a much greater amount if
 we include elite-speak.  The above example need not have been a phish
 using cnn.com, but just as easily could have been a spamvertised domain or
 have been valid non-spam HTML.  Unfortunately the case of MUAs accepting
 non-standard (re. illegal) HTML constructs is the most common case (e.g.
 Outlook and OE as well as many more MUAs which *need* to be able to read
 the same emails under MS Win*).  And still more cases of URIs exist, which
 are not parsed by SA, but can have constructs like these with embedded
 domain names (e.g. Message-ID: lines).  Life would be much easier if all
 URIs were contained within '' and '' (as at least one standard requires).
 
   The problem is that sometimes '=' is a word break, and sometimes
 it is used a either a continuation or meta-character.  Find a rule with
 a very good rate at disambiguating these cases (for example, an '=' as the
 final character on a line can probably almost never be ignored). and file
 a Bugzilla;  I'm sure the developers would at least look at whatever you
 come up with.  Remember to also handle '%', '#' and '$' while you're at
 it:-)
 

well, one can find rules for the case of http://..., but it's hard to
get ones when there is no scheme part. because as you said before, it is
hard to guess how foo REMOVESPACE  REMOVESPACE something.example
would be interpreted by the MUA (in http://foo REMOVESPACE ..., there
is no ambiguity as foo is clearly part of the uri).

BTW. I have some rules to tag mail when the host part of the uri
contains '' (I don't see why the hostname part should contain this). I
wonder if I can just tag if the host part contains any but \d\w\.-_:@?
This would obviously reject encoded URIs. Is there enough ham where the
hostname part is encoded?


Re: SARE_URI_EQUALS false positives

2005-12-27 Thread List Mail User
...
List Mail User wrote on Mon, 26 Dec 2005 16:46:00 -0800 (PST):

 How about the case of http=3A=2F=2Fwww=2Ecnn=2Ecom=2F2003=2F 
 inside of HTML?   i.e. http://www.cnn.com/2003/ - from a phishing spam, 
 the full line was:

You mean it displayed like this in the mail agent *after* Q decoding and 
worked?

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com


I'm old fashioned and use a command line MUA, but a quick check
by copying the old email to a XP box and loading it into Outlook, shows
that, yes, it does work there.

Paul Shupak
[EMAIL PROTECTED]


Re: SARE_URI_EQUALS false positives

2005-12-26 Thread mouss
List Mail User a écrit :
   updated.by - check http://www.tld.by/cgi-bin/registry.cgi
 
   You'll see that update.by is a registered domain!  Therefore
 updated.by is indeed a URI. QED

the question is: if foo.example-DEMUNGED is listed in uribl/surbl, does
that make it a bad string in mail?

If it appears as http://somethin.foo.example-DEMUNGED, or even as a
textual www.foo.example-DEMUNGED, we may consider it risky

But if it appears as:
telnet smtp.foo.example-DEMUNGED
or
Dec 26 23:41:53 bobo postfix/smtpd[7560]: connect from
foo.example-DEMUNGED[192.0.2.56]
...

then checking *BLs is questionable. There are more chances to block
someone reporting a spammy session or asking for help than seeing a
spammer advertize his site via a log line...

I believe this is the most important issue that uribl encounters: is the
URI used to advertize or is it an example/report/...? if we solve this,
we'll feel very happy.


Re: SARE_URI_EQUALS false positives

2005-12-26 Thread mouss
Loren Wilton a écrit :
Does anyone have any suggestions, apart from simply reducing the score
for SARE_URI_EQUALS? Is this a spamassassin bug, or is there no way to
guarantee that only real uris are parsed as such?
 
 
 Several.
 
 1.Change your report generator to remove the extraneous dot between
 updated and by.  Or change it to the more common underscore, if you insist
 on these words being connected for some reason.
 
 2.Put spaces around the equal sign.
 
 3.If you are reluctant for the correct fix, drop the score on the
 uri_equals rule to 4 or maybe 3, depending on what else your report manages
 to hit.
 
 4.You could submit a Bugzilla on the parsing of that phrase.  But
 frankly I consider the bug in the report generation, not SA's parsing of
 strange syntax.
 

Is foo.tld=bar a valid hostname part in a URI? I doubt that. now, would
a MUA show that as a URI followed by bar?

I think that SA should provide an option to enable/disable:
uri_broken_mua, so that people not caring for broken MUAs can avoid
such false positives.


Re: SARE_URI_EQUALS false positives

2005-12-26 Thread List Mail User
...
Mouss,

List Mail User a écrit :
  updated.by - check http://www.tld.by/cgi-bin/registry.cgi
 
  You'll see that update.by is a registered domain!  Therefore
 updated.by is indeed a URI. QED

the question is: if foo.example-DEMUNGED is listed in uribl/surbl, does
that make it a bad string in mail?

If it appears as http://somethin.foo.example-DEMUNGED, or even as a
textual www.foo.example-DEMUNGED, we may consider it risky

But if it appears as:
   telnet smtp.foo.example-DEMUNGED
or
   Dec 26 23:41:53 bobo postfix/smtpd[7560]: connect from
foo.example-DEMUNGED[192.0.2.56]
...

then checking *BLs is questionable. There are more chances to block
someone reporting a spammy session or asking for help than seeing a
spammer advertize his site via a log line...

I believe this is the most important issue that uribl encounters: is the
URI used to advertize or is it an example/report/...? if we solve this,
we'll feel very happy.


There are several parts to the answer, but the first and most
important part can be phrased as barring a special case, yes a spam
domain in mail is bad (period).

Now, there are more than a few special cases.  One immediate
case is that no abuse@ email account should be doing content filtering.
Another obvious case is that any person or mailing list which discusses
spam need to be whitelisted, setup to avoid filtering or some other action
take to configure it not to trip spam filters.  The case you listed of
an example/report would/should always come under these situations, but
there are still others;  If you file a complaint with any party about an
abuse situation, you should be prepared to have your own message quoted
back to you (this one has to include organizations like ICANN, the internic,
ARIN, RIPE, etc.).  If you discuss spam or abuse with another person or
on a list, again you should be prepared to be answered similarly (this
case I have been guilty of forgetting more than once).  There are still
more possible cases that can be hard to expect - e.g. I recently got
an email from a hosting service that I have locally BL'd which was sent
addressed to customers (I am *not* one), but which I was copied on (I have
spoken by telephone and email to the business' managers and staff on a few
occasions) - fortunately they sent it to an account which is only used for
certain types of complaints and communication, and which bypasses the BLs
at the MTA level (still hits SA).  Also, there are some companies/newsletters
which may be on quite a few BLs, but are solicited mail at my site, so they
*must* be whitelisted (at the MTA, in SA, in DCC, etc.).  If you accept
requests for help (with abuse issues, or even allowing such things), you
should either be using a dedicated account or be prepared to FP on the
emails.  (Yes, I know not everyone controls one or more domains and can
not create special purpose accounts trivially.)

Even the simplest case of a bare domain name is clearly bad.  How
can you distinguish (without building/writing a natural language parser)
the difference between saying I got spam from example.com for ... and
Copy example.com into your browser to see our specials...?  The second
format is fairly common in spam.  You could try to somehow score a bare
name differently, but them what if it is embedded in a scripting language,
HTML or obfuscated with character translations (e.g. %45xample%2Eco%4D or
similar);  This kind of style can still be dangerous.

There are many examples of non-distributed rules (i.e. not part of
SA distributions) which conflict with common styles of email writing and
quoting (e.g. the SARE chicken-pox rules and large chunks of source code
is a common example).  Most of the standard SA rules are safe under
normal conditions, but if some automated tool creates text containing a
string which happens by be formatted the same as a spam domain, there will
be a conflict (e.g. if updated.by were spammy - or even if a local rule
penalized non-RFC compliant TLDs, since .by doesn't have a whois server,
so any string with .by, .my, .de, .mx, etc. at its end could cause
problems).

I don't think you can find any way to tell if something is actually
advertising even if you did have a natural language parser.  Consider the
case where the mail contains an image of a watch, pills or scantily clad
young woman, random text (not random words, but literary chaff) and a bare
domain name.  To a human it may be obvious what is happening, but you'd need
a very complex recognizer to get a computer to know it was advertising;
It could be a picture of your cousin with the poem she won a prize for and
the domain it is published at, sent by a relative (example from mail I've
actually recieved) or it could be an advertisement for child pornography;
How can you tell (especially when it comes from a DUL host via a cable ISP)?

Not an easy case, and not one I expect to be solved in my lifetime.


Paul 

Re: SARE_URI_EQUALS false positives

2005-12-26 Thread List Mail User
...
Is foo.tld=bar a valid hostname part in a URI? I doubt that. now, would
a MUA show that as a URI followed by bar?

I think that SA should provide an option to enable/disable:
uri_broken_mua, so that people not caring for broken MUAs can avoid
such false positives.


How about the case of http=3A=2F=2Fwww=2Ecnn=2Ecom=2F2003=2F
inside of HTML?   i.e. http://www.cnn.com/2003/ - from a phishing spam,
the full line was:

=3Chttp=3A=2F=2Fwww=2Ecnn=2Ecom=2F2003=2FWORLD=2Fafrica=2F07=2F20=2Fkenya=2Ecrash=2Findex=2Ehtml=3E

which itself was a continuation of a previous line.  If you allow for more
than just ASCII or UTF-8, there are quite a few words that can be built
from the first six letters of the alphabet - and a much greater amount if
we include elite-speak.  The above example need not have been a phish
using cnn.com, but just as easily could have been a spamvertised domain or
have been valid non-spam HTML.  Unfortunately the case of MUAs accepting
non-standard (re. illegal) HTML constructs is the most common case (e.g.
Outlook and OE as well as many more MUAs which *need* to be able to read
the same emails under MS Win*).  And still more cases of URIs exist, which
are not parsed by SA, but can have constructs like these with embedded
domain names (e.g. Message-ID: lines).  Life would be much easier if all
URIs were contained within '' and '' (as at least one standard requires).

The problem is that sometimes '=' is a word break, and sometimes
it is used a either a continuation or meta-character.  Find a rule with
a very good rate at disambiguating these cases (for example, an '=' as the
final character on a line can probably almost never be ignored). and file
a Bugzilla;  I'm sure the developers would at least look at whatever you
come up with.  Remember to also handle '%', '#' and '$' while you're at
it:-)


Paul Shupak
[EMAIL PROTECTED]


Re: SARE_URI_EQUALS false positives

2005-12-23 Thread jdow

From: Chris Lear [EMAIL PROTECTED]


I'm getting false positives for SARE_URI_EQUALS, which scores 5 and is
therefore skewing the scoring of some mail quite badly.
The weird thing is that the uris that spamassassin is complaining about
aren't uris at all. The mail in question is auto-created reports of cvs
diffs, so it's slightly unusual.
I've tried to condense the debug information. Here it is:

This is some of the output from spamassassin -D false_positive

[16733] dbg: uri: parsed uri found, updated.by=Mis
[16733] dbg: uri: cleaned parsed uri, http://updated.by=Mis
[16733] dbg: uri: cleaned parsed uri, updated.by=Mis
[16733] dbg: uri: parsed uri found, http://updated.by=Mis
[16733] dbg: uri: cleaned parsed uri, http://updated.by=Mis
[16733] dbg: uri: parsed uri found, updated.by=Updated
[16733] dbg: uri: cleaned parsed uri, updated.by=Updated
[16733] dbg: uri: cleaned parsed uri, http://updated.by=Updated
[16733] dbg: uri: parsed uri found, http://updated.by=Updated
[16733] dbg: uri: cleaned parsed uri, http://updated.by=Updated

These parsed uris are not links in the e-mail. They are just text.

I've had a bit of a look at the regexps that spamassassin uses to work
out what is a uri, and it seems that updated.by=Updated is treated as
a uri because .by is a valid tld and spamassassin looks for schemeless
uris, then prepends http:// for the tests.

I'm running spamassassin 3.1.0 on perl 5.8.2.

Does anyone have any suggestions, apart from simply reducing the score
for SARE_URI_EQUALS? Is this a spamassassin bug, or is there no way to
guarantee that only real uris are parsed as such?


Before you drop the score precipitously check if there is some other
characteristic of the emails that trigger falsely which can be used to
apply a negative score. If there is such a characteristic then generate
the appropriate negative score. If not weigh how effective the rule is
for you. The version of sa-stats.pl that is on the SARE site helps
figure this out nicely.

That said it's close to a 50/50 rule that hits on very few messages
here so should have a low score. (It hit on 6 messages out of 75000.)
Cutting it out completely here seems like it would be effective TODAY.
That could change. At one time it was quite necessary. Spammer fads
change.)

{^_^}



Re: SARE_URI_EQUALS false positives

2005-12-23 Thread Chris Lear
* jdow wrote (23/12/05 11:26):
 From: Chris Lear [EMAIL PROTECTED]
 
 I'm getting false positives for SARE_URI_EQUALS, which scores 5 and is
 therefore skewing the scoring of some mail quite badly.
 The weird thing is that the uris that spamassassin is complaining about
 aren't uris at all. The mail in question is auto-created reports of cvs
 diffs, so it's slightly unusual.

[...]
 
 I've had a bit of a look at the regexps that spamassassin uses to work
 out what is a uri, and it seems that updated.by=Updated is treated as
 a uri because .by is a valid tld and spamassassin looks for schemeless
 uris, then prepends http:// for the tests.
 
 I'm running spamassassin 3.1.0 on perl 5.8.2.
 
 Does anyone have any suggestions, apart from simply reducing the score
 for SARE_URI_EQUALS? Is this a spamassassin bug, or is there no way to
 guarantee that only real uris are parsed as such?
 
 Before you drop the score precipitously check if there is some other
 characteristic of the emails that trigger falsely which can be used to
 apply a negative score. If there is such a characteristic then generate
 the appropriate negative score. If not weigh how effective the rule is
 for you. The version of sa-stats.pl that is on the SARE site helps
 figure this out nicely.
 
 That said it's close to a 50/50 rule that hits on very few messages
 here so should have a low score. (It hit on 6 messages out of 75000.)
 Cutting it out completely here seems like it would be effective TODAY.
 That could change. At one time it was quite necessary. Spammer fads
 change.)

I've reduced the score, and a quick check shows that that rule hits
almost nothing anyway, so it's not a big problem. The bayes rules were
keeping the false positives from doing much damage, anyway.
But spamassassin uses uris for lots of things, and if it's commonly
parsing (reasonably) normal text as uris, I would expect that to be a
problem in more rules than just SARE_URI_EQUALS.

Chris


Re: SARE_URI_EQUALS false positives

2005-12-23 Thread jdow

From: Chris Lear [EMAIL PROTECTED]

* jdow wrote (23/12/05 11:26):

From: Chris Lear [EMAIL PROTECTED]


I'm getting false positives for SARE_URI_EQUALS, which scores 5 and is
therefore skewing the scoring of some mail quite badly.
The weird thing is that the uris that spamassassin is complaining about
aren't uris at all. The mail in question is auto-created reports of cvs
diffs, so it's slightly unusual.


[...]


I've had a bit of a look at the regexps that spamassassin uses to work
out what is a uri, and it seems that updated.by=Updated is treated as
a uri because .by is a valid tld and spamassassin looks for schemeless
uris, then prepends http:// for the tests.

I'm running spamassassin 3.1.0 on perl 5.8.2.

Does anyone have any suggestions, apart from simply reducing the score
for SARE_URI_EQUALS? Is this a spamassassin bug, or is there no way to
guarantee that only real uris are parsed as such?


Before you drop the score precipitously check if there is some other
characteristic of the emails that trigger falsely which can be used to
apply a negative score. If there is such a characteristic then generate
the appropriate negative score. If not weigh how effective the rule is
for you. The version of sa-stats.pl that is on the SARE site helps
figure this out nicely.

That said it's close to a 50/50 rule that hits on very few messages
here so should have a low score. (It hit on 6 messages out of 75000.)
Cutting it out completely here seems like it would be effective TODAY.
That could change. At one time it was quite necessary. Spammer fads
change.)


I've reduced the score, and a quick check shows that that rule hits
almost nothing anyway, so it's not a big problem. The bayes rules were
keeping the false positives from doing much damage, anyway.
But spamassassin uses uris for lots of things, and if it's commonly
parsing (reasonably) normal text as uris, I would expect that to be a
problem in more rules than just SARE_URI_EQUALS.


That is a standalone rule.

And I do note that many of the SARE rules have severe problems in very
specific cases. There are some mailing lists that are not well filtered
for spam which have postings which trigger some of the too effective
to toss SARE rules. I've developed some massive meta rules to at least
partially get a handle on the problem. (A number of times XXX hit option
would be nice to have for this.)

{^_^}



Re: SARE_URI_EQUALS false positives

2005-12-23 Thread Chris Lear
* jdow wrote (23/12/05 12:06):
 From: Chris Lear [EMAIL PROTECTED]
* jdow wrote (23/12/05 11:26):
 From: Chris Lear [EMAIL PROTECTED]
 
 I'm getting false positives for SARE_URI_EQUALS, which scores 5 and is
 therefore skewing the scoring of some mail quite badly.
 The weird thing is that the uris that spamassassin is complaining about
 aren't uris at all. The mail in question is auto-created reports of cvs
 diffs, so it's slightly unusual.
 
 [...]
 
 I've had a bit of a look at the regexps that spamassassin uses to work
 out what is a uri, and it seems that updated.by=Updated is treated as
 a uri because .by is a valid tld and spamassassin looks for schemeless
 uris, then prepends http:// for the tests.
 
 I'm running spamassassin 3.1.0 on perl 5.8.2.
 
 Does anyone have any suggestions, apart from simply reducing the score
 for SARE_URI_EQUALS? Is this a spamassassin bug, or is there no way to
 guarantee that only real uris are parsed as such?
 
 Before you drop the score precipitously check if there is some other
 characteristic of the emails that trigger falsely which can be used to
 apply a negative score. If there is such a characteristic then generate
 the appropriate negative score. If not weigh how effective the rule is
 for you. The version of sa-stats.pl that is on the SARE site helps
 figure this out nicely.
 
 That said it's close to a 50/50 rule that hits on very few messages
 here so should have a low score. (It hit on 6 messages out of 75000.)
 Cutting it out completely here seems like it would be effective TODAY.
 That could change. At one time it was quite necessary. Spammer fads
 change.)
 
 I've reduced the score, and a quick check shows that that rule hits
 almost nothing anyway, so it's not a big problem. The bayes rules were
 keeping the false positives from doing much damage, anyway.
 But spamassassin uses uris for lots of things, and if it's commonly
 parsing (reasonably) normal text as uris, I would expect that to be a
 problem in more rules than just SARE_URI_EQUALS.
 
 That is a standalone rule.
 
 And I do note that many of the SARE rules have severe problems in very
 specific cases. There are some mailing lists that are not well filtered
 for spam which have postings which trigger some of the too effective
 to toss SARE rules. I've developed some massive meta rules to at least
 partially get a handle on the problem. (A number of times XXX hit option
 would be nice to have for this.)

Sorry to go on, but I wonder whether you've missed by point. The
SARE_URI_EQUALS rule is working fine. It just looks in the uris that
spamassassin gives it, and complains when they contain =.
The problem is that spamassassin is treating things that aren't uris as
uris. So SARE_URI_EQUALS is working on dud data.

In this specific case, the e-mail contains the text
updated.by=Updated. This is not a uri, and nor should it be treated as
one. But spamassassin thinks it is (becasue .by is a valid tld), so, as
far as I can tell, *all* uri rules will check it. It so happens that
SARE_URI_EQUALS hits in this case, but other uri rules are vulnerable to
false positives if the uri parsing is wrong, aren't they?

Chris


Re: SARE_URI_EQUALS false positives

2005-12-23 Thread List Mail User
updated.by - check http://www.tld.by/cgi-bin/registry.cgi

You'll see that update.by is a registered domain!  Therefore
updated.by is indeed a URI. QED

Paul Shupak
[EMAIL PROTECTED]


Re: SARE_URI_EQUALS false positives

2005-12-23 Thread Robert Menschel
Hello Chris,

Friday, December 23, 2005, 3:04:29 AM, you wrote:

CL I'm getting false positives for SARE_URI_EQUALS, which scores 5 and is
CL therefore skewing the scoring of some mail quite badly. ...

CL Does anyone have any suggestions, apart from simply reducing the
CL score for SARE_URI_EQUALS? Is this a spamassassin bug, or is there
CL no way to guarantee that only real uris are parsed as such?

Send me a couple of sample emails with this problem so I can add them
to my ham corpus, and SARE_URI_EQUALS will automagically drop its
score to 1.666 (or lower).  No SARE rule with ham scores more than
1.666.

Best is to put them into an mbox file, zip, and email.  Thanks.

Bob Menschel





Re: SARE_URI_EQUALS false positives

2005-12-23 Thread Loren Wilton
 Does anyone have any suggestions, apart from simply reducing the score
 for SARE_URI_EQUALS? Is this a spamassassin bug, or is there no way to
 guarantee that only real uris are parsed as such?

Several.

1.Change your report generator to remove the extraneous dot between
updated and by.  Or change it to the more common underscore, if you insist
on these words being connected for some reason.

2.Put spaces around the equal sign.

3.If you are reluctant for the correct fix, drop the score on the
uri_equals rule to 4 or maybe 3, depending on what else your report manages
to hit.

4.You could submit a Bugzilla on the parsing of that phrase.  But
frankly I consider the bug in the report generation, not SA's parsing of
strange syntax.

Loren