sa-learn and modern spam sizes

2011-12-16 Thread Jonas
Hi all

 

I found out today that the reason a spammer was giving me pain when trying to 
learn his spam mails as spam for bayes, was that they are too big?

 

There are a couple of hits on it on google with various people having the same 
problem, I didn't find much answers but it appears that the hardcoded limit is 
somewhere between 256-512kilobyte?

 

These spam mails are all between 1-2MB, so is there no way to learn this as 
spam with bayes?

 

On top of that he's sending via Gmail making it hard to use rbl's.

 

Looking forward to suggestions and clarification of the sa-learn limit (if its 
hardcoded I would strongly suggest it becomes configurable)

 

 

Med venlig hilsen / Best regards

 

Jonas Akrouh Larsen

 

TechBiz ApS

Laplandsgade 4, 2. sal

2300 København S

 

Office: 7020 0979

Direct: 3336 9974

Mobile: 5120 1096

Fax:7020 0978

Web: www.techbiz.dk http://www.techbiz.dk 

 

 



Re: sa-learn and modern spam sizes

2011-12-16 Thread Robert Schetterer
Am 16.12.2011 13:30, schrieb Jonas:
 Hi all
 
  
 
 I found out today that the reason a spammer was giving me pain when
 trying to learn his spam mails as spam for bayes, was that they are too big?
 
  
 
 There are a couple of hits on it on google with various people having
 the same problem, I didn’t find much answers but it appears that the
 hardcoded limit is somewhere between 256-512kilobyte?

man spamc

 -s max_size, --max-size=max_size
   Set the maximum message size which will be sent to spamd --
any bigger than this threshold and the message will be returned
unprocessed (default: 500 KB).  If spamc gets handed a message bigger
   than this, it won't be passed to spamd.  The maximum message
size is 256 MB.


 
  
 
 These spam mails are all between 1-2MB, so is there no way to learn this
 as spam with bayes?
 

i guess this isnt a big problem, it may consum
machine power and time
but gurus should know more

  
 
 On top of that he’s sending via Gmail making it hard to use rbl’s.
 
  
 
 Looking forward to suggestions and clarification of the sa-learn limit
 (if its hardcoded I would strongly suggest it becomes configurable)
 
  
 
  
 
 Med venlig hilsen / Best regards
 
  
 
 Jonas Akrouh Larsen
 
  
 
 TechBiz ApS
 
 Laplandsgade 4, 2. sal
 
 2300 København S
 
  
 
 Office: 7020 0979
 
 Direct: 3336 9974
 
 Mobile: 5120 1096
 
 Fax:7020 0978
 
 Web: www.techbiz.dk http://www.techbiz.dk
 
  
 
  
 


-- 
Best Regards

MfG Robert Schetterer

Germany/Munich/Bavaria


Re: Apply Bayes learning to all users?

2011-12-16 Thread RW
On Fri, 16 Dec 2011 08:54:36 +0100
Benny Pedersen wrote:

 On Fri, 16 Dec 2011 06:30:31 +, Martin Hepworth wrote:
  Created a shared iMap or similar email account with a spam and ham
  folder for users to drag email into (not forward as that breaks
  headers in thing like outlook)
 
 yes, here i found that dovecot-antispam helpfull in the way 

I think you've both misread the question. The OP wants to use spamtrap
mail to train the individual user Bayes accounts.


The best way to do this would be to use the global database to adjust
the probabilities for low count tokens in the user database. Nothing
like that is supported.

Doing it via sa-learn sounds like more trouble than it's worth. It's
probably a good thing for high volume accounts, but swamping low
volume accounts may make things worse. 


RE: sa-learn and modern spam sizes

2011-12-16 Thread Jonas
  There are a couple of hits on it on google with various people having
  the same problem, I didn't find much answers but it appears that the
  hardcoded limit is somewhere between 256-512kilobyte?
 
 man spamc
 
  -s max_size, --max-size=max_size
Set the maximum message size which will be sent to spamd -- any 
 bigger
 than this threshold and the message will be returned unprocessed (default: 500
 KB).  If spamc gets handed a message bigger
than this, it won't be passed to spamd.  The maximum message size 
 is
 256 MB.
 
 
I do not use spamd/spamc. But the perl module of spamassassin.

So is there no way to get around this?


Med venlig hilsen / Best regards
 
Jonas Akrouh Larsen
 
TechBiz ApS
Laplandsgade 4, 2. sal
2300 København S
 
Office: 7020 0979
Direct: 3336 9974
Mobile: 5120 1096
Fax:    7020 0978
Web: www.techbiz.dk




SA Sorbs Usage/Rules

2011-12-16 Thread Lutz Petersen

I know some of the discussions in the past about usage of Sorbs RBLs
in Spamassassin. The scores today are as follows:

score RCVD_IN_SORBS_BLOCK 0 # n=0 n=1 n=2 n=3
score RCVD_IN_SORBS_DUL 0 0.001 0 0.001 # n=0 n=2
score RCVD_IN_SORBS_HTTP 0 2.499 0 0.001 # n=0 n=2
score RCVD_IN_SORBS_MISC 0 # n=0 n=1 n=2 n=3
score RCVD_IN_SORBS_SMTP 0 # n=0 n=1 n=2 n=3
score RCVD_IN_SORBS_SOCKS 0 2.443 0 1.927 # n=0 n=2
score RCVD_IN_SORBS_WEB 0 0.614 0 0.770 # n=0 n=2
score RCVD_IN_SORBS_ZOMBIE 0 # n=0 n=1 n=2 n=3

The 0-Scores for DUL was done because lot of people thought there 
were too much false positives within that (I dont see so, but ok).
Another Argument for 0-Scoring or not using sorbs was that the rbl
contains a lot of old (meaning not actual) entries in the spam
section (in mind of the dislist policy). Ok.

But today I take a deeper look at the sorbs rbls and found, that
there is a very simple misconfigration in the SA rules. The rbl
check is done against the big 'dnsbl.sorbs.net' zone:
eval:check_rbl('sorbs', 'dnsbl.sorbs.net.')

And _that_ in my opinion is wrong. The rbl lookup should be done
against the rbl 'safe.dnsbl.sorbs.net' instead. This rbl is a
compilation of most of the sorbs partial lists as dnsbl.sorbs.net
but with a simple difference: In opposite to dnsl.sorbs.net it
does not contain the 'recent.spam' and the 'old.spam' partial
lists, which are contained in 'dnsbl.sorbs.net'. The only spam
listed in this 'safe.dnsbl.sorbs.net' contains spam of the last
24 hours, so the arguments against using sorbs especially because
of its spam delisting policy do not exist. One could simply change
the rbl lookup to the right zone and so also score spams within
that rbl (low).

Description of the different sorbs partial-zones as of the 
aggregate zones here:  https://www.sorbs.net/using.shtml



Re: SA Sorbs Usage/Rules

2011-12-16 Thread Kevin A. McGrail

  
  
Interesting.  Will cross-post to dev and see if anyone has some
input.

On 12/16/2011 12:22 PM, Lutz Petersen wrote:

  
I know some of the discussions in the past about usage of Sorbs RBLs
in Spamassassin. The scores today are as follows:

score RCVD_IN_SORBS_BLOCK 0 # n=0 n=1 n=2 n=3
score RCVD_IN_SORBS_DUL 0 0.001 0 0.001 # n=0 n=2
score RCVD_IN_SORBS_HTTP 0 2.499 0 0.001 # n=0 n=2
score RCVD_IN_SORBS_MISC 0 # n=0 n=1 n=2 n=3
score RCVD_IN_SORBS_SMTP 0 # n=0 n=1 n=2 n=3
score RCVD_IN_SORBS_SOCKS 0 2.443 0 1.927 # n=0 n=2
score RCVD_IN_SORBS_WEB 0 0.614 0 0.770 # n=0 n=2
score RCVD_IN_SORBS_ZOMBIE 0 # n=0 n=1 n=2 n=3

The 0-Scores for DUL was done because lot of people thought there 
were too much false positives within that (I dont see so, but ok).
Another Argument for 0-Scoring or not using sorbs was that the rbl
contains a lot of old (meaning not actual) entries in the spam
section (in mind of the dislist policy). Ok.

But today I take a deeper look at the sorbs rbls and found, that
there is a very simple misconfigration in the SA rules. The rbl
check is done against the big 'dnsbl.sorbs.net' zone:
eval:check_rbl('sorbs', 'dnsbl.sorbs.net.')

And _that_ in my opinion is wrong. The rbl lookup should be done
against the rbl 'safe.dnsbl.sorbs.net' instead. This rbl is a
compilation of most of the sorbs partial lists as dnsbl.sorbs.net
but with a simple difference: In opposite to dnsl.sorbs.net it
does not contain the 'recent.spam' and the 'old.spam' partial
lists, which are contained in 'dnsbl.sorbs.net'. The only spam
listed in this 'safe.dnsbl.sorbs.net' contains spam of the last
24 hours, so the arguments against using sorbs especially because
of its spam delisting policy do not exist. One could simply change
the rbl lookup to the right zone and so also score spams within
that rbl (low).

Description of the different sorbs partial-zones as of the 
aggregate zones here:  https://www.sorbs.net/using.shtml




-- 
  Kevin A. McGrail
  President
  
Peregrine Computer Consultants Corporation
3927 Old Lee Highway, Suite 102-C
Fairfax, VA 22030-2422
  
http://www.pccc.com/
  
703-359-9700 x50 / 800-823-8402 (Toll-Free)
703-359-8451 (fax)
kmcgr...@pccc.com
  
  
  

  



Re: sa-learn and modern spam sizes

2011-12-16 Thread RW
On Fri, 16 Dec 2011 12:06:15 -0500
Kevin A. McGrail wrote:

 
  There are a couple of hits on it on google with various people
  having the same problem, I didn't find much answers but it
  appears that the hardcoded limit is somewhere between
  256-512kilobyte?
  man spamc
 
-s max_size, --max-size=max_size
  Set the maximum message size which will be sent to
  spamd -- any bigger than this threshold and the message will be
  returned unprocessed (default: 500 KB).  If spamc gets handed a
  message bigger than this, it won't be passed to spamd.  The
  maximum message size is 256 MB.
 
 
  I do not use spamd/spamc. But the perl module of spamassassin.
 
  So is there no way to get around this?
 Hmm.  I didn't think SA had a limit internally.  Normally, you
 utilize a limit on spamc (-s/--max-size) or in procmail such as * 
 524288.
 
 But if you call SA directly as an API, I don't think there is a
 limit. You might want to post this on the dev list.

It's an optional limit in ArchiveIterator.pm. It's turned-on in
sa-learn.


Re: SA Sorbs Usage/Rules

2011-12-16 Thread darxus
On 12/16, Lutz Petersen wrote:
 
 I know some of the discussions in the past about usage of Sorbs RBLs
 in Spamassassin. The scores today are as follows:
 
 score RCVD_IN_SORBS_BLOCK 0 # n=0 n=1 n=2 n=3
 score RCVD_IN_SORBS_DUL 0 0.001 0 0.001 # n=0 n=2
 score RCVD_IN_SORBS_HTTP 0 2.499 0 0.001 # n=0 n=2
 score RCVD_IN_SORBS_MISC 0 # n=0 n=1 n=2 n=3
 score RCVD_IN_SORBS_SMTP 0 # n=0 n=1 n=2 n=3
 score RCVD_IN_SORBS_SOCKS 0 2.443 0 1.927 # n=0 n=2
 score RCVD_IN_SORBS_WEB 0 0.614 0 0.770 # n=0 n=2
 score RCVD_IN_SORBS_ZOMBIE 0 # n=0 n=1 n=2 n=3
 
 The 0-Scores for DUL was done because lot of people thought there 
 were too much false positives within that (I dont see so, but ok).
 Another Argument for 0-Scoring or not using sorbs was that the rbl
 contains a lot of old (meaning not actual) entries in the spam
 section (in mind of the dislist policy). Ok.
 
 But today I take a deeper look at the sorbs rbls and found, that
 there is a very simple misconfigration in the SA rules. The rbl
 check is done against the big 'dnsbl.sorbs.net' zone:
 eval:check_rbl('sorbs', 'dnsbl.sorbs.net.')
 
 And _that_ in my opinion is wrong. The rbl lookup should be done
 against the rbl 'safe.dnsbl.sorbs.net' instead. This rbl is a
 compilation of most of the sorbs partial lists as dnsbl.sorbs.net
 but with a simple difference: In opposite to dnsl.sorbs.net it
 does not contain the 'recent.spam' and the 'old.spam' partial
 lists, which are contained in 'dnsbl.sorbs.net'. The only spam
 listed in this 'safe.dnsbl.sorbs.net' contains spam of the last
 24 hours, so the arguments against using sorbs especially because
 of its spam delisting policy do not exist. One could simply change
 the rbl lookup to the right zone and so also score spams within
 that rbl (low).
 
 Description of the different sorbs partial-zones as of the 
 aggregate zones here:  https://www.sorbs.net/using.shtml

After digging into this a bit, I believe your entire objection is to the
default rule set not handling the 127.0.0.6 return code, used by the
following lists?

  new.spam.dnsbl.sorbs.net127.0.0.6
   recent.spam.dnsbl.sorbs.net127.0.0.6
  old.spam.dnsbl.sorbs.net127.0.0.6
  spam.dnsbl.sorbs.net127.0.0.6
   escalations.dnsbl.sorbs.net127.0.0.6

The rule for that return code is commented out in the default rule set with
this comment:

# delist: $50 fee for RCVD_IN_SORBS_SPAM, others have free retest on request

Which seems likely to have resulted from this bug:

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=2221

Lists returning the 127.0.0.6 code in the safe.dnsbl.sorbs.net agregate
zone are:

new.spam.dnsbl.sorbs.net
recent.spam.dnsbl.sorbs.net
escalations.dnsbl.sorbs.net

new.spam is only hosts from the last 48 hours.
recent.spam is hosts from the last 28 days.
escalations doesn't seem to have a time limit.

So it seems your statement that The only spam listed in this
'safe.dnsbl.sorbs.net' contains spam of the last 24 hours is incorrect.

Basically, without evidence money is not charged to be delisted from any
of those three lists, they're going to stay out of the default rule set.


With the currently enabled default rules, there would be *no* difference
if you changed from dnsbl.sorbs.net to safe.dnsbl.sorbs.net because we're
not using the lists as an aggregate (we don't only have a RCVD_IN_SORBS
rule), but have separate rules for each of the return codes.  And there
is no difference in what lists are providing which return codes between
those two aggregate lists other than the 127.0.0.6 (spam) value (which is
disabled).


Also, I wouldn't say the 0 scores were done because lot of people thought
there were too much false positives.  The scores are flagged
as mutable, meaning optimal scores are generated daily
using masscheck data.  Related statistics can be seen here:
http://ruleqa.spamassassin.org/?daterev=20111210rule=%2Fsorbs
RCVD_IN_SORBS_DUL seems to have a decent hit rate for both spam and ham, so
somehow the score generator just decided the most spams would be caught
without exceeding 1 false positive in 2500 hams with that score.  It's not
always clear what exactly it's thinking.  It could be, for example,
that almost all of the spam hits from RCVD_IN_SORBS_DUL overlapped with
another blacklist, and the SORBS_DUL list caused more false positives
than that other blacklist, so that other blacklist got a decent score,
and SORBS_DUL didn't.  But these scores do not come from the whims
of humans.

-- 
Anarchy is based on the observation that since few are fit to rule
themselves, even fewer are fit to rule others. -Edward Abbey
http://www.ChaosReigns.com


Re: sa-learn and modern spam sizes

2011-12-16 Thread Joseph Brennan



The maximum message size is 256 MB.



I've never seen spam larger than 3 MB.

Joseph Brennan
Lead Email Systems Engineer
Columbia University Information Technology



Re: SA Sorbs Usage/Rules

2011-12-16 Thread Noel Butler
On Fri, 2011-12-16 at 13:57 -0500, dar...@chaosreigns.com wrote:


 Basically, without evidence money is not charged to be delisted from any
 of those three lists, they're going to stay out of the default rule set.
 


Plenty of people can attest to the fact there is no payment taking
place, its just a scare tactic to coerce admins to act rather then
ignore and hope it sorts itself out. Don't use DNSBL's in SA myself, I
use them in MTA (frankly, where they belong).
At least under the control of its original owner there wasn't anyway,
and yes, we, like most large ISP's, had a couple of times the odd
different outbound smtp server listed with them, typically we were
alerted of the listing quickly (by use of mon) , a login to the SORBS
site for info, and the culprit was identified and we were unlisted in
hours, only one time did it take about 24 hours, and, IIRC, that was a
holiday season, happy to say not had any my servers listed anywhere that
I know of since 2005. 

Lastly, I would have thought SA dev team would have liked to see hard
evidence that someone was _forced_ to pay the 50 donation to be
delisted, because all I here is the web site says it which frankly
doesn't cut it with me, we were nobody special to SORBS, so I can't see
why they'd remove us for free but forcibly demand payments from others,
the only common ground we had with Matt back then was we were both
located in the same city, along with 2 million others.



signature.asc
Description: This is a digitally signed message part


Re: sa-learn and modern spam sizes

2011-12-16 Thread Noel Butler
On Fri, 2011-12-16 at 18:17 -0500, Joseph Brennan wrote:

  The maximum message size is 256 MB.
 
 
 I've never seen spam larger than 3 MB.
 


About 3 years ago(?), remember all the pdf spam? SA caught some that
were about 5mb, but yes, on the whole it is rather rare to be more than
a few KB.




signature.asc
Description: This is a digitally signed message part


Re: Apply Bayes learning to all users?

2011-12-16 Thread Steve Freitas

On 12/16/11 05:53, RW wrote:

On Fri, 16 Dec 2011 08:54:36 +0100
Benny Pedersen wrote:


On Fri, 16 Dec 2011 06:30:31 +, Martin Hepworth wrote:

Created a shared iMap or similar email account with a spam and ham
folder for users to drag email into (not forward as that breaks
headers in thing like outlook)

yes, here i found that dovecot-antispam helpfull in the way

I think you've both misread the question. The OP wants to use spamtrap
mail to train the individual user Bayes accounts.


The best way to do this would be to use the global database to adjust
the probabilities for low count tokens in the user database. Nothing
like that is supported.

Doing it via sa-learn sounds like more trouble than it's worth. It's
probably a good thing for high volume accounts, but swamping low
volume accounts may make things worse.


Thanks RW, you understood the question correctly. I'll take a look at 
those suggestions.


Stev3e