Re: spamassassin learn spam

2020-05-08 Thread support
Hi Harald,

yes i execute this as the root user.

but with which user i have to execute sa-lean --spam if i use amavisd?

per default you can't switch to the amavis user to execute the learn command.


cheers


Re: spamassassin learn spam

2020-05-08 Thread Matus UHLAR - fantomas

On 08.05.20 09:27, supp...@mmarzouki.de wrote:

i have spamassassin on my centos7 system.
sometimes i received spammails and i would like to learn this mails as spam 
with sa-learn --spam.

but it doesn't seem to work, because the spamscore is before and after the same.



what i did?:

i have a spammail in my inbox as maildir format.
when i check the spamscore with spamassassin < $spam_mail, then i get a score 
from 1.0.

i learned this mail as spam with sa-learn --spam $spam_mail and the system 
confirm this.
if i check the mail again with spamassassin, i get the same score 1.0.

is this normal? i think the score should over the required spamscore


you need to train at least 200 spams and 200 hams before bayes start kicking

your system can use different bayes database, e.g. systems using amavis
se bayes database in amavis user's directory

one spam sometimed may not be enough to change the resulting score

post your X-Spam headers.

I have added these lines to mu user_prefs file:

add_header  all Report  _REPORT_
add_header  all Languages   _LANGUAGES_
add_header  all tokens-spam _SPAMMYTOKENS(25,short)_
add_header  all tokens-ham  _HAMMYTOKENS(25,short)_
add_header  all tokens-sum  _TOKENSUMMARY_
add_header  all countries   _RELAYCOUNTRY_

my X-Spam-Report contains lines like these to see how my bayes works:

   * -0.0 BAYES_20 BODY: Bayes spam probability is 5 to 20%
   *  [score: 0.1545]

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
On the other hand, you have different fingers.


Re: Spamassassin Learn

2006-02-08 Thread jdow

From: Gene Heskett [EMAIL PROTECTED]



On Tuesday 07 February 2006 15:27, Clay Davis wrote:

Does anyone have any good techniques for capturing a sample of ham
that can be used as the ham corpus.  I'm in a corporate environment
and am not keen on the idea of intercepting non-spam messages.  I
will if I have to, but was hoping someone had a better idea.

I wouldn't have too guilty a consience(sp?) on that subject because 
generally, you won't be reading very much other than intercepted spam.  
There may be an FP in there occasionally, but you'll soon learn to 
catch those and feed them to the ham learner  hence move them to the 
correct mailbox folder.  In other words, to make an omelete, you 
normally have to break a few eggs.  What you accidently read in an FP 
should be treated with the usual amount of salt and otherwise 
forgotten.


Intercept some ham, feed it through SpamAssasin's salearn, forget to store
it on the way out. You don't have to know WHAT you trained with. You just
have to know it's ham.

Now, if you are in a corporate environment and don't have a strong email
policy you'd best do that first. Then you can sample the email, with
some discretion, legally and properly to get a test set of ham messages.
It MAY even be good corporate policy to save, for at least a short time,
all incoming and outgoing emails. 3 months to 6 months may be OK. This
will be handy if an employee is caught engaging in illegal activities
and must be terminated for cause, for example. Just make sure that the
company has a firm and clear email policy with regards to permissable
uses and notify the employees that the company reserves the right to
read emails in and out. If you don't your company could face some
interesting time if the fit hits the shan.

{^_^}


Regards,
Clay


On 2/7/2006 at 3:16 pm, in message [EMAIL PROTECTED],
Matt Kettler


[EMAIL PROTECTED] wrote:

[EMAIL PROTECTED] wrote:

Can you just feed spamassassin spam or do you need to give it ham
also?

I read the docs and it didn't say you had to feed it ham.

I then read another doc and it said you should feed it equal
amounts of spam and ham.


Yes, you really should feed it both. You also should strive for a
1:1 ratio of
spam and nonspam, but don't kill yourself to get there.

SA's use of chi-squared combining makes it very tolerant of wild
imbalances in
training. However, the closer you are to a 1:1 ratio the better SA
will be able
to distinguish tokens that are present in both kinds of mail and
ignore them. So
this is a worthwhile goal to strive for as long as it doesn't become
a burden.

My current training ratio is about 7:1 spam:nonspam, but in the past
it's been
as bad as 20:1. Both of those are very far off from equal amounts,
but the imbalance has never caused me any problems.

From my sa-learn --dump magic output as of today:
0.000  0 995764  0  non-token data: nspam
0.000  0 145377  0  non-token data: nham

That works out to a ratio of 6.85:1


--
Cheers, Gene
People having trouble with vz bouncing email to me should add the word
'online' between the 'verizon', and the dot which bypasses vz's
stupid bounce rules.  I do use spamassassin too. :-)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.


Re: Spamassassin Learn

2006-02-08 Thread Matt Kettler
jdow wrote:
 From: Matt Kettler [EMAIL PROTECTED]

 (Note I use mailscanner, hence the odd log syntax)

 grep is spam, /var/log/maillog |wc -l
   3434
 grep is spam, /var/log/maillog|grep autolearn=spam |wc -l
   2766
 grep is spam, /var/log/maillog|grep autolearn=not spam | wc -l
  0

snip
 
 I wonder if he has greylisting turned on.

I do, I don't know about Jim.

(Note: my greylisting configuration isn't entirely conventional. I only greylist
certain hosts using regex rules in milter-greylist's ACLs. I greylist APNIC,
LACNIC, dynamic-looking hostnames, and hosts with no RDNS.)


Re: Spamassassin Learn

2006-02-08 Thread Jim C. Nasby
On Tue, Feb 07, 2006 at 10:05:05PM -0500, Matt Kettler wrote:
 For reference, these are the only rules in a stock  SA 3.1.0 that can
 give you a negative learning score:
 
 score HABEAS_ACCREDITED_COI 0 -8.0 0 -8.0
 score RCVD_IN_BSP_TRUSTED 0 -4.3 0 -4.3
 score HABEAS_ACCREDITED_SOI 0 -4.3 0 -4.3
 score ALL_TRUSTED -1.360 -1.440 -1.665 -1.800
 score RCVD_IN_IADB_VOUCHED 0 -1.825 0 -2.200
 score HABEAS_CHECKED 0 -0.2 0 -0.2
 score RCVD_IN_BSP_OTHER 0 -0.1 0 -0.1
 score NO_RELAYS -0.001
 score NO_RECEIVED -0.001
 score DK_VERIFIED -0.001
 score SPF_PASS -0.001
 score SPF_HELO_PASS -0.001
 
 score HASHCASH_20 -0.500
 score HASHCASH_21 -0.700
 score HASHCASH_22 -1.000
 score HASHCASH_23 -2.000
 score HASHCASH_24 -3.000
 score HASHCASH_25 -4.000
 score HASHCASH_HIGH -5.000

The hashcash scores don't seem to be triggering learning for me, for
some reason...

X-Spam-Status: No, score=-5.0 required=5.0 tests=AWL,BAYES_00,  

FORGED_RCVD_HELO,HASHCASH_HIGH autolearn=no version=3.1.0 

grep threshold .spamassassin/user_prefs|grep -v #
bayes_auto_learn_threshold_spam 5.0
bayes_auto_learn_threshold_nonspam 0.0

Or is that because of the AWL rule?
-- 
Jim C. Nasby, Database Architect[EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?


Re: Spamassassin Learn

2006-02-08 Thread Matt Kettler
Jim C. Nasby wrote:
 
 The hashcash scores don't seem to be triggering learning for me, for
 some reason...

They generally won't. Three things must happen for hashcash to fire:

1) you need a loadplugin Mail::SpamAssassin::Plugin::Hashcash command in your
init.pre

2) you need a hashcash_accept command with the recipient address in your config
files.

3) the sender needs to generate a hashcash hash when sending the message.

It sounds like you've done 1, but you probably haven't done 2.

 
 X-Spam-Status: No, score=-5.0 required=5.0 tests=AWL,BAYES_00,
   
 FORGED_RCVD_HELO,HASHCASH_HIGH autolearn=no version=3.1.0 
 
 grep threshold .spamassassin/user_prefs|grep -v #
 bayes_auto_learn_threshold_spam 5.0
 bayes_auto_learn_threshold_nonspam 0.0
 
 Or is that because of the AWL rule?

Hmm, well, you have to ignore the AWL and BAYES scores when figuring out the
autolearner.. So you have FORGED_RCVD_HELO and HASHCASH_HIGH..

That would leave a score of -4.865, which confused me for a bit...

However, looking in the config files, HASHCASH rules have the userconf flag.
This means that the Autolearner will also ignore these rules too, as SA will
treat it as a user configured whitelist.


So, this message had an autolearner score of +0.135 from the FORGED_RCVD_HELO.


Re: Spamassassin Learn

2006-02-08 Thread Jim C. Nasby
On Wed, Feb 08, 2006 at 11:29:36AM -0500, Matt Kettler wrote:
 However, looking in the config files, HASHCASH rules have the userconf flag.
 This means that the Autolearner will also ignore these rules too, as SA will
 treat it as a user configured whitelist.
 
 
 So, this message had an autolearner score of +0.135 from the FORGED_RCVD_HELO.

Ahh, so hashcash scores don't actually count towards learning. Should
maybe be changed...?

BTW, I was reading http://article.gmane.org/gmane.mail.spam.hashcash/803
last night, and I'm wondering if there's been any progress on a way to
enable hashcash without requiring users to supply emails they receive
stamps for?
-- 
Jim C. Nasby, Database Architect[EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?


Re: Spamassassin Learn

2006-02-08 Thread Matt Kettler
Jim C. Nasby wrote:
 On Wed, Feb 08, 2006 at 11:29:36AM -0500, Matt Kettler wrote:
 However, looking in the config files, HASHCASH rules have the userconf flag.
 This means that the Autolearner will also ignore these rules too, as SA will
 treat it as a user configured whitelist.


 So, this message had an autolearner score of +0.135 from the 
 FORGED_RCVD_HELO.
 
 Ahh, so hashcash scores don't actually count towards learning. Should
 maybe be changed...?

I'm not entirely sure.. Part of me thinks it's a good idea to not count it,
since it does effectively behave a bit like a user-configured whitelist.

I mean, if you start accepting hashcash for learning, then you probably should
also accept whitelist_from_spf.

Realistically, hashcash doesn't provide any proof the sender isn't a spammer. It
merely provides proof they are willing to burn some CPU time to send you an 
email.

In the era of spammers using enormous botnets a little CPU time really costs a
spammer very little. They're much more limited by network bandwidth than
available CPU power when they control 10,000+ infected PCs each with a cable/dsl
uplink speed of 128k-1mbit to send spam with.

 
 BTW, I was reading http://article.gmane.org/gmane.mail.spam.hashcash/803
 last night, and I'm wondering if there's been any progress on a way to
 enable hashcash without requiring users to supply emails they receive
 stamps for?

The hashcash_accept command accepts file-glob style wildcards, so this should 
work:

hashcash_accept *

or safer:

hashcash_accept [EMAIL PROTECTED]

The problem with wildcards is that a spammer doesn't need to compute a hash on a
per-recipient basis. They merely need to do it on a per-message basis, which
makes it much less expensive for a spammer to use.


Re: Spamassassin Learn (hashcash)

2006-02-08 Thread Matt Kettler
Matt Kettler wrote:
 Jim C. Nasby wrote:
 On Wed, Feb 08, 2006 at 11:29:36AM -0500, Matt Kettler wrote:
 However, looking in the config files, HASHCASH rules have the userconf flag.
 This means that the Autolearner will also ignore these rules too, as SA will
 treat it as a user configured whitelist.


 So, this message had an autolearner score of +0.135 from the 
 FORGED_RCVD_HELO.
 Ahh, so hashcash scores don't actually count towards learning. Should
 maybe be changed...?
 
 I'm not entirely sure.. Part of me thinks it's a good idea to not count it,
 since it does effectively behave a bit like a user-configured whitelist.

Also, let's face it.. Hashcash is almost completely unused, so this is a lot of
worry over something very rare.

Since 1/1/2006 I have received mail with hashcash signatures from exactly 5
persons. Only 2 of those persons sent mail directly to me and had hashcash
signatures for my address.

Summary of persons who have used hashcash posting to lists: (names and public
list they were on only, don't want to re-post people's email addresses on lists
they don't subscribe to)

Alex B. (uribl-discuss)
John D. (uribl-discuss)
Andrew D. (sa-talk)
Jim N. (sa-talk)
rogelio a. (dansguardian)

Direct to me:
Andrew D.
Jim N.

Both of the above were sending emails regarding sa-talk postings.

Since the only people who sent me emails with hashcash for my address were
discussing spamassassin, it would have been counterproductive for me to use
hashcash in autolearning.

Since SA discussions often contain spam quotes, it's best not to intentionally
take steps that will learn such messages as ham. The benefit you get from
learning it will ultimately be counter-balanced by the mis-learning of the
occasional spam quote.






Re: Spamassassin Learn

2006-02-08 Thread Jim C. Nasby
On Wed, Feb 08, 2006 at 11:49:09AM -0500, Matt Kettler wrote:
 Jim C. Nasby wrote:
  On Wed, Feb 08, 2006 at 11:29:36AM -0500, Matt Kettler wrote:
  However, looking in the config files, HASHCASH rules have the userconf 
  flag.
  This means that the Autolearner will also ignore these rules too, as SA 
  will
  treat it as a user configured whitelist.
 
 
  So, this message had an autolearner score of +0.135 from the 
  FORGED_RCVD_HELO.
  
  Ahh, so hashcash scores don't actually count towards learning. Should
  maybe be changed...?
 
 I'm not entirely sure.. Part of me thinks it's a good idea to not count it,
 since it does effectively behave a bit like a user-configured whitelist.
 
 I mean, if you start accepting hashcash for learning, then you probably should
 also accept whitelist_from_spf.
 
 Realistically, hashcash doesn't provide any proof the sender isn't a spammer. 
 It
 merely provides proof they are willing to burn some CPU time to send you an 
 email.

Sure, but I think it warrants a small negative learn score. I'd expect
that real spam would have plenty enough positive score to ensure that it
didn't get learned. Of course I guess part of this is that the default
learn ham score of 0.1 is probably too high...

 In the era of spammers using enormous botnets a little CPU time really costs a
 spammer very little. They're much more limited by network bandwidth than
 available CPU power when they control 10,000+ infected PCs each with a 
 cable/dsl
 uplink speed of 128k-1mbit to send spam with.

True, but if they start burning that kind of CPU generating postage the
owner of the machine is more likely to notice something's wrong...

  
  BTW, I was reading http://article.gmane.org/gmane.mail.spam.hashcash/803
  last night, and I'm wondering if there's been any progress on a way to
  enable hashcash without requiring users to supply emails they receive
  stamps for?
 
 The hashcash_accept command accepts file-glob style wildcards, so this should 
 work:
 
 hashcash_accept *
 
 or safer:
 
 hashcash_accept [EMAIL PROTECTED]
 
 The problem with wildcards is that a spammer doesn't need to compute a hash 
 on a
 per-recipient basis. They merely need to do it on a per-message basis, which
 makes it much less expensive for a spammer to use.

Yeah, I was specifically wondering about getting it into the default
config. It seems like it would be a very useful tool if more people used
it, and having it work by default in SA would undoubtedly go a long way
towards getting people to use it.

BTW, there were 3 proposals in that thread to combat generating one
stamp per email.
-- 
Jim C. Nasby, Database Architect[EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?


Re: Spamassassin Learn

2006-02-08 Thread Matt Kettler
Jim C. Nasby wrote:
 On Wed, Feb 08, 2006 at 11:49:09AM -0500, Matt Kettler wrote:
 Jim C. Nasby wrote:
 On Wed, Feb 08, 2006 at 11:29:36AM -0500, Matt Kettler wrote:
 However, looking in the config files, HASHCASH rules have the userconf 
 flag.
 This means that the Autolearner will also ignore these rules too, as SA 
 will
 treat it as a user configured whitelist.


 So, this message had an autolearner score of +0.135 from the 
 FORGED_RCVD_HELO.
 Ahh, so hashcash scores don't actually count towards learning. Should
 maybe be changed...?
 I'm not entirely sure.. Part of me thinks it's a good idea to not count it,
 since it does effectively behave a bit like a user-configured whitelist.

 I mean, if you start accepting hashcash for learning, then you probably 
 should
 also accept whitelist_from_spf.

 Realistically, hashcash doesn't provide any proof the sender isn't a 
 spammer. It
 merely provides proof they are willing to burn some CPU time to send you an 
 email.
 
 Sure, but I think it warrants a small negative learn score. 

Does it? A negative learning score is a VERY powerful thing. VERY powerful.

Someone who can forge a negative learning score can poison your bayes database
rather quickly.

Currently SA only accepts negative learning scores for things which actually
attest to the fact that this specific sender is not a spammer. SA doesn't even
trust the user's own whitelists for this purpose, because too many users do
whitelist_from *


 In the era of spammers using enormous botnets a little CPU time really costs 
 a
 spammer very little. They're much more limited by network bandwidth than
 available CPU power when they control 10,000+ infected PCs each with a 
 cable/dsl
 uplink speed of 128k-1mbit to send spam with.
 
 True, but if they start burning that kind of CPU generating postage the
 owner of the machine is more likely to notice something's wrong...


Surely you're joking.

The average user would only notice if their computer became sluggish and
unresponsive. If you do the hashes in a low-priority thread the user interface
responsiveness will never be affected. Take the distributed.net client as an
example. It burns tons of CPU, and the average user wouldn't realize it was 
there.

Sure the user could detect it with a processor usage monitor. However, if they
were clueful enough to detect CPU load by using the task manager, they'd be
clueful enough to avoid infection in the first place, or at least realize they'd
infected themselves and clean it up asap.


Remember, the bot nets are largely built from users who are infected by email
viruses. Thus for the most part we are dealing with users that will open a .pif
file attached to an email with a body saying nothing but Please read the
document. and a subject Re: document (a netsky/somefool variant)








Re: Spamassassin Learn

2006-02-08 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Jim C. Nasby writes:
 On Wed, Feb 08, 2006 at 11:29:36AM -0500, Matt Kettler wrote:
  However, looking in the config files, HASHCASH rules have the userconf flag.
  This means that the Autolearner will also ignore these rules too, as SA will
  treat it as a user configured whitelist.
  
  
  So, this message had an autolearner score of +0.135 from the 
  FORGED_RCVD_HELO.
 
 Ahh, so hashcash scores don't actually count towards learning. Should
 maybe be changed...?

Nah.   The idea is that rules where users can conceivably configure
SpamAssassin to induce FNs or FPs should be ignored for purposes of
auto-learning;  we've seen *many* cases where an accidental whitelisting
of spam (for example) polluted the Bayes db.   This was put in place to
avoid that problem.


 BTW, I was reading http://article.gmane.org/gmane.mail.spam.hashcash/803
 last night, and I'm wondering if there's been any progress on a way to
 enable hashcash without requiring users to supply emails they receive
 stamps for?

Not yet; none of us are keen to get into that argument^Wdiscussion ;)

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFD6i3yMJF5cimLx9ARAhbsAKCNXlzAi0EYOzn/g81yZ8tSCz4OuwCeMRXr
4FCPKphI7Q6W9RxGKrMVvV8=
=4ZcX
-END PGP SIGNATURE-



RE: Spamassassin Learn

2006-02-07 Thread Bowie Bailey
[EMAIL PROTECTED] wrote:
 Can you just feed spamassassin spam or do you need to give it ham
 also? 
 
 I read the docs and it didn't say you had to feed it ham.
 
 I then read another doc and it said you should feed it equal amounts
 of spam and ham.

You need to feed it both.  I wouldn't worry too much about the ratios,
but the Bayes scoring won't take effect until you have learned at least
200 ham and 200 spam.

-- 
Bowie


Re: Spamassassin Learn

2006-02-07 Thread mike
200 of each to even make it start working on sa-learn email.  I then 
feed it representative amounts of ham and spam.  The ratio it comes in.


[EMAIL PROTECTED] wrote:


Can you just feed spamassassin spam or do you need to give it ham also?

I read the docs and it didn't say you had to feed it ham.

I then read another doc and it said you should feed it equal amounts of
spam and ham.


 





Re: Spamassassin Learn

2006-02-07 Thread Matt Kettler
[EMAIL PROTECTED] wrote:
 Can you just feed spamassassin spam or do you need to give it ham also?
 
 I read the docs and it didn't say you had to feed it ham.
 
 I then read another doc and it said you should feed it equal amounts of
 spam and ham.

Yes, you really should feed it both. You also should strive for a 1:1 ratio of
spam and nonspam, but don't kill yourself to get there.

SA's use of chi-squared combining makes it very tolerant of wild imbalances in
training. However, the closer you are to a 1:1 ratio the better SA will be able
to distinguish tokens that are present in both kinds of mail and ignore them. So
this is a worthwhile goal to strive for as long as it doesn't become a burden.

My current training ratio is about 7:1 spam:nonspam, but in the past it's been
as bad as 20:1. Both of those are very far off from equal amounts, but the
imbalance has never caused me any problems.

From my sa-learn --dump magic output as of today:
0.000  0 995764  0  non-token data: nspam
0.000  0 145377  0  non-token data: nham

That works out to a ratio of 6.85:1






Re: Spamassassin Learn

2006-02-07 Thread Clay Davis
Does anyone have any good techniques for capturing a sample of ham that can be 
used as the ham corpus.  I'm in a corporate environment and am not keen on the 
idea of intercepting non-spam messages.  I will if I have to, but was hoping 
someone had a better idea.

Regards,
Clay


 On 2/7/2006 at 3:16 pm, in message [EMAIL PROTECTED], Matt Kettler
[EMAIL PROTECTED] wrote:
 [EMAIL PROTECTED] wrote:
 Can you just feed spamassassin spam or do you need to give it ham also?
 
 I read the docs and it didn't say you had to feed it ham.
 
 I then read another doc and it said you should feed it equal amounts of
 spam and ham.
 
 Yes, you really should feed it both. You also should strive for a 1:1 ratio 
 of
 spam and nonspam, but don't kill yourself to get there.
 
 SA's use of chi-squared combining makes it very tolerant of wild imbalances 
 in
 training. However, the closer you are to a 1:1 ratio the better SA will be 
 able
 to distinguish tokens that are present in both kinds of mail and ignore 
 them. So
 this is a worthwhile goal to strive for as long as it doesn't become a 
 burden.
 
 My current training ratio is about 7:1 spam:nonspam, but in the past it's 
 been
 as bad as 20:1. Both of those are very far off from equal amounts, but the
 imbalance has never caused me any problems.
 
 From my sa-learn --dump magic output as of today:
 0.000  0 995764  0  non-token data: nspam
 0.000  0 145377  0  non-token data: nham
 
 That works out to a ratio of 6.85:1



Re: Spamassassin Learn

2006-02-07 Thread jdow

This is what automatic training attempts to solve.

If you are reliably nailing spam with your current setup you can experiment
with the automatic learning. But I'd widen the score ranges a little, as
far as is practical for your mail mix.

{^_^}
- Original Message - 
From: Clay Davis [EMAIL PROTECTED]



Does anyone have any good techniques for capturing a sample of ham that can be used as the 
ham corpus.  I'm in a corporate environment and am not keen on the idea of intercepting 
non-spam messages.  I will if I have to, but was hoping someone had a better idea.


Regards,
Clay



On 2/7/2006 at 3:16 pm, in message [EMAIL PROTECTED], Matt Kettler

[EMAIL PROTECTED] wrote:

[EMAIL PROTECTED] wrote:

Can you just feed spamassassin spam or do you need to give it ham also?

I read the docs and it didn't say you had to feed it ham.

I then read another doc and it said you should feed it equal amounts of
spam and ham.


Yes, you really should feed it both. You also should strive for a 1:1 ratio
of
spam and nonspam, but don't kill yourself to get there.

SA's use of chi-squared combining makes it very tolerant of wild imbalances
in
training. However, the closer you are to a 1:1 ratio the better SA will be
able
to distinguish tokens that are present in both kinds of mail and ignore
them. So
this is a worthwhile goal to strive for as long as it doesn't become a
burden.

My current training ratio is about 7:1 spam:nonspam, but in the past it's
been
as bad as 20:1. Both of those are very far off from equal amounts, but the
imbalance has never caused me any problems.

From my sa-learn --dump magic output as of today:
0.000  0 995764  0  non-token data: nspam
0.000  0 145377  0  non-token data: nham

That works out to a ratio of 6.85:1 




Re: Spamassassin Learn

2006-02-07 Thread Jim C. Nasby
On Tue, Feb 07, 2006 at 03:16:57PM -0500, Matt Kettler wrote:
 My current training ratio is about 7:1 spam:nonspam, but in the past it's been
 as bad as 20:1. Both of those are very far off from equal amounts, but the
 imbalance has never caused me any problems.
 
 From my sa-learn --dump magic output as of today:
 0.000  0 995764  0  non-token data: nspam
 0.000  0 145377  0  non-token data: nham

Interesting... it appears I actually need to do a better job of training
spam!
sa-learn --dump magic|grep am
0.000  0  98757  0  non-token data: nspam
0.000  0 255134  0  non-token data: nham

I just changed bayes_auto_learn_threshold_spam to 5.0, we'll see what
that does...
-- 
Jim C. Nasby, Database Architect[EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?


Re: Spamassassin Learn

2006-02-07 Thread Matt Kettler
Jim C. Nasby wrote:
 On Tue, Feb 07, 2006 at 03:16:57PM -0500, Matt Kettler wrote:
 My current training ratio is about 7:1 spam:nonspam, but in the past it's 
 been
 as bad as 20:1. Both of those are very far off from equal amounts, but the
 imbalance has never caused me any problems.

 From my sa-learn --dump magic output as of today:
 0.000  0 995764  0  non-token data: nspam
 0.000  0 145377  0  non-token data: nham
 
 Interesting... it appears I actually need to do a better job of training
 spam!
 sa-learn --dump magic|grep am
 0.000  0  98757  0  non-token data: nspam
 0.000  0 255134  0  non-token data: nham
 
 I just changed bayes_auto_learn_threshold_spam to 5.0, we'll see what
 that does...

Actually, you can't ever set the threshold below 6.0. SA has a hard-coded
requirement of at least 3.0 header points, and 3.0 body points before it will
autolearn as spam. Therefore, any setting below 6 is moot, because the two 3.0
requirements can't both be met without a score of at least 6.

I would also check to make sure you don't have a lot of spam coming in that's
getting autolearned as ham. (note: the learner's idea of score is very different
than the final message score, so a message CAN be tagged as spam, and still get
autolearned as ham)




Re: Spamassassin Learn

2006-02-07 Thread jdow

From: Jim C. Nasby [EMAIL PROTECTED]


On Tue, Feb 07, 2006 at 03:16:57PM -0500, Matt Kettler wrote:

My current training ratio is about 7:1 spam:nonspam, but in the past it's been
as bad as 20:1. Both of those are very far off from equal amounts, but the
imbalance has never caused me any problems.

From my sa-learn --dump magic output as of today:
0.000  0 995764  0  non-token data: nspam
0.000  0 145377  0  non-token data: nham


Interesting... it appears I actually need to do a better job of training
spam!
sa-learn --dump magic|grep am
0.000  0  98757  0  non-token data: nspam
0.000  0 255134  0  non-token data: nham

I just changed bayes_auto_learn_threshold_spam to 5.0, we'll see what
that does...


If you have the option manually train the spam for awhile. If the threshold
is set too low for autolearning spam you will find yourself with a mangled
database that has a high percentage of actual ham learned as spam. That is
not a good thing. You might actually lower the ham threshold, as well. It
looks like you might be at risk of learning spam as ham. (And in fact may
have done this already to a high degree.)

{^_^}


Re: Spamassassin Learn

2006-02-07 Thread Jim C. Nasby
On Tue, Feb 07, 2006 at 04:40:40PM -0500, Matt Kettler wrote:
 I would also check to make sure you don't have a lot of spam coming in that's
 getting autolearned as ham. (note: the learner's idea of score is very 
 different
 than the final message score, so a message CAN be tagged as spam, and still 
 get
 autolearned as ham)
 
What would be the easiest way to do that? Grep through my caughtspam
maildir?
-- 
Jim C. Nasby, Database Architect[EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?


Re: Spamassassin Learn

2006-02-07 Thread Matt Kettler
Jim C. Nasby wrote:
 On Tue, Feb 07, 2006 at 04:40:40PM -0500, Matt Kettler wrote:
 I would also check to make sure you don't have a lot of spam coming in that's
 getting autolearned as ham. (note: the learner's idea of score is very 
 different
 than the final message score, so a message CAN be tagged as spam, and still 
 get
 autolearned as ham)
  
 What would be the easiest way to do that? Grep through my caughtspam
 maildir?

That would be the way I'd check.. grep for autolearn=ham


Re: Spamassassin Learn

2006-02-07 Thread Jim C. Nasby
On Tue, Feb 07, 2006 at 05:02:25PM -0500, Matt Kettler wrote:
 Jim C. Nasby wrote:
  On Tue, Feb 07, 2006 at 04:40:40PM -0500, Matt Kettler wrote:
  I would also check to make sure you don't have a lot of spam coming in 
  that's
  getting autolearned as ham. (note: the learner's idea of score is very 
  different
  than the final message score, so a message CAN be tagged as spam, and 
  still get
  autolearned as ham)
   
  What would be the easiest way to do that? Grep through my caughtspam
  maildir?
 
 That would be the way I'd check.. grep for autolearn=ham
 
Nothing autolearned. Interesting... I know I've fed my sent mail as ham,
but I'm pretty sure I only did that once or twice...

Guess I'll see how the numbers change with the low autolearn
threshold...
-- 
Jim C. Nasby, Database Architect[EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?


Re: Spamassassin Learn

2006-02-07 Thread Mike Jackson
Does anyone have any good techniques for capturing a sample of ham that 
can be used as the ham corpus.  I'm in a corporate environment and am not 
keen on the idea of intercepting non-spam messages.  I will if I have to, 
but was hoping someone had a better idea.


Depending on your MTA/MDA, you might be able to do it on the fly so that an 
actual copy of the message isn't necessary. For instance, if the messages 
pass through procmail, learn them just before delivery if the X-Spam-Status 
header isn't set to yes. Oh, and make sure you pass the --no-sync flag to 
sa-learn, then schedule the syncing for sometime during off-peak hours. 



Re: Spamassassin Learn

2006-02-07 Thread Matt Kettler
Jim C. Nasby wrote:
 On Tue, Feb 07, 2006 at 05:02:25PM -0500, Matt Kettler wrote:
 Jim C. Nasby wrote:
 On Tue, Feb 07, 2006 at 04:40:40PM -0500, Matt Kettler wrote:
 I would also check to make sure you don't have a lot of spam coming in 
 that's
 getting autolearned as ham. (note: the learner's idea of score is very 
 different
 than the final message score, so a message CAN be tagged as spam, and 
 still get
 autolearned as ham)
  
 What would be the easiest way to do that? Grep through my caughtspam
 maildir?
 That would be the way I'd check.. grep for autolearn=ham
  
 Nothing autolearned. 

Nothing autolearned at all? or nothing autolearned as ham?

Are there any autolearn strings? Are they all autolearn=no? are there any
decent number that are autolearn=failed or autolearn=disabled?


Re: Spamassassin Learn

2006-02-07 Thread Jim C. Nasby
On Tue, Feb 07, 2006 at 01:45:48PM -0800, jdow wrote:
 From: Jim C. Nasby [EMAIL PROTECTED]
 
 On Tue, Feb 07, 2006 at 03:16:57PM -0500, Matt Kettler wrote:
 My current training ratio is about 7:1 spam:nonspam, but in the past it's 
 been
 as bad as 20:1. Both of those are very far off from equal amounts, but the
 imbalance has never caused me any problems.
 
 From my sa-learn --dump magic output as of today:
 0.000  0 995764  0  non-token data: nspam
 0.000  0 145377  0  non-token data: nham
 
 Interesting... it appears I actually need to do a better job of training
 spam!
 sa-learn --dump magic|grep am
 0.000  0  98757  0  non-token data: nspam
 0.000  0 255134  0  non-token data: nham
 
 I just changed bayes_auto_learn_threshold_spam to 5.0, we'll see what
 that does...
 
 If you have the option manually train the spam for awhile. If the threshold
 is set too low for autolearning spam you will find yourself with a mangled
 database that has a high percentage of actual ham learned as spam. That is
 not a good thing. You might actually lower the ham threshold, as well. It
 looks like you might be at risk of learning spam as ham. (And in fact may
 have done this already to a high degree.)

See my other reply, which showed stats for all spam over 5 this month.
The stats for last month are:
grep -r autolearn oldspam/ | grep -v 'Binary file' | sed -e
's/.*autolearn=\([^ ]*\).*/\1/' | sort | uniq -c
5862 no
1225 spam
  24 unavailable

So based on this, I'd think it's not learning spam as ham...

BTW, autolearn ham should be at it's default setting...

What's interesting is that I get about 10-20 spams a day that are scored
below 3, and another 30-50 a day that are between 3 and 5 (which go to
my 'probablespam' folder). I send all of these to sa via spamassassin
-r, so I would have thought that I'd have far more spam in the database
than ham...
-- 
Jim C. Nasby, Database Architect[EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?


Re: Spamassassin Learn

2006-02-07 Thread Matt Kettler
Jim C. Nasby wrote:
 Are there any autolearn strings? Are they all autolearn=no? are there any
 decent number that are autolearn=failed or autolearn=disabled?

 
 grep -r autolearn caughtspam/ | grep -v 'Binary file' | sed -e
 's/.*autolearn=\([^ ]*\).*/\1/'|sort|uniq -c
 1545 no
  140 spam
4 unavailable

Fair enough, that at least suggests that the autolearner is working. However,
that learning ratio is pretty low.

Are you using network tests? Without DNSBLs it's often hard to get enough header
points to cause spam learning..

(Note I use mailscanner, hence the odd log syntax)

 grep is spam, /var/log/maillog |wc -l
   3434
 grep is spam, /var/log/maillog|grep autolearn=spam |wc -l
   2766
 grep is spam, /var/log/maillog|grep autolearn=not spam | wc -l
  0

So I'm autolearning about 80% of my tagged spam as spam, and none as ham.

I'm also autolearning about 38% of my nonspam as ham.

I'm using the default bayes_auto_learn_threshold_spam (12.0)

I'm also using modified bayes_auto_learn_threshold_nonspam (-0.01). I use this
coupled with a series of custom rules with tiny negative scores (all  -0.1).
This makes nonspam learning something that has to be minimally earned, not just
granted by virtue of a low score.




Re: Spamassassin Learn

2006-02-07 Thread Jim C. Nasby
On Tue, Feb 07, 2006 at 05:47:36PM -0500, Matt Kettler wrote:
 Jim C. Nasby wrote:
  On Tue, Feb 07, 2006 at 05:02:25PM -0500, Matt Kettler wrote:
  Jim C. Nasby wrote:
  On Tue, Feb 07, 2006 at 04:40:40PM -0500, Matt Kettler wrote:
  I would also check to make sure you don't have a lot of spam coming in 
  that's
  getting autolearned as ham. (note: the learner's idea of score is very 
  different
  than the final message score, so a message CAN be tagged as spam, and 
  still get
  autolearned as ham)
   
  What would be the easiest way to do that? Grep through my caughtspam
  maildir?
  That would be the way I'd check.. grep for autolearn=ham
   
  Nothing autolearned. 
 
 Nothing autolearned at all? or nothing autolearned as ham?
 
 Are there any autolearn strings? Are they all autolearn=no? are there any
 decent number that are autolearn=failed or autolearn=disabled?
 

grep -r autolearn caughtspam/ | grep -v 'Binary file' | sed -e
's/.*autolearn=\([^ ]*\).*/\1/'|sort|uniq -c
1545 no
 140 spam
   4 unavailable

-- 
Jim C. Nasby, Database Architect[EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?


Re: Spamassassin Learn

2006-02-07 Thread Jim C. Nasby
On Tue, Feb 07, 2006 at 06:17:20PM -0500, Matt Kettler wrote:
 Jim C. Nasby wrote:
  Are there any autolearn strings? Are they all autolearn=no? are there any
  decent number that are autolearn=failed or autolearn=disabled?
 
  
  grep -r autolearn caughtspam/ | grep -v 'Binary file' | sed -e
  's/.*autolearn=\([^ ]*\).*/\1/'|sort|uniq -c
  1545 no
   140 spam
 4 unavailable
 
 Fair enough, that at least suggests that the autolearner is working. However,
 that learning ratio is pretty low.
 
 Are you using network tests? Without DNSBLs it's often hard to get enough 
 header
 points to cause spam learning..

I believe so...

grep loadplugin /usr/local/etc/mail/spamassassin/init.pre
# loadplugin Mail::SpamAssassin::Plugin::RelayCountry
loadplugin Mail::SpamAssassin::Plugin::URIDNSBL
loadplugin Mail::SpamAssassin::Plugin::Hashcash
loadplugin Mail::SpamAssassin::Plugin::SPF

grep -v # ~/.spamassassin/user_prefs | grep -v whitelist
bayes_auto_learn 1
bayes_auto_learn_threshold_spam 5.0


This is basically a stock FreeBSD install from ports, if you're
familiar...
-- 
Jim C. Nasby, Database Architect[EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?


Re: Spamassassin Learn

2006-02-07 Thread mike

Probably would work if you were running Linux.

Jim C. Nasby wrote:


On Tue, Feb 07, 2006 at 05:47:36PM -0500, Matt Kettler wrote:
 


Chupacabra


Re: Spamassassin Learn

2006-02-07 Thread Matt Kettler
Jim C. Nasby wrote:
 Are you using network tests? Without DNSBLs it's often hard to get enough 
 header
 points to cause spam learning..
 
 I believe so...
 
 grep loadplugin /usr/local/etc/mail/spamassassin/init.pre
 # loadplugin Mail::SpamAssassin::Plugin::RelayCountry
 loadplugin Mail::SpamAssassin::Plugin::URIDNSBL
 loadplugin Mail::SpamAssassin::Plugin::Hashcash
 loadplugin Mail::SpamAssassin::Plugin::SPF
 

None of that will tell you if DNSBLs are enabled.. The DNSBLs aren't a plugin,
they're a built-in that auto-enables itself in you have perl's Net::DNS
installed. Try running spamassassin --lint -D and look for these lines:

[18000] dbg: dns: is Net::DNS::Resolver available? yes
[18000] dbg: dns: Net::DNS version: 0.48

 This is basically a stock FreeBSD install from ports, if you're
 familiar...

Nope. I personally dislike distro packages and ports of any sort for tools that
are rapidly updated.


Re: Spamassassin Learn

2006-02-07 Thread Jim C. Nasby
On Tue, Feb 07, 2006 at 05:36:56PM -0600, Jim C. Nasby wrote:
 On Tue, Feb 07, 2006 at 06:17:20PM -0500, Matt Kettler wrote:
  Jim C. Nasby wrote:
   Are there any autolearn strings? Are they all autolearn=no? are there 
   any
   decent number that are autolearn=failed or autolearn=disabled?
  
   
   grep -r autolearn caughtspam/ | grep -v 'Binary file' | sed -e
   's/.*autolearn=\([^ ]*\).*/\1/'|sort|uniq -c
   1545 no
140 spam
  4 unavailable
  
  Fair enough, that at least suggests that the autolearner is working. 
  However,
  that learning ratio is pretty low.
  
  Are you using network tests? Without DNSBLs it's often hard to get enough 
  header
  points to cause spam learning..
 
 I believe so...
 
 grep loadplugin /usr/local/etc/mail/spamassassin/init.pre
 # loadplugin Mail::SpamAssassin::Plugin::RelayCountry
 loadplugin Mail::SpamAssassin::Plugin::URIDNSBL
 loadplugin Mail::SpamAssassin::Plugin::Hashcash
 loadplugin Mail::SpamAssassin::Plugin::SPF
 
 grep -v # ~/.spamassassin/user_prefs | grep -v whitelist
 bayes_auto_learn 1
 bayes_auto_learn_threshold_spam 5.0

Hmm... here's something interesting...

grep -r autolearn pgsql/ | grep -v 'Binary file' | sed -e
's/.*autolearn=\([^ ]*\).*/\1/' | sort | uniq -c
2010 ham
 198 no
  17 unavailable

So a big chunk of [EMAIL PROTECTED] email is being learned as ham.
Looking further, I see...

X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00 autolearn=ham
version=3.1.0

ISTM that having the thresholds setup so that BAYES_00 scores low enough
to autolearn is a BadThing, as it creates a positive feedback loop. :)
I've added bayes_auto_learn_threshold_nonspam -2.6 to my personal
config; we'll see if that helps.
-- 
Jim C. Nasby, Database Architect[EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?


Re: Spamassassin Learn

2006-02-07 Thread Jim C. Nasby
On Tue, Feb 07, 2006 at 06:47:06PM -0500, Matt Kettler wrote:
 Jim C. Nasby wrote:
  Are you using network tests? Without DNSBLs it's often hard to get enough 
  header
  points to cause spam learning..
  
  I believe so...
  
  grep loadplugin /usr/local/etc/mail/spamassassin/init.pre
  # loadplugin Mail::SpamAssassin::Plugin::RelayCountry
  loadplugin Mail::SpamAssassin::Plugin::URIDNSBL
  loadplugin Mail::SpamAssassin::Plugin::Hashcash
  loadplugin Mail::SpamAssassin::Plugin::SPF
  
 
 None of that will tell you if DNSBLs are enabled.. The DNSBLs aren't a plugin,
 they're a built-in that auto-enables itself in you have perl's Net::DNS
 installed. Try running spamassassin --lint -D and look for these lines:
 
 [18000] dbg: dns: is Net::DNS::Resolver available? yes
 [18000] dbg: dns: Net::DNS version: 0.48

spamassassin --lint -D | grep Net::DNS | grep -i version
[50306] dbg: dns: Net::DNS version: 0.55
[50306] dbg: diag: module installed: Net::DNS, version 0.55

-- 
Jim C. Nasby, Database Architect[EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?


Re: Spamassassin Learn

2006-02-07 Thread Jim C. Nasby
On Tue, Feb 07, 2006 at 05:45:54PM -0600, mike wrote:
 Probably would work if you were running Linux.

The problem isn't that it isn't working, the problem is that it's
working too well. I guess maybe that's something you're not used to. :P
-- 
Jim C. Nasby, Database Architect[EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?


Re: Spamassassin Learn

2006-02-07 Thread mike

Jim C. Nasby wrote:


On Tue, Feb 07, 2006 at 05:45:54PM -0600, mike wrote:
 


Probably would work if you were running Linux.
   



The problem isn't that it isn't working, the problem is that it's
working too well. I guess maybe that's something you're not used to. :P
 



Something tells me if that were true you would not be in here asking 
questions but demoing howtos 


IE how to make SA work too well.  Whatever that is supposed to mean.



Re: Spamassassin Learn

2006-02-07 Thread Matt Kettler
Jim C. Nasby wrote:
 On Tue, Feb 07, 2006 at 05:36:56PM -0600, Jim C. Nasby wrote:
   
 On Tue, Feb 07, 2006 at 06:17:20PM -0500, Matt Kettler wrote:
 
 Jim C. Nasby wrote:
   
 Are there any autolearn strings? Are they all autolearn=no? are there 
 any
 decent number that are autolearn=failed or autolearn=disabled?

   
 grep -r autolearn caughtspam/ | grep -v 'Binary file' | sed -e
 's/.*autolearn=\([^ ]*\).*/\1/'|sort|uniq -c
 1545 no
  140 spam
4 unavailable
 
 Fair enough, that at least suggests that the autolearner is working. 
 However,
 that learning ratio is pretty low.

 Are you using network tests? Without DNSBLs it's often hard to get enough 
 header
 points to cause spam learning..
   
 I believe so...

 grep loadplugin /usr/local/etc/mail/spamassassin/init.pre
 # loadplugin Mail::SpamAssassin::Plugin::RelayCountry
 loadplugin Mail::SpamAssassin::Plugin::URIDNSBL
 loadplugin Mail::SpamAssassin::Plugin::Hashcash
 loadplugin Mail::SpamAssassin::Plugin::SPF

 grep -v # ~/.spamassassin/user_prefs | grep -v whitelist
 bayes_auto_learn 1
 bayes_auto_learn_threshold_spam 5.0
 

 Hmm... here's something interesting...

 grep -r autolearn pgsql/ | grep -v 'Binary file' | sed -e
 's/.*autolearn=\([^ ]*\).*/\1/' | sort | uniq -c
 2010 ham
  198 no
   17 unavailable

 So a big chunk of [EMAIL PROTECTED] email is being learned as ham.
 Looking further, I see...

 X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00 autolearn=ham
 version=3.1.0

 ISTM that having the thresholds setup so that BAYES_00 scores low enough
 to autolearn is a BadThing, as it creates a positive feedback loop. :)
 I've added bayes_auto_learn_threshold_nonspam -2.6 to my personal
 config; we'll see if that helps.
   

Jim,

Bayes is NOT used when calculating autolearning score, that would
promote self feedbac. As I said before, the autolearner's concept of
score is VERY different from the final message score. Score
contributions from bayes, white/blacklists, and the AWL are all ignored
by the autolearner. It also looks up the individual rule scores from set
0 or 1 instead of 2 or 3. This is a MASSIVE difference.


However, the default autolearn threshold is 0.1. That's a POSITIVE
threshold. To the autolearner that message scored 0 points. 0 is less
than 0.1, so it learned as HAM.

I'd suggest re-adjusting your threshold, as a default spamassasin config
will only VERY rarely generate a negative score to the autolearner. The
only rules that can do it are bondedsender, habeas COI/SOI and hashcash.
Hashcash is so rare it may as well not exist at present. BondedSender
and Habeas are only use by large legitamate mailers, so none of your
person-to-person mail will ever get autolearned in your current setup
unless you know someone who uses hashcash.




Re: Spamassassin Learn

2006-02-07 Thread jdow

From: Matt Kettler [EMAIL PROTECTED]


Jim C. Nasby wrote:

Are there any autolearn strings? Are they all autolearn=no? are there any
decent number that are autolearn=failed or autolearn=disabled?



grep -r autolearn caughtspam/ | grep -v 'Binary file' | sed -e
's/.*autolearn=\([^ ]*\).*/\1/'|sort|uniq -c
1545 no
 140 spam
   4 unavailable


Fair enough, that at least suggests that the autolearner is working. However,
that learning ratio is pretty low.

Are you using network tests? Without DNSBLs it's often hard to get enough header
points to cause spam learning..

(Note I use mailscanner, hence the odd log syntax)

grep is spam, /var/log/maillog |wc -l
  3434
grep is spam, /var/log/maillog|grep autolearn=spam |wc -l
  2766
grep is spam, /var/log/maillog|grep autolearn=not spam | wc -l
 0

So I'm autolearning about 80% of my tagged spam as spam, and none as ham.

I'm also autolearning about 38% of my nonspam as ham.

I'm using the default bayes_auto_learn_threshold_spam (12.0)

I'm also using modified bayes_auto_learn_threshold_nonspam (-0.01). I use this
coupled with a series of custom rules with tiny negative scores (all  -0.1).
This makes nonspam learning something that has to be minimally earned, not just
granted by virtue of a low score.


I wonder if he has greylisting turned on.
{^_^}


Re: Spamassassin Learn

2006-02-07 Thread Jim C. Nasby
On Tue, Feb 07, 2006 at 07:59:37PM -0500, Matt Kettler wrote:
 Jim,
 
 Bayes is NOT used when calculating autolearning score, that would
 promote self feedbac. As I said before, the autolearner's concept of
 score is VERY different from the final message score. Score
 contributions from bayes, white/blacklists, and the AWL are all ignored
 by the autolearner. It also looks up the individual rule scores from set
 0 or 1 instead of 2 or 3. This is a MASSIVE difference.
 
 
 However, the default autolearn threshold is 0.1. That's a POSITIVE
 threshold. To the autolearner that message scored 0 points. 0 is less
 than 0.1, so it learned as HAM.
 
 I'd suggest re-adjusting your threshold, as a default spamassasin config
 will only VERY rarely generate a negative score to the autolearner. The
 only rules that can do it are bondedsender, habeas COI/SOI and hashcash.
 Hashcash is so rare it may as well not exist at present. BondedSender
 and Habeas are only use by large legitamate mailers, so none of your
 person-to-person mail will ever get autolearned in your current setup
 unless you know someone who uses hashcash.

Ahh, got it. Makes much more sense. :)

So I guess either 0 or -0.1 makes the most sense?
-- 
Jim C. Nasby, Database Architect[EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?


Re: Spamassassin Learn

2006-02-07 Thread Matt Kettler
Jim C. Nasby wrote:
 On Tue, Feb 07, 2006 at 07:59:37PM -0500, Matt Kettler wrote:
   
 Jim,

 Bayes is NOT used when calculating autolearning score, that would
 promote self feedbac. As I said before, the autolearner's concept of
 score is VERY different from the final message score. Score
 contributions from bayes, white/blacklists, and the AWL are all ignored
 by the autolearner. It also looks up the individual rule scores from set
 0 or 1 instead of 2 or 3. This is a MASSIVE difference.


 However, the default autolearn threshold is 0.1. That's a POSITIVE
 threshold. To the autolearner that message scored 0 points. 0 is less
 than 0.1, so it learned as HAM.

 I'd suggest re-adjusting your threshold, as a default spamassasin config
 will only VERY rarely generate a negative score to the autolearner. The
 only rules that can do it are bondedsender, habeas COI/SOI and hashcash.
 Hashcash is so rare it may as well not exist at present. BondedSender
 and Habeas are only use by large legitamate mailers, so none of your
 person-to-person mail will ever get autolearned in your current setup
 unless you know someone who uses hashcash.
 

 Ahh, got it. Makes much more sense. :)

 So I guess either 0 or -0.1 makes the most sense?
   
0 makes the most sense, unless you add on negative-scoring rules.  With
a default SA there's really no difference in autolearning threshold
between -1.3 and -0.1, and very little difference between -0.001 and -100.0.

Ignoring hashcash due to it's rarity, and bayes, the AWL, and all
whitelists can't count so they are omitted:

There are 0 rules in SA that can get you a learning score at or below -8.001
There are only 3 rules in SA that can get you a learning score at or
below -2.3
There are only 7 rules in SA that can get you a learning score at or
below -0.1.
There are only 12 rules in SA that can get you a learning score at or
below -0.001.

The differences between the 4 cases is more-or less moot. You won't
learn much ham at all.

Even if you consider hashcash, that's only another 5 rules, and only
applies when senders realize what hashcash even is.

 I run my boxes with -0.01 as a threshold, but I've added on about 30
simple body-text rules looking for industry terminology for my
company's business and assigning -0.02 scores to them. This way I
autolearn any business-related mail without any real chance of a spammer
abusing them to whitelist himself. Even if a spam every single one of my
rules, it would only knock 0.6 points off the spam score.

For reference, these are the only rules in a stock  SA 3.1.0 that can
give you a negative learning score:

score HABEAS_ACCREDITED_COI 0 -8.0 0 -8.0
score RCVD_IN_BSP_TRUSTED 0 -4.3 0 -4.3
score HABEAS_ACCREDITED_SOI 0 -4.3 0 -4.3
score ALL_TRUSTED -1.360 -1.440 -1.665 -1.800
score RCVD_IN_IADB_VOUCHED 0 -1.825 0 -2.200
score HABEAS_CHECKED 0 -0.2 0 -0.2
score RCVD_IN_BSP_OTHER 0 -0.1 0 -0.1
score NO_RELAYS -0.001
score NO_RECEIVED -0.001
score DK_VERIFIED -0.001
score SPF_PASS -0.001
score SPF_HELO_PASS -0.001

score HASHCASH_20 -0.500
score HASHCASH_21 -0.700
score HASHCASH_22 -1.000
score HASHCASH_23 -2.000
score HASHCASH_24 -3.000
score HASHCASH_25 -4.000
score HASHCASH_HIGH -5.000





Re: Spamassassin Learn

2006-02-07 Thread Gene Heskett
On Tuesday 07 February 2006 15:27, Clay Davis wrote:
Does anyone have any good techniques for capturing a sample of ham
 that can be used as the ham corpus.  I'm in a corporate environment
 and am not keen on the idea of intercepting non-spam messages.  I
 will if I have to, but was hoping someone had a better idea.

I wouldn't have too guilty a consience(sp?) on that subject because 
generally, you won't be reading very much other than intercepted spam.  
There may be an FP in there occasionally, but you'll soon learn to 
catch those and feed them to the ham learner  hence move them to the 
correct mailbox folder.  In other words, to make an omelete, you 
normally have to break a few eggs.  What you accidently read in an FP 
should be treated with the usual amount of salt and otherwise 
forgotten.

Regards,
Clay

 On 2/7/2006 at 3:16 pm, in message [EMAIL PROTECTED],
 Matt Kettler

[EMAIL PROTECTED] wrote:
 [EMAIL PROTECTED] wrote:
 Can you just feed spamassassin spam or do you need to give it ham
 also?

 I read the docs and it didn't say you had to feed it ham.

 I then read another doc and it said you should feed it equal
 amounts of spam and ham.

 Yes, you really should feed it both. You also should strive for a
 1:1 ratio of
 spam and nonspam, but don't kill yourself to get there.

 SA's use of chi-squared combining makes it very tolerant of wild
 imbalances in
 training. However, the closer you are to a 1:1 ratio the better SA
 will be able
 to distinguish tokens that are present in both kinds of mail and
 ignore them. So
 this is a worthwhile goal to strive for as long as it doesn't become
 a burden.

 My current training ratio is about 7:1 spam:nonspam, but in the past
 it's been
 as bad as 20:1. Both of those are very far off from equal amounts,
 but the imbalance has never caused me any problems.

 From my sa-learn --dump magic output as of today:
 0.000  0 995764  0  non-token data: nspam
 0.000  0 145377  0  non-token data: nham

 That works out to a ratio of 6.85:1

-- 
Cheers, Gene
People having trouble with vz bouncing email to me should add the word
'online' between the 'verizon', and the dot which bypasses vz's
stupid bounce rules.  I do use spamassassin too. :-)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.