subject:"Re\: Very spammy messages yield BAYES

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-22 Thread Bowie Bailey


On 8/21/2012 5:51 PM, Ben Johnson wrote:


On 8/21/2012 5:19 PM, John Hardin wrote:

On Tue, 21 Aug 2012, Ben Johnson wrote:


Aug 21 13:08:33.729 [23714] dbg: bayes: tie-ing to DB file R/O
/var/lib/amavis/.spamassassin/bayes_toks

---8--
# sa-learn --username=amavis --dump magic

Run that with --debug and verify that the filenames match.


Sure enough, they don't match:

---8--
[...]
dbg: bayes: tie-ing to DB file R/O /root/.spamassassin/bayes_toks
dbg: bayes: tie-ing to DB file R/O /root/.spamassassin/bayes_seen
Aug 21 14:41:13.112 [32170] dbg: bayes: found bayes db version 3
0.000  0  3  0  non-token data: bayes db version
0.000  0 95  0  non-token data: nspam
0.000  0307  0  non-token data: nham
0.000  0  62301  0  non-token data: ntokens
0.000  0 1345469997  0  non-token data: oldest atime
0.000  0 1345579297  0  non-token data: newest atime
0.000  0  0  0  non-token data: last journal
sync atime
0.000  0  0  0  non-token data: last expiry atime
0.000  0  0  0  non-token data: last expire
atime delta
0.000  0  0  0  non-token data: last expire
reduction count
---8--

So, I suppose that I didn't actually resolve the problem from yesterday,
which was that I cannot seem to train under the amavis user due to the
ownership/permissions on the /var/vmail directory.

What good is the --username switch, then?

Why does this command train the root user's database?

# sa-learn --username=amavis --spam /path/to/spam

And why does this command dump the root user's database?

# sa-learn --username=amavis --dump magic

Thanks very much,


As has already been mentioned, the '--username' option is only useful if 
you're using SQL.  You should set your bayes_path so there is no confusion.


Since you have been training the root database, you may want to copy 
that one over.


$ cp /root/.spamassassin/bayes* /var/lib/amavis/.spamassassin/

Then fix the permissions and ownership back to what they should be for 
the amavis user.


Then set the bayes path in your local.cf:

bayes_path /var/lib/amavis/.spamassassin/bayes

(Don't double the 'bayes' at the end as was suggested previously unless 
you want to move the bayes files into a 'bayes' directory)


Restart amavis and try again...

--
Bowie

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-22 Thread John Hardin


On Wed, 22 Aug 2012, Bowie Bailey wrote:


On 8/21/2012 5:51 PM, Ben Johnson wrote:


 What good is the --username switch, then?


See other responses.


 Why does this command train the root user's database?


Because you ran the command as root.

I apologize, I didn't provide sufficient details. When I said train as 
the user who runs SA I meant su to that OS user ID before running the 
sa-learn command.


You can either override the default Bayes database files path to 
explicitly specify a shared global database as has been suggested, or run 
sa-learn as the amavis user via su or a cron job. Defining a global bayes 
database is probably a better solution overall, but bear in mind if you 
have to wipe and retrain you need to check the permissions on the new 
database files after you run sa-learn the first time.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  ...every time I sit down in front of a Windows machine I feel as
  if the computer is just a place for the manufacturers to put their
  advertising. -- fwadling on Y! SCOX
---
 2 days until the 1933rd anniversary of the destruction of Pompeii

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-22 Thread Ben Johnson



On 8/22/2012 9:05 AM, Bowie Bailey wrote:
 On 8/21/2012 5:51 PM, Ben Johnson wrote:

 On 8/21/2012 5:19 PM, John Hardin wrote:
 On Tue, 21 Aug 2012, Ben Johnson wrote:

 Aug 21 13:08:33.729 [23714] dbg: bayes: tie-ing to DB file R/O
 /var/lib/amavis/.spamassassin/bayes_toks

 ---8--
 # sa-learn --username=amavis --dump magic
 Run that with --debug and verify that the filenames match.

 Sure enough, they don't match:

 ---8--
 [...]
 dbg: bayes: tie-ing to DB file R/O /root/.spamassassin/bayes_toks
 dbg: bayes: tie-ing to DB file R/O /root/.spamassassin/bayes_seen
 Aug 21 14:41:13.112 [32170] dbg: bayes: found bayes db version 3
 0.000  0  3  0  non-token data: bayes db version
 0.000  0 95  0  non-token data: nspam
 0.000  0307  0  non-token data: nham
 0.000  0  62301  0  non-token data: ntokens
 0.000  0 1345469997  0  non-token data: oldest atime
 0.000  0 1345579297  0  non-token data: newest atime
 0.000  0  0  0  non-token data: last journal
 sync atime
 0.000  0  0  0  non-token data: last expiry atime
 0.000  0  0  0  non-token data: last expire
 atime delta
 0.000  0  0  0  non-token data: last expire
 reduction count
 ---8--

 So, I suppose that I didn't actually resolve the problem from yesterday,
 which was that I cannot seem to train under the amavis user due to the
 ownership/permissions on the /var/vmail directory.

 What good is the --username switch, then?

 Why does this command train the root user's database?

 # sa-learn --username=amavis --spam /path/to/spam

 And why does this command dump the root user's database?

 # sa-learn --username=amavis --dump magic

 Thanks very much,
 
 As has already been mentioned, the '--username' option is only useful if
 you're using SQL.  You should set your bayes_path so there is no confusion.

Thank you Axb and Bowie for clarifying this point. Perhaps the sa-learn
documentation should be updated to eliminate the ambiguity around this
switch. In particular, I am referring to this page:
http://spamassassin.apache.org/full/3.0.x/dist/doc/sa-learn.html , which
states only the following:

If specified this username will override the username taken from the
runtime environment. You can use this option to specify users in a
virtual user configuration.

Maybe adding the SQL keyword will make the virtual user
configuration distinction more evident.

 Since you have been training the root database, you may want to copy
 that one over.
 
 $ cp /root/.spamassassin/bayes* /var/lib/amavis/.spamassassin/
 
 Then fix the permissions and ownership back to what they should be for
 the amavis user.

I did think to do this, but I approached it a bit differently, and used
sa-learn --backup (and --restore), under the amavis user account,
which mitigated the need to modify the permissions on the database.

 Then set the bayes path in your local.cf:
 
 bayes_path /var/lib/amavis/.spamassassin/bayes
 
 (Don't double the 'bayes' at the end as was suggested previously unless
 you want to move the bayes files into a 'bayes' directory)
 
 Restart amavis and try again...
 

Again, thanks to Axb and Bowie for making this suggestion. Hard-coding
the bayes_path was the missing link for me; this is what allowed me to
train under the amavis user while having root (or vmail)
privileges, which on Debian, are necessary to read mail during training.

I think I'm sorted here; thanks again, guys!

-Ben

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-22 Thread Ben Johnson



On 8/22/2012 9:43 AM, John Hardin wrote:
 On Wed, 22 Aug 2012, Bowie Bailey wrote:
 
 On 8/21/2012 5:51 PM, Ben Johnson wrote:

  What good is the --username switch, then?

Thanks for the follow-up, John!

 See other responses.
 
  Why does this command train the root user's database?
 
 Because you ran the command as root.
 
 I apologize, I didn't provide sufficient details. When I said train as
 the user who runs SA I meant su to that OS user ID before running the
 sa-learn command.

No apology necessary; I knew what you meant, and did indeed try running
the sa-learn command as root, initially, but the problem then was a
lack of access to the mail directories. On Debian/Ubuntu systems, when
using Dovecot, all mail directories are vmail:vmail owned, with 700
permissions, which prevents the amavis user from having access to
them. (This is by design, I'm sure, and makes sense.)

 You can either override the default Bayes database files path to
 explicitly specify a shared global database as has been suggested, or
 run sa-learn as the amavis user via su or a cron job.

I did end-up overriding the bayes_path, which provided a workaround for
the permissions issues. Cheers to the suggestion.

Defining a global
 bayes database is probably a better solution overall, but bear in mind
 if you have to wipe and retrain you need to check the permissions on the
 new database files after you run sa-learn the first time.
 

This is an important point; thanks for articulating it.

All appears to be well in SpamAssassin Town for the time being (don't
think you've heard the last of me, though!). Thanks to everyone who
shared his or her expertise.

-Ben

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-22 Thread Axb


On 08/22/2012 04:10 PM, Ben Johnson wrote:


I did end-up overriding the bayes_path, which provided a workaround for
the permissions issues. Cheers to the suggestion.


This is not a workaround, it's common practice in many types of setups 
and documented, but due to numerous reasons can't be set as a default.
If the install routine would require/create a 
/etc/mail/spamassassin/bayes path it could bite other systems than 
standard Linux distros.

(note to myself: discuss this in dev list)

As so often, the clue is diagnostics.
In this case, I think we all worked backwards, first answering your 
questions before getting the big picture.



Defining a global

bayes database is probably a better solution overall, but bear in mind
if you have to wipe and retrain you need to check the permissions on the
new database files after you run sa-learn the first time.

This is an important point; thanks for articulating it.


Once you start seeing bayes hits, I'd switch to autolearn, disable auto 
expiration and set a weekly cron job to do the expiration.
That way Bayes keeps itself busy and you only have to train low scored 
stuff on a daily basis (cron job as amavis imap user user) or rsync the 
imap folder content out and sa-learn from target path



All appears to be well in SpamAssassin Town for the time being (don't
think you've heard the last of me, though!). Thanks to everyone who
shared his or her expertise.


Learning SA seems like a never ending process - the deeper you go, the 
more of its beauty comes to light.


Axb

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-22 Thread Ben Johnson



On 8/22/2012 10:26 AM, Axb wrote:
 On 08/22/2012 04:10 PM, Ben Johnson wrote:

 I did end-up overriding the bayes_path, which provided a workaround for
 the permissions issues. Cheers to the suggestion.
 
 This is not a workaround, it's common practice in many types of setups
 and documented, but due to numerous reasons can't be set as a default.
 If the install routine would require/create a
 /etc/mail/spamassassin/bayes path it could bite other systems than
 standard Linux distros.
 (note to myself: discuss this in dev list)

Right; it makes sense that this path cannot have a default value (other
than ~/...).

That said, it seems that for some users (myself included), setting this
path manually is a critical step in creating a maximally functional
(that is, Bayes-enabled) SpamAssassin installation. This would be
especially true if the SA developers were to change the
bayes_auto_learn default value to zero, or lower the default value for
bayes_auto_learn_threshold_nonspam (as a result of my incident here).

For this reason, it seems prudent for developers/contributors to take
one of two actions (or both):

1.)

Add the bayes_path directive to the default/stock local.cf that
ships with SpamAssassin, in a commented-out state. I realize that this
file may be maintainer/distribution specific, and that there are
attendant challenges associated with such a change.

This measure would underscore the directive's importance for the
administrator who is configuring the software.

2.)

Where possible, modify the SpamAssassin installer package to prompt the
user for the bayes_path during installation. These types of prompts
are common among related packages. For example, Postfix asks for all
kinds of information during its installation (on Debian-based systems,
anyway).

Again, I realize that the SA developers likely have no control over how
the software is packaged and delivered, so if this point seems valid, I
am happy to open distro-specific bug reports (or feature requests).

Thanks, Axb.

-Ben

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-21 Thread Ben Johnson

On 8/20/2012 2:47 PM, Ben Johnson wrote:
 I was able to resolve the issue by adding the --username switch to the
 'sa-learn' executable:
 
 # sa-learn --username=amavis --spam
 /var/vmail/example.com/trainer/Maildir/.INBOX.Spam/cur
 
 Thanks for all of the hints, folks!

So, I've been training SpamAssassin like a mad-man for a couple of days.

I don't have over 200 spams and 200 hams, so I don't expect Bayes to be
used yet (and it's not), but the following output is puzzling
(particularly, only 0 spam(s) in bayes DB  200):

---8--
# su amavis -c spamassassin -D -t 
/usr/share/doc/spamassassin/examples/sample-spam.txt 21 | egrep
'(bayes:|whitelist:|AWL)'

Aug 21 13:08:33.717 [23714] dbg: bayes: learner_new
self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x213613f8),
bayes_store_module=Mail::SpamAssassin::BayesStore::DBM
Aug 21 13:08:33.728 [23714] dbg: bayes: learner_new: got
store=Mail::SpamAssassin::BayesStore::DBM=HASH(0x2153b400)
Aug 21 13:08:33.729 [23714] dbg: bayes: tie-ing to DB file R/O
/var/lib/amavis/.spamassassin/bayes_toks
Aug 21 13:08:33.729 [23714] dbg: bayes: tie-ing to DB file R/O
/var/lib/amavis/.spamassassin/bayes_seen
Aug 21 13:08:33.730 [23714] dbg: bayes: found bayes db version 3
Aug 21 13:08:33.730 [23714] dbg: bayes: DB journal sync: last sync: 0
Aug 21 13:08:33.730 [23714] dbg: bayes: not available for scanning, only
0 spam(s) in bayes DB  200
Aug 21 13:08:33.730 [23714] dbg: bayes: untie-ing
Aug 21 13:08:33.732 [23714] dbg: bayes: tie-ing to DB file R/O
/var/lib/amavis/.spamassassin/bayes_toks
Aug 21 13:08:33.732 [23714] dbg: bayes: tie-ing to DB file R/O
/var/lib/amavis/.spamassassin/bayes_seen
Aug 21 13:08:33.733 [23714] dbg: bayes: found bayes db version 3
Aug 21 13:08:33.733 [23714] dbg: bayes: DB journal sync: last sync: 0
Aug 21 13:08:33.733 [23714] dbg: bayes: not available for scanning, only
0 spam(s) in bayes DB  200
Aug 21 13:08:33.733 [23714] dbg: bayes: untie-ing
---8--

Restarting Amavis does not change the output above.

And the output below seems to contradict the above (300 spams and 95 hams):

---8--
# sa-learn --username=amavis --dump magic

0.000  0  3  0  non-token data: bayes db version
0.000  0 95  0  non-token data: nspam
0.000  0300  0  non-token data: nham
0.000  0  59420  0  non-token data: ntokens
0.000  0 1345469997  0  non-token data: oldest atime
0.000  0 1345577900  0  non-token data: newest atime
0.000  0  0  0  non-token data: last journal
sync atime
0.000  0  0  0  non-token data: last expiry atime
0.000  0  0  0  non-token data: last expire
atime delta
0.000  0  0  0  non-token data: last expire
reduction count
---8--

Am I doing something silly?

Thanks for any help,

-Ben

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-21 Thread John Hardin


On Tue, 21 Aug 2012, Ben Johnson wrote:


Aug 21 13:08:33.729 [23714] dbg: bayes: tie-ing to DB file R/O
/var/lib/amavis/.spamassassin/bayes_toks

---8--
# sa-learn --username=amavis --dump magic


Run that with --debug and verify that the filenames match.

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  USMC Rules of Gunfighting #20: The faster you finish the fight,
  the less shot you will get.
---
 3 days until the 1933rd anniversary of the destruction of Pompeii

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-21 Thread Axb


On 08/21/2012 11:19 PM, John Hardin wrote:

On Tue, 21 Aug 2012, Ben Johnson wrote:


Aug 21 13:08:33.729 [23714] dbg: bayes: tie-ing to DB file R/O
/var/lib/amavis/.spamassassin/bayes_toks

---8--
# sa-learn --username=amavis --dump magic


Run that with --debug and verify that the filenames match.




it never hurts to define the bayes path in local.cf
(keeps you from guessing where it will land)


bayes_path /var/lib/amavis/.spamassassin/bayes/bayes

yes! , bayes twice!!! - make sure the path is as above

mkdir /var/lib/amavis/.spamassassin/bayes

/var/lib/amavis/.spamassassin/bayes_seen doesn't seem right
afaik, normally it would be

/var/lib/amavis/.spamassassin/bayes/bayes_seen

try this, relearn as much as you can  and run a --dump magic

Axb

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-21 Thread Jonas Eckerman


On 2012-08-15 20:56, Ben Johnson wrote:


On 8/15/2012 2:24 PM, John Hardin wrote:

You may also want to set up some mechanism for users to submit
misclassified messages for training.



That sounds like a good idea.
[...] this server runs Ubuntu 10.04 with Dovecot


Since you're using Dovecot you might be able to use the antispam plugin 
for dovecot. It let's you specify a special spam folder, and when users 
move mail into or out of that folder they are spooled or piped for 
retraining as spam or ham.


This way, the user running sa-learn does not need access to the users 
maildirs.


http://wiki2.dovecot.org/Plugins/Antispam
http://johannes.sipsolutions.net/Projects/dovecot-antispam

Regards
/Jonas
--
Jonas Eckerman
http://www.truls.org/

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-21 Thread Ben Johnson



On 8/21/2012 5:19 PM, John Hardin wrote:
 On Tue, 21 Aug 2012, Ben Johnson wrote:
 
 Aug 21 13:08:33.729 [23714] dbg: bayes: tie-ing to DB file R/O
 /var/lib/amavis/.spamassassin/bayes_toks

 ---8--
 # sa-learn --username=amavis --dump magic
 
 Run that with --debug and verify that the filenames match.
 

Sure enough, they don't match:

---8--
[...]
dbg: bayes: tie-ing to DB file R/O /root/.spamassassin/bayes_toks
dbg: bayes: tie-ing to DB file R/O /root/.spamassassin/bayes_seen
Aug 21 14:41:13.112 [32170] dbg: bayes: found bayes db version 3
0.000  0  3  0  non-token data: bayes db version
0.000  0 95  0  non-token data: nspam
0.000  0307  0  non-token data: nham
0.000  0  62301  0  non-token data: ntokens
0.000  0 1345469997  0  non-token data: oldest atime
0.000  0 1345579297  0  non-token data: newest atime
0.000  0  0  0  non-token data: last journal
sync atime
0.000  0  0  0  non-token data: last expiry atime
0.000  0  0  0  non-token data: last expire
atime delta
0.000  0  0  0  non-token data: last expire
reduction count
---8--

So, I suppose that I didn't actually resolve the problem from yesterday,
which was that I cannot seem to train under the amavis user due to the
ownership/permissions on the /var/vmail directory.

What good is the --username switch, then?

Why does this command train the root user's database?

# sa-learn --username=amavis --spam /path/to/spam

And why does this command dump the root user's database?

# sa-learn --username=amavis --dump magic

Thanks very much,

-Ben

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-21 Thread Axb


On 08/21/2012 11:51 PM, Ben Johnson wrote:



On 8/21/2012 5:19 PM, John Hardin wrote:

On Tue, 21 Aug 2012, Ben Johnson wrote:


Aug 21 13:08:33.729 [23714] dbg: bayes: tie-ing to DB file R/O
/var/lib/amavis/.spamassassin/bayes_toks

---8--
# sa-learn --username=amavis --dump magic


Run that with --debug and verify that the filenames match.



Sure enough, they don't match:

---8--
[...]
dbg: bayes: tie-ing to DB file R/O /root/.spamassassin/bayes_toks
dbg: bayes: tie-ing to DB file R/O /root/.spamassassin/bayes_seen


if you add bayes_path in local.cf it should find the right path, no 
matter what user you run it as. (works for me)

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-21 Thread Axb


On 08/21/2012 11:51 PM, Ben Johnson wrote:



On 8/21/2012 5:19 PM, John Hardin wrote:

On Tue, 21 Aug 2012, Ben Johnson wrote:


Aug 21 13:08:33.729 [23714] dbg: bayes: tie-ing to DB file R/O
/var/lib/amavis/.spamassassin/bayes_toks

---8--
# sa-learn --username=amavis --dump magic


Run that with --debug and verify that the filenames match.



Sure enough, they don't match:

---8--
[...]
dbg: bayes: tie-ing to DB file R/O /root/.spamassassin/bayes_toks
dbg: bayes: tie-ing to DB file R/O /root/.spamassassin/bayes_seen
Aug 21 14:41:13.112 [32170] dbg: bayes: found bayes db version 3
0.000  0  3  0  non-token data: bayes db version
0.000  0 95  0  non-token data: nspam
0.000  0307  0  non-token data: nham
0.000  0  62301  0  non-token data: ntokens
0.000  0 1345469997  0  non-token data: oldest atime
0.000  0 1345579297  0  non-token data: newest atime
0.000  0  0  0  non-token data: last journal
sync atime
0.000  0  0  0  non-token data: last expiry atime
0.000  0  0  0  non-token data: last expire
atime delta
0.000  0  0  0  non-token data: last expire
reduction count
---8--

So, I suppose that I didn't actually resolve the problem from yesterday,
which was that I cannot seem to train under the amavis user due to the
ownership/permissions on the /var/vmail directory.

What good is the --username switch, then?

Why does this command train the root user's database?

# sa-learn --username=amavis --spam /path/to/spam

And why does this command dump the root user's database?

# sa-learn --username=amavis --dump magic


because:

-u username, --username=username
   Override username taken from the runtime
   environment, used with SQL

and *not* for file based Bayes

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-20 Thread Ben Johnson



On 8/17/2012 11:28 AM, John Hardin wrote:
 On Fri, 17 Aug 2012, Ben Johnson wrote:
 
 On 8/16/2012 2:00 PM, Ben Johnson wrote:
 Basically, I need to do something about the spam inundation, as soon as
 possible.

 Is there any reason that I should NOT be performing the sa-learn
 training under the amavis user account?
 
 In general, all training should be done as the user that SA (in your
 case, SA via Amavis) is running as.

I have tried to do this, but to no avail:

---
# su amavis -c 'sa-learn --spam
/var/vmail/example.com/trainer/Maildir/.INBOX.Spam'

archive-iterator: no access to
/var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at
/usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 539.
archive-iterator: no access to
/var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at
/usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 771.
archive-iterator: unable to open
/var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13
---

This seems to occur because the virtual mail directory permissions do
not provide the amavis user with the required access level. The
vmail user is the only user with any type of access to
/var/vmail/example.com/user/Maildir. I suspect that there is a good
reason for this and that the ownership/permissions should not be changed.

I've done some research on this issue and there isn't much to be found.
This archived thread ( http://marc.info/?l=amavis-userm=116457786312019
) discusses overriding the Bayes user with bayes_sql_override_username
amavis, but that doesn't solve the problem (obviously). I still see the
same permission errors, although the need to use the 'su' wrapper does
go away.

Is there a conventional means by which to deal with this issue?

 If you have your system configured for per-user Bayes databases, then
 you'd need to train as the user whose database you want to affect.

The system in question leverages ISPConfig 3, which implements virtual
users/mailboxes, although, I don't know if ISPConfig configures Amavis
to utilize individual Bayes databases or if there's an individual
database for the amavis user. I can check with the developers.

 What is your bayes_path config?

I don't see this directive anywhere on the system in question; perhaps a
default value is being used. The only instance of that string exists in
a source file:

/usr/share/perl5/Mail/SpamAssassin/Conf.pm:=item bayes_path
/path/filename  (default: ~/.spamassassin/bayes)

So, presumably, bayes_path is equating to ~/.spamassassin/bayes, or
in my case, /var/lib/amavis/.spamassassin.

 Would doing so preclude me from creating training folders for
 individual IMAP users in the future?
 
 They're not related. Per-user ham and spam training folders doesn't
 preclude using those messages for training a global Bayes database.

Understood.

 You actually may want to implement a hybrid folder model: per-user ham
 training folders and a global spam training folder. Misclassified ham
 could potentially be private messages that the recipient doesn't want
 other users to see, but for misclassified spam who cares?

Right, that makes sense.

 Or can I train under the amavis user for now and then layer-on
 training for individual IMAP users in the future without undesirable
 consequences?
 
 As stated above, if you're not enabling per-user Bayes *databases*, the
 question is meaningless. Are you going to configure per-user Bayes
 databases? Or (as I suspect is more likely) perform global database
 training from individual users whose judgement you trust?
 

I suppose that I need to determine whether or not ISPConfig implements
per-user Bayes database already. I'll report-back for those who may be
curious.

Thanks again,

-Ben

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-20 Thread Axb


On 08/20/2012 06:42 PM, Ben Johnson wrote:



On 8/17/2012 11:28 AM, John Hardin wrote:

On Fri, 17 Aug 2012, Ben Johnson wrote:


On 8/16/2012 2:00 PM, Ben Johnson wrote:
Basically, I need to do something about the spam inundation, as soon as
possible.

Is there any reason that I should NOT be performing the sa-learn
training under the amavis user account?


In general, all training should be done as the user that SA (in your
case, SA via Amavis) is running as.


I have tried to do this, but to no avail:

---
# su amavis -c 'sa-learn --spam
/var/vmail/example.com/trainer/Maildir/.INBOX.Spam'

archive-iterator: no access to
/var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at
/usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 539.
archive-iterator: no access to
/var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at
/usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 771.
archive-iterator: unable to open
/var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13
---


~/Maildir/* assumes 1 file=1 mail

pls try

su amavis -c 'sa-learn --spam --progress --dir 
/var/vmail/example.com/trainer/Maildir/.INBOX.Spam/cur/'


or wherever the message are stored

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-20 Thread Bowie Bailey


On 8/20/2012 12:46 PM, Axb wrote:

On 08/20/2012 06:42 PM, Ben Johnson wrote:


On 8/17/2012 11:28 AM, John Hardin wrote:

On Fri, 17 Aug 2012, Ben Johnson wrote:


On 8/16/2012 2:00 PM, Ben Johnson wrote:
Basically, I need to do something about the spam inundation, as soon as
possible.

Is there any reason that I should NOT be performing the sa-learn
training under the amavis user account?

In general, all training should be done as the user that SA (in your
case, SA via Amavis) is running as.

I have tried to do this, but to no avail:

---
# su amavis -c 'sa-learn --spam
/var/vmail/example.com/trainer/Maildir/.INBOX.Spam'

archive-iterator: no access to
/var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at
/usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 539.
archive-iterator: no access to
/var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at
/usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 771.
archive-iterator: unable to open
/var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13
---

~/Maildir/* assumes 1 file=1 mail

pls try

su amavis -c 'sa-learn --spam --progress --dir
/var/vmail/example.com/trainer/Maildir/.INBOX.Spam/cur/'

or wherever the message are stored


But first, you need access to the files.  The simplest way is probably 
to add the amavis user account to the group used by the mail directories.


Assuming the group is vmail, the command should look like this (on 
RedHat/CentOS):


$ usermod -a -G vmail amavis

This command will probably need to be run as root.  If you are using a 
different distro, you will need to look up the command to add the amavis 
user to the vmail group.


--
Bowie

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-20 Thread Ben Johnson



On 8/20/2012 12:56 PM, Bowie Bailey wrote:
 On 8/20/2012 12:46 PM, Axb wrote:
 On 08/20/2012 06:42 PM, Ben Johnson wrote:

 On 8/17/2012 11:28 AM, John Hardin wrote:
 On Fri, 17 Aug 2012, Ben Johnson wrote:

 On 8/16/2012 2:00 PM, Ben Johnson wrote:
 Basically, I need to do something about the spam inundation, as
 soon as
 possible.

 Is there any reason that I should NOT be performing the sa-learn
 training under the amavis user account?
 In general, all training should be done as the user that SA (in your
 case, SA via Amavis) is running as.
 I have tried to do this, but to no avail:

 ---
 # su amavis -c 'sa-learn --spam
 /var/vmail/example.com/trainer/Maildir/.INBOX.Spam'

 archive-iterator: no access to
 /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at
 /usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 539.
 archive-iterator: no access to
 /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at
 /usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 771.
 archive-iterator: unable to open
 /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13
 ---
 ~/Maildir/* assumes 1 file=1 mail

 pls try

 su amavis -c 'sa-learn --spam --progress --dir
 /var/vmail/example.com/trainer/Maildir/.INBOX.Spam/cur/'

 or wherever the message are stored
 
 But first, you need access to the files.  The simplest way is probably
 to add the amavis user account to the group used by the mail directories.
 
 Assuming the group is vmail, the command should look like this (on
 RedHat/CentOS):
 
 $ usermod -a -G vmail amavis

Thanks, guys. I did consider adding the amavis user to the vmail
group, but the default permissions on the directories within Maildir
are 700 (with vmail:vmail ownership).

So, I'd have to fiddle with the permissions on the entire directory
tree, for each user, which seems like a bad idea.

Furthermore, ISPconfig handles the creation (and deletion) of these
directories, so I hesitate to change anything manually and muck-up the
installation.

While there may be permissions mask that is applied, modifying it seems
risky.

I wonder what the rest of the Dovecot + Amavis + SA world is doing about
this. Maybe I should ask on the Amavis mailing list.

If anyone has other suggestions, by all means, please do share.

 This command will probably need to be run as root.  If you are using a
 different distro, you will need to look up the command to add the amavis
 user to the vmail group.
 

Much thanks,

-Ben

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-20 Thread Ben Johnson



On 8/20/2012 2:02 PM, Ben Johnson wrote:
 
 
 On 8/20/2012 12:56 PM, Bowie Bailey wrote:
 On 8/20/2012 12:46 PM, Axb wrote:
 On 08/20/2012 06:42 PM, Ben Johnson wrote:

 On 8/17/2012 11:28 AM, John Hardin wrote:
 On Fri, 17 Aug 2012, Ben Johnson wrote:

 On 8/16/2012 2:00 PM, Ben Johnson wrote:
 Basically, I need to do something about the spam inundation, as
 soon as
 possible.

 Is there any reason that I should NOT be performing the sa-learn
 training under the amavis user account?
 In general, all training should be done as the user that SA (in your
 case, SA via Amavis) is running as.
 I have tried to do this, but to no avail:

 ---
 # su amavis -c 'sa-learn --spam
 /var/vmail/example.com/trainer/Maildir/.INBOX.Spam'

 archive-iterator: no access to
 /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at
 /usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 539.
 archive-iterator: no access to
 /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at
 /usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 771.
 archive-iterator: unable to open
 /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13
 ---
 ~/Maildir/* assumes 1 file=1 mail

 pls try

 su amavis -c 'sa-learn --spam --progress --dir
 /var/vmail/example.com/trainer/Maildir/.INBOX.Spam/cur/'

 or wherever the message are stored

 But first, you need access to the files.  The simplest way is probably
 to add the amavis user account to the group used by the mail directories.

 Assuming the group is vmail, the command should look like this (on
 RedHat/CentOS):

 $ usermod -a -G vmail amavis
 
 Thanks, guys. I did consider adding the amavis user to the vmail
 group, but the default permissions on the directories within Maildir
 are 700 (with vmail:vmail ownership).
 
 So, I'd have to fiddle with the permissions on the entire directory
 tree, for each user, which seems like a bad idea.
 
 Furthermore, ISPconfig handles the creation (and deletion) of these
 directories, so I hesitate to change anything manually and muck-up the
 installation.
 
 While there may be permissions mask that is applied, modifying it seems
 risky.
 
 I wonder what the rest of the Dovecot + Amavis + SA world is doing about
 this. Maybe I should ask on the Amavis mailing list.
 
 If anyone has other suggestions, by all means, please do share.
 
 This command will probably need to be run as root.  If you are using a
 different distro, you will need to look up the command to add the amavis
 user to the vmail group.

 
 Much thanks,
 
 -Ben
 

I was able to resolve the issue by adding the --username switch to the
'sa-learn' executable:

# sa-learn --username=amavis --spam
/var/vmail/example.com/trainer/Maildir/.INBOX.Spam/cur

Thanks for all of the hints, folks!

-Ben

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-20 Thread Axb


On 08/20/2012 08:02 PM, Ben Johnson wrote:

Furthermore, ISPconfig handles the creation (and deletion) of these
directories, so I hesitate to change anything manually and muck-up the
installation.

While there may be permissions mask that is applied, modifying it seems
risky.


IDEA:

I have a little homebrew Python Imap client which picks up stuff from an 
IMAP server and stores in a regular directory
Atm it doesn't delete messages after pickup but I could have it changed 
so it deletes after pickup


You could run that as the amavis user and store to ~/spam-dump/*.eml
Permissions would be ok for sa-learn to read the msgs.

If you want it, pls contact me offlist

Axb

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-17 Thread Ben Johnson

On 8/16/2012 2:00 PM, Ben Johnson wrote:
In any event, at this point, I'm confused as to which user account I
should be using when executing sa-learn --spam, for example.

As a bit of background, I'm using ISPConfig 3, which implements virtual
mailbox users via MySQL.

I dug through the mailing list archive and found
http://spamassassin.1065346.n5.nabble.com/Problem-with-sa-learn-and-virtual-user-td44666.html
, which seems to be relevant.

Ultimately, I'm wondering if I should be using the amavis user to
learn ham/spam, or individual mailbox user accounts.

If it is possible to use either, are there pros and cons of which one
should be aware before settling on an approach?

As I mentioned previously, I would like to set-up LearnHam and
LearnSpam folders for each IMAP user, eventually, so perhaps this
answers my question?

Thanks again for all the help!

John Hardin, sorry to bust you up here... just curious whether or not
you saw the rest of my previous note. If you didn't address these
questions intentionally, then please ignore me. :)

Basically, I need to do something about the spam inundation, as soon as
possible.

Is there any reason that I should NOT be performing the sa-learn
training under the amavis user account? Would doing so preclude me
from creating training folders for individual IMAP users in the future?
Or can I train under the amavis user for now and then layer-on
training for individual IMAP users in the future without undesirable
consequences?

Thanks again,

-Ben

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-17 Thread Bowie Bailey

On 8/17/2012 10:56 AM, Ben Johnson wrote:

On 8/16/2012 2:00 PM, Ben Johnson wrote:

In any event, at this point, I'm confused as to which user account I
should be using when executing sa-learn --spam, for example.

As a bit of background, I'm using ISPConfig 3, which implements virtual
mailbox users via MySQL.

I dug through the mailing list archive and found
http://spamassassin.1065346.n5.nabble.com/Problem-with-sa-learn-and-virtual-user-td44666.html
, which seems to be relevant.

Ultimately, I'm wondering if I should be using the amavis user to
learn ham/spam, or individual mailbox user accounts.

If it is possible to use either, are there pros and cons of which one
should be aware before settling on an approach?

As I mentioned previously, I would like to set-up LearnHam and
LearnSpam folders for each IMAP user, eventually, so perhaps this
answers my question?

Thanks again for all the help!

John Hardin, sorry to bust you up here... just curious whether or not
you saw the rest of my previous note. If you didn't address these
questions intentionally, then please ignore me. :)

Basically, I need to do something about the spam inundation, as soon as
possible.

The quickest way I know of to reduce spam is to reject mail at the MTA
based on the zen.spamhaus.org blacklist. I have been using this for a
few years now. It blocks lots of spam and I haven't had any problems
with it.

You can also implement graylisting, although it will slow down mail
delivery from new senders, which may or may not be an issue for you. I
haven't tried it, but lots of people swear by it.

As for Bayes, Amavis uses a single user for spam scanning. This means
that Bayes will use a single database (under the amavis user). You
may be able to get individual databases via SQL, but I'm not sure about
that.

--
Bowie

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-17 Thread John Hardin


On Fri, 17 Aug 2012, Ben Johnson wrote:


On 8/16/2012 2:00 PM, Ben Johnson wrote:
Basically, I need to do something about the spam inundation, as soon as
possible.

Is there any reason that I should NOT be performing the sa-learn
training under the amavis user account?


In general, all training should be done as the user that SA (in your case, 
SA via Amavis) is running as.


If you have your system configured for per-user Bayes databases, then 
you'd need to train as the user whose database you want to affect.


What is your bayes_path config?

Would doing so preclude me from creating training folders for individual 
IMAP users in the future?


They're not related. Per-user ham and spam training folders doesn't 
preclude using those messages for training a global Bayes database.


You actually may want to implement a hybrid folder model: per-user ham 
training folders and a global spam training folder. Misclassified ham 
could potentially be private messages that the recipient doesn't want 
other users to see, but for misclassified spam who cares?



Or can I train under the amavis user for now and then layer-on
training for individual IMAP users in the future without undesirable
consequences?


As stated above, if you're not enabling per-user Bayes *databases*, the 
question is meaningless. Are you going to configure per-user Bayes 
databases? Or (as I suspect is more likely) perform global database 
training from individual users whose judgement you trust?


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Ignorance is no excuse for a law.
---
 7 days until the 1933rd anniversary of the destruction of Pompeii

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-17 Thread Bowie Bailey


On 8/17/2012 11:28 AM, John Hardin wrote:

On Fri, 17 Aug 2012, Ben Johnson wrote:


Would doing so preclude me from creating training folders for individual
IMAP users in the future?

They're not related. Per-user ham and spam training folders doesn't
preclude using those messages for training a global Bayes database.

You actually may want to implement a hybrid folder model: per-user ham
training folders and a global spam training folder. Misclassified ham
could potentially be private messages that the recipient doesn't want
other users to see, but for misclassified spam who cares?


I have individual Spam and Ham training folders.  Then a cronjob moves 
everything to the main Spam and Ham directories for learning on a 
regular basis.  The main directories are not related to the mail server, 
so there is no real privacy concern.


--
Bowie

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-17 Thread John Hardin


On Fri, 17 Aug 2012, Bowie Bailey wrote:


On 8/17/2012 10:56 AM, Ben Johnson wrote:

 Basically, I need to do something about the spam inundation, as soon as
 possible.


The quickest way I know of to reduce spam is to reject mail at the MTA based 
on the zen.spamhaus.org blacklist.  I have been using this for a few years 
now.  It blocks lots of spam and I haven't had any problems with it.


+1 for zen.spamhaus.org DNSBL at SMTP time.

You can also implement graylisting, although it will slow down mail delivery 
from new senders, which may or may not be an issue for you.  I haven't tried 
it, but lots of people swear by it.


As for Greylisting, a lot of spam is least-effort one-shot no-retry 
delivery, but not all. It won't reduce spam that is sent via a proper 
MTA or via a spambot that does retry-until-successful. You can set a short 
delay period to block the one-attempt-gush spammers, or a longer delay 
period to give new spamvertised domain names a chance to appear in URIBLs 
for the spammers who retry. And, of course, you have to balance this 
against your users' expectations for delivery time, and perhaps do some 
education to set those expectations more realistically.


I use greylisting, with whitelists for regular correspondents.

There are some other MTA SMTP-time methods to pluck the low-hanging fruit:

Publishing an SPF record. There's anecdotal evidence that it cuts down on 
joe-job attempts.


Even if you publish an SPF record, you might want to explicltly reject 
From addresses in your domain if the message is received from the 
Internet. This can be done using SPF, but you may not be comfortable doing 
SMTP-time rejects based on SPF failures.


Something I have fairly good results with is rejecting mail from the 
Internet where the HELO is not a fully-qualified domain name.


Since my MTA is the only valid source for email from my domain, I also 
reject messages where the HELO is in my domain. You will, of course, have 
to carve out exceptions to this rule for valid outbound mail. On a 
multihomed MTA or an MTA where outbound mail is submitted via an SSL 
tunnel this is pretty easy.


For the above, if you have Sendmail I recommend milter-regex; my 
milter-regex.conf is available here:


  http://www.impsec.org/~jhardin/antispam/milter-regex.conf

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Ignorance is no excuse for a law.
---
 7 days until the 1933rd anniversary of the destruction of Pompeii

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-16 Thread Ben Johnson



On 8/15/2012 4:05 PM, John Hardin wrote:
 On Wed, 15 Aug 2012, Ben Johnson wrote:
 
 On 8/15/2012 2:24 PM, John Hardin wrote:
 On Wed, 15 Aug 2012, Ben Johnson wrote:

 Some 99% of the spam that I receive, which is grossly spammy (we're
 talking auto loans, cash advances, dink pills, the whole lot) contains
 BAYES_00=-1.9 in the tests portion of the X-Spam-Status header.

 Might anyone know why?

 Poor training.

 John, I can't thank you enough for the thoroughness of your response.
 
 I like to show off. :)
 
 Apart from the Bayes score, what kind of scores are those spams getting?

 Here are a few examples (the first two of which are two of VERY few in
 which the BAYES_* value is over 00):

 -
 No, score=0.192 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001,
 HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793,
 SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=no

 No, score=2.241 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001,
 HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RDNS_NONE=0.793,
 SPF_PASS=-0.001] autolearn=no

 No, score=-0.836 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
 HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3,
 RDNS_NONE=0.793, SPF_PASS=-0.001, URI_HEX=1.122] autolearn=no

 No, score=1.256 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
 HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3,
 RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7,
 URIBL_RHS_DOB=1.514] autolearn=no
 -
 
 It might be interesting to see some log entries where autolearn=yes...

Here you go:

No, score=-4.2 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001] autolearn=ham

No, score=-3.407 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793,
SPF_PASS=-0.001] autolearn=ham

No, score=-2.5 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001,
URIBL_DBL_SPAM=1.7] autolearn=ham

No, score=-3.407 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793,
SPF_PASS=-0.001] autolearn=ham

 It bears mention that the RCVD_IN_DNSWL_MED test is having even more
 of a negative impact (pardon the pun) than BAYES_*. I am already
 working with the dnswl.org folks (off-list, for privacy reasons) to
 get to the bottom of that issue.
 
 This might be a major contributing factor. If your system was taught
 from scratch by autolearn, and DNSWL (which is fairly well trusted) has
 been pushing a lot of spams to low scores...

It looks as though this is exactly what happened. I'll post back once
I've done some more troubleshooting with the folks at dnswl.org.

 You might want to set:
 bayes_auto_learn_threshold_nonspam -3

Done.

 That won't _fix_ the problem (at least not quickly) or avoid the need to
 wipe and retrain, but it might keep things from getting worse.

I disabled auto-learn and executed sa-learn --clear, too. So, I should
be starting with a clean slate, right?

I have also disabled the DNSWL rules, until the issue can be resolved,
and will begin manual training immediately.

 See perldoc Mail::SpamAssassin::Plugin::AutoLearnThreshold for more info.
 
 Most of the list is probably laughing, but given the complexity of Spam
 Assassin, this crucial requirement was lost on me, amidst the sea of
 information and instructions. For example, there is no mention of the
 fact that SA is essentially useless without Bayesian training on
 http://wiki.apache.org/spamassassin/StartUsing .
 
 That's because that shouldn't be the case. The base ruleset + URIBL
 should be very effective pretty much out-of-the-box.
 
 What version of SA is this?

 # spamassassin --version
 SpamAssassin version 3.3.1
  running on Perl version 5.10.1
 
 A little stale, but not bad.

'Tis the major drawback with using LTS Linux distros and managing
software via packages, I suppose.

 You may also want to set up some mechanism for users to submit
 misclassified messages for training. Depending on how much you trust
 their judgement the learning from these can be automatic or can go
 through you as a reviewer.

 That sounds like a good idea. Is there a particular HOW TO or tutorial
 that you recommend? If it depends on the environment/configuration, this
 server runs Ubuntu 10.04 with Dovecot, Amavis, Sieve, and Spam Assassin.
 
 I'm not sure, I don't lurk the Wiki much. About the best I can suggest
 is search the SA users mailing list archives for training dovecot.
 

Thanks, I'll look into setting-up IMAP folders for individual users in
some programmatic way.

-Ben

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-16 Thread Ben Johnson



On 8/16/2012 10:14 AM, Ben Johnson wrote:
 
 
 On 8/15/2012 4:05 PM, John Hardin wrote:
 On Wed, 15 Aug 2012, Ben Johnson wrote:

 On 8/15/2012 2:24 PM, John Hardin wrote:
 On Wed, 15 Aug 2012, Ben Johnson wrote:

 Some 99% of the spam that I receive, which is grossly spammy (we're
 talking auto loans, cash advances, dink pills, the whole lot) contains
 BAYES_00=-1.9 in the tests portion of the X-Spam-Status header.

 Might anyone know why?

 Poor training.

 John, I can't thank you enough for the thoroughness of your response.

 I like to show off. :)

 Apart from the Bayes score, what kind of scores are those spams getting?

 Here are a few examples (the first two of which are two of VERY few in
 which the BAYES_* value is over 00):

 -
 No, score=0.192 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001,
 HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793,
 SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=no

 No, score=2.241 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001,
 HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RDNS_NONE=0.793,
 SPF_PASS=-0.001] autolearn=no

 No, score=-0.836 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
 HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3,
 RDNS_NONE=0.793, SPF_PASS=-0.001, URI_HEX=1.122] autolearn=no

 No, score=1.256 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
 HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3,
 RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7,
 URIBL_RHS_DOB=1.514] autolearn=no
 -

 It might be interesting to see some log entries where autolearn=yes...
 
 Here you go:
 
 No, score=-4.2 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
 HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001] autolearn=ham
 
 No, score=-3.407 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
 HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793,
 SPF_PASS=-0.001] autolearn=ham
 
 No, score=-2.5 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
 HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001,
 URIBL_DBL_SPAM=1.7] autolearn=ham
 
 No, score=-3.407 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
 HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793,
 SPF_PASS=-0.001] autolearn=ham
 
 It bears mention that the RCVD_IN_DNSWL_MED test is having even more
 of a negative impact (pardon the pun) than BAYES_*. I am already
 working with the dnswl.org folks (off-list, for privacy reasons) to
 get to the bottom of that issue.

 This might be a major contributing factor. If your system was taught
 from scratch by autolearn, and DNSWL (which is fairly well trusted) has
 been pushing a lot of spams to low scores...
 
 It looks as though this is exactly what happened. I'll post back once
 I've done some more troubleshooting with the folks at dnswl.org.
 
 You might want to set:
 bayes_auto_learn_threshold_nonspam -3
 
 Done.
 
 That won't _fix_ the problem (at least not quickly) or avoid the need to
 wipe and retrain, but it might keep things from getting worse.
 
 I disabled auto-learn and executed sa-learn --clear, too. So, I should
 be starting with a clean slate, right?
 
 I have also disabled the DNSWL rules, until the issue can be resolved,
 and will begin manual training immediately.
 
 See perldoc Mail::SpamAssassin::Plugin::AutoLearnThreshold for more info.

 Most of the list is probably laughing, but given the complexity of Spam
 Assassin, this crucial requirement was lost on me, amidst the sea of
 information and instructions. For example, there is no mention of the
 fact that SA is essentially useless without Bayesian training on
 http://wiki.apache.org/spamassassin/StartUsing .

 That's because that shouldn't be the case. The base ruleset + URIBL
 should be very effective pretty much out-of-the-box.

 What version of SA is this?

 # spamassassin --version
 SpamAssassin version 3.3.1
  running on Perl version 5.10.1

 A little stale, but not bad.
 
 'Tis the major drawback with using LTS Linux distros and managing
 software via packages, I suppose.
 
 You may also want to set up some mechanism for users to submit
 misclassified messages for training. Depending on how much you trust
 their judgement the learning from these can be automatic or can go
 through you as a reviewer.

 That sounds like a good idea. Is there a particular HOW TO or tutorial
 that you recommend? If it depends on the environment/configuration, this
 server runs Ubuntu 10.04 with Dovecot, Amavis, Sieve, and Spam Assassin.

 I'm not sure, I don't lurk the Wiki much. About the best I can suggest
 is search the SA users mailing list archives for training dovecot.

 
 Thanks, I'll look into setting-up IMAP folders for individual users in
 some programmatic way.
 
 -Ben
 

So, after disabling auto-learn (for now) and executing sa-learn
--clear, and restarting Amavis, I'm still seeing this:

No, score=0.593 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, RDNS_NONE=0.793,

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-16 Thread John Hardin


On Thu, 16 Aug 2012, Ben Johnson wrote:


So, after disabling auto-learn (for now) and executing sa-learn
--clear, and restarting Amavis, I'm still seeing this:

No, score=0.593 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, RDNS_NONE=0.793, SPF_PASS=-0.001,
URIBL_DBL_SPAM=1.7] autolearn=disabled

Why BAYES_00 still? Am I running the wrong command to clear the database?


That's correct. Be sure that you're running it as the same user that 
amavis+SA is running as, otherwise you're clearing the wrong files.


You might want to run sa-learn --dump magic afterwards to verify the 
database is cleared.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  ...the good of having the government prohibited from doing harm
  far outweighs the harm of having it obstructed from doing good.
   -- Mike@mike-istan
---
 8 days until the 1933rd anniversary of the destruction of Pompeii

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-16 Thread Ben Johnson



On 8/16/2012 11:38 AM, John Hardin wrote:
 On Thu, 16 Aug 2012, Ben Johnson wrote:
 
 So, after disabling auto-learn (for now) and executing sa-learn
 --clear, and restarting Amavis, I'm still seeing this:

 No, score=0.593 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
 HTML_MESSAGE=0.001, RDNS_NONE=0.793, SPF_PASS=-0.001,
 URIBL_DBL_SPAM=1.7] autolearn=disabled

 Why BAYES_00 still? Am I running the wrong command to clear the database?
 
 That's correct. Be sure that you're running it as the same user that
 amavis+SA is running as, otherwise you're clearing the wrong files.
 
 You might want to run sa-learn --dump magic afterwards to verify the
 database is cleared.
 

John,

You were exactly right; I forgot to execute sa-learn --clear as the
amavis user.

What is the expected output of sa-learn --dump magic once the database
has been cleared successfully?

# su amavis -c 'sa-learn --dump magic'

ERROR: Bayes dump returned an error, please re-run with -D for more
information

# su amavis -c 'sa-learn -D --dump magic'

[...]
dbg: bayes: no dbs present, cannot tie DB R/O:
/var/lib/amavis/.spamassassin/bayes_toks
[...]

Is this to be expected? Or did I muck-up the works?

Thanks again,

-Ben

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-16 Thread Alex

Hi,

 What will probably end up happening is this:
 (1) wipe your Bayes database
 (2) turn off autolearn
 (3) collect several hundred hams and spams for an initial training corpus
 (4) train using that corpus
 (5) evaluate results

 Depending on your mail volume, once Bayes is working well after manual
 training, you may then want to reenable autolearn; I personally suggest
 it only where the volume is high enough and/or the character of mail is
 varied enough to prohibit manual training. You might also want to adjust
 the autolearn thresholds.

What effect do whitelist entries have on autolearning and scores?

In other words, my whitelist_from_rcvd entries add -100 to the score,
which would be way beyond the -3 I have required for autolearn.

I asked this question some years ago, but thought it was worthwhile to
involve this factor again, and just make sure my understanding was
correct.

Thanks,
Alex

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-16 Thread John Hardin


On Thu, 16 Aug 2012, Ben Johnson wrote:


On 8/16/2012 11:38 AM, John Hardin wrote:

On Thu, 16 Aug 2012, Ben Johnson wrote:


So, after disabling auto-learn (for now) and executing sa-learn
--clear, and restarting Amavis, I'm still seeing this:

No, score=0.593 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, RDNS_NONE=0.793, SPF_PASS=-0.001,
URIBL_DBL_SPAM=1.7] autolearn=disabled

Why BAYES_00 still? Am I running the wrong command to clear the database?


That's correct. Be sure that you're running it as the same user that
amavis+SA is running as, otherwise you're clearing the wrong files.

You might want to run sa-learn --dump magic afterwards to verify the
database is cleared.


John,

You were exactly right; I forgot to execute sa-learn --clear as the
amavis user.

What is the expected output of sa-learn --dump magic once the database
has been cleared successfully?

# su amavis -c 'sa-learn --dump magic'

ERROR: Bayes dump returned an error, please re-run with -D for more
information

# su amavis -c 'sa-learn -D --dump magic'

[...]
dbg: bayes: no dbs present, cannot tie DB R/O:
/var/lib/amavis/.spamassassin/bayes_toks
[...]

Is this to be expected? Or did I muck-up the works?


Heh. I was expecting zeroes, but no dbs present is also a good 
confirmation that the Bayes database has been reset... :)


You might need to restart amavis now, too.

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  The United States has become a place where entertainers and
  professional athletes are mistaken for people of importance.
-- Maureen Johnson Smith Long
---
 8 days until the 1933rd anniversary of the destruction of Pompeii

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-16 Thread Ben Johnson



On 8/16/2012 12:32 PM, John Hardin wrote:
 On Thu, 16 Aug 2012, Ben Johnson wrote:
 
 On 8/16/2012 11:38 AM, John Hardin wrote:
 On Thu, 16 Aug 2012, Ben Johnson wrote:

 So, after disabling auto-learn (for now) and executing sa-learn
 --clear, and restarting Amavis, I'm still seeing this:

 No, score=0.593 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
 HTML_MESSAGE=0.001, RDNS_NONE=0.793, SPF_PASS=-0.001,
 URIBL_DBL_SPAM=1.7] autolearn=disabled

 Why BAYES_00 still? Am I running the wrong command to clear the
 database?

 That's correct. Be sure that you're running it as the same user that
 amavis+SA is running as, otherwise you're clearing the wrong files.

 You might want to run sa-learn --dump magic afterwards to verify the
 database is cleared.

 John,

 You were exactly right; I forgot to execute sa-learn --clear as the
 amavis user.

 What is the expected output of sa-learn --dump magic once the database
 has been cleared successfully?

 # su amavis -c 'sa-learn --dump magic'

 ERROR: Bayes dump returned an error, please re-run with -D for more
 information

 # su amavis -c 'sa-learn -D --dump magic'

 [...]
 dbg: bayes: no dbs present, cannot tie DB R/O:
 /var/lib/amavis/.spamassassin/bayes_toks
 [...]

 Is this to be expected? Or did I muck-up the works?
 
 Heh. I was expecting zeroes, but no dbs present is also a good
 confirmation that the Bayes database has been reset... :)
 
 You might need to restart amavis now, too.
 

So, I preemptively restarted Amavis, per your suggestion (without
executing su amavis -c 'sa-learn -D --dump magic' first), and when I
executed the aforementioned command after the restart, I received the
expected output:

# su amavis -c 'sa-learn --dump magic'
0.000  0  3  0  non-token data: bayes db version
0.000  0  0  0  non-token data: nspam
0.000  0  0  0  non-token data: nham
0.000  0  0  0  non-token data: ntokens
0.000  0  0  0  non-token data: oldest atime
0.000  0  0  0  non-token data: newest atime
0.000  0  0  0  non-token data: last journal
sync atime
0.000  0  0  0  non-token data: last expiry atime
0.000  0  0  0  non-token data: last expire
atime delta
0.000  0  0  0  non-token data: last expire
reduction count

All looks well. (I'm performing these actions in a test/development
environment, by the way.)

So, I went to follow the same procedure in production:

# su amavis -c 'sa-learn --clear'

# service amavis restart

# su amavis -c 'sa-learn -D --dump magic'

Yet this yields that familiar message:

ERROR: Bayes dump returned an error, please re-run with -D for more
information

I waited a little while (at least an hour) and tried again. Same thing.
I restarted Amavis again, same thing.

A few minutes later, I decided to give it one last shot, and sure
enough, I received the expected output with all zeros.

It may be academic at this point, but I'm now curious as to what causes
the DB file to be recreated, if not restarting Amavis. (It bears mention
that plenty of mail came in between using the --clear switch and when
using the --dump switch began to produce valid [all-zero] output. In
other words, the DB didn't seem to be recreated when the first message
was received after clearing the old DB and restarting Amavis.)

In any event, at this point, I'm confused as to which user account I
should be using when executing sa-learn --spam, for example.

As a bit of background, I'm using ISPConfig 3, which implements virtual
mailbox users via MySQL.

I dug through the mailing list archive and found
http://spamassassin.1065346.n5.nabble.com/Problem-with-sa-learn-and-virtual-user-td44666.html
, which seems to be relevant.

Ultimately, I'm wondering if I should be using the amavis user to
learn ham/spam, or individual mailbox user accounts.

If it is possible to use either, are there pros and cons of which one
should be aware before settling on an approach?

As I mentioned previously, I would like to set-up LearnHam and
LearnSpam folders for each IMAP user, eventually, so perhaps this
answers my question?

Thanks again for all the help!

-Ben

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-16 Thread John Hardin


On Thu, 16 Aug 2012, Ben Johnson wrote:


It may be academic at this point, but I'm now curious as to what causes
the DB file to be recreated, if not restarting Amavis. (It bears mention
that plenty of mail came in between using the --clear switch and when
using the --dump switch began to produce valid [all-zero] output. In
other words, the DB didn't seem to be recreated when the first message
was received after clearing the old DB and restarting Amavis.)


That I couldn't say, as I have no experience with Amavis. Somebody else 
may chime in, or you could ask that on the Amavis list. There might be an 
Amavis FAQ on how to properly reset the bayes database when using Amavis.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  USMC Rules of Gunfighting #20: The faster you finish the fight,
  the less shot you will get.
---
 8 days until the 1933rd anniversary of the destruction of Pompeii

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-16 Thread RW

On Thu, 16 Aug 2012 12:18:44 -0400
Alex wrote:

 Hi,
 
  What will probably end up happening is this:
  (1) wipe your Bayes database
  (2) turn off autolearn
  (3) collect several hundred hams and spams for an initial training
  corpus (4) train using that corpus
  (5) evaluate results
 
  Depending on your mail volume, once Bayes is working well after
  manual training, you may then want to reenable autolearn; I
  personally suggest it only where the volume is high enough and/or
  the character of mail is varied enough to prohibit manual
  training. You might also want to adjust the autolearn thresholds.
 
 What effect do whitelist entries have on autolearning

 None at all because they are marked as userconf.

 In other words, my whitelist_from_rcvd entries add -100 to the score,
 which would be way beyond the -3 I have required for autolearn.

Setting a threshold of -3 is a bad idea unless you are going to write a
lot of local rules with negative scores. The OP would be much better
off zeroing the scores of the the offending DNSWL rules.

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-16 Thread John Hardin


On Thu, 16 Aug 2012, RW wrote:


On Thu, 16 Aug 2012 12:18:44 -0400
Alex wrote:


What effect do whitelist entries have on autolearning


None at all because they are marked as userconf.


bummer.


In other words, my whitelist_from_rcvd entries add -100 to the score,
which would be way beyond the -3 I have required for autolearn.


Setting a threshold of -3 is a bad idea unless you are going to write a
lot of local rules with negative scores. The OP would be much better
off zeroing the scores of the the offending DNSWL rules.


Then we get to the situation of the administrator has to know to do that 
or SA goes off the rails.


It seems that the proper approach is to set tflags noautolearn on any 
DNS-based base rule that has a negative score...


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
 An operating system design that requires a system reboot in order to
 install a document viewing utility does not earn my respect.
---
 8 days until the 1933rd anniversary of the destruction of Pompeii

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-15 Thread Jari Fredriksson

15.08.2012 20:36, Ben Johnson kirjoitti:
 Hello,

 Some 99% of the spam that I receive, which is grossly spammy (we're
 talking auto loans, cash advances, dink pills, the whole lot) contains
 BAYES_00=-1.9 in the tests portion of the X-Spam-Status header.

 Might anyone know why? This is a stock installation (Ubuntu package on
 10.04).

 local.cf contains

 #   Bayesian classifier auto-learning (default: 1)
 #
 # bayes_auto_learn 1

 and I have not overridden the default elsewhere. So, presumably,
 auto-learning is enabled (if that's event relevant).

 While I have not trained the Bayesian filter manually to date, how is it
 that the spammiest of the spam is being classified with BAYES_00
 (thereby receiving the score -1.9)? Doesn't BAYES_00 imply that the
 message is almost certainly not spam?
How could the Bayes classifier know that it is spammy, if no one make it
learn what spam looks like?

Start training it now.


 Others have run into this same problem, but I see no resolution; here is
 one such example:

 http://forums.eukhost.com/f38/problems-spamassassin-bayes-filter-16948/

 Outside of the above forum post, search query results for this issue are
 scant.

 Thanks for any help,

 -Ben



-- 

Never thought the space i Program Files would be a problem in Linux

Husse Apr 9 2007




signature.asc
Description: OpenPGP digital signature

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-15 Thread John Hardin


On Wed, 15 Aug 2012, Ben Johnson wrote:


Some 99% of the spam that I receive, which is grossly spammy (we're
talking auto loans, cash advances, dink pills, the whole lot) contains
BAYES_00=-1.9 in the tests portion of the X-Spam-Status header.

Might anyone know why?


Poor training.

Apart from the Bayes score, what kind of scores are those spams 
getting?



While I have not trained the Bayesian filter manually to date,


Is there any provision for any manual training in your environment? Have 
you set up training folders where your users can submit message for 
training? Do you run sa-learn at all?


how is it that the spammiest of the spam is being classified with 
BAYES_00 (thereby receiving the score -1.9)? Doesn't BAYES_00 imply that 
the message is almost certainly not spam?


BAYES_00 implies that the message in question looks very similar to 
messages the Bayes system has been told are not spam. It depends solely on 
how it has been trained.


I wasn't aware that autolearning could do a cold-start of Bayes, can 
anyone confirm whether this is the case?


If it can't then someone somewhere trained bayes up to the default minimum 
200 hams and 200 spams needed for it to start classifying.


Before we offer suggestions, some more data from you please:

What version of SA is this?

What does sa-learn --dump magic report about your current Bayes 
database?


What are all of the bayes_* configuration options in your local config?


What will probably end up happening is this:
(1) wipe your Bayes database
(2) turn off autolearn
(3) collect several hundred hams and spams for an initial training corpus
(4) train using that corpus
(5) evaluate results

Depending on your mail volume, once Bayes is working well after manual 
training, you may then want to reenable autolearn; I personally suggest it 
only where the volume is high enough and/or the character of mail is 
varied enough to prohibit manual training. You might also want to adjust 
the autolearn thresholds.


You may also want to set up some mechanism for users to submit 
misclassified messages for training. Depending on how much you trust their 
judgement the learning from these can be automatic or can go through you 
as a reviewer.


Recommendation: keep your manual training corpus around in case you need 
to do the above again for some reason.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Judicial Activism (n): interpreting the Constitution to grant the
  government powers that are popularly felt to be needed but that
  are not explicitly provided for therein (common definition);
  interpreting the Constitution as it is written (Brady definition)
---
 Today: the 67th anniversary of the end of World War II

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-15 Thread John Hardin


On Wed, 15 Aug 2012, Jari Fredriksson wrote:


15.08.2012 20:36, Ben Johnson kirjoitti:

While I have not trained the Bayesian filter manually to date, how is it
that the spammiest of the spam is being classified with BAYES_00
(thereby receiving the score -1.9)? Doesn't BAYES_00 imply that the
message is almost certainly not spam?


How could the Bayes classifier know that it is spammy, if no one make it
learn what spam looks like?

Start training it now.


It he's getting BAYES_00 hits _something_ has trained it.

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Judicial Activism (n): interpreting the Constitution to grant the
  government powers that are popularly felt to be needed but that
  are not explicitly provided for therein (common definition);
  interpreting the Constitution as it is written (Brady definition)
---
 Today: the 67th anniversary of the end of World War II

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-15 Thread Jeff Mincy

   From: Ben Johnson b...@indietorrent.org
   Date: Wed, 15 Aug 2012 13:36:08 -0400

   Some 99% of the spam that I receive, which is grossly spammy (we're
   talking auto loans, cash advances, dink pills, the whole lot) contains
   BAYES_00=-1.9 in the tests portion of the X-Spam-Status header.

   Might anyone know why? This is a stock installation (Ubuntu package on
   10.04).

Most likely you've let autolearn learn a large number of spam messages
as ham.  Any autolearn mistakes need to be corrected.

One or two spam messages with BAYES_00 is not a problem, but a large
number of them indicates a serious problem with learning.   If you
have the old spam messages then you can retrain correctly.  Otherwise
it would probably be best to start over by deleting the bayes database.

   local.cf contains

   #   Bayesian classifier auto-learning (default: 1)
   #
   # bayes_auto_learn 1

   and I have not overridden the default elsewhere. So, presumably,
   auto-learning is enabled (if that's event relevant).

   While I have not trained the Bayesian filter manually to date, how is it
   that the spammiest of the spam is being classified with BAYES_00
   (thereby receiving the score -1.9)? Doesn't BAYES_00 imply that the
   message is almost certainly not spam?

Yes, BAYES_00 says the spam probability is between 0 and 1%.

   http://forums.eukhost.com/f38/problems-spamassassin-bayes-filter-16948/

   Outside of the above forum post, search query results for this issue are
   scant.

There have been numerous posts on BAYES.

-jeff

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-15 Thread Ben Johnson

On 8/15/2012 2:24 PM, John Hardin wrote:
 On Wed, 15 Aug 2012, Ben Johnson wrote:
 
 Some 99% of the spam that I receive, which is grossly spammy (we're
 talking auto loans, cash advances, dink pills, the whole lot) contains
 BAYES_00=-1.9 in the tests portion of the X-Spam-Status header.

 Might anyone know why?
 
 Poor training.

John, I can't thank you enough for the thoroughness of your response.

 Apart from the Bayes score, what kind of scores are those spams getting?

Here are a few examples (the first two of which are two of VERY few in
which the BAYES_* value is over 00):

-
No, score=0.192 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001,
HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793,
SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=no

No, score=2.241 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001,
HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RDNS_NONE=0.793,
SPF_PASS=-0.001] autolearn=no

No, score=-0.836 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3,
RDNS_NONE=0.793, SPF_PASS=-0.001, URI_HEX=1.122] autolearn=no

No, score=1.256 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3,
RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7,
URIBL_RHS_DOB=1.514] autolearn=no
-

It bears mention that the RCVD_IN_DNSWL_MED test is having even more of
a negative impact (pardon the pun) than BAYES_*. I am already working
with the dnswl.org folks (off-list, for privacy reasons) to get to the
bottom of that issue.

 While I have not trained the Bayesian filter manually to date,
 
 Is there any provision for any manual training in your environment? Have
 you set up training folders where your users can submit message for
 training? Do you run sa-learn at all?

No, there is no provision. No, I have not set-up training folders, and
no, I have no run sa-learn manually at all.

Most of the list is probably laughing, but given the complexity of Spam
Assassin, this crucial requirement was lost on me, amidst the sea of
information and instructions. For example, there is no mention of the
fact that SA is essentially useless without Bayesian training on
http://wiki.apache.org/spamassassin/StartUsing .

 how is it that the spammiest of the spam is being classified with
 BAYES_00 (thereby receiving the score -1.9)? Doesn't BAYES_00 imply
 that the message is almost certainly not spam?
 
 BAYES_00 implies that the message in question looks very similar to
 messages the Bayes system has been told are not spam. It depends solely
 on how it has been trained.
 
 I wasn't aware that autolearning could do a cold-start of Bayes, can
 anyone confirm whether this is the case?
 
 If it can't then someone somewhere trained bayes up to the default
 minimum 200 hams and 200 spams needed for it to start classifying.
 
 Before we offer suggestions, some more data from you please:
 
 What version of SA is this?

# spamassassin --version
SpamAssassin version 3.3.1
  running on Perl version 5.10.1

 What does sa-learn --dump magic report about your current Bayes database?

# sa-learn --dump magic
ERROR: Bayes dump returned an error, please re-run with -D for more
information

# su amavis -c 'sa-learn --dump magic'

# su amavis -c 'sa-learn --dump magic'
0.000  0  3  0  non-token data: bayes db version
0.000  0  11499  0  non-token data: nspam
0.000  0  39412  0  non-token data: nham
0.000  0 197769  0  non-token data: ntokens
0.000  0 1344331893  0  non-token data: oldest atime
0.000  0 1345056746  0  non-token data: newest atime
0.000  0 1345053771  0  non-token data: last journal
sync atime
0.000  0 1345023550  0  non-token data: last expiry atime
0.000  0 345600  0  non-token data: last expire
atime delta
0.000  0   6482  0  non-token data: last expire
reduction count

 What are all of the bayes_* configuration options in your local config?

None are defined there. There are a few defaults/examples, but they are
commented-out.

 
 What will probably end up happening is this:
 (1) wipe your Bayes database
 (2) turn off autolearn
 (3) collect several hundred hams and spams for an initial training corpus
 (4) train using that corpus
 (5) evaluate results
 
 Depending on your mail volume, once Bayes is working well after manual
 training, you may then want to reenable autolearn; I personally suggest
 it only where the volume is high enough and/or the character of mail is
 varied enough to prohibit manual training. You might also want to adjust
 the autolearn thresholds.

That makes sense; thank you for the suggestion.

 You may also want to set up some mechanism for users to submit
 misclassified messages for training. Depending on how much you trust
 their judgement the

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-15 Thread John Hardin


On Wed, 15 Aug 2012, Ben Johnson wrote:


On 8/15/2012 2:24 PM, John Hardin wrote:

On Wed, 15 Aug 2012, Ben Johnson wrote:


Some 99% of the spam that I receive, which is grossly spammy (we're
talking auto loans, cash advances, dink pills, the whole lot) contains
BAYES_00=-1.9 in the tests portion of the X-Spam-Status header.

Might anyone know why?


Poor training.


John, I can't thank you enough for the thoroughness of your response.


I like to show off. :)


Apart from the Bayes score, what kind of scores are those spams getting?


Here are a few examples (the first two of which are two of VERY few in
which the BAYES_* value is over 00):

-
No, score=0.192 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001,
HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793,
SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=no

No, score=2.241 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001,
HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RDNS_NONE=0.793,
SPF_PASS=-0.001] autolearn=no

No, score=-0.836 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3,
RDNS_NONE=0.793, SPF_PASS=-0.001, URI_HEX=1.122] autolearn=no

No, score=1.256 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3,
RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7,
URIBL_RHS_DOB=1.514] autolearn=no
-


It might be interesting to see some log entries where autolearn=yes...

It bears mention that the RCVD_IN_DNSWL_MED test is having even more of 
a negative impact (pardon the pun) than BAYES_*. I am already working 
with the dnswl.org folks (off-list, for privacy reasons) to get to the 
bottom of that issue.


This might be a major contributing factor. If your system was taught from 
scratch by autolearn, and DNSWL (which is fairly well trusted) has been 
pushing a lot of spams to low scores...


You might want to set:
bayes_auto_learn_threshold_nonspam -3

That won't _fix_ the problem (at least not quickly) or avoid the need to 
wipe and retrain, but it might keep things from getting worse.


See perldoc Mail::SpamAssassin::Plugin::AutoLearnThreshold for more info.


Most of the list is probably laughing, but given the complexity of Spam
Assassin, this crucial requirement was lost on me, amidst the sea of
information and instructions. For example, there is no mention of the
fact that SA is essentially useless without Bayesian training on
http://wiki.apache.org/spamassassin/StartUsing .


That's because that shouldn't be the case. The base ruleset + URIBL should 
be very effective pretty much out-of-the-box.



What version of SA is this?


# spamassassin --version
SpamAssassin version 3.3.1
 running on Perl version 5.10.1


A little stale, but not bad.


You may also want to set up some mechanism for users to submit
misclassified messages for training. Depending on how much you trust
their judgement the learning from these can be automatic or can go
through you as a reviewer.


That sounds like a good idea. Is there a particular HOW TO or tutorial
that you recommend? If it depends on the environment/configuration, this
server runs Ubuntu 10.04 with Dovecot, Amavis, Sieve, and Spam Assassin.


I'm not sure, I don't lurk the Wiki much. About the best I can suggest is 
search the SA users mailing list archives for training dovecot.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  The [assault weapons] ban is the moral equivalent of banning red
  cars because they look too fast.  -- Steve Chapman, Chicago Tribune
---
 Today: the 67th anniversary of the end of World War II

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-15 Thread Kris Deugau

John Hardin wrote:
 I wasn't aware that autolearning could do a cold-start of Bayes, can
 anyone confirm whether this is the case?

If you let it run long enough to pass the 200/200 ham/spam thresholds,
yes;  there's no distinction I've ever met about where the learning came
from.

That said, I wouldn't trust a pure autolearn setup with stock autolearn
thresholds - all too much spam will get learned scoring under 0.1.  :(

-kgd

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-15 Thread Ben Johnson

On 8/15/2012 4:19 PM, Kris Deugau wrote:
 John Hardin wrote:
 I wasn't aware that autolearning could do a cold-start of Bayes, can
 anyone confirm whether this is the case?
 
 If you let it run long enough to pass the 200/200 ham/spam thresholds,
 yes;  there's no distinction I've ever met about where the learning came
 from.
 
 That said, I wouldn't trust a pure autolearn setup with stock autolearn
 thresholds - all too much spam will get learned scoring under 0.1.  :(
 
 -kgd
 

It's a bit disappointing to learn this (pardon the pun), given:

a.) This exchange between John Hardin and I, which occurred previously
in this thread:

---8--

Me:

 Most of the list is probably laughing, but given the complexity of Spam
 Assassin, this crucial requirement was lost on me, amidst the sea of
 information and instructions. For example, there is no mention of the
 fact that SA is essentially useless without Bayesian training on
 http://wiki.apache.org/spamassassin/StartUsing .

John:

That's because that shouldn't be the case. The base ruleset + URIBL
should be very effective pretty much out-of-the-box.

---8--

b.) The default value for bayes_auto_learn is 1 (on). (At least in my
particular distribution.)

Correct me if I'm wrong, but this issue's root cause seems to be that
bayes_auto_learn was on, out-of-the-box, yet I was not complementing its
efficacy via sa-learn.

Is this an accurate summary? Because if so, it seems prudent to change
the default bayes_auto_learn value to zero, and scorn any package
maintainer or developer who modifies it, or, alternatively, put a
banner, at font-size 100em, on the SpamAssassin homepage that issues an
unmistakable warning about Bayesian training's importance.

(John, I'll respond to your most recent message tomorrow most likely;
had enough for one day!)

Thank you,

-Ben

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-15 Thread John Hardin


On Wed, 15 Aug 2012, Kris Deugau wrote:


John Hardin wrote:

I wasn't aware that autolearning could do a cold-start of Bayes, can
anyone confirm whether this is the case?


If you let it run long enough to pass the 200/200 ham/spam thresholds,
yes;  there's no distinction I've ever met about where the learning came
from.

That said, I wouldn't trust a pure autolearn setup with stock autolearn
thresholds - all too much spam will get learned scoring under 0.1.  :(


Right. It might be prudent to review the defaults before the next major 
release.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
 An operating system design that requires a system reboot in order to
 install a document viewing utility does not earn my respect.
---
 Today: the 67th anniversary of the end of World War II

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-15 Thread Kevin A. McGrail


On 8/15/2012 5:00 PM, John Hardin wrote:


Right. It might be prudent to review the defaults before the next 
major release. 
I wonder if we shouldn't disable auto-learning by default (assuming it's 
on by default)...


Bayes should really be trained.

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-15 Thread John Hardin


On Wed, 15 Aug 2012, Kevin A. McGrail wrote:


On 8/15/2012 5:00 PM, John Hardin wrote:


 Right. It might be prudent to review the defaults before the next major
 release. 


I wonder if we shouldn't disable auto-learning by default (assuming it's on 
by default)...


It is.


Bayes should really be trained.


I might not go so far as to say autolearn should be disabled by default, 
as it is a major good if well trained; but setting the defaults extreme 
enough that it is reliably, if slowly, initially trained seems to me a 
fair middle ground. Setting the ham default threshold to -3 or even -5 
seems prudent (_much_ better than the current 0.1), then someone who 
actually wants to configure it can adjust based on how well it's 
performing and whether they want autolearn on at all.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Watch... Wallet... Gun... Knee...-- Denny Crane
---
 Today: the 67th anniversary of the end of World War II

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-15 Thread John Hardin


On Wed, 15 Aug 2012, John Hardin wrote:

I might not go so far as to say autolearn should be disabled by default, 
as it is a major good if well trained;


Sorry, poor wording, I meant to say as _Bayes_ is a major good if well 
trained.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Watch... Wallet... Gun... Knee...-- Denny Crane
---
 Today: the 67th anniversary of the end of World War II

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-15 Thread Kevin A. McGrail


  
  
On 8/15/2012 5:18 PM, John Hardin
  wrote:

On Wed, 15 Aug 2012, Kevin A. McGrail wrote:
  
  
  On 8/15/2012 5:00 PM, John Hardin wrote:


  
  Right. It might be prudent to review the defaults before the
  next major
  
  release. 

I wonder if we shouldn't disable auto-learning by default
(assuming it's on by default)...

  
  
  It is.
  
  
  Bayes should really be trained.

  
  
  I might not go so far as to say autolearn should be disabled by
  default, as it is a major good if well trained; but setting the
  defaults extreme enough that it is reliably, if slowly, initially
  trained seems to me a fair middle ground. Setting the ham default
  threshold to -3 or even -5 seems prudent (_much_ better than the
  current 0.1), then someone who actually wants to configure it can
  adjust based on how well it's performing and whether they want
  autolearn on at all.
  
  

Can you open a bug about that and let's see if we can get that done?
I agree that a slower training threshold makes sense.


-- 
  Kevin A. McGrail
  President
  
Peregrine Computer Consultants Corporation
3927 Old Lee Highway, Suite 102-C
Fairfax, VA 22030-2422
  
http://www.pccc.com/
  
703-359-9700 x50 / 800-823-8402 (Toll-Free)
703-359-8451 (fax)
kmcgr...@pccc.com

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-15 Thread JP Kelly

Dumb question:
How can I set the autolearn thresholds?

On Aug 15, 2012, at 15 2:18 PM, John Hardin jhar...@impsec.org wrote:

 Setting the ham default threshold to -3 or even -5 seems prudent (_much_ 
 better than the current 0.1)

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-15 Thread Axb


On 08/15/2012 11:28 PM, JP Kelly wrote:

Dumb question:
How can I set the autolearn thresholds?

On Aug 15, 2012, at 15 2:18 PM, John Hardin jhar...@impsec.org wrote:


Setting the ham default threshold to -3 or even -5 seems prudent (_much_ better 
than the current 0.1)





In local.cf

bayes_auto_learn_threshold_nonspam -3.0

# uncomment  change below if you want to raise or lower the spam 
learning threshold

#bayes_auto_learn_threshold_spam 15.0   

reload spamd or your glue.

h2h

Axb

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-15 Thread Kevin A. McGrail


On 8/15/2012 5:28 PM, JP Kelly wrote:

Dumb question:
How can I set the autolearn thresholds?

perldoc Mail::SpamAssassin::Plugin::AutoLearnThreshold



 bayes_auto_learn_threshold_nonspam n.nn   (default: 0.1)
   The score threshold below which a mail has to score, to be 
fed into

   SpamAssassin's learning systems automatically as a non-spam
   message.

   bayes_auto_learn_threshold_spam n.nn  (default: 12.0)
   The score threshold above which a mail has to score, to be 
fed into

   SpamAssassin's learning systems automatically as a spam message.

   Note: SpamAssassin requires at least 3 points from the 
header, and

   3 points from the body to auto-learn as spam.  Therefore, the
   minimum working value for this option is 6.
Regards,
KAM

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-15 Thread RW

On Wed, 15 Aug 2012 17:05:00 -0400
Kevin A. McGrail wrote:

 On 8/15/2012 5:00 PM, John Hardin wrote:
 
  Right. It might be prudent to review the defaults before the next 
  major release. 
 I wonder if we shouldn't disable auto-learning by default (assuming
 it's on by default)...
 
 Bayes should really be trained.

It seems to me that bug 6344 from 2010 has some merit. (I was about to
file something similar myself.)

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6344

This suggests that lists like RCVD_IN_DNSWL_* should be marked as
noautolearn so that when they fail they don't screw-up autolearning- 
which is what appears to have happened here. This is exacerbated by the
fact that autolearning wont learn  against a strong Bayes result (quite
rightly), so damage can become permanent.

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-15 Thread John Hardin


On Wed, 15 Aug 2012, Kevin A. McGrail wrote:


On 8/15/2012 5:18 PM, John Hardin wrote:

 I might not go so far as to say autolearn should be disabled by default,
 as it is a major good if well trained; but setting the defaults extreme
 enough that it is reliably, if slowly, initially trained seems to me a
 fair middle ground. Setting the ham default threshold to -3 or even -5
 seems prudent (_much_ better than the current 0.1), then someone who
 actually wants to configure it can adjust based on how well it's
 performing and whether they want autolearn on at all.


Can you open a bug about that and let's see if we can get that done? I agree 
that a slower training threshold makes sense.


https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6828

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
 Today: the 67th anniversary of the end of World War II

52 matches

Mail list logo