Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/21/2012 5:51 PM, Ben Johnson wrote: On 8/21/2012 5:19 PM, John Hardin wrote: On Tue, 21 Aug 2012, Ben Johnson wrote: Aug 21 13:08:33.729 [23714] dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_toks ---8-- # sa-learn --username=amavis --dump magic Run that with --debug and verify that the filenames match. Sure enough, they don't match: ---8-- [...] dbg: bayes: tie-ing to DB file R/O /root/.spamassassin/bayes_toks dbg: bayes: tie-ing to DB file R/O /root/.spamassassin/bayes_seen Aug 21 14:41:13.112 [32170] dbg: bayes: found bayes db version 3 0.000 0 3 0 non-token data: bayes db version 0.000 0 95 0 non-token data: nspam 0.000 0307 0 non-token data: nham 0.000 0 62301 0 non-token data: ntokens 0.000 0 1345469997 0 non-token data: oldest atime 0.000 0 1345579297 0 non-token data: newest atime 0.000 0 0 0 non-token data: last journal sync atime 0.000 0 0 0 non-token data: last expiry atime 0.000 0 0 0 non-token data: last expire atime delta 0.000 0 0 0 non-token data: last expire reduction count ---8-- So, I suppose that I didn't actually resolve the problem from yesterday, which was that I cannot seem to train under the amavis user due to the ownership/permissions on the /var/vmail directory. What good is the --username switch, then? Why does this command train the root user's database? # sa-learn --username=amavis --spam /path/to/spam And why does this command dump the root user's database? # sa-learn --username=amavis --dump magic Thanks very much, As has already been mentioned, the '--username' option is only useful if you're using SQL. You should set your bayes_path so there is no confusion. Since you have been training the root database, you may want to copy that one over. $ cp /root/.spamassassin/bayes* /var/lib/amavis/.spamassassin/ Then fix the permissions and ownership back to what they should be for the amavis user. Then set the bayes path in your local.cf: bayes_path /var/lib/amavis/.spamassassin/bayes (Don't double the 'bayes' at the end as was suggested previously unless you want to move the bayes files into a 'bayes' directory) Restart amavis and try again... -- Bowie
Re: Very spammy messages yield BAYES_00 (-1.9)
On Wed, 22 Aug 2012, Bowie Bailey wrote: On 8/21/2012 5:51 PM, Ben Johnson wrote: What good is the --username switch, then? See other responses. Why does this command train the root user's database? Because you ran the command as root. I apologize, I didn't provide sufficient details. When I said train as the user who runs SA I meant su to that OS user ID before running the sa-learn command. You can either override the default Bayes database files path to explicitly specify a shared global database as has been suggested, or run sa-learn as the amavis user via su or a cron job. Defining a global bayes database is probably a better solution overall, but bear in mind if you have to wipe and retrain you need to check the permissions on the new database files after you run sa-learn the first time. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- ...every time I sit down in front of a Windows machine I feel as if the computer is just a place for the manufacturers to put their advertising. -- fwadling on Y! SCOX --- 2 days until the 1933rd anniversary of the destruction of Pompeii
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/22/2012 9:05 AM, Bowie Bailey wrote: On 8/21/2012 5:51 PM, Ben Johnson wrote: On 8/21/2012 5:19 PM, John Hardin wrote: On Tue, 21 Aug 2012, Ben Johnson wrote: Aug 21 13:08:33.729 [23714] dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_toks ---8-- # sa-learn --username=amavis --dump magic Run that with --debug and verify that the filenames match. Sure enough, they don't match: ---8-- [...] dbg: bayes: tie-ing to DB file R/O /root/.spamassassin/bayes_toks dbg: bayes: tie-ing to DB file R/O /root/.spamassassin/bayes_seen Aug 21 14:41:13.112 [32170] dbg: bayes: found bayes db version 3 0.000 0 3 0 non-token data: bayes db version 0.000 0 95 0 non-token data: nspam 0.000 0307 0 non-token data: nham 0.000 0 62301 0 non-token data: ntokens 0.000 0 1345469997 0 non-token data: oldest atime 0.000 0 1345579297 0 non-token data: newest atime 0.000 0 0 0 non-token data: last journal sync atime 0.000 0 0 0 non-token data: last expiry atime 0.000 0 0 0 non-token data: last expire atime delta 0.000 0 0 0 non-token data: last expire reduction count ---8-- So, I suppose that I didn't actually resolve the problem from yesterday, which was that I cannot seem to train under the amavis user due to the ownership/permissions on the /var/vmail directory. What good is the --username switch, then? Why does this command train the root user's database? # sa-learn --username=amavis --spam /path/to/spam And why does this command dump the root user's database? # sa-learn --username=amavis --dump magic Thanks very much, As has already been mentioned, the '--username' option is only useful if you're using SQL. You should set your bayes_path so there is no confusion. Thank you Axb and Bowie for clarifying this point. Perhaps the sa-learn documentation should be updated to eliminate the ambiguity around this switch. In particular, I am referring to this page: http://spamassassin.apache.org/full/3.0.x/dist/doc/sa-learn.html , which states only the following: If specified this username will override the username taken from the runtime environment. You can use this option to specify users in a virtual user configuration. Maybe adding the SQL keyword will make the virtual user configuration distinction more evident. Since you have been training the root database, you may want to copy that one over. $ cp /root/.spamassassin/bayes* /var/lib/amavis/.spamassassin/ Then fix the permissions and ownership back to what they should be for the amavis user. I did think to do this, but I approached it a bit differently, and used sa-learn --backup (and --restore), under the amavis user account, which mitigated the need to modify the permissions on the database. Then set the bayes path in your local.cf: bayes_path /var/lib/amavis/.spamassassin/bayes (Don't double the 'bayes' at the end as was suggested previously unless you want to move the bayes files into a 'bayes' directory) Restart amavis and try again... Again, thanks to Axb and Bowie for making this suggestion. Hard-coding the bayes_path was the missing link for me; this is what allowed me to train under the amavis user while having root (or vmail) privileges, which on Debian, are necessary to read mail during training. I think I'm sorted here; thanks again, guys! -Ben
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/22/2012 9:43 AM, John Hardin wrote: On Wed, 22 Aug 2012, Bowie Bailey wrote: On 8/21/2012 5:51 PM, Ben Johnson wrote: What good is the --username switch, then? Thanks for the follow-up, John! See other responses. Why does this command train the root user's database? Because you ran the command as root. I apologize, I didn't provide sufficient details. When I said train as the user who runs SA I meant su to that OS user ID before running the sa-learn command. No apology necessary; I knew what you meant, and did indeed try running the sa-learn command as root, initially, but the problem then was a lack of access to the mail directories. On Debian/Ubuntu systems, when using Dovecot, all mail directories are vmail:vmail owned, with 700 permissions, which prevents the amavis user from having access to them. (This is by design, I'm sure, and makes sense.) You can either override the default Bayes database files path to explicitly specify a shared global database as has been suggested, or run sa-learn as the amavis user via su or a cron job. I did end-up overriding the bayes_path, which provided a workaround for the permissions issues. Cheers to the suggestion. Defining a global bayes database is probably a better solution overall, but bear in mind if you have to wipe and retrain you need to check the permissions on the new database files after you run sa-learn the first time. This is an important point; thanks for articulating it. All appears to be well in SpamAssassin Town for the time being (don't think you've heard the last of me, though!). Thanks to everyone who shared his or her expertise. -Ben
Re: Very spammy messages yield BAYES_00 (-1.9)
On 08/22/2012 04:10 PM, Ben Johnson wrote: I did end-up overriding the bayes_path, which provided a workaround for the permissions issues. Cheers to the suggestion. This is not a workaround, it's common practice in many types of setups and documented, but due to numerous reasons can't be set as a default. If the install routine would require/create a /etc/mail/spamassassin/bayes path it could bite other systems than standard Linux distros. (note to myself: discuss this in dev list) As so often, the clue is diagnostics. In this case, I think we all worked backwards, first answering your questions before getting the big picture. Defining a global bayes database is probably a better solution overall, but bear in mind if you have to wipe and retrain you need to check the permissions on the new database files after you run sa-learn the first time. This is an important point; thanks for articulating it. Once you start seeing bayes hits, I'd switch to autolearn, disable auto expiration and set a weekly cron job to do the expiration. That way Bayes keeps itself busy and you only have to train low scored stuff on a daily basis (cron job as amavis imap user user) or rsync the imap folder content out and sa-learn from target path All appears to be well in SpamAssassin Town for the time being (don't think you've heard the last of me, though!). Thanks to everyone who shared his or her expertise. Learning SA seems like a never ending process - the deeper you go, the more of its beauty comes to light. Axb
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/22/2012 10:26 AM, Axb wrote: On 08/22/2012 04:10 PM, Ben Johnson wrote: I did end-up overriding the bayes_path, which provided a workaround for the permissions issues. Cheers to the suggestion. This is not a workaround, it's common practice in many types of setups and documented, but due to numerous reasons can't be set as a default. If the install routine would require/create a /etc/mail/spamassassin/bayes path it could bite other systems than standard Linux distros. (note to myself: discuss this in dev list) Right; it makes sense that this path cannot have a default value (other than ~/...). That said, it seems that for some users (myself included), setting this path manually is a critical step in creating a maximally functional (that is, Bayes-enabled) SpamAssassin installation. This would be especially true if the SA developers were to change the bayes_auto_learn default value to zero, or lower the default value for bayes_auto_learn_threshold_nonspam (as a result of my incident here). For this reason, it seems prudent for developers/contributors to take one of two actions (or both): 1.) Add the bayes_path directive to the default/stock local.cf that ships with SpamAssassin, in a commented-out state. I realize that this file may be maintainer/distribution specific, and that there are attendant challenges associated with such a change. This measure would underscore the directive's importance for the administrator who is configuring the software. 2.) Where possible, modify the SpamAssassin installer package to prompt the user for the bayes_path during installation. These types of prompts are common among related packages. For example, Postfix asks for all kinds of information during its installation (on Debian-based systems, anyway). Again, I realize that the SA developers likely have no control over how the software is packaged and delivered, so if this point seems valid, I am happy to open distro-specific bug reports (or feature requests). Thanks, Axb. -Ben
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/20/2012 2:47 PM, Ben Johnson wrote: I was able to resolve the issue by adding the --username switch to the 'sa-learn' executable: # sa-learn --username=amavis --spam /var/vmail/example.com/trainer/Maildir/.INBOX.Spam/cur Thanks for all of the hints, folks! So, I've been training SpamAssassin like a mad-man for a couple of days. I don't have over 200 spams and 200 hams, so I don't expect Bayes to be used yet (and it's not), but the following output is puzzling (particularly, only 0 spam(s) in bayes DB 200): ---8-- # su amavis -c spamassassin -D -t /usr/share/doc/spamassassin/examples/sample-spam.txt 21 | egrep '(bayes:|whitelist:|AWL)' Aug 21 13:08:33.717 [23714] dbg: bayes: learner_new self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x213613f8), bayes_store_module=Mail::SpamAssassin::BayesStore::DBM Aug 21 13:08:33.728 [23714] dbg: bayes: learner_new: got store=Mail::SpamAssassin::BayesStore::DBM=HASH(0x2153b400) Aug 21 13:08:33.729 [23714] dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_toks Aug 21 13:08:33.729 [23714] dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_seen Aug 21 13:08:33.730 [23714] dbg: bayes: found bayes db version 3 Aug 21 13:08:33.730 [23714] dbg: bayes: DB journal sync: last sync: 0 Aug 21 13:08:33.730 [23714] dbg: bayes: not available for scanning, only 0 spam(s) in bayes DB 200 Aug 21 13:08:33.730 [23714] dbg: bayes: untie-ing Aug 21 13:08:33.732 [23714] dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_toks Aug 21 13:08:33.732 [23714] dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_seen Aug 21 13:08:33.733 [23714] dbg: bayes: found bayes db version 3 Aug 21 13:08:33.733 [23714] dbg: bayes: DB journal sync: last sync: 0 Aug 21 13:08:33.733 [23714] dbg: bayes: not available for scanning, only 0 spam(s) in bayes DB 200 Aug 21 13:08:33.733 [23714] dbg: bayes: untie-ing ---8-- Restarting Amavis does not change the output above. And the output below seems to contradict the above (300 spams and 95 hams): ---8-- # sa-learn --username=amavis --dump magic 0.000 0 3 0 non-token data: bayes db version 0.000 0 95 0 non-token data: nspam 0.000 0300 0 non-token data: nham 0.000 0 59420 0 non-token data: ntokens 0.000 0 1345469997 0 non-token data: oldest atime 0.000 0 1345577900 0 non-token data: newest atime 0.000 0 0 0 non-token data: last journal sync atime 0.000 0 0 0 non-token data: last expiry atime 0.000 0 0 0 non-token data: last expire atime delta 0.000 0 0 0 non-token data: last expire reduction count ---8-- Am I doing something silly? Thanks for any help, -Ben
Re: Very spammy messages yield BAYES_00 (-1.9)
On Tue, 21 Aug 2012, Ben Johnson wrote: Aug 21 13:08:33.729 [23714] dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_toks ---8-- # sa-learn --username=amavis --dump magic Run that with --debug and verify that the filenames match. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- USMC Rules of Gunfighting #20: The faster you finish the fight, the less shot you will get. --- 3 days until the 1933rd anniversary of the destruction of Pompeii
Re: Very spammy messages yield BAYES_00 (-1.9)
On 08/21/2012 11:19 PM, John Hardin wrote: On Tue, 21 Aug 2012, Ben Johnson wrote: Aug 21 13:08:33.729 [23714] dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_toks ---8-- # sa-learn --username=amavis --dump magic Run that with --debug and verify that the filenames match. it never hurts to define the bayes path in local.cf (keeps you from guessing where it will land) bayes_path /var/lib/amavis/.spamassassin/bayes/bayes yes! , bayes twice!!! - make sure the path is as above mkdir /var/lib/amavis/.spamassassin/bayes /var/lib/amavis/.spamassassin/bayes_seen doesn't seem right afaik, normally it would be /var/lib/amavis/.spamassassin/bayes/bayes_seen try this, relearn as much as you can and run a --dump magic Axb
Re: Very spammy messages yield BAYES_00 (-1.9)
On 2012-08-15 20:56, Ben Johnson wrote: On 8/15/2012 2:24 PM, John Hardin wrote: You may also want to set up some mechanism for users to submit misclassified messages for training. That sounds like a good idea. [...] this server runs Ubuntu 10.04 with Dovecot Since you're using Dovecot you might be able to use the antispam plugin for dovecot. It let's you specify a special spam folder, and when users move mail into or out of that folder they are spooled or piped for retraining as spam or ham. This way, the user running sa-learn does not need access to the users maildirs. http://wiki2.dovecot.org/Plugins/Antispam http://johannes.sipsolutions.net/Projects/dovecot-antispam Regards /Jonas -- Jonas Eckerman http://www.truls.org/
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/21/2012 5:19 PM, John Hardin wrote: On Tue, 21 Aug 2012, Ben Johnson wrote: Aug 21 13:08:33.729 [23714] dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_toks ---8-- # sa-learn --username=amavis --dump magic Run that with --debug and verify that the filenames match. Sure enough, they don't match: ---8-- [...] dbg: bayes: tie-ing to DB file R/O /root/.spamassassin/bayes_toks dbg: bayes: tie-ing to DB file R/O /root/.spamassassin/bayes_seen Aug 21 14:41:13.112 [32170] dbg: bayes: found bayes db version 3 0.000 0 3 0 non-token data: bayes db version 0.000 0 95 0 non-token data: nspam 0.000 0307 0 non-token data: nham 0.000 0 62301 0 non-token data: ntokens 0.000 0 1345469997 0 non-token data: oldest atime 0.000 0 1345579297 0 non-token data: newest atime 0.000 0 0 0 non-token data: last journal sync atime 0.000 0 0 0 non-token data: last expiry atime 0.000 0 0 0 non-token data: last expire atime delta 0.000 0 0 0 non-token data: last expire reduction count ---8-- So, I suppose that I didn't actually resolve the problem from yesterday, which was that I cannot seem to train under the amavis user due to the ownership/permissions on the /var/vmail directory. What good is the --username switch, then? Why does this command train the root user's database? # sa-learn --username=amavis --spam /path/to/spam And why does this command dump the root user's database? # sa-learn --username=amavis --dump magic Thanks very much, -Ben
Re: Very spammy messages yield BAYES_00 (-1.9)
On 08/21/2012 11:51 PM, Ben Johnson wrote: On 8/21/2012 5:19 PM, John Hardin wrote: On Tue, 21 Aug 2012, Ben Johnson wrote: Aug 21 13:08:33.729 [23714] dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_toks ---8-- # sa-learn --username=amavis --dump magic Run that with --debug and verify that the filenames match. Sure enough, they don't match: ---8-- [...] dbg: bayes: tie-ing to DB file R/O /root/.spamassassin/bayes_toks dbg: bayes: tie-ing to DB file R/O /root/.spamassassin/bayes_seen if you add bayes_path in local.cf it should find the right path, no matter what user you run it as. (works for me)
Re: Very spammy messages yield BAYES_00 (-1.9)
On 08/21/2012 11:51 PM, Ben Johnson wrote: On 8/21/2012 5:19 PM, John Hardin wrote: On Tue, 21 Aug 2012, Ben Johnson wrote: Aug 21 13:08:33.729 [23714] dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_toks ---8-- # sa-learn --username=amavis --dump magic Run that with --debug and verify that the filenames match. Sure enough, they don't match: ---8-- [...] dbg: bayes: tie-ing to DB file R/O /root/.spamassassin/bayes_toks dbg: bayes: tie-ing to DB file R/O /root/.spamassassin/bayes_seen Aug 21 14:41:13.112 [32170] dbg: bayes: found bayes db version 3 0.000 0 3 0 non-token data: bayes db version 0.000 0 95 0 non-token data: nspam 0.000 0307 0 non-token data: nham 0.000 0 62301 0 non-token data: ntokens 0.000 0 1345469997 0 non-token data: oldest atime 0.000 0 1345579297 0 non-token data: newest atime 0.000 0 0 0 non-token data: last journal sync atime 0.000 0 0 0 non-token data: last expiry atime 0.000 0 0 0 non-token data: last expire atime delta 0.000 0 0 0 non-token data: last expire reduction count ---8-- So, I suppose that I didn't actually resolve the problem from yesterday, which was that I cannot seem to train under the amavis user due to the ownership/permissions on the /var/vmail directory. What good is the --username switch, then? Why does this command train the root user's database? # sa-learn --username=amavis --spam /path/to/spam And why does this command dump the root user's database? # sa-learn --username=amavis --dump magic because: -u username, --username=username Override username taken from the runtime environment, used with SQL and *not* for file based Bayes
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/17/2012 11:28 AM, John Hardin wrote: On Fri, 17 Aug 2012, Ben Johnson wrote: On 8/16/2012 2:00 PM, Ben Johnson wrote: Basically, I need to do something about the spam inundation, as soon as possible. Is there any reason that I should NOT be performing the sa-learn training under the amavis user account? In general, all training should be done as the user that SA (in your case, SA via Amavis) is running as. I have tried to do this, but to no avail: --- # su amavis -c 'sa-learn --spam /var/vmail/example.com/trainer/Maildir/.INBOX.Spam' archive-iterator: no access to /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at /usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 539. archive-iterator: no access to /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at /usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 771. archive-iterator: unable to open /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 --- This seems to occur because the virtual mail directory permissions do not provide the amavis user with the required access level. The vmail user is the only user with any type of access to /var/vmail/example.com/user/Maildir. I suspect that there is a good reason for this and that the ownership/permissions should not be changed. I've done some research on this issue and there isn't much to be found. This archived thread ( http://marc.info/?l=amavis-userm=116457786312019 ) discusses overriding the Bayes user with bayes_sql_override_username amavis, but that doesn't solve the problem (obviously). I still see the same permission errors, although the need to use the 'su' wrapper does go away. Is there a conventional means by which to deal with this issue? If you have your system configured for per-user Bayes databases, then you'd need to train as the user whose database you want to affect. The system in question leverages ISPConfig 3, which implements virtual users/mailboxes, although, I don't know if ISPConfig configures Amavis to utilize individual Bayes databases or if there's an individual database for the amavis user. I can check with the developers. What is your bayes_path config? I don't see this directive anywhere on the system in question; perhaps a default value is being used. The only instance of that string exists in a source file: /usr/share/perl5/Mail/SpamAssassin/Conf.pm:=item bayes_path /path/filename (default: ~/.spamassassin/bayes) So, presumably, bayes_path is equating to ~/.spamassassin/bayes, or in my case, /var/lib/amavis/.spamassassin. Would doing so preclude me from creating training folders for individual IMAP users in the future? They're not related. Per-user ham and spam training folders doesn't preclude using those messages for training a global Bayes database. Understood. You actually may want to implement a hybrid folder model: per-user ham training folders and a global spam training folder. Misclassified ham could potentially be private messages that the recipient doesn't want other users to see, but for misclassified spam who cares? Right, that makes sense. Or can I train under the amavis user for now and then layer-on training for individual IMAP users in the future without undesirable consequences? As stated above, if you're not enabling per-user Bayes *databases*, the question is meaningless. Are you going to configure per-user Bayes databases? Or (as I suspect is more likely) perform global database training from individual users whose judgement you trust? I suppose that I need to determine whether or not ISPConfig implements per-user Bayes database already. I'll report-back for those who may be curious. Thanks again, -Ben
Re: Very spammy messages yield BAYES_00 (-1.9)
On 08/20/2012 06:42 PM, Ben Johnson wrote: On 8/17/2012 11:28 AM, John Hardin wrote: On Fri, 17 Aug 2012, Ben Johnson wrote: On 8/16/2012 2:00 PM, Ben Johnson wrote: Basically, I need to do something about the spam inundation, as soon as possible. Is there any reason that I should NOT be performing the sa-learn training under the amavis user account? In general, all training should be done as the user that SA (in your case, SA via Amavis) is running as. I have tried to do this, but to no avail: --- # su amavis -c 'sa-learn --spam /var/vmail/example.com/trainer/Maildir/.INBOX.Spam' archive-iterator: no access to /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at /usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 539. archive-iterator: no access to /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at /usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 771. archive-iterator: unable to open /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 --- ~/Maildir/* assumes 1 file=1 mail pls try su amavis -c 'sa-learn --spam --progress --dir /var/vmail/example.com/trainer/Maildir/.INBOX.Spam/cur/' or wherever the message are stored
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/20/2012 12:46 PM, Axb wrote: On 08/20/2012 06:42 PM, Ben Johnson wrote: On 8/17/2012 11:28 AM, John Hardin wrote: On Fri, 17 Aug 2012, Ben Johnson wrote: On 8/16/2012 2:00 PM, Ben Johnson wrote: Basically, I need to do something about the spam inundation, as soon as possible. Is there any reason that I should NOT be performing the sa-learn training under the amavis user account? In general, all training should be done as the user that SA (in your case, SA via Amavis) is running as. I have tried to do this, but to no avail: --- # su amavis -c 'sa-learn --spam /var/vmail/example.com/trainer/Maildir/.INBOX.Spam' archive-iterator: no access to /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at /usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 539. archive-iterator: no access to /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at /usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 771. archive-iterator: unable to open /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 --- ~/Maildir/* assumes 1 file=1 mail pls try su amavis -c 'sa-learn --spam --progress --dir /var/vmail/example.com/trainer/Maildir/.INBOX.Spam/cur/' or wherever the message are stored But first, you need access to the files. The simplest way is probably to add the amavis user account to the group used by the mail directories. Assuming the group is vmail, the command should look like this (on RedHat/CentOS): $ usermod -a -G vmail amavis This command will probably need to be run as root. If you are using a different distro, you will need to look up the command to add the amavis user to the vmail group. -- Bowie
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/20/2012 12:56 PM, Bowie Bailey wrote: On 8/20/2012 12:46 PM, Axb wrote: On 08/20/2012 06:42 PM, Ben Johnson wrote: On 8/17/2012 11:28 AM, John Hardin wrote: On Fri, 17 Aug 2012, Ben Johnson wrote: On 8/16/2012 2:00 PM, Ben Johnson wrote: Basically, I need to do something about the spam inundation, as soon as possible. Is there any reason that I should NOT be performing the sa-learn training under the amavis user account? In general, all training should be done as the user that SA (in your case, SA via Amavis) is running as. I have tried to do this, but to no avail: --- # su amavis -c 'sa-learn --spam /var/vmail/example.com/trainer/Maildir/.INBOX.Spam' archive-iterator: no access to /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at /usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 539. archive-iterator: no access to /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at /usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 771. archive-iterator: unable to open /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 --- ~/Maildir/* assumes 1 file=1 mail pls try su amavis -c 'sa-learn --spam --progress --dir /var/vmail/example.com/trainer/Maildir/.INBOX.Spam/cur/' or wherever the message are stored But first, you need access to the files. The simplest way is probably to add the amavis user account to the group used by the mail directories. Assuming the group is vmail, the command should look like this (on RedHat/CentOS): $ usermod -a -G vmail amavis Thanks, guys. I did consider adding the amavis user to the vmail group, but the default permissions on the directories within Maildir are 700 (with vmail:vmail ownership). So, I'd have to fiddle with the permissions on the entire directory tree, for each user, which seems like a bad idea. Furthermore, ISPconfig handles the creation (and deletion) of these directories, so I hesitate to change anything manually and muck-up the installation. While there may be permissions mask that is applied, modifying it seems risky. I wonder what the rest of the Dovecot + Amavis + SA world is doing about this. Maybe I should ask on the Amavis mailing list. If anyone has other suggestions, by all means, please do share. This command will probably need to be run as root. If you are using a different distro, you will need to look up the command to add the amavis user to the vmail group. Much thanks, -Ben
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/20/2012 2:02 PM, Ben Johnson wrote: On 8/20/2012 12:56 PM, Bowie Bailey wrote: On 8/20/2012 12:46 PM, Axb wrote: On 08/20/2012 06:42 PM, Ben Johnson wrote: On 8/17/2012 11:28 AM, John Hardin wrote: On Fri, 17 Aug 2012, Ben Johnson wrote: On 8/16/2012 2:00 PM, Ben Johnson wrote: Basically, I need to do something about the spam inundation, as soon as possible. Is there any reason that I should NOT be performing the sa-learn training under the amavis user account? In general, all training should be done as the user that SA (in your case, SA via Amavis) is running as. I have tried to do this, but to no avail: --- # su amavis -c 'sa-learn --spam /var/vmail/example.com/trainer/Maildir/.INBOX.Spam' archive-iterator: no access to /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at /usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 539. archive-iterator: no access to /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at /usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 771. archive-iterator: unable to open /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 --- ~/Maildir/* assumes 1 file=1 mail pls try su amavis -c 'sa-learn --spam --progress --dir /var/vmail/example.com/trainer/Maildir/.INBOX.Spam/cur/' or wherever the message are stored But first, you need access to the files. The simplest way is probably to add the amavis user account to the group used by the mail directories. Assuming the group is vmail, the command should look like this (on RedHat/CentOS): $ usermod -a -G vmail amavis Thanks, guys. I did consider adding the amavis user to the vmail group, but the default permissions on the directories within Maildir are 700 (with vmail:vmail ownership). So, I'd have to fiddle with the permissions on the entire directory tree, for each user, which seems like a bad idea. Furthermore, ISPconfig handles the creation (and deletion) of these directories, so I hesitate to change anything manually and muck-up the installation. While there may be permissions mask that is applied, modifying it seems risky. I wonder what the rest of the Dovecot + Amavis + SA world is doing about this. Maybe I should ask on the Amavis mailing list. If anyone has other suggestions, by all means, please do share. This command will probably need to be run as root. If you are using a different distro, you will need to look up the command to add the amavis user to the vmail group. Much thanks, -Ben I was able to resolve the issue by adding the --username switch to the 'sa-learn' executable: # sa-learn --username=amavis --spam /var/vmail/example.com/trainer/Maildir/.INBOX.Spam/cur Thanks for all of the hints, folks! -Ben
Re: Very spammy messages yield BAYES_00 (-1.9)
On 08/20/2012 08:02 PM, Ben Johnson wrote: Furthermore, ISPconfig handles the creation (and deletion) of these directories, so I hesitate to change anything manually and muck-up the installation. While there may be permissions mask that is applied, modifying it seems risky. IDEA: I have a little homebrew Python Imap client which picks up stuff from an IMAP server and stores in a regular directory Atm it doesn't delete messages after pickup but I could have it changed so it deletes after pickup You could run that as the amavis user and store to ~/spam-dump/*.eml Permissions would be ok for sa-learn to read the msgs. If you want it, pls contact me offlist Axb
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/16/2012 2:00 PM, Ben Johnson wrote: In any event, at this point, I'm confused as to which user account I should be using when executing sa-learn --spam, for example. As a bit of background, I'm using ISPConfig 3, which implements virtual mailbox users via MySQL. I dug through the mailing list archive and found http://spamassassin.1065346.n5.nabble.com/Problem-with-sa-learn-and-virtual-user-td44666.html , which seems to be relevant. Ultimately, I'm wondering if I should be using the amavis user to learn ham/spam, or individual mailbox user accounts. If it is possible to use either, are there pros and cons of which one should be aware before settling on an approach? As I mentioned previously, I would like to set-up LearnHam and LearnSpam folders for each IMAP user, eventually, so perhaps this answers my question? Thanks again for all the help! John Hardin, sorry to bust you up here... just curious whether or not you saw the rest of my previous note. If you didn't address these questions intentionally, then please ignore me. :) Basically, I need to do something about the spam inundation, as soon as possible. Is there any reason that I should NOT be performing the sa-learn training under the amavis user account? Would doing so preclude me from creating training folders for individual IMAP users in the future? Or can I train under the amavis user for now and then layer-on training for individual IMAP users in the future without undesirable consequences? Thanks again, -Ben
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/17/2012 10:56 AM, Ben Johnson wrote: On 8/16/2012 2:00 PM, Ben Johnson wrote: In any event, at this point, I'm confused as to which user account I should be using when executing sa-learn --spam, for example. As a bit of background, I'm using ISPConfig 3, which implements virtual mailbox users via MySQL. I dug through the mailing list archive and found http://spamassassin.1065346.n5.nabble.com/Problem-with-sa-learn-and-virtual-user-td44666.html , which seems to be relevant. Ultimately, I'm wondering if I should be using the amavis user to learn ham/spam, or individual mailbox user accounts. If it is possible to use either, are there pros and cons of which one should be aware before settling on an approach? As I mentioned previously, I would like to set-up LearnHam and LearnSpam folders for each IMAP user, eventually, so perhaps this answers my question? Thanks again for all the help! John Hardin, sorry to bust you up here... just curious whether or not you saw the rest of my previous note. If you didn't address these questions intentionally, then please ignore me. :) Basically, I need to do something about the spam inundation, as soon as possible. Is there any reason that I should NOT be performing the sa-learn training under the amavis user account? Would doing so preclude me from creating training folders for individual IMAP users in the future? Or can I train under the amavis user for now and then layer-on training for individual IMAP users in the future without undesirable consequences? The quickest way I know of to reduce spam is to reject mail at the MTA based on the zen.spamhaus.org blacklist. I have been using this for a few years now. It blocks lots of spam and I haven't had any problems with it. You can also implement graylisting, although it will slow down mail delivery from new senders, which may or may not be an issue for you. I haven't tried it, but lots of people swear by it. As for Bayes, Amavis uses a single user for spam scanning. This means that Bayes will use a single database (under the amavis user). You may be able to get individual databases via SQL, but I'm not sure about that. -- Bowie
Re: Very spammy messages yield BAYES_00 (-1.9)
On Fri, 17 Aug 2012, Ben Johnson wrote: On 8/16/2012 2:00 PM, Ben Johnson wrote: Basically, I need to do something about the spam inundation, as soon as possible. Is there any reason that I should NOT be performing the sa-learn training under the amavis user account? In general, all training should be done as the user that SA (in your case, SA via Amavis) is running as. If you have your system configured for per-user Bayes databases, then you'd need to train as the user whose database you want to affect. What is your bayes_path config? Would doing so preclude me from creating training folders for individual IMAP users in the future? They're not related. Per-user ham and spam training folders doesn't preclude using those messages for training a global Bayes database. You actually may want to implement a hybrid folder model: per-user ham training folders and a global spam training folder. Misclassified ham could potentially be private messages that the recipient doesn't want other users to see, but for misclassified spam who cares? Or can I train under the amavis user for now and then layer-on training for individual IMAP users in the future without undesirable consequences? As stated above, if you're not enabling per-user Bayes *databases*, the question is meaningless. Are you going to configure per-user Bayes databases? Or (as I suspect is more likely) perform global database training from individual users whose judgement you trust? -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Ignorance is no excuse for a law. --- 7 days until the 1933rd anniversary of the destruction of Pompeii
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/17/2012 11:28 AM, John Hardin wrote: On Fri, 17 Aug 2012, Ben Johnson wrote: Would doing so preclude me from creating training folders for individual IMAP users in the future? They're not related. Per-user ham and spam training folders doesn't preclude using those messages for training a global Bayes database. You actually may want to implement a hybrid folder model: per-user ham training folders and a global spam training folder. Misclassified ham could potentially be private messages that the recipient doesn't want other users to see, but for misclassified spam who cares? I have individual Spam and Ham training folders. Then a cronjob moves everything to the main Spam and Ham directories for learning on a regular basis. The main directories are not related to the mail server, so there is no real privacy concern. -- Bowie
Re: Very spammy messages yield BAYES_00 (-1.9)
On Fri, 17 Aug 2012, Bowie Bailey wrote: On 8/17/2012 10:56 AM, Ben Johnson wrote: Basically, I need to do something about the spam inundation, as soon as possible. The quickest way I know of to reduce spam is to reject mail at the MTA based on the zen.spamhaus.org blacklist. I have been using this for a few years now. It blocks lots of spam and I haven't had any problems with it. +1 for zen.spamhaus.org DNSBL at SMTP time. You can also implement graylisting, although it will slow down mail delivery from new senders, which may or may not be an issue for you. I haven't tried it, but lots of people swear by it. As for Greylisting, a lot of spam is least-effort one-shot no-retry delivery, but not all. It won't reduce spam that is sent via a proper MTA or via a spambot that does retry-until-successful. You can set a short delay period to block the one-attempt-gush spammers, or a longer delay period to give new spamvertised domain names a chance to appear in URIBLs for the spammers who retry. And, of course, you have to balance this against your users' expectations for delivery time, and perhaps do some education to set those expectations more realistically. I use greylisting, with whitelists for regular correspondents. There are some other MTA SMTP-time methods to pluck the low-hanging fruit: Publishing an SPF record. There's anecdotal evidence that it cuts down on joe-job attempts. Even if you publish an SPF record, you might want to explicltly reject From addresses in your domain if the message is received from the Internet. This can be done using SPF, but you may not be comfortable doing SMTP-time rejects based on SPF failures. Something I have fairly good results with is rejecting mail from the Internet where the HELO is not a fully-qualified domain name. Since my MTA is the only valid source for email from my domain, I also reject messages where the HELO is in my domain. You will, of course, have to carve out exceptions to this rule for valid outbound mail. On a multihomed MTA or an MTA where outbound mail is submitted via an SSL tunnel this is pretty easy. For the above, if you have Sendmail I recommend milter-regex; my milter-regex.conf is available here: http://www.impsec.org/~jhardin/antispam/milter-regex.conf -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Ignorance is no excuse for a law. --- 7 days until the 1933rd anniversary of the destruction of Pompeii
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/15/2012 4:05 PM, John Hardin wrote: On Wed, 15 Aug 2012, Ben Johnson wrote: On 8/15/2012 2:24 PM, John Hardin wrote: On Wed, 15 Aug 2012, Ben Johnson wrote: Some 99% of the spam that I receive, which is grossly spammy (we're talking auto loans, cash advances, dink pills, the whole lot) contains BAYES_00=-1.9 in the tests portion of the X-Spam-Status header. Might anyone know why? Poor training. John, I can't thank you enough for the thoroughness of your response. I like to show off. :) Apart from the Bayes score, what kind of scores are those spams getting? Here are a few examples (the first two of which are two of VERY few in which the BAYES_* value is over 00): - No, score=0.192 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=no No, score=2.241 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001, HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=no No, score=-0.836 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, SPF_PASS=-0.001, URI_HEX=1.122] autolearn=no No, score=1.256 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7, URIBL_RHS_DOB=1.514] autolearn=no - It might be interesting to see some log entries where autolearn=yes... Here you go: No, score=-4.2 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001] autolearn=ham No, score=-3.407 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=ham No, score=-2.5 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=ham No, score=-3.407 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=ham It bears mention that the RCVD_IN_DNSWL_MED test is having even more of a negative impact (pardon the pun) than BAYES_*. I am already working with the dnswl.org folks (off-list, for privacy reasons) to get to the bottom of that issue. This might be a major contributing factor. If your system was taught from scratch by autolearn, and DNSWL (which is fairly well trusted) has been pushing a lot of spams to low scores... It looks as though this is exactly what happened. I'll post back once I've done some more troubleshooting with the folks at dnswl.org. You might want to set: bayes_auto_learn_threshold_nonspam -3 Done. That won't _fix_ the problem (at least not quickly) or avoid the need to wipe and retrain, but it might keep things from getting worse. I disabled auto-learn and executed sa-learn --clear, too. So, I should be starting with a clean slate, right? I have also disabled the DNSWL rules, until the issue can be resolved, and will begin manual training immediately. See perldoc Mail::SpamAssassin::Plugin::AutoLearnThreshold for more info. Most of the list is probably laughing, but given the complexity of Spam Assassin, this crucial requirement was lost on me, amidst the sea of information and instructions. For example, there is no mention of the fact that SA is essentially useless without Bayesian training on http://wiki.apache.org/spamassassin/StartUsing . That's because that shouldn't be the case. The base ruleset + URIBL should be very effective pretty much out-of-the-box. What version of SA is this? # spamassassin --version SpamAssassin version 3.3.1 running on Perl version 5.10.1 A little stale, but not bad. 'Tis the major drawback with using LTS Linux distros and managing software via packages, I suppose. You may also want to set up some mechanism for users to submit misclassified messages for training. Depending on how much you trust their judgement the learning from these can be automatic or can go through you as a reviewer. That sounds like a good idea. Is there a particular HOW TO or tutorial that you recommend? If it depends on the environment/configuration, this server runs Ubuntu 10.04 with Dovecot, Amavis, Sieve, and Spam Assassin. I'm not sure, I don't lurk the Wiki much. About the best I can suggest is search the SA users mailing list archives for training dovecot. Thanks, I'll look into setting-up IMAP folders for individual users in some programmatic way. -Ben
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/16/2012 10:14 AM, Ben Johnson wrote: On 8/15/2012 4:05 PM, John Hardin wrote: On Wed, 15 Aug 2012, Ben Johnson wrote: On 8/15/2012 2:24 PM, John Hardin wrote: On Wed, 15 Aug 2012, Ben Johnson wrote: Some 99% of the spam that I receive, which is grossly spammy (we're talking auto loans, cash advances, dink pills, the whole lot) contains BAYES_00=-1.9 in the tests portion of the X-Spam-Status header. Might anyone know why? Poor training. John, I can't thank you enough for the thoroughness of your response. I like to show off. :) Apart from the Bayes score, what kind of scores are those spams getting? Here are a few examples (the first two of which are two of VERY few in which the BAYES_* value is over 00): - No, score=0.192 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=no No, score=2.241 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001, HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=no No, score=-0.836 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, SPF_PASS=-0.001, URI_HEX=1.122] autolearn=no No, score=1.256 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7, URIBL_RHS_DOB=1.514] autolearn=no - It might be interesting to see some log entries where autolearn=yes... Here you go: No, score=-4.2 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001] autolearn=ham No, score=-3.407 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=ham No, score=-2.5 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=ham No, score=-3.407 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=ham It bears mention that the RCVD_IN_DNSWL_MED test is having even more of a negative impact (pardon the pun) than BAYES_*. I am already working with the dnswl.org folks (off-list, for privacy reasons) to get to the bottom of that issue. This might be a major contributing factor. If your system was taught from scratch by autolearn, and DNSWL (which is fairly well trusted) has been pushing a lot of spams to low scores... It looks as though this is exactly what happened. I'll post back once I've done some more troubleshooting with the folks at dnswl.org. You might want to set: bayes_auto_learn_threshold_nonspam -3 Done. That won't _fix_ the problem (at least not quickly) or avoid the need to wipe and retrain, but it might keep things from getting worse. I disabled auto-learn and executed sa-learn --clear, too. So, I should be starting with a clean slate, right? I have also disabled the DNSWL rules, until the issue can be resolved, and will begin manual training immediately. See perldoc Mail::SpamAssassin::Plugin::AutoLearnThreshold for more info. Most of the list is probably laughing, but given the complexity of Spam Assassin, this crucial requirement was lost on me, amidst the sea of information and instructions. For example, there is no mention of the fact that SA is essentially useless without Bayesian training on http://wiki.apache.org/spamassassin/StartUsing . That's because that shouldn't be the case. The base ruleset + URIBL should be very effective pretty much out-of-the-box. What version of SA is this? # spamassassin --version SpamAssassin version 3.3.1 running on Perl version 5.10.1 A little stale, but not bad. 'Tis the major drawback with using LTS Linux distros and managing software via packages, I suppose. You may also want to set up some mechanism for users to submit misclassified messages for training. Depending on how much you trust their judgement the learning from these can be automatic or can go through you as a reviewer. That sounds like a good idea. Is there a particular HOW TO or tutorial that you recommend? If it depends on the environment/configuration, this server runs Ubuntu 10.04 with Dovecot, Amavis, Sieve, and Spam Assassin. I'm not sure, I don't lurk the Wiki much. About the best I can suggest is search the SA users mailing list archives for training dovecot. Thanks, I'll look into setting-up IMAP folders for individual users in some programmatic way. -Ben So, after disabling auto-learn (for now) and executing sa-learn --clear, and restarting Amavis, I'm still seeing this: No, score=0.593 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RDNS_NONE=0.793,
Re: Very spammy messages yield BAYES_00 (-1.9)
On Thu, 16 Aug 2012, Ben Johnson wrote: So, after disabling auto-learn (for now) and executing sa-learn --clear, and restarting Amavis, I'm still seeing this: No, score=0.593 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=disabled Why BAYES_00 still? Am I running the wrong command to clear the database? That's correct. Be sure that you're running it as the same user that amavis+SA is running as, otherwise you're clearing the wrong files. You might want to run sa-learn --dump magic afterwards to verify the database is cleared. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- ...the good of having the government prohibited from doing harm far outweighs the harm of having it obstructed from doing good. -- Mike@mike-istan --- 8 days until the 1933rd anniversary of the destruction of Pompeii
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/16/2012 11:38 AM, John Hardin wrote: On Thu, 16 Aug 2012, Ben Johnson wrote: So, after disabling auto-learn (for now) and executing sa-learn --clear, and restarting Amavis, I'm still seeing this: No, score=0.593 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=disabled Why BAYES_00 still? Am I running the wrong command to clear the database? That's correct. Be sure that you're running it as the same user that amavis+SA is running as, otherwise you're clearing the wrong files. You might want to run sa-learn --dump magic afterwards to verify the database is cleared. John, You were exactly right; I forgot to execute sa-learn --clear as the amavis user. What is the expected output of sa-learn --dump magic once the database has been cleared successfully? # su amavis -c 'sa-learn --dump magic' ERROR: Bayes dump returned an error, please re-run with -D for more information # su amavis -c 'sa-learn -D --dump magic' [...] dbg: bayes: no dbs present, cannot tie DB R/O: /var/lib/amavis/.spamassassin/bayes_toks [...] Is this to be expected? Or did I muck-up the works? Thanks again, -Ben
Re: Very spammy messages yield BAYES_00 (-1.9)
Hi, What will probably end up happening is this: (1) wipe your Bayes database (2) turn off autolearn (3) collect several hundred hams and spams for an initial training corpus (4) train using that corpus (5) evaluate results Depending on your mail volume, once Bayes is working well after manual training, you may then want to reenable autolearn; I personally suggest it only where the volume is high enough and/or the character of mail is varied enough to prohibit manual training. You might also want to adjust the autolearn thresholds. What effect do whitelist entries have on autolearning and scores? In other words, my whitelist_from_rcvd entries add -100 to the score, which would be way beyond the -3 I have required for autolearn. I asked this question some years ago, but thought it was worthwhile to involve this factor again, and just make sure my understanding was correct. Thanks, Alex
Re: Very spammy messages yield BAYES_00 (-1.9)
On Thu, 16 Aug 2012, Ben Johnson wrote: On 8/16/2012 11:38 AM, John Hardin wrote: On Thu, 16 Aug 2012, Ben Johnson wrote: So, after disabling auto-learn (for now) and executing sa-learn --clear, and restarting Amavis, I'm still seeing this: No, score=0.593 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=disabled Why BAYES_00 still? Am I running the wrong command to clear the database? That's correct. Be sure that you're running it as the same user that amavis+SA is running as, otherwise you're clearing the wrong files. You might want to run sa-learn --dump magic afterwards to verify the database is cleared. John, You were exactly right; I forgot to execute sa-learn --clear as the amavis user. What is the expected output of sa-learn --dump magic once the database has been cleared successfully? # su amavis -c 'sa-learn --dump magic' ERROR: Bayes dump returned an error, please re-run with -D for more information # su amavis -c 'sa-learn -D --dump magic' [...] dbg: bayes: no dbs present, cannot tie DB R/O: /var/lib/amavis/.spamassassin/bayes_toks [...] Is this to be expected? Or did I muck-up the works? Heh. I was expecting zeroes, but no dbs present is also a good confirmation that the Bayes database has been reset... :) You might need to restart amavis now, too. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- The United States has become a place where entertainers and professional athletes are mistaken for people of importance. -- Maureen Johnson Smith Long --- 8 days until the 1933rd anniversary of the destruction of Pompeii
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/16/2012 12:32 PM, John Hardin wrote: On Thu, 16 Aug 2012, Ben Johnson wrote: On 8/16/2012 11:38 AM, John Hardin wrote: On Thu, 16 Aug 2012, Ben Johnson wrote: So, after disabling auto-learn (for now) and executing sa-learn --clear, and restarting Amavis, I'm still seeing this: No, score=0.593 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=disabled Why BAYES_00 still? Am I running the wrong command to clear the database? That's correct. Be sure that you're running it as the same user that amavis+SA is running as, otherwise you're clearing the wrong files. You might want to run sa-learn --dump magic afterwards to verify the database is cleared. John, You were exactly right; I forgot to execute sa-learn --clear as the amavis user. What is the expected output of sa-learn --dump magic once the database has been cleared successfully? # su amavis -c 'sa-learn --dump magic' ERROR: Bayes dump returned an error, please re-run with -D for more information # su amavis -c 'sa-learn -D --dump magic' [...] dbg: bayes: no dbs present, cannot tie DB R/O: /var/lib/amavis/.spamassassin/bayes_toks [...] Is this to be expected? Or did I muck-up the works? Heh. I was expecting zeroes, but no dbs present is also a good confirmation that the Bayes database has been reset... :) You might need to restart amavis now, too. So, I preemptively restarted Amavis, per your suggestion (without executing su amavis -c 'sa-learn -D --dump magic' first), and when I executed the aforementioned command after the restart, I received the expected output: # su amavis -c 'sa-learn --dump magic' 0.000 0 3 0 non-token data: bayes db version 0.000 0 0 0 non-token data: nspam 0.000 0 0 0 non-token data: nham 0.000 0 0 0 non-token data: ntokens 0.000 0 0 0 non-token data: oldest atime 0.000 0 0 0 non-token data: newest atime 0.000 0 0 0 non-token data: last journal sync atime 0.000 0 0 0 non-token data: last expiry atime 0.000 0 0 0 non-token data: last expire atime delta 0.000 0 0 0 non-token data: last expire reduction count All looks well. (I'm performing these actions in a test/development environment, by the way.) So, I went to follow the same procedure in production: # su amavis -c 'sa-learn --clear' # service amavis restart # su amavis -c 'sa-learn -D --dump magic' Yet this yields that familiar message: ERROR: Bayes dump returned an error, please re-run with -D for more information I waited a little while (at least an hour) and tried again. Same thing. I restarted Amavis again, same thing. A few minutes later, I decided to give it one last shot, and sure enough, I received the expected output with all zeros. It may be academic at this point, but I'm now curious as to what causes the DB file to be recreated, if not restarting Amavis. (It bears mention that plenty of mail came in between using the --clear switch and when using the --dump switch began to produce valid [all-zero] output. In other words, the DB didn't seem to be recreated when the first message was received after clearing the old DB and restarting Amavis.) In any event, at this point, I'm confused as to which user account I should be using when executing sa-learn --spam, for example. As a bit of background, I'm using ISPConfig 3, which implements virtual mailbox users via MySQL. I dug through the mailing list archive and found http://spamassassin.1065346.n5.nabble.com/Problem-with-sa-learn-and-virtual-user-td44666.html , which seems to be relevant. Ultimately, I'm wondering if I should be using the amavis user to learn ham/spam, or individual mailbox user accounts. If it is possible to use either, are there pros and cons of which one should be aware before settling on an approach? As I mentioned previously, I would like to set-up LearnHam and LearnSpam folders for each IMAP user, eventually, so perhaps this answers my question? Thanks again for all the help! -Ben
Re: Very spammy messages yield BAYES_00 (-1.9)
On Thu, 16 Aug 2012, Ben Johnson wrote: It may be academic at this point, but I'm now curious as to what causes the DB file to be recreated, if not restarting Amavis. (It bears mention that plenty of mail came in between using the --clear switch and when using the --dump switch began to produce valid [all-zero] output. In other words, the DB didn't seem to be recreated when the first message was received after clearing the old DB and restarting Amavis.) That I couldn't say, as I have no experience with Amavis. Somebody else may chime in, or you could ask that on the Amavis list. There might be an Amavis FAQ on how to properly reset the bayes database when using Amavis. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- USMC Rules of Gunfighting #20: The faster you finish the fight, the less shot you will get. --- 8 days until the 1933rd anniversary of the destruction of Pompeii
Re: Very spammy messages yield BAYES_00 (-1.9)
On Thu, 16 Aug 2012 12:18:44 -0400 Alex wrote: Hi, What will probably end up happening is this: (1) wipe your Bayes database (2) turn off autolearn (3) collect several hundred hams and spams for an initial training corpus (4) train using that corpus (5) evaluate results Depending on your mail volume, once Bayes is working well after manual training, you may then want to reenable autolearn; I personally suggest it only where the volume is high enough and/or the character of mail is varied enough to prohibit manual training. You might also want to adjust the autolearn thresholds. What effect do whitelist entries have on autolearning None at all because they are marked as userconf. In other words, my whitelist_from_rcvd entries add -100 to the score, which would be way beyond the -3 I have required for autolearn. Setting a threshold of -3 is a bad idea unless you are going to write a lot of local rules with negative scores. The OP would be much better off zeroing the scores of the the offending DNSWL rules.
Re: Very spammy messages yield BAYES_00 (-1.9)
On Thu, 16 Aug 2012, RW wrote: On Thu, 16 Aug 2012 12:18:44 -0400 Alex wrote: What effect do whitelist entries have on autolearning None at all because they are marked as userconf. bummer. In other words, my whitelist_from_rcvd entries add -100 to the score, which would be way beyond the -3 I have required for autolearn. Setting a threshold of -3 is a bad idea unless you are going to write a lot of local rules with negative scores. The OP would be much better off zeroing the scores of the the offending DNSWL rules. Then we get to the situation of the administrator has to know to do that or SA goes off the rails. It seems that the proper approach is to set tflags noautolearn on any DNS-based base rule that has a negative score... -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- An operating system design that requires a system reboot in order to install a document viewing utility does not earn my respect. --- 8 days until the 1933rd anniversary of the destruction of Pompeii
Re: Very spammy messages yield BAYES_00 (-1.9)
15.08.2012 20:36, Ben Johnson kirjoitti: Hello, Some 99% of the spam that I receive, which is grossly spammy (we're talking auto loans, cash advances, dink pills, the whole lot) contains BAYES_00=-1.9 in the tests portion of the X-Spam-Status header. Might anyone know why? This is a stock installation (Ubuntu package on 10.04). local.cf contains # Bayesian classifier auto-learning (default: 1) # # bayes_auto_learn 1 and I have not overridden the default elsewhere. So, presumably, auto-learning is enabled (if that's event relevant). While I have not trained the Bayesian filter manually to date, how is it that the spammiest of the spam is being classified with BAYES_00 (thereby receiving the score -1.9)? Doesn't BAYES_00 imply that the message is almost certainly not spam? How could the Bayes classifier know that it is spammy, if no one make it learn what spam looks like? Start training it now. Others have run into this same problem, but I see no resolution; here is one such example: http://forums.eukhost.com/f38/problems-spamassassin-bayes-filter-16948/ Outside of the above forum post, search query results for this issue are scant. Thanks for any help, -Ben -- Never thought the space i Program Files would be a problem in Linux Husse Apr 9 2007 signature.asc Description: OpenPGP digital signature
Re: Very spammy messages yield BAYES_00 (-1.9)
On Wed, 15 Aug 2012, Ben Johnson wrote: Some 99% of the spam that I receive, which is grossly spammy (we're talking auto loans, cash advances, dink pills, the whole lot) contains BAYES_00=-1.9 in the tests portion of the X-Spam-Status header. Might anyone know why? Poor training. Apart from the Bayes score, what kind of scores are those spams getting? While I have not trained the Bayesian filter manually to date, Is there any provision for any manual training in your environment? Have you set up training folders where your users can submit message for training? Do you run sa-learn at all? how is it that the spammiest of the spam is being classified with BAYES_00 (thereby receiving the score -1.9)? Doesn't BAYES_00 imply that the message is almost certainly not spam? BAYES_00 implies that the message in question looks very similar to messages the Bayes system has been told are not spam. It depends solely on how it has been trained. I wasn't aware that autolearning could do a cold-start of Bayes, can anyone confirm whether this is the case? If it can't then someone somewhere trained bayes up to the default minimum 200 hams and 200 spams needed for it to start classifying. Before we offer suggestions, some more data from you please: What version of SA is this? What does sa-learn --dump magic report about your current Bayes database? What are all of the bayes_* configuration options in your local config? What will probably end up happening is this: (1) wipe your Bayes database (2) turn off autolearn (3) collect several hundred hams and spams for an initial training corpus (4) train using that corpus (5) evaluate results Depending on your mail volume, once Bayes is working well after manual training, you may then want to reenable autolearn; I personally suggest it only where the volume is high enough and/or the character of mail is varied enough to prohibit manual training. You might also want to adjust the autolearn thresholds. You may also want to set up some mechanism for users to submit misclassified messages for training. Depending on how much you trust their judgement the learning from these can be automatic or can go through you as a reviewer. Recommendation: keep your manual training corpus around in case you need to do the above again for some reason. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Judicial Activism (n): interpreting the Constitution to grant the government powers that are popularly felt to be needed but that are not explicitly provided for therein (common definition); interpreting the Constitution as it is written (Brady definition) --- Today: the 67th anniversary of the end of World War II
Re: Very spammy messages yield BAYES_00 (-1.9)
On Wed, 15 Aug 2012, Jari Fredriksson wrote: 15.08.2012 20:36, Ben Johnson kirjoitti: While I have not trained the Bayesian filter manually to date, how is it that the spammiest of the spam is being classified with BAYES_00 (thereby receiving the score -1.9)? Doesn't BAYES_00 imply that the message is almost certainly not spam? How could the Bayes classifier know that it is spammy, if no one make it learn what spam looks like? Start training it now. It he's getting BAYES_00 hits _something_ has trained it. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Judicial Activism (n): interpreting the Constitution to grant the government powers that are popularly felt to be needed but that are not explicitly provided for therein (common definition); interpreting the Constitution as it is written (Brady definition) --- Today: the 67th anniversary of the end of World War II
Re: Very spammy messages yield BAYES_00 (-1.9)
From: Ben Johnson b...@indietorrent.org Date: Wed, 15 Aug 2012 13:36:08 -0400 Some 99% of the spam that I receive, which is grossly spammy (we're talking auto loans, cash advances, dink pills, the whole lot) contains BAYES_00=-1.9 in the tests portion of the X-Spam-Status header. Might anyone know why? This is a stock installation (Ubuntu package on 10.04). Most likely you've let autolearn learn a large number of spam messages as ham. Any autolearn mistakes need to be corrected. One or two spam messages with BAYES_00 is not a problem, but a large number of them indicates a serious problem with learning. If you have the old spam messages then you can retrain correctly. Otherwise it would probably be best to start over by deleting the bayes database. local.cf contains # Bayesian classifier auto-learning (default: 1) # # bayes_auto_learn 1 and I have not overridden the default elsewhere. So, presumably, auto-learning is enabled (if that's event relevant). While I have not trained the Bayesian filter manually to date, how is it that the spammiest of the spam is being classified with BAYES_00 (thereby receiving the score -1.9)? Doesn't BAYES_00 imply that the message is almost certainly not spam? Yes, BAYES_00 says the spam probability is between 0 and 1%. http://forums.eukhost.com/f38/problems-spamassassin-bayes-filter-16948/ Outside of the above forum post, search query results for this issue are scant. There have been numerous posts on BAYES. -jeff
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/15/2012 2:24 PM, John Hardin wrote: On Wed, 15 Aug 2012, Ben Johnson wrote: Some 99% of the spam that I receive, which is grossly spammy (we're talking auto loans, cash advances, dink pills, the whole lot) contains BAYES_00=-1.9 in the tests portion of the X-Spam-Status header. Might anyone know why? Poor training. John, I can't thank you enough for the thoroughness of your response. Apart from the Bayes score, what kind of scores are those spams getting? Here are a few examples (the first two of which are two of VERY few in which the BAYES_* value is over 00): - No, score=0.192 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=no No, score=2.241 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001, HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=no No, score=-0.836 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, SPF_PASS=-0.001, URI_HEX=1.122] autolearn=no No, score=1.256 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7, URIBL_RHS_DOB=1.514] autolearn=no - It bears mention that the RCVD_IN_DNSWL_MED test is having even more of a negative impact (pardon the pun) than BAYES_*. I am already working with the dnswl.org folks (off-list, for privacy reasons) to get to the bottom of that issue. While I have not trained the Bayesian filter manually to date, Is there any provision for any manual training in your environment? Have you set up training folders where your users can submit message for training? Do you run sa-learn at all? No, there is no provision. No, I have not set-up training folders, and no, I have no run sa-learn manually at all. Most of the list is probably laughing, but given the complexity of Spam Assassin, this crucial requirement was lost on me, amidst the sea of information and instructions. For example, there is no mention of the fact that SA is essentially useless without Bayesian training on http://wiki.apache.org/spamassassin/StartUsing . how is it that the spammiest of the spam is being classified with BAYES_00 (thereby receiving the score -1.9)? Doesn't BAYES_00 imply that the message is almost certainly not spam? BAYES_00 implies that the message in question looks very similar to messages the Bayes system has been told are not spam. It depends solely on how it has been trained. I wasn't aware that autolearning could do a cold-start of Bayes, can anyone confirm whether this is the case? If it can't then someone somewhere trained bayes up to the default minimum 200 hams and 200 spams needed for it to start classifying. Before we offer suggestions, some more data from you please: What version of SA is this? # spamassassin --version SpamAssassin version 3.3.1 running on Perl version 5.10.1 What does sa-learn --dump magic report about your current Bayes database? # sa-learn --dump magic ERROR: Bayes dump returned an error, please re-run with -D for more information # su amavis -c 'sa-learn --dump magic' # su amavis -c 'sa-learn --dump magic' 0.000 0 3 0 non-token data: bayes db version 0.000 0 11499 0 non-token data: nspam 0.000 0 39412 0 non-token data: nham 0.000 0 197769 0 non-token data: ntokens 0.000 0 1344331893 0 non-token data: oldest atime 0.000 0 1345056746 0 non-token data: newest atime 0.000 0 1345053771 0 non-token data: last journal sync atime 0.000 0 1345023550 0 non-token data: last expiry atime 0.000 0 345600 0 non-token data: last expire atime delta 0.000 0 6482 0 non-token data: last expire reduction count What are all of the bayes_* configuration options in your local config? None are defined there. There are a few defaults/examples, but they are commented-out. What will probably end up happening is this: (1) wipe your Bayes database (2) turn off autolearn (3) collect several hundred hams and spams for an initial training corpus (4) train using that corpus (5) evaluate results Depending on your mail volume, once Bayes is working well after manual training, you may then want to reenable autolearn; I personally suggest it only where the volume is high enough and/or the character of mail is varied enough to prohibit manual training. You might also want to adjust the autolearn thresholds. That makes sense; thank you for the suggestion. You may also want to set up some mechanism for users to submit misclassified messages for training. Depending on how much you trust their judgement the
Re: Very spammy messages yield BAYES_00 (-1.9)
On Wed, 15 Aug 2012, Ben Johnson wrote: On 8/15/2012 2:24 PM, John Hardin wrote: On Wed, 15 Aug 2012, Ben Johnson wrote: Some 99% of the spam that I receive, which is grossly spammy (we're talking auto loans, cash advances, dink pills, the whole lot) contains BAYES_00=-1.9 in the tests portion of the X-Spam-Status header. Might anyone know why? Poor training. John, I can't thank you enough for the thoroughness of your response. I like to show off. :) Apart from the Bayes score, what kind of scores are those spams getting? Here are a few examples (the first two of which are two of VERY few in which the BAYES_* value is over 00): - No, score=0.192 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=no No, score=2.241 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001, HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=no No, score=-0.836 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, SPF_PASS=-0.001, URI_HEX=1.122] autolearn=no No, score=1.256 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7, URIBL_RHS_DOB=1.514] autolearn=no - It might be interesting to see some log entries where autolearn=yes... It bears mention that the RCVD_IN_DNSWL_MED test is having even more of a negative impact (pardon the pun) than BAYES_*. I am already working with the dnswl.org folks (off-list, for privacy reasons) to get to the bottom of that issue. This might be a major contributing factor. If your system was taught from scratch by autolearn, and DNSWL (which is fairly well trusted) has been pushing a lot of spams to low scores... You might want to set: bayes_auto_learn_threshold_nonspam -3 That won't _fix_ the problem (at least not quickly) or avoid the need to wipe and retrain, but it might keep things from getting worse. See perldoc Mail::SpamAssassin::Plugin::AutoLearnThreshold for more info. Most of the list is probably laughing, but given the complexity of Spam Assassin, this crucial requirement was lost on me, amidst the sea of information and instructions. For example, there is no mention of the fact that SA is essentially useless without Bayesian training on http://wiki.apache.org/spamassassin/StartUsing . That's because that shouldn't be the case. The base ruleset + URIBL should be very effective pretty much out-of-the-box. What version of SA is this? # spamassassin --version SpamAssassin version 3.3.1 running on Perl version 5.10.1 A little stale, but not bad. You may also want to set up some mechanism for users to submit misclassified messages for training. Depending on how much you trust their judgement the learning from these can be automatic or can go through you as a reviewer. That sounds like a good idea. Is there a particular HOW TO or tutorial that you recommend? If it depends on the environment/configuration, this server runs Ubuntu 10.04 with Dovecot, Amavis, Sieve, and Spam Assassin. I'm not sure, I don't lurk the Wiki much. About the best I can suggest is search the SA users mailing list archives for training dovecot. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- The [assault weapons] ban is the moral equivalent of banning red cars because they look too fast. -- Steve Chapman, Chicago Tribune --- Today: the 67th anniversary of the end of World War II
Re: Very spammy messages yield BAYES_00 (-1.9)
John Hardin wrote: I wasn't aware that autolearning could do a cold-start of Bayes, can anyone confirm whether this is the case? If you let it run long enough to pass the 200/200 ham/spam thresholds, yes; there's no distinction I've ever met about where the learning came from. That said, I wouldn't trust a pure autolearn setup with stock autolearn thresholds - all too much spam will get learned scoring under 0.1. :( -kgd
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/15/2012 4:19 PM, Kris Deugau wrote: John Hardin wrote: I wasn't aware that autolearning could do a cold-start of Bayes, can anyone confirm whether this is the case? If you let it run long enough to pass the 200/200 ham/spam thresholds, yes; there's no distinction I've ever met about where the learning came from. That said, I wouldn't trust a pure autolearn setup with stock autolearn thresholds - all too much spam will get learned scoring under 0.1. :( -kgd It's a bit disappointing to learn this (pardon the pun), given: a.) This exchange between John Hardin and I, which occurred previously in this thread: ---8-- Me: Most of the list is probably laughing, but given the complexity of Spam Assassin, this crucial requirement was lost on me, amidst the sea of information and instructions. For example, there is no mention of the fact that SA is essentially useless without Bayesian training on http://wiki.apache.org/spamassassin/StartUsing . John: That's because that shouldn't be the case. The base ruleset + URIBL should be very effective pretty much out-of-the-box. ---8-- b.) The default value for bayes_auto_learn is 1 (on). (At least in my particular distribution.) Correct me if I'm wrong, but this issue's root cause seems to be that bayes_auto_learn was on, out-of-the-box, yet I was not complementing its efficacy via sa-learn. Is this an accurate summary? Because if so, it seems prudent to change the default bayes_auto_learn value to zero, and scorn any package maintainer or developer who modifies it, or, alternatively, put a banner, at font-size 100em, on the SpamAssassin homepage that issues an unmistakable warning about Bayesian training's importance. (John, I'll respond to your most recent message tomorrow most likely; had enough for one day!) Thank you, -Ben
Re: Very spammy messages yield BAYES_00 (-1.9)
On Wed, 15 Aug 2012, Kris Deugau wrote: John Hardin wrote: I wasn't aware that autolearning could do a cold-start of Bayes, can anyone confirm whether this is the case? If you let it run long enough to pass the 200/200 ham/spam thresholds, yes; there's no distinction I've ever met about where the learning came from. That said, I wouldn't trust a pure autolearn setup with stock autolearn thresholds - all too much spam will get learned scoring under 0.1. :( Right. It might be prudent to review the defaults before the next major release. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- An operating system design that requires a system reboot in order to install a document viewing utility does not earn my respect. --- Today: the 67th anniversary of the end of World War II
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/15/2012 5:00 PM, John Hardin wrote: Right. It might be prudent to review the defaults before the next major release. I wonder if we shouldn't disable auto-learning by default (assuming it's on by default)... Bayes should really be trained.
Re: Very spammy messages yield BAYES_00 (-1.9)
On Wed, 15 Aug 2012, Kevin A. McGrail wrote: On 8/15/2012 5:00 PM, John Hardin wrote: Right. It might be prudent to review the defaults before the next major release. I wonder if we shouldn't disable auto-learning by default (assuming it's on by default)... It is. Bayes should really be trained. I might not go so far as to say autolearn should be disabled by default, as it is a major good if well trained; but setting the defaults extreme enough that it is reliably, if slowly, initially trained seems to me a fair middle ground. Setting the ham default threshold to -3 or even -5 seems prudent (_much_ better than the current 0.1), then someone who actually wants to configure it can adjust based on how well it's performing and whether they want autolearn on at all. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Watch... Wallet... Gun... Knee...-- Denny Crane --- Today: the 67th anniversary of the end of World War II
Re: Very spammy messages yield BAYES_00 (-1.9)
On Wed, 15 Aug 2012, John Hardin wrote: I might not go so far as to say autolearn should be disabled by default, as it is a major good if well trained; Sorry, poor wording, I meant to say as _Bayes_ is a major good if well trained. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Watch... Wallet... Gun... Knee...-- Denny Crane --- Today: the 67th anniversary of the end of World War II
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/15/2012 5:18 PM, John Hardin wrote: On Wed, 15 Aug 2012, Kevin A. McGrail wrote: On 8/15/2012 5:00 PM, John Hardin wrote: Right. It might be prudent to review the defaults before the next major release. I wonder if we shouldn't disable auto-learning by default (assuming it's on by default)... It is. Bayes should really be trained. I might not go so far as to say autolearn should be disabled by default, as it is a major good if well trained; but setting the defaults extreme enough that it is reliably, if slowly, initially trained seems to me a fair middle ground. Setting the ham default threshold to -3 or even -5 seems prudent (_much_ better than the current 0.1), then someone who actually wants to configure it can adjust based on how well it's performing and whether they want autolearn on at all. Can you open a bug about that and let's see if we can get that done? I agree that a slower training threshold makes sense. -- Kevin A. McGrail President Peregrine Computer Consultants Corporation 3927 Old Lee Highway, Suite 102-C Fairfax, VA 22030-2422 http://www.pccc.com/ 703-359-9700 x50 / 800-823-8402 (Toll-Free) 703-359-8451 (fax) kmcgr...@pccc.com
Re: Very spammy messages yield BAYES_00 (-1.9)
Dumb question: How can I set the autolearn thresholds? On Aug 15, 2012, at 15 2:18 PM, John Hardin jhar...@impsec.org wrote: Setting the ham default threshold to -3 or even -5 seems prudent (_much_ better than the current 0.1)
Re: Very spammy messages yield BAYES_00 (-1.9)
On 08/15/2012 11:28 PM, JP Kelly wrote: Dumb question: How can I set the autolearn thresholds? On Aug 15, 2012, at 15 2:18 PM, John Hardin jhar...@impsec.org wrote: Setting the ham default threshold to -3 or even -5 seems prudent (_much_ better than the current 0.1) In local.cf bayes_auto_learn_threshold_nonspam -3.0 # uncomment change below if you want to raise or lower the spam learning threshold #bayes_auto_learn_threshold_spam 15.0 reload spamd or your glue. h2h Axb
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/15/2012 5:28 PM, JP Kelly wrote: Dumb question: How can I set the autolearn thresholds? perldoc Mail::SpamAssassin::Plugin::AutoLearnThreshold bayes_auto_learn_threshold_nonspam n.nn (default: 0.1) The score threshold below which a mail has to score, to be fed into SpamAssassin's learning systems automatically as a non-spam message. bayes_auto_learn_threshold_spam n.nn (default: 12.0) The score threshold above which a mail has to score, to be fed into SpamAssassin's learning systems automatically as a spam message. Note: SpamAssassin requires at least 3 points from the header, and 3 points from the body to auto-learn as spam. Therefore, the minimum working value for this option is 6. Regards, KAM
Re: Very spammy messages yield BAYES_00 (-1.9)
On Wed, 15 Aug 2012 17:05:00 -0400 Kevin A. McGrail wrote: On 8/15/2012 5:00 PM, John Hardin wrote: Right. It might be prudent to review the defaults before the next major release. I wonder if we shouldn't disable auto-learning by default (assuming it's on by default)... Bayes should really be trained. It seems to me that bug 6344 from 2010 has some merit. (I was about to file something similar myself.) https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6344 This suggests that lists like RCVD_IN_DNSWL_* should be marked as noautolearn so that when they fail they don't screw-up autolearning- which is what appears to have happened here. This is exacerbated by the fact that autolearning wont learn against a strong Bayes result (quite rightly), so damage can become permanent.
Re: Very spammy messages yield BAYES_00 (-1.9)
On Wed, 15 Aug 2012, Kevin A. McGrail wrote: On 8/15/2012 5:18 PM, John Hardin wrote: I might not go so far as to say autolearn should be disabled by default, as it is a major good if well trained; but setting the defaults extreme enough that it is reliably, if slowly, initially trained seems to me a fair middle ground. Setting the ham default threshold to -3 or even -5 seems prudent (_much_ better than the current 0.1), then someone who actually wants to configure it can adjust based on how well it's performing and whether they want autolearn on at all. Can you open a bug about that and let's see if we can get that done? I agree that a slower training threshold makes sense. https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6828 -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Today: the 67th anniversary of the end of World War II