Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 4/20/2013 3:20 PM, Benny Pedersen wrote: > Ben Johnson skrev den 2013-04-20 05:02: > >> Yes, I believe that me and the system always execute SA commands as the >> "amavis" user. When I was using the SQL setup, I had the following in >> local.cf: >> >> bayes_path /var/lib/amavis/.spamassassin/bayes > > is amavis have homedir in /var/lib/ ? The amavis user's home directory is /var/lib/amavis. This seems to be the default on Ubuntu; I didn't change this path. > in gentoo its default as /var/amavis where the .spamassassin dir is > created by amavisd > > use user_prefs to set bayes_path does not make sense if sql is used > Thanks; I did comment-out the "bayes_path" directive. I figured that it wouldn't matter whether it is commented or not, in the presence of SQL-related directives, but it can't hurt to comment-out this line. >> With the DBM setup, I had the following (I have since commented it out, >> while attempting to debug this Bayes issue): >> >> bayes_sql_override_username amavis > > +1 to this one since amavis cant use multiple sa users very easy, but > depending on what amavis it being supported with complicated setups :( I only need one Bayes user, so this setup is adequate. > i changed away from amavisd to clamav-milter, spampd in postfix after > queue, this is working very well for me, and i hope sa 3.4.x will not > break spampd :=) Hmm, I will consider your sound advice in this regard. amavis does seem to be overly memory-hungry (despite setting $max_servers = 1 and $max_requests = 1). If there is a better alternative, I'm all ears (or eyes, as the case may be). >> Is something more required to ensure that my mail system, which runs >> under the "amavis" user, is always reading from and writing to the >> same DB? > > nope just remember that amavis also reads .spamassassin/user_prefs > > if you like you can copy that user_prefs to /root/.spamassassin so you > dont have to remember something :) > > user_prefs should ONLY be readble by that user that runs it > Thanks for pointing this out. I will double-check the permissions. I'll respond to your other email momentarily. Thanks, Benny! -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
Ben Johnson skrev den 2013-04-20 19:01: Welp, that'll do it! How those four files were set to root:root ownership is beyond me, that means that root have doing some testing :) later amavisd cant write, you should change to amavis user before testing su amavis -c cmd foo but that was certainly a factor. Maybe this was a result of executing my training script as root yep, relaern scripts should run from cron user of amavisd, to keep permission owner of amavis, if running as root it would change to be owned by root change setup so cron it started by amavis, then it works (even though I had hard-coded the bayes_path to use /var/lib/amavis/.spamassassin/bayes, and when using SQL, hard-coded bayes_sql_override_username to use amavis)? do you want sql bayes ?, you still using dbm based bayes setup, but sql bayes does not use bayes_path I changed ownership to amavis:amavis and now messages are being scored with Bayes (all of them, from what I can tell so far). Also, I looked into the fact that I was running the cron job that trains ham and spam as root. I did this only because the amavis user lacks access to /var/vmail, which is where mail is stored on this system. (As a corollary, I'm a bit curious as to how amavis is able to scan incoming mail, given this lack of access -- maybe it does so using a pipe or some other method that does not require access to /var/vmail.) if you use sql based bayes, then you can change learn scripting to be runned by vmail user, problem solved, remember vmail should then have user_prefs :) I think the disconnect was in the fact that I placed my custom configuration directives in /etc/spamassassin/local.cf, when I should have placed them in /var/lib/amavis/.spamassassin/user_prefs (for message scoring) *and* /root/.spamassassin/user_prefs (for ham/spam training). (Thanks for pointing-out this mistake, Benny P.) yep, this is common error, i also remember pyzord was in the latest ebuild setup to run as root, but hey it uses udp port above 1023 :) Putting my custom SA configuration directives in both of these files was the only way I was able to train mail and score incoming messages using the same credentials "across-the-board". its possible to use dovecot-antispam ?, then it will call sa-learn pr msg, with the user that owns vmail, but i dropped it since i still not upgraded to dovecot 2.x yet Once I did this, I was able to use SQL or flat-file DB with the same exact results. progress ? Is there a better way to achieve this consistency, aside from putting duplicate content into /var/lib/amavis/.spamassassin/user_prefs and /root/.spamassassin/user_prefs? nope this is the perfect way, also security wise Feels like I'm out of the woods here! Thanks for all the expert help, guys. +1 -- senders that put my email into body content will deliver it to my own trashcan, so if you like to get reply, dont do it
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
Ben Johnson skrev den 2013-04-20 05:02: Yes, I believe that me and the system always execute SA commands as the "amavis" user. When I was using the SQL setup, I had the following in local.cf: bayes_path /var/lib/amavis/.spamassassin/bayes is amavis have homedir in /var/lib/ ? in gentoo its default as /var/amavis where the .spamassassin dir is created by amavisd use user_prefs to set bayes_path does not make sense if sql is used With the DBM setup, I had the following (I have since commented it out, while attempting to debug this Bayes issue): bayes_sql_override_username amavis +1 to this one since amavis cant use multiple sa users very easy, but depending on what amavis it being supported with complicated setups :( i changed away from amavisd to clamav-milter, spampd in postfix after queue, this is working very well for me, and i hope sa 3.4.x will not break spampd :=) Is something more required to ensure that my mail system, which runs under the "amavis" user, is always reading from and writing to the same DB? nope just remember that amavis also reads .spamassassin/user_prefs if you like you can copy that user_prefs to /root/.spamassassin so you dont have to remember something :) user_prefs should ONLY be readble by that user that runs it -- senders that put my email into body content will deliver it to my own trashcan, so if you like to get reply, dont do it
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
Ben Johnson skrev den 2013-04-20 04:40: By "feed it a few thousand NEW spams", do you mean to scrap the training corpora that I've hand-sorted in favor of starting over? Or do you mean to clear the database and re-run the training script against the corpora? ls /path/to/maildir/spam >/tmp/spam cd /path/to/maildir/spam sa-learn --spam --progress -f /tmp/spam ls /path/to/maildir/ham >/tmp/ham cd /path/to/maildir/ham sa-learn --ham --progress -f /tmp/ham do this for each bayes user, dependign on how your setup is this should be it basicly to get bayes on track again -- senders that put my email into body content will deliver it to my own trashcan, so if you like to get reply, dont do it
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
So, the problem seems not to be SQL-specific, as it occurs with SQL or flat-file DB. Upon following Benny Pedersen's advice (to move SA configuration directives from /etc/spamassassin/local.cf to /var/lib/amavis/.spamassassin/user_prefs), I noticed something unusual: $ ls -lah /var/lib/amavis/.spamassassin/ total 7.6M drwx-- 2 amavis amavis 4.0K Apr 20 08:54 . drwxr-xr-x 7 amavis amavis 4.0K Apr 20 08:56 .. -rw--- 1 root root 8.0K Apr 20 08:33 bayes_journal -rw--- 1 root root 1.3M Apr 20 00:09 bayes_seen -rw--- 1 root root 4.8M Apr 20 08:29 bayes_toks -rw-r--r-- 1 root root799 Jun 28 2004 gtube.txt -rw-r--r-- 1 amavis amavis 2.7K Apr 20 08:55 user_prefs Welp, that'll do it! How those four files were set to root:root ownership is beyond me, but that was certainly a factor. Maybe this was a result of executing my training script as root (even though I had hard-coded the bayes_path to use /var/lib/amavis/.spamassassin/bayes, and when using SQL, hard-coded bayes_sql_override_username to use amavis)? I changed ownership to amavis:amavis and now messages are being scored with Bayes (all of them, from what I can tell so far). Also, I looked into the fact that I was running the cron job that trains ham and spam as root. I did this only because the amavis user lacks access to /var/vmail, which is where mail is stored on this system. (As a corollary, I'm a bit curious as to how amavis is able to scan incoming mail, given this lack of access -- maybe it does so using a pipe or some other method that does not require access to /var/vmail.) I think the disconnect was in the fact that I placed my custom configuration directives in /etc/spamassassin/local.cf, when I should have placed them in /var/lib/amavis/.spamassassin/user_prefs (for message scoring) *and* /root/.spamassassin/user_prefs (for ham/spam training). (Thanks for pointing-out this mistake, Benny P.) Putting my custom SA configuration directives in both of these files was the only way I was able to train mail and score incoming messages using the same credentials "across-the-board". Once I did this, I was able to use SQL or flat-file DB with the same exact results. Is there a better way to achieve this consistency, aside from putting duplicate content into /var/lib/amavis/.spamassassin/user_prefs and /root/.spamassassin/user_prefs? Feels like I'm out of the woods here! Thanks for all the expert help, guys. -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
Apologies for the rapid-fire here folks, but I wanted to correct something. I had these backwards: >> Yes, I believe that me and the system always execute SA commands as the >> "amavis" user. When I was using the SQL setup, I had the following in >> local.cf: >> >> bayes_path /var/lib/amavis/.spamassassin/bayes >> >> With the DBM setup, I had the following (I have since commented it out, >> while attempting to debug this Bayes issue): >> >> bayes_sql_override_username amavis I meant to say that I have *always* had bayes_path /var/lib/amavis/.spamassassin/bayes in local.cf, and using the SQL setup, I added bayes_sql_override_username amavis Sorry for the confusion! -Ben On 4/19/2013 11:02 PM, Ben Johnson wrote: > > > On 4/19/2013 1:54 PM, Benny Pedersen wrote: >> Ben Johnson skrev den 2013-04-19 18:02: >> >>> Still stumped here... >> >> for amavisd-new, put spamassassin sql setup into user_prefs file for the >> user amavisd-new runs as might be working better then have insecure sql >> settings in /etc/mail/spamassassin :) >> >> i dont know if this is really that you have another user for amavisd, >> and test spamassassin -t msg with another user that uses another sql user ? >> >> make sure both users is really using same sql user as intended >> > > Benny, thanks for the suggestion regarding moving the SA SQL setup into > user_prefs. I will look into that soon. > > Yes, I believe that me and the system always execute SA commands as the > "amavis" user. When I was using the SQL setup, I had the following in > local.cf: > > bayes_path /var/lib/amavis/.spamassassin/bayes > > With the DBM setup, I had the following (I have since commented it out, > while attempting to debug this Bayes issue): > > bayes_sql_override_username amavis > > Is something more required to ensure that my mail system, which runs > under the "amavis" user, is always reading from and writing to the same DB? > > Best regards, > > -Ben > >
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 4/19/2013 1:54 PM, Benny Pedersen wrote: > Ben Johnson skrev den 2013-04-19 18:02: > >> Still stumped here... > > for amavisd-new, put spamassassin sql setup into user_prefs file for the > user amavisd-new runs as might be working better then have insecure sql > settings in /etc/mail/spamassassin :) > > i dont know if this is really that you have another user for amavisd, > and test spamassassin -t msg with another user that uses another sql user ? > > make sure both users is really using same sql user as intended > Benny, thanks for the suggestion regarding moving the SA SQL setup into user_prefs. I will look into that soon. Yes, I believe that me and the system always execute SA commands as the "amavis" user. When I was using the SQL setup, I had the following in local.cf: bayes_path /var/lib/amavis/.spamassassin/bayes With the DBM setup, I had the following (I have since commented it out, while attempting to debug this Bayes issue): bayes_sql_override_username amavis Is something more required to ensure that my mail system, which runs under the "amavis" user, is always reading from and writing to the same DB? Best regards, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 4/19/2013 12:12 PM, Axb wrote: > On 04/19/2013 06:02 PM, Ben Johnson wrote: > >> Still stumped here... > > do a bayes sa-learn --backup > > switch to file based in SDBM format (which is fast) > > do a > > sa-learn --restore > > feed it a few thousand NEW spams > > see what happens > > > > > > Thanks for the suggestion, Axb. Your help and time is much appreciated. By "feed it a few thousand NEW spams", do you mean to scrap the training corpora that I've hand-sorted in favor of starting over? Or do you mean to clear the database and re-run the training script against the corpora? If your thinking is that the token data may be "stale", then I will really be stumped. When I hand-classify 12 messages with a subject and body about a retractable garden hose as spam, I expect the 13th message about the same hose to score high on the Bayes tests. Is this an unreasonable expectation? I commented-out all of the DB-related lines in my SA configuration file (local.cf) and restarted amavis-new. I also cleared the existing DB tokens (with "sa-learn --clear") after amavis had restarted, and then executed my normal training script against my ham and spam corpora. I'll keep an eye on incoming messages to see if those that "slip through" and score below 4.0 demonstrate evidence of Bayes testing. I am beginning to wonder if some kind of "corruption", for lack of a better term, had been introduced by using utf8 to store the token data (Benny Pedersen mentioned that Unicode is overkill, and he is probably right). Performance aside, could using utf8_bin (instead of ascii) introduce a problem for SA (despite no errors during "sa-learn" training or --restore commands)? The strange thing is that Bayes seems to work fine most of the time. But as I've stated previously, almost all "obvious to a human" spam that scores below 4.0 lacks evidence of Bayes testing. Since switching back to a DBM Bayes setup, the results look pretty much as expected (wrapped), and this is the type of thing I expect to see on every message: --- spamassassin -D -t < "/tmp/email.txt" 2>&1 | egrep '(bayes:|whitelist:|AWL)' dbg: bayes: learner_new self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x37520f0), bayes_store_module=Mail::SpamAssassin::BayesStore::DBM dbg: bayes: learner_new: got store=Mail::SpamAssassin::BayesStore::DBM=HASH(0x2c52558) dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_toks dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_seen dbg: bayes: found bayes db version 3 dbg: bayes: DB journal sync: last sync: 0 dbg: bayes: DB journal sync: last sync: 0 dbg: bayes: corpus size: nspam = 6203, nham = 2479 dbg: bayes: score = 5.55111512312578e-17 dbg: bayes: DB journal sync: last sync: 0 dbg: bayes: untie-ing dbg: timing: total 2925 ms - init: 907 (31.0%), parse: 1.92 (0.1%), extract_message_metadata: 108 (3.7%), poll_dns_idle: 1040 (35.6%), get_uri_detail_list: 1.22 (0.0%), tests_pri_-1000: 19 (0.7%), compile_gen: 185 (6.3%), compile_eval: 19 (0.6%), tests_pri_-950: 5 (0.2%), tests_pri_-900: 5 (0.2%), tests_pri_-400: 32 (1.1%), check_bayes: 26 (0.9%), tests_pri_0: 836 (28.6%), dkim_load_modules: 27 (0.9%), check_dkim_signature: 1.23 (0.0%), check_dkim_adsp: 24 (0.8%), check_spf: 70 (2.4%), check_razor2: 202 (6.9%), check_pyzor: 135 (4.6%), tests_pri_500: 988 (33.8%) --- I'll wait and see if I receive messages without Bayes results and report back. Even if using DBM "works", I don't see this as a long-term solution -- only as a troubleshooting step. I would really like to keep my Bayes data in a MySQL or PostgreSQL database. Thanks again for the help! -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
Ben Johnson skrev den 2013-04-19 18:02: Still stumped here... for amavisd-new, put spamassassin sql setup into user_prefs file for the user amavisd-new runs as might be working better then have insecure sql settings in /etc/mail/spamassassin :) i dont know if this is really that you have another user for amavisd, and test spamassassin -t msg with another user that uses another sql user ? make sure both users is really using same sql user as intended -- senders that put my email into body content will deliver it to my own trashcan, so if you like to get reply, dont do it
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
John Hardin skrev den 2013-04-18 04:15: ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin; unicode is overkill since bayes is just ascii it will if unicode is used create bigger db, that will slow down more then ascii Please check the SpamAssassin bugzilla to see if this situation is already mentioned, and if not, add a bug. This seems pretty critical. i dont know how bayes in 3.4.x is now adays, its long since i have seen the source for it, but i maked some changes to bayes mysql so it can be cleaned up with timed expire of data, this is properly lost in transistion with 3.4.x :( It's possible that there's a good reason the default script still uses myISAM. If so, the documentation for this fix should at least be easier to find. it was dokumented ? -- senders that put my email into body content will deliver it to my own trashcan, so if you like to get reply, dont do it
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 04/19/2013 06:02 PM, Ben Johnson wrote: Still stumped here... do a bayes sa-learn --backup switch to file based in SDBM format (which is fast) do a sa-learn --restore feed it a few thousand NEW spams see what happens
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 4/19/2013 11:42 AM, Alex wrote: > Hi, > >> Is this normal? If so, what is the explanation for this behavior? I have > > marked dozens of nearly-identical messages with the subject > "Garden hose > expands up to three times its length" as SPAM (over the course of > several weeks) as SPAM, and yet SA reports "not enough usable > tokens found". > > > If they are identical, I don't believe it will create new tokens, > per se. > > > > Is SA referring to the number of tokens in the message? Or the > Bayes DB? > > > I should also mention that while training a message, use "--progress", > as such (assuming you're running it on an mbox or message that's in mbox > format): > > # sa-learn --progress --spam --mbox mymboxfile > > It will show you how many tokens have been learned during that run. It > might also be a good idea to add the token summary flag to your config: > > add_header all Tok-Stat _TOKENSUMMARY_ > > If you run spamassassin on a message directly, and add the -t option, it > will show you the number of different types of tokens found in the message: > > X-Spam-Tok-Stat: Tokens: new, 0; hammy, 6; neutral, 84; spammy, 36. > > Regards, > Alex > Alex, thanks very much for the quick reply. I really appreciate it. One can see from the output in my previous message (two messages back) that the user is amavis (correct for my system) and the corpus size, as well as the token count: dbg: bayes: corpus size: nspam = 6155, nham = 2342 dbg: bayes: tok_get_all: token count: 176 dbg: bayes: cannot use bayes on this message; not enough usable tokens found bayes: not scoring message, returning undef Now that I look at this output again, the "token count: 176" stands-out. That seems like a pretty low value. Is this the token count for the entire Bayes DB??? Or only the tokens that apply to the particular message being fed to SA? The "garden hose" messages are probably not *identical*, but they are very similar, so it seems that each variant should have tokens to offer. The concern I expressed around bug 6624 relates to Mark's comment, which seems to imply that while SA will not insert a token twice, it *will* increase the token "count". Here's an excerpt from Mark's comment from that bug report: "The effect of the bug with SpamAssassin is that tokens are only able to be inserted once, but their counts cannot increase, leading to terrible bayes results if the bug is not noticed. Also the conversion form db fails, as reported by Dave." Is it possible that training similar messages as SPAM is not having the intended effect due to this bug in my version of SA? My "bayes_vars" table looks like this (sorry for the wrapping, this is the best I could do): id usernamespam_count ham_count token_count last_expire last_atime_deltalast_expire_reduce oldest_token_agenewest_token_age 1 amavis 61852427120092 1366364379 8380417 14747 1357985848 1366386865 The SQL query: SELECT count( * ) FROM `bayes_token` returns 120092 rows, so the above value is accurate (that is, the "token_count" value in the `bayes_vars` table matches the actual row count in the `bayes_token` table). Also, thanks for the other tips regarding the "token summary flag" directive an the -t switch. I was actually using the -t switch to produce the output that I pasted two messages back. So, it seems that the "X-Spam-Tok-Stat" output is added only when the token count is high enough to be useful. Still stumped here... -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
Hi, > Is this normal? If so, what is the explanation for this behavior? I have > marked dozens of nearly-identical messages with the subject "Garden hose >> expands up to three times its length" as SPAM (over the course of >> several weeks) as SPAM, and yet SA reports "not enough usable tokens >> found". >> > > If they are identical, I don't believe it will create new tokens, per se. > > > >> Is SA referring to the number of tokens in the message? Or the Bayes DB? >> > I should also mention that while training a message, use "--progress", as such (assuming you're running it on an mbox or message that's in mbox format): # sa-learn --progress --spam --mbox mymboxfile It will show you how many tokens have been learned during that run. It might also be a good idea to add the token summary flag to your config: add_header all Tok-Stat _TOKENSUMMARY_ If you run spamassassin on a message directly, and add the -t option, it will show you the number of different types of tokens found in the message: X-Spam-Tok-Stat: Tokens: new, 0; hammy, 6; neutral, 84; spammy, 36. Regards, Alex
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
Hi, > Might anyone be in a position to offer an authoritative response to > these questions? > > I continue to see messages that are very similar to dozens of messages > that have been marked as SPAM slipping through with *no Bayes scoring* > (this is *after* fixing the SQL syntax error issue): > > bayes: cannot use bayes on this message; not enough usable tokens found > bayes: not scoring message, returning undef > Have you tried to find out how many tokens are in your bayes DB? As the user specified by bayes_sql_username (actually, it probably doesn't matter, but you should to be sure) run the following: # sa-learn --dump magic 0.000 0 3 0 non-token data: bayes db version 0.000 0 466417 0 non-token data: nspam 0.000 0 508868 0 non-token data: nham 0.000 0 10788203 0 non-token data: ntokens 0.000 0 1320901921 0 non-token data: oldest atime 0.000 0 1366385643 0 non-token data: newest atime 0.000 0 0 0 non-token data: last journal sync atime 0.000 0 1366348380 0 non-token data: last expiry atime 0.000 0 28651364 0 non-token data: last expire atime delta 0.000 0 0 0 non-token data: last expire reduction count This should show you the number of spam (nspam) and ham (nham) tokens in the db. > Is this normal? If so, what is the explanation for this behavior? I have > marked dozens of nearly-identical messages with the subject "Garden hose > expands up to three times its length" as SPAM (over the course of > several weeks) as SPAM, and yet SA reports "not enough usable tokens > found". > If they are identical, I don't believe it will create new tokens, per se. > Is SA referring to the number of tokens in the message? Or the Bayes DB? > I believe it would be talking about the database, not the message. Regards, Alex
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 4/18/2013 12:18 PM, Ben Johnson wrote: > > My concern now is that I am on 3.3.1, with little control over upgrades. > I have read all three bug reports in their entirety and Bug 6624 seems > to be a very legitimate concern. To quote Mark in the bug description: > >> The effect of the bug with SpamAssassin is that tokens are only able >> to be inserted once, but their counts cannot increase, leading to >> terrible bayes results if the bug is not noticed. Also the conversion >> form db fails, as reported by Dave. >> >> Attached is a patch for lib/Mail/SpamAssassin/BayesStore/MySQL.pm to >> provide a workaround for the MySQL server bug, and improved debug logging. > > How can I discern whether or not this bug does, in fact, affect me? Are > my Bayes results being crippled as a result of this bug? > >> It's possible that there's a good reason the default script still uses >> myISAM. If so, the documentation for this fix should at least be easier >> to find. >> > > In any event, I'm a little concerned because while the majority of > messages are now tagged with BAYES_* hits, I am now seeing this debug > output on a significant percentage of messages ("cannot use bayes on > this message; not enough usable tokens found"): > > # spamassassin -D -t < /tmp/msg.txt 2>&1 | egrep '(bayes:|whitelist:|AWL)' > > -- > Apr 18 09:15:36.537 [21797] dbg: bayes: learner_new > self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x4430388), > bayes_store_module=Mail::SpamAssassin::BayesStore::MySQL > Apr 18 09:15:36.568 [21797] dbg: bayes: using username: amavis > Apr 18 09:15:36.568 [21797] dbg: bayes: learner_new: got > store=Mail::SpamAssassin::BayesStore::MySQL=HASH(0x4779778) > Apr 18 09:15:36.580 [21797] dbg: bayes: database connection established > Apr 18 09:15:36.580 [21797] dbg: bayes: found bayes db version 3 > Apr 18 09:15:36.581 [21797] dbg: bayes: Using userid: 1 > Apr 18 09:15:36.781 [21797] dbg: bayes: corpus size: nspam = 6155, nham > = 2342 > Apr 18 09:15:36.787 [21797] dbg: bayes: tok_get_all: token count: 176 > Apr 18 09:15:36.790 [21797] dbg: bayes: cannot use bayes on this > message; not enough usable tokens found > Apr 18 09:15:36.790 [21797] dbg: bayes: not scoring message, returning undef > Apr 18 09:15:37.861 [21797] dbg: timing: total 2109 ms - init: 830 > (39.4%), parse: 7 (0.4%), extract_message_metadata: 123 (5.9%), > poll_dns_idle: 74 (3.5%), get_uri_detail_list: 2 (0.1%), > tests_pri_-1000: 26 (1.3%), compile_gen: 155 (7.4%), compile_eval: 19 > (0.9%), tests_pri_-950: 7 (0.3%), tests_pri_-900: 7 (0.3%), > tests_pri_-400: 15 (0.7%), check_bayes: 10 (0.5%), tests_pri_0: 1018 > (48.3%), dkim_load_modules: 25 (1.2%), check_dkim_signature: 3 (0.2%), > check_dkim_adsp: 16 (0.7%), check_spf: 78 (3.7%), check_razor2: 91 > (4.3%), check_pyzor: 430 (20.4%), tests_pri_500: 50 (2.4%) > -- > > I have done some searching-around on the string "cannot use bayes on > this message; not enough usable tokens found" and have not found > anything authoritative regarding what this message might mean and > whether or not it can be ignored or if it is symptomatic of a larger > Bayes problem. > > Thank you, > > -Ben > Might anyone be in a position to offer an authoritative response to these questions? I continue to see messages that are very similar to dozens of messages that have been marked as SPAM slipping through with *no Bayes scoring* (this is *after* fixing the SQL syntax error issue): bayes: cannot use bayes on this message; not enough usable tokens found bayes: not scoring message, returning undef Is this normal? If so, what is the explanation for this behavior? I have marked dozens of nearly-identical messages with the subject "Garden hose expands up to three times its length" as SPAM (over the course of several weeks) as SPAM, and yet SA reports "not enough usable tokens found". Is SA referring to the number of tokens in the message? Or the Bayes DB? Thanks, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
Hi, > Curious: what are your reasons for using Bayes in SQL? > > Are you sharing the DB among several machines? Or is this a single > > box/global bayes setup? > > > > > > Not yet, but that is the ultimate plan (to share the DB across multiple > servers). Also, I like the idea that the Bayes DB is backed-up > automatically along with all other databases on the server (we run a > cron script that performs the dump). Granted, it would be trivial to > schedule a call to "sa-learn --backup", but storing the data in SQL > seems more portable and makes it easier to query the data for reporting > purposes. > I have bayes in MySQL now, and I think it performs better than with just a flat file berkeley db. I believe it solved some locking/sharing issues I was having too. I converted to it a few months ago (relearned the corpus from scratch to mysql) with the intention of sharing between three systems, but the network latency and general performance between the systems for updates was horrible so they're all separate databases now. I'm still a mysql novice, so I don't doubt someone with more mysql networking experience could figure out how to share them between systems properly. I thought there would be one master system with two slaves, but instead they all seemed to be shared interactively for every query or update. For the InnoDB/MyISAM issue, if I'm understanding it correctly, I just edited the sql file I used to create the database, and I'm using InnoDB now without any issues on v3.3.2. I believe I used these instructions, with the sql modifications from above: http://www200.pair.com/mecham/spam/debian-spamassassin-sql.html Regards, Alex
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 4/18/2013 12:26 PM, Axb wrote: > On 04/18/2013 06:18 PM, Ben Johnson wrote: >> I have done some searching-around on the string "cannot use bayes on >> this message; not enough usable tokens found" and have not found >> anything authoritative regarding what this message might mean and >> whether or not it can be ignored or if it is symptomatic of a larger >> Bayes problem. > > Curious: what are your reasons for using Bayes in SQL? > Are you sharing the DB among several machines? Or is this a single > box/global bayes setup? > > Not yet, but that is the ultimate plan (to share the DB across multiple servers). Also, I like the idea that the Bayes DB is backed-up automatically along with all other databases on the server (we run a cron script that performs the dump). Granted, it would be trivial to schedule a call to "sa-learn --backup", but storing the data in SQL seems more portable and makes it easier to query the data for reporting purposes. Then again, I retain the corpora, so backing-up the DB is only useful for when data needs to be moved from one server or database to another (as moving the corpora seems far less practical). Are you suggesting that I should scrap SQL and go back to a flat-file DB? Is that the only path to a fix (short of upgrading SA)? Thanks for your help! -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 04/18/2013 06:18 PM, Ben Johnson wrote: I have done some searching-around on the string "cannot use bayes on this message; not enough usable tokens found" and have not found anything authoritative regarding what this message might mean and whether or not it can be ignored or if it is symptomatic of a larger Bayes problem. Curious: what are your reasons for using Bayes in SQL? Are you sharing the DB among several machines? Or is this a single box/global bayes setup?
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 4/17/2013 10:15 PM, John Hardin wrote: > On Wed, 17 Apr 2013, Ben Johnson wrote: > >> The first post on that page was the key. In particular, adding the >> following to each MySQL "CREATE TABLE" statement: >> >> ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin; > > Please check the SpamAssassin bugzilla to see if this situation is > already mentioned, and if not, add a bug. This seems pretty critical. Mark Martinec opened three reports in relation to this issue (quoted from the archive thread cited in my previous post): [Bug 6624] BayesStore/MySQL.pm fails to update tokens due to MySQL server bug (wrong count of rows affected) https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6624 (^^ Fixed in 3.4 ^^) [Bug 6625] Bayes SQL schema treats bayes_token.token as char instead of binary, fails chset checks https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6625 (^^ Fixed in 3.4 ^^) [Bug 6626] Newer MySQL chokes on TYPE=MyISAM syntax https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6626 (^^ Fixed in 3.4 ^^) My concern now is that I am on 3.3.1, with little control over upgrades. I have read all three bug reports in their entirety and Bug 6624 seems to be a very legitimate concern. To quote Mark in the bug description: > The effect of the bug with SpamAssassin is that tokens are only able > to be inserted once, but their counts cannot increase, leading to > terrible bayes results if the bug is not noticed. Also the conversion > form db fails, as reported by Dave. > > Attached is a patch for lib/Mail/SpamAssassin/BayesStore/MySQL.pm to > provide a workaround for the MySQL server bug, and improved debug logging. How can I discern whether or not this bug does, in fact, affect me? Are my Bayes results being crippled as a result of this bug? > It's possible that there's a good reason the default script still uses > myISAM. If so, the documentation for this fix should at least be easier > to find. > If there is a good reason, I have yet to discern what it might be. The third bug from above (Mark's comments, specifically) imply that there is no particular reason for using MyISAM. I have good reason for wanting to use the InnoDB storage engine, and I have seen no performance hit as a result of so doing. (In fact, performance seems better than with MyISAM in my scripted, once-a-day training setup.) The perfectly acceptable performance I'm observing could be because a) the InnoDB-related resources allocated to MySQL are more than sufficient, b) the schema that I used has a newly-added INDEX whereas those prior to it did not, or c) I was sure to use the "MySQL" module instead of the "SQL" module with my InnoDB setup: bayes_store_module Mail::SpamAssassin::BayesStore::MySQL The bottom line seems to be that for those who have settings like these in their MySQL configurations > default_storage_engine=InnoDB > skip-character-set-client-handshake > collation_server=utf8_unicode_ci > character_set_server=utf8 it is absolutely necessary to include ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin; at the end of each CREATE TABLE statement (otherwise, the MySQL syntax error results and all Bayes SELECT statements fail). In any event, I'm a little concerned because while the majority of messages are now tagged with BAYES_* hits, I am now seeing this debug output on a significant percentage of messages ("cannot use bayes on this message; not enough usable tokens found"): # spamassassin -D -t < /tmp/msg.txt 2>&1 | egrep '(bayes:|whitelist:|AWL)' -- Apr 18 09:15:36.537 [21797] dbg: bayes: learner_new self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x4430388), bayes_store_module=Mail::SpamAssassin::BayesStore::MySQL Apr 18 09:15:36.568 [21797] dbg: bayes: using username: amavis Apr 18 09:15:36.568 [21797] dbg: bayes: learner_new: got store=Mail::SpamAssassin::BayesStore::MySQL=HASH(0x4779778) Apr 18 09:15:36.580 [21797] dbg: bayes: database connection established Apr 18 09:15:36.580 [21797] dbg: bayes: found bayes db version 3 Apr 18 09:15:36.581 [21797] dbg: bayes: Using userid: 1 Apr 18 09:15:36.781 [21797] dbg: bayes: corpus size: nspam = 6155, nham = 2342 Apr 18 09:15:36.787 [21797] dbg: bayes: tok_get_all: token count: 176 Apr 18 09:15:36.790 [21797] dbg: bayes: cannot use bayes on this message; not enough usable tokens found Apr 18 09:15:36.790 [21797] dbg: bayes: not scoring message, returning undef Apr 18 09:15:37.861 [21797] dbg: timing: total 2109 ms - init: 830 (39.4%), parse: 7 (0.4%), extract_message_metadata: 123 (5.9%), poll_dns_idle: 74 (3.5%), get_uri_detail_list: 2 (0.1%), tests_pri_-1000: 26 (1.3%), compile_gen: 155 (7.4%), compile_eval: 19 (0.9%), tests_pri_-950: 7 (0.3%), tests_pri_-900: 7 (0.3%), tests_pri_-400: 15 (0.7%), check_bayes: 10 (0.5%), tests_pri_0: 1018 (48.3%), dkim_load_modules: 25 (1.2%), check_dkim_signature: 3 (0.2%), check_dkim_adsp: 16 (0.7%), check_spf: 78 (3.7%), check_razo
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 4/17/2013 5:39 PM, Tom Hendrikx wrote: > On 17-04-13 21:40, Ben Johnson wrote: >> Ideally, using the above directives will tell us whether we're >> experiencing timeouts, or these spam messages are simply not in the >> Pyzor or Razor2 databases. >> >> Off the top of your head, do you happen to know what will happen if one >> or both of the Pyzor/Razor2 tests timeout? Will some indication that the >> tests were at least *started* still be added to the SA header? > > The razor client (don't know about pyzor) logs its activity to some > logfile in ~razor. There you can see what (or what not) is happening. > > It's also possible to raise logfile verbosity by changing the razor > config file. See the man page for details. > > -- > Tom > Tom, thanks for the excellent tip regarding Razor's own log file. Tailing that log will make this kind of debugging much simpler in the future. Much appreciated. One of the reasons for which I also like the idea of using Daniel McDonald's include-scores-in-header rule (for Pyzor and Razor) is that the data is embedded right in the message, which can be useful. For one, this makes the scoring data more "portable" (it stays with the message to which it applies). Secondly, when tailing a log, it can be difficult to determine where the data relevant to one message ends and another begins. Thanks again, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On Wed, 17 Apr 2013, Ben Johnson wrote: The first post on that page was the key. In particular, adding the following to each MySQL "CREATE TABLE" statement: ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin; Please check the SpamAssassin bugzilla to see if this situation is already mentioned, and if not, add a bug. This seems pretty critical. It's possible that there's a good reason the default script still uses myISAM. If so, the documentation for this fix should at least be easier to find. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Our government should bear in mind the fact that the American Revolution was touched off by the then-current government attempting to confiscate firearms from the people. --- 2 days until the 238th anniversary of The Shot Heard 'Round The World
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 4/17/2013 6:47 PM, Ben Johnson wrote: > > > On 4/17/2013 5:05 PM, Kris Deugau wrote: >> Ben Johnson wrote: >>> Is there anything else that would cause Bayes tests not be performed? I >>> ask because other types of tests are disabled automatically under >>> certain circumstances (e.g., network tests), and I'm wondering if there >>> is some obscure combination of factors that causes Bayes tests not to be >>> performed. >> >> Do you have bayes_sql_override_username set? (This forces use of a >> single Bayes DB for all SA calls that reference this configuration file >> set.) >> >> If not, you may be getting a Bayes DB for each user on your system; >> IIRC this is supported (sort of) and default with Amavis. >> >> -kgd >> > > Thanks for jumping-in here, Kris. > > Yes, I do have the following in my SA local.cf: > > bayes_sql_override_username amavis > > So, all users are sharing the same Bayes DB. I train Bayes daily and the > token count, etc., etc. all look good and correct. > > Just a quick update to my previous post. > > The Pyzor and Razor2 score information is indeed coming through for the > handful of messages that have landed since I made those configuration > changes. So, all seems to be well on the Pyzor / Razor2 front. > > However, I still don't see any evidence that Bayes testing was performed > on the messages that are "slipping through". > > It bears mention that *most* messages do indeed show evidence of Bayes > scoring. > > --- OH, SNAP! I found the root cause. --- > > Well, when I went to confirm the above statement, regarding most > messages showing evidence of Bayes scoring, I realized that *none* show > evidence of it since 3/23! No wonder all of this garbage is slipping > through! > > I recognized the date 3/23 immediately; it was the date on which we > upgraded ISPConfig from 3.0.4.6 to 3.0.5. (For those who have no > knowledge of ISPConfig, it is basically a FOSS solution to managing vast > numbers of websites, domains, mailboxes, etc., as the name implies.) > > We also updated OS packages (security only) on that day. > > After diff-ing all of the relevant service configuration files > (amavis-new, spamassassin, postfix, dovecot, etc.) I couldn't find any > discrepancies. > > Then, I tried: > > - > # spamassassin -D -t < /tmp/msg.txt 2>&1 | egrep '(bayes:|whitelist:|AWL)' > > Apr 17 15:36:08.723 [23302] dbg: bayes: learner_new > self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x2fbc508), > bayes_store_module=Mail::SpamAssassin::BayesStore::MySQL > Apr 17 15:36:08.746 [23302] dbg: bayes: using username: amavis > Apr 17 15:36:08.746 [23302] dbg: bayes: learner_new: got > store=Mail::SpamAssassin::BayesStore::MySQL=HASH(0x3305358) > Apr 17 15:36:08.758 [23302] dbg: bayes: database connection established > Apr 17 15:36:08.758 [23302] dbg: bayes: found bayes db version 3 > Apr 17 15:36:08.759 [23302] dbg: bayes: Using userid: 1 > Apr 17 15:36:08.914 [23302] dbg: bayes: corpus size: nspam = 6083, nham > = 2334 > Apr 17 15:36:08.920 [23302] dbg: bayes: tok_get_all: token count: 163 > Apr 17 15:36:08.921 [23302] dbg: bayes: tok_get_all: SQL error: Illegal > mix of collations for operation ' IN ' > Apr 17 15:36:08.921 [23302] dbg: bayes: cannot use bayes on this > message; none of the tokens were found in the database > Apr 17 15:36:08.921 [23302] dbg: bayes: not scoring message, returning undef > Apr 17 15:36:13.116 [23302] dbg: timing: total 5159 ms - init: 804 > (15.6%), parse: 10 (0.2%), extract_message_metadata: 99 (1.9%), > poll_dns_idle: 3426 (66.4%), get_uri_detail_list: 0.24 (0.0%), > tests_pri_-1000: 11 (0.2%), compile_gen: 133 (2.6%), compile_eval: 18 > (0.3%), tests_pri_-950: 5 (0.1%), tests_pri_-900: 5 (0.1%), > tests_pri_-400: 12 (0.2%), check_bayes: 8 (0.1%), tests_pri_0: 804 > (15.6%), dkim_load_modules: 23 (0.4%), check_dkim_signature: 5 (0.1%), > check_dkim_adsp: 99 (1.9%), check_spf: 61 (1.2%), check_razor2: 211 > (4.1%), check_pyzor: 138 (2.7%), tests_pri_500: 3387 (65.7%) > - > > Check-out the message buried half-way down: > > bayes: tok_get_all: SQL error: Illegal mix of collations for operation ' > IN ' > > I have run into this unsightly message before, but in that case, I could > see the entire query, which enabled me to change the collations accordingly. > > In this case, I have no idea what the original query might have been. > > Further, I have no idea what changed that introduced this problems on 3/23. > > Was it a MySQL upgrade? Was it an ISPConfig change? > > Has anybody else run into this? > > Thanks again, > > -Ben > I managed to fix this issue. The date on which Bayes stopped "working" was relevant only in as much it was the first date on which MySQL had been restarted in months. The software updates had nothing to do with the issue. The critical change was that sometime between the last MySQL start/stop, I had added a handful of [m
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 4/17/2013 5:05 PM, Kris Deugau wrote: > Ben Johnson wrote: >> Is there anything else that would cause Bayes tests not be performed? I >> ask because other types of tests are disabled automatically under >> certain circumstances (e.g., network tests), and I'm wondering if there >> is some obscure combination of factors that causes Bayes tests not to be >> performed. > > Do you have bayes_sql_override_username set? (This forces use of a > single Bayes DB for all SA calls that reference this configuration file > set.) > > If not, you may be getting a Bayes DB for each user on your system; > IIRC this is supported (sort of) and default with Amavis. > > -kgd > Thanks for jumping-in here, Kris. Yes, I do have the following in my SA local.cf: bayes_sql_override_username amavis So, all users are sharing the same Bayes DB. I train Bayes daily and the token count, etc., etc. all look good and correct. Just a quick update to my previous post. The Pyzor and Razor2 score information is indeed coming through for the handful of messages that have landed since I made those configuration changes. So, all seems to be well on the Pyzor / Razor2 front. However, I still don't see any evidence that Bayes testing was performed on the messages that are "slipping through". It bears mention that *most* messages do indeed show evidence of Bayes scoring. --- OH, SNAP! I found the root cause. --- Well, when I went to confirm the above statement, regarding most messages showing evidence of Bayes scoring, I realized that *none* show evidence of it since 3/23! No wonder all of this garbage is slipping through! I recognized the date 3/23 immediately; it was the date on which we upgraded ISPConfig from 3.0.4.6 to 3.0.5. (For those who have no knowledge of ISPConfig, it is basically a FOSS solution to managing vast numbers of websites, domains, mailboxes, etc., as the name implies.) We also updated OS packages (security only) on that day. After diff-ing all of the relevant service configuration files (amavis-new, spamassassin, postfix, dovecot, etc.) I couldn't find any discrepancies. Then, I tried: - # spamassassin -D -t < /tmp/msg.txt 2>&1 | egrep '(bayes:|whitelist:|AWL)' Apr 17 15:36:08.723 [23302] dbg: bayes: learner_new self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x2fbc508), bayes_store_module=Mail::SpamAssassin::BayesStore::MySQL Apr 17 15:36:08.746 [23302] dbg: bayes: using username: amavis Apr 17 15:36:08.746 [23302] dbg: bayes: learner_new: got store=Mail::SpamAssassin::BayesStore::MySQL=HASH(0x3305358) Apr 17 15:36:08.758 [23302] dbg: bayes: database connection established Apr 17 15:36:08.758 [23302] dbg: bayes: found bayes db version 3 Apr 17 15:36:08.759 [23302] dbg: bayes: Using userid: 1 Apr 17 15:36:08.914 [23302] dbg: bayes: corpus size: nspam = 6083, nham = 2334 Apr 17 15:36:08.920 [23302] dbg: bayes: tok_get_all: token count: 163 Apr 17 15:36:08.921 [23302] dbg: bayes: tok_get_all: SQL error: Illegal mix of collations for operation ' IN ' Apr 17 15:36:08.921 [23302] dbg: bayes: cannot use bayes on this message; none of the tokens were found in the database Apr 17 15:36:08.921 [23302] dbg: bayes: not scoring message, returning undef Apr 17 15:36:13.116 [23302] dbg: timing: total 5159 ms - init: 804 (15.6%), parse: 10 (0.2%), extract_message_metadata: 99 (1.9%), poll_dns_idle: 3426 (66.4%), get_uri_detail_list: 0.24 (0.0%), tests_pri_-1000: 11 (0.2%), compile_gen: 133 (2.6%), compile_eval: 18 (0.3%), tests_pri_-950: 5 (0.1%), tests_pri_-900: 5 (0.1%), tests_pri_-400: 12 (0.2%), check_bayes: 8 (0.1%), tests_pri_0: 804 (15.6%), dkim_load_modules: 23 (0.4%), check_dkim_signature: 5 (0.1%), check_dkim_adsp: 99 (1.9%), check_spf: 61 (1.2%), check_razor2: 211 (4.1%), check_pyzor: 138 (2.7%), tests_pri_500: 3387 (65.7%) - Check-out the message buried half-way down: bayes: tok_get_all: SQL error: Illegal mix of collations for operation ' IN ' I have run into this unsightly message before, but in that case, I could see the entire query, which enabled me to change the collations accordingly. In this case, I have no idea what the original query might have been. Further, I have no idea what changed that introduced this problems on 3/23. Was it a MySQL upgrade? Was it an ISPConfig change? Has anybody else run into this? Thanks again, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 17-04-13 21:40, Ben Johnson wrote: > Ideally, using the above directives will tell us whether we're > experiencing timeouts, or these spam messages are simply not in the > Pyzor or Razor2 databases. > > Off the top of your head, do you happen to know what will happen if one > or both of the Pyzor/Razor2 tests timeout? Will some indication that the > tests were at least *started* still be added to the SA header? The razor client (don't know about pyzor) logs its activity to some logfile in ~razor. There you can see what (or what not) is happening. It's also possible to raise logfile verbosity by changing the razor config file. See the man page for details. -- Tom
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
Ben Johnson wrote: > Is there anything else that would cause Bayes tests not be performed? I > ask because other types of tests are disabled automatically under > certain circumstances (e.g., network tests), and I'm wondering if there > is some obscure combination of factors that causes Bayes tests not to be > performed. Do you have bayes_sql_override_username set? (This forces use of a single Bayes DB for all SA calls that reference this configuration file set.) If not, you may be getting a Bayes DB for each user on your system; IIRC this is supported (sort of) and default with Amavis. -kgd
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
Daniel, thanks for the quick reply. I'll reply inline, below. On 4/16/2013 5:01 PM, Daniel McDonald wrote: > > > > On 4/16/13 2:59 PM, "Ben Johnson" wrote: > >> Are there any normal circumstances under which Bayes tests are not run? > Yes, if USE_BAYES = 0 is included in the local.cf file. I checked in /etc/spamassassin/local.cf, and find the following: use_bayes 1 So, that seems not to be the issue. >> >> If not, are there circumstances under which Bayes tests are run but >> their results are not included in the message headers? (I have tag_level >> set to -999, so SA headers are always added.) > > That sounds like an amavisd command, you may want to check in > ~amavisd/.spamassassin/user_prefs as well I checked in the equivalent path on my system (/var/lib/amavis/.spamassassin/user_prefs) and the entire file is commented-out. So, that seems not to be the issue, either. Is there anything else that would cause Bayes tests not be performed? I ask because other types of tests are disabled automatically under certain circumstances (e.g., network tests), and I'm wondering if there is some obscure combination of factors that causes Bayes tests not to be performed. >> >> Likewise, for the vast majority of spam messages that slip-through, I >> see no evidence of Pyzor or Razor2 activity. I have heretofore assumed >> that this observation indicates that the network tests were performed, >> but did not contribute to the SA score. Is this assumption valid? > Yes. Okay, very good. It occurred to me that perhaps the Pyzor and/or Razor2 tests are timing-out (both timeouts are set to 15 seconds) some percentage of the time, which may explain why these tests do not contribute to a given message's score. That's why I asked about forcing the results into the SA header. >> >> Also, is there some means by which to *force* Pyzor and Razor2 scores to >> be added to the SA header, even if they did not contribute to the score? > > I imagine you would want something like this: > > fullRAZOR2_CF_RANGE_0_50 eval:check_razor2_range('','0','50') > tflags RAZOR2_CF_RANGE_0_50 net > reuse RAZOR2_CF_RANGE_0_50 > describe RAZOR2_CF_RANGE_0_50 Razor2 gives confidence level under 50% > score RAZOR2_CF_RANGE_0_500.01 > > fullRAZOR2_CF_RANGE_E4_0_50 eval:check_razor2_range('4','0','50') > tflags RAZOR2_CF_RANGE_E4_0_50 net > reuse RAZOR2_CF_RANGE_E4_0_50 > describe RAZOR2_CF_RANGE_E4_0_50 Razor2 gives engine 4 confidence level > below 50% > score RAZOR2_CF_RANGE_E4_0_50 0.01 > > fullRAZOR2_CF_RANGE_E8_0_50 eval:check_razor2_range('8','0','50') > tflags RAZOR2_CF_RANGE_E8_0_50 net > reuse RAZOR2_CF_RANGE_E8_0_50 > describe RAZOR2_CF_RANGE_E8_0_50 Razor2 gives engine 8 confidence level > below 50% > score RAZOR2_CF_RANGE_E8_0_50 0.01 This seems to work brilliantly. I can't thank you enough; I never would have figured this out. Ideally, using the above directives will tell us whether we're experiencing timeouts, or these spam messages are simply not in the Pyzor or Razor2 databases. Off the top of your head, do you happen to know what will happen if one or both of the Pyzor/Razor2 tests timeout? Will some indication that the tests were at least *started* still be added to the SA header? >> >> To refresh folks' memories, we have verified that Bayes is setup >> correctly (database was wiped and now training is done manually and is >> supervised), and that network tests are being performed when messages >> are scanned. >> >> Thanks for sticking with me through all of this, guys! >> >> -Ben > Thanks again, Daniel! -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 4/16/13 2:59 PM, "Ben Johnson" wrote: >Are there any normal circumstances under which Bayes tests are not run? Yes, if USE_BAYES = 0 is included in the local.cf file. > > If not, are there circumstances under which Bayes tests are run but > their results are not included in the message headers? (I have tag_level > set to -999, so SA headers are always added.) That sounds like an amavisd command, you may want to check in ~amavisd/.spamassassin/user_prefs as well > > Likewise, for the vast majority of spam messages that slip-through, I > see no evidence of Pyzor or Razor2 activity. I have heretofore assumed > that this observation indicates that the network tests were performed, > but did not contribute to the SA score. Is this assumption valid? Yes. > > Also, is there some means by which to *force* Pyzor and Razor2 scores to > be added to the SA header, even if they did not contribute to the score? I imagine you would want something like this: fullRAZOR2_CF_RANGE_0_50 eval:check_razor2_range('','0','50') tflags RAZOR2_CF_RANGE_0_50 net reuse RAZOR2_CF_RANGE_0_50 describe RAZOR2_CF_RANGE_0_50 Razor2 gives confidence level under 50% score RAZOR2_CF_RANGE_0_500.01 fullRAZOR2_CF_RANGE_E4_0_50 eval:check_razor2_range('4','0','50') tflags RAZOR2_CF_RANGE_E4_0_50 net reuse RAZOR2_CF_RANGE_E4_0_50 describe RAZOR2_CF_RANGE_E4_0_50 Razor2 gives engine 4 confidence level below 50% score RAZOR2_CF_RANGE_E4_0_50 0.01 fullRAZOR2_CF_RANGE_E8_0_50 eval:check_razor2_range('8','0','50') tflags RAZOR2_CF_RANGE_E8_0_50 net reuse RAZOR2_CF_RANGE_E8_0_50 describe RAZOR2_CF_RANGE_E8_0_50 Razor2 gives engine 8 confidence level below 50% score RAZOR2_CF_RANGE_E8_0_50 0.01 > > To refresh folks' memories, we have verified that Bayes is setup > correctly (database was wiped and now training is done manually and is > supervised), and that network tests are being performed when messages > are scanned. > > Thanks for sticking with me through all of this, guys! > > -Ben -- Daniel J McDonald, CCIE # 2495, CISSP # 78281
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
Apologies for resurrecting the thread, but I never did receive a response to this particular aspect of the problem (asked on Jan 18, 2013, 8:51 AM). This is probably because I replied to my own post before anyone else did, and changed the subject slightly. We are being hammered pretty hard with spam (again), and as I inspect messages whose score is below tag2_level, BAYES_* is conspicuously absent from the headers. To reiterate my question: >> Are there any normal circumstances under which Bayes tests are not run? If not, are there circumstances under which Bayes tests are run but their results are not included in the message headers? (I have tag_level set to -999, so SA headers are always added.) Likewise, for the vast majority of spam messages that slip-through, I see no evidence of Pyzor or Razor2 activity. I have heretofore assumed that this observation indicates that the network tests were performed, but did not contribute to the SA score. Is this assumption valid? Also, is there some means by which to *force* Pyzor and Razor2 scores to be added to the SA header, even if they did not contribute to the score? To refresh folks' memories, we have verified that Bayes is setup correctly (database was wiped and now training is done manually and is supervised), and that network tests are being performed when messages are scanned. Thanks for sticking with me through all of this, guys! -Ben On 1/18/2013 11:51 AM, Ben Johnson wrote: > So, I've been keeping an eye on things again today. > > Overall, things look pretty good, and most spam is being blocked > outright at the MTA and scored appropriately in SA if not. > > I've been inspecting the X-Spam-Status headers for the handful of > messages that do slip through and noticed that most of them lack any > evidence of the BAYES_* tests. Here's one such header: > > No, score=3.115 tagged_above=-999 required=4.5 tests=[HK_NAME_FREE=1, > HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, PYZOR_CHECK=1.392, > SPF_PASS=-0.001] autolearn=disabled > > The messages that were delivered just before and after this one do have > evidence of BAYES_* tests, so, it's not as though something is > completely broken. > > Are there any normal circumstances under which Bayes tests are not run? > Do I need to turn debugging back on and wait until this happens again? > > Thanks for all the help, everyone! > > -Ben >
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On Wed, 6 Feb 2013, Ben Johnson wrote: On 2/1/2013 7:58 PM, John Hardin wrote: That latter brings up another concern for the vetted-corpora model: if a message is *removed* from a training corpora mailbox rather than reclassified, you'd have to wipe and retrain your database from scratch to remove that message's effects. So, you need *three* vetted corpus mailboxes: spam, ham, and should-not-have-been-trained (forget). Rather than deleting a message from the ham or spam corpus mailbox you move it to the forget mailbox and the in next training pass sa-learn forgets the message and removes it from the forget mailbox. This would be some special scripting, because you can't just "sa-learn --forget" a whole mailbox. There would also need to be an audit process to detect whether the same message_id is in both the ham and spam corpus mailboxes, so that the admin can delete (NOT forget) the incorrect classification, or forget the message if neither classification is reasonable. You reveal some crucial information with regard to corpora management here, John. I've taken your good advice and created a third mailbox (well, a third "folder" within the same mailbox), named "Forget". It sounds as though the key here is never to delete messages from either corpus -- unless the same message exists in both corpora, in which case the misclassified message should be deleted. If neither classification is reasonable and the message should instead be forgotten, what's the order of operations? Should a copy of the message be created in the "Forget" corpus and then the message deleted from both the "Ham" and "Spam" corpora? I would suggest: *move* one to the Forget folder and delete the other. I am assuming that learning from the vetted corpora folders is on a schedule rather than in real-time, so that you have a liberal window for completing these operations. With regard to the specialized scripting required to "forget" messages, this sounds cumbersome Yeah. because you can't just "sa-learn --forget" a whole mailbox. Is there a non-obvious reason for this? Would the logic behind a recursive --forget switch not be the same or similar as with the existing --ham and --spam switches? Oh, the logic would be the same, it's just not implemented. That's why you can't do it. :) Finally, when a user submits a message to be classified as ham or spam, how should I be sorting the messages? I see the following scenarios: 1.) I agree with the end-user's classification. 2.) I disagree with the end-user's classification. a.) Because the message was submitted as ham but is really spam (or vice versa) b.) Because neither classification is reasonable In case 1.), should I *copy* the message from the submission inbox's Ham folder to the permanent Ham corpus folder? Or should I *move* the message? I'm trying to discern whether or not there's value in retaining end-user submissions *as they were classified upon submission*. I don't see any value to retaining them in the public submission folders. In fact, you may want to make the ham submission folder write-only (if that's possible) in order to help preserve your individual users' privacy. In case 2.), should I simply delete the message from the submission folder? Or is there some reason to retain the message (i.e., move it into an "Erroneous" folder within the submission mailbox)? You might want to do that if you intend to approach the user and train them about why it wasn't a correct submission and you want evidence - for example, to say that this looks like a message from a legitimate mailing list that they intentionally subscribed to at some point, and the unsubscribe link is right there (points at screen). Apart from that, I don't see a reason to keep erroneous submissions either. I did read http://wiki.apache.org/spamassassin/HandClassifiedCorpora , but it doesn't address these issues, specifically. Yeah, that assumes familiarity with these issues, and managing masscheck corpora is a slightly different task than managing user-fed Bayes training corpora. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- ...we talk about creating "millions of shovel-ready jobs" for a society that doesn't really encourage people to pick up a shovel. -- Mike Rowe, testifying before Congress --- 6 days until Abraham Lincoln's and Charles Darwin's 204th Birthdays
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 2/1/2013 7:58 PM, John Hardin wrote: > On Sat, 2 Feb 2013, RW wrote: > >> ALLOWING APPENDS >>By appends we mean the case of mail moving when the source folder is >>unknown, e.g. when you move from some other account or with tools >>like offlineimap. You should be careful with allowing APPENDs to >>SPAM folders. The reason for possibly allowing it is to allow >>not-SPAM --> SPAM transitions to work and be trained. However, >>because the plugin cannot know the source of the message (it is >>assumed to be from OTHER folder), multiple bad scenarios can happen: >> >>1. SPAM --> SPAM transitions cannot be recognised and are trained; >>2. TRASH --> SPAM transitions cannot be recognised and are trained; >>3. SPAM --> not-SPAM transitions cannot be recognised therefore >> training good messages will never work with APPENDs. >> >> >> I presume that the plugin works by monitoring COPY commands and so >> can't work properly when a move is done by FETCH-APPEND-DELETE. >> >> For sa-learn the problem would be 3, but I don't see how that is >> affected by allowing appends on the spam folder. > > Yeah, all of that sounds like they're talking about non-vetted training > mailboxes where the users are effectively talking directly to sa-learn. > > I think I may see at least part of what they are driving at. > > If one user trains a message as ham and another user who got a copy of > the same message trains it as spam, who wins? > > Absent some conflict-detection mechanism, the last mailbox trained > (either spam or ham) wins. > > As for the other two: > > spam -> spam transitions don't matter, sa-learn recognises message-IDs > and won't learn from the same message in the same corpus more than once > (i.e. having the same message in the spam corpus multiple times does not > "weight" the tokens learned from that message). So (1) may be a > performance concern but it won't affect the database. > > trash -> spam transition being learned is a problem how? > > That latter brings up another concern for the vetted-corpora model: if a > message is *removed* from a training corpora mailbox rather than > reclassified, you'd have to wipe and retrain your database from scratch > to remove that message's effects. > > So, you need *three* vetted corpus mailboxes: spam, ham, and > should-not-have-been-trained (forget). Rather than deleting a message > from the ham or spam corpus mailbox you move it to the forget mailbox > and the in next training pass sa-learn forgets the message and removes > it from the forget mailbox. This would be some special scripting, > because you can't just "sa-learn --forget" a whole mailbox. > > There would also need to be an audit process to detect whether the same > message_id is in both the ham and spam corpus mailboxes, so that the > admin can delete (NOT forget) the incorrect classification, or forget > the message if neither classification is reasonable. > You reveal some crucial information with regard to corpora management here, John. I've taken your good advice and created a third mailbox (well, a third "folder" within the same mailbox), named "Forget". It sounds as though the key here is never to delete messages from either corpus -- unless the same message exists in both corpora, in which case the misclassified message should be deleted. If neither classification is reasonable and the message should instead be forgotten, what's the order of operations? Should a copy of the message be created in the "Forget" corpus and then the message deleted from both the "Ham" and "Spam" corpora? With regard to the specialized scripting required to "forget" messages, this sounds cumbersome > because you can't just "sa-learn --forget" a whole mailbox. Is there a non-obvious reason for this? Would the logic behind a recursive --forget switch not be the same or similar as with the existing --ham and --spam switches? Finally, when a user submits a message to be classified as ham or spam, how should I be sorting the messages? I see the following scenarios: 1.) I agree with the end-user's classification. 2.) I disagree with the end-user's classification. a.) Because the message was submitted as ham but is really spam (or vice versa) b.) Because neither classification is reasonable In case 1.), should I *copy* the message from the submission inbox's Ham folder to the permanent Ham corpus folder? Or should I *move* the message? I'm trying to discern whether or not there's value in retaining end-user submissions *as they were classified upon submission*. In case 2.), should I simply delete the message from the submission folder? Or is there some reason to retain the message (i.e., move it into an "Erroneous" folder within the submission mailbox)? I did read http://wiki.apache.org/spamassassin/HandClassifiedCorpora , but it doesn't address these issues, specifically. Thanks again! -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 2/1/2013 12:00 PM, John Hardin wrote: > On Fri, 1 Feb 2013, Ben Johnson wrote: > >> John, thanks for pointing-out the problems associated with re-sending >> the messages via sendmail. >> >> I threw a line out to the Dovecot users group and learned how to move >> messages without going through the MTA. Dovecot has a utility >> executable, "deliver", which is well-suited to the task. >> >> For those who may have a similar need, here's the Dovecot Antispam pipe >> script that I'm using, courtesy of Steffen Kaiser on the Dovecot Users >> mailing list: >> >> --- >> #!/bin/bash >> >> mode= >> for opt; do >> if test "x$*" == "x--ham"; then >> mode=HAM >> break >> elif test "x$*" == "x--spam"; then >> mode=SPAM >> break >> fi >> done >> >> if test -n "$mode"; then >> # options from http://wiki1.dovecot.org/LDA >> /usr/lib/dovecot/deliver -d u...@example.com -m Training.$mode >> fi >> >> exit 0 >> --- > > That seems a lot better. > >> Regarding the second point, I'm not sure I understand the problem. If >> someone drags a message from Trash to SPAM, shouldn't it be submitted >> for learning as spam? >> >> The last sentence sounds like somewhat of a deal-breaker. Doesn't my >> whole strategy go somewhat limp if ham cannot be submitted for training? >> >> John and RW, do you recommend enabling or disabling the append option, >> given the way I'm reviewing the submissions and sorting them manually? > > I think they're proceeding from the assumption of *un-reviewed* > training, i.e. blind trust in the reliability of the users. > > If it's possible to enable IMAP Append on a per-folder basis then > enabling it only on your training inbox folders shouldn't be an issue - > the messages won't be trained until you've reviewed them. > > Without that level of fine-grain control I still don't see an issue from > this if you can prevent the users from adding content directly to the > folders that sa-learn actually processes. If IMAP Append only applies to > "shared" folders then there shouldn't be a problem - configure sa-learn > to learn from folders in *your account*, that nobody else can access > directly. > Thanks, John. If I'm understanding you correctly, your assessment is that enabling IMAP append in the Antispam plug-in configuration (not the default, by the way) shouldn't cause problems for my Bayes training setup, primarily because users cannot train Bayes unsupervised. If that is so, what's the real benefit to enabling this "feature" that is off by default? Users will be able to submit messages for training while "offline" and when they reconnect the plug-in will be triggered and the messages copied to the training mailbox? -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On Sat, 2 Feb 2013, RW wrote: ALLOWING APPENDS By appends we mean the case of mail moving when the source folder is unknown, e.g. when you move from some other account or with tools like offlineimap. You should be careful with allowing APPENDs to SPAM folders. The reason for possibly allowing it is to allow not-SPAM --> SPAM transitions to work and be trained. However, because the plugin cannot know the source of the message (it is assumed to be from OTHER folder), multiple bad scenarios can happen: 1. SPAM --> SPAM transitions cannot be recognised and are trained; 2. TRASH --> SPAM transitions cannot be recognised and are trained; 3. SPAM --> not-SPAM transitions cannot be recognised therefore training good messages will never work with APPENDs. I presume that the plugin works by monitoring COPY commands and so can't work properly when a move is done by FETCH-APPEND-DELETE. For sa-learn the problem would be 3, but I don't see how that is affected by allowing appends on the spam folder. Yeah, all of that sounds like they're talking about non-vetted training mailboxes where the users are effectively talking directly to sa-learn. I think I may see at least part of what they are driving at. If one user trains a message as ham and another user who got a copy of the same message trains it as spam, who wins? Absent some conflict-detection mechanism, the last mailbox trained (either spam or ham) wins. As for the other two: spam -> spam transitions don't matter, sa-learn recognises message-IDs and won't learn from the same message in the same corpus more than once (i.e. having the same message in the spam corpus multiple times does not "weight" the tokens learned from that message). So (1) may be a performance concern but it won't affect the database. trash -> spam transition being learned is a problem how? That latter brings up another concern for the vetted-corpora model: if a message is *removed* from a training corpora mailbox rather than reclassified, you'd have to wipe and retrain your database from scratch to remove that message's effects. So, you need *three* vetted corpus mailboxes: spam, ham, and should-not-have-been-trained (forget). Rather than deleting a message from the ham or spam corpus mailbox you move it to the forget mailbox and the in next training pass sa-learn forgets the message and removes it from the forget mailbox. This would be some special scripting, because you can't just "sa-learn --forget" a whole mailbox. There would also need to be an audit process to detect whether the same message_id is in both the ham and spam corpus mailboxes, so that the admin can delete (NOT forget) the incorrect classification, or forget the message if neither classification is reasonable. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- When designing software, any time you think to yourself "a user would never be stupid enough to do *that*", you're wrong. --- Today: the 10th anniversary of the loss of STS-107 Columbia
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On Fri, 1 Feb 2013 09:00:48 -0800 (PST) John Hardin wrote: > On Fri, 1 Feb 2013, Ben Johnson wrote: > > > John, thanks for pointing-out the problems associated with > > re-sending the messages via sendmail. > > > > I threw a line out to the Dovecot users group and learned how to > > move messages without going through the MTA. Dovecot has a utility > > executable, "deliver", which is well-suited to the task. > > > > For those who may have a similar need, here's the Dovecot Antispam > > pipe script that I'm using, courtesy of Steffen Kaiser on the > > Dovecot Users mailing list: > > > > --- > > #!/bin/bash > > > > mode= > > for opt; do > > if test "x$*" == "x--ham"; then > > mode=HAM > > break > > elif test "x$*" == "x--spam"; then > > mode=SPAM > > break > > fi > > done > > > > if test -n "$mode"; then > > # options from http://wiki1.dovecot.org/LDA > > /usr/lib/dovecot/deliver -d u...@example.com -m > > Training.$mode fi > > > > exit 0 > > --- > > That seems a lot better. > > > Regarding the second point, I'm not sure I understand the problem. > > If someone drags a message from Trash to SPAM, shouldn't it be > > submitted for learning as spam? > > > > The last sentence sounds like somewhat of a deal-breaker. Doesn't my > > whole strategy go somewhat limp if ham cannot be submitted for > > training? > > > > John and RW, do you recommend enabling or disabling the append > > option, given the way I'm reviewing the submissions and sorting > > them manually? > > I think they're proceeding from the assumption of *un-reviewed* > training, i.e. blind trust in the reliability of the users. > > If it's possible to enable IMAP Append on a per-folder basis then > enabling it only on your training inbox folders shouldn't be an issue > - the messages won't be trained until you've reviewed them. > > Without that level of fine-grain control I still don't see an issue > from this if you can prevent the users from adding content directly > to the folders that sa-learn actually processes. If IMAP Append only > applies to "shared" folders then there shouldn't be a problem - > configure sa-learn to learn from folders in *your account*, that > nobody else can access directly. This is what it says: antispam_allow_append_to_spam (boolean) Specifies whether to allow appending mails to the spam folder from the unknown source. See the ALLOWING APPENDS section below for the details on why it is not advised to turn this option on. Optional, default = NO. ... ALLOWING APPENDS By appends we mean the case of mail moving when the source folder is unknown, e.g. when you move from some other account or with tools like offlineimap. You should be careful with allowing APPENDs to SPAM folders. The reason for possibly allowing it is to allow not-SPAM --> SPAM transitions to work and be trained. However, because the plugin cannot know the source of the message (it is assumed to be from OTHER folder), multiple bad scenarios can happen: 1. SPAM --> SPAM transitions cannot be recognised and are trained; 2. TRASH --> SPAM transitions cannot be recognised and are trained; 3. SPAM --> not-SPAM transitions cannot be recognised therefore training good messages will never work with APPENDs. I presume that the plugin works by monitoring COPY commands and so can't work properly when a move is done by FETCH-APPEND-DELETE. For sa-learn the problem would be 3, but I don't see how that is affected by allowing appends on the spam folder.
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On Fri, 1 Feb 2013, Ben Johnson wrote: John, thanks for pointing-out the problems associated with re-sending the messages via sendmail. I threw a line out to the Dovecot users group and learned how to move messages without going through the MTA. Dovecot has a utility executable, "deliver", which is well-suited to the task. For those who may have a similar need, here's the Dovecot Antispam pipe script that I'm using, courtesy of Steffen Kaiser on the Dovecot Users mailing list: --- #!/bin/bash mode= for opt; do if test "x$*" == "x--ham"; then mode=HAM break elif test "x$*" == "x--spam"; then mode=SPAM break fi done if test -n "$mode"; then # options from http://wiki1.dovecot.org/LDA /usr/lib/dovecot/deliver -d u...@example.com -m Training.$mode fi exit 0 --- That seems a lot better. Regarding the second point, I'm not sure I understand the problem. If someone drags a message from Trash to SPAM, shouldn't it be submitted for learning as spam? The last sentence sounds like somewhat of a deal-breaker. Doesn't my whole strategy go somewhat limp if ham cannot be submitted for training? John and RW, do you recommend enabling or disabling the append option, given the way I'm reviewing the submissions and sorting them manually? I think they're proceeding from the assumption of *un-reviewed* training, i.e. blind trust in the reliability of the users. If it's possible to enable IMAP Append on a per-folder basis then enabling it only on your training inbox folders shouldn't be an issue - the messages won't be trained until you've reviewed them. Without that level of fine-grain control I still don't see an issue from this if you can prevent the users from adding content directly to the folders that sa-learn actually processes. If IMAP Append only applies to "shared" folders then there shouldn't be a problem - configure sa-learn to learn from folders in *your account*, that nobody else can access directly. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Gun Control laws aren't enacted to control guns, they are enacted to control people: catholics (1500s), japanese peasants (1600s), blacks (1860s), italian immigrants (1911), the irish (1920s), jews (1930s), blacks (1960s), the poor (always) --- Today: the 10th anniversary of the loss of STS-107 Columbia
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/31/2013 5:50 PM, RW wrote: > On Thu, 31 Jan 2013 12:12:15 -0800 (PST) > John Hardin wrote: > >> On Thu, 31 Jan 2013, Ben Johnson wrote: >> > >>> So, I finally got around to tackling this change. >>> >>> With a couple of simple modifications, I was able to achieve the >>> desired result with the Dovecot Antispam plug-in. >>> >>> Basically, I changed the last two directive values from the switches >>> that are normally passed to the "sa-learn" binary (--spam and >>> --ham) to destination email addresses that are passed to "sendmail" >>> in my revised pipe script. >> >> Passing the messages through sendmail again isn't optimal as that >> will make further changes to the headers. This may have effects on >> the quality of the learning, unless the original message is attached >> as an RFC-822 attachment to the message being sent to the corpus >> mailbox, which of course means you then can't just run sa-learn >> directly against that mailbox - the review process would involve >> moving the attachment as a standalone message to the spam or ham >> learning mailbox. >> >> Ideally you want to just move the messages between mailboxes without >> involving another delivery processing. I don't know enough about >> Dovecot or your topology to say whether that's going to be as easy as >> using sendmail to mail the message to you. > > Actually that's the way that the dovecot plugin works. I think that the > sendmail option is mainly a way to get training done on a remote > machine - it's a standard feature of DSPAM for which the plugin was > originally developed. > > When I looked at the plugin it seemed to have quite a serious flaw. > IIRC it disables IMAP APPENDs on the Spam folder which makes it > incompatible with synchronisation tools like OfflineImap and probably > some IMAP clients that implement offline support in the same way. > John, thanks for pointing-out the problems associated with re-sending the messages via sendmail. I threw a line out to the Dovecot users group and learned how to move messages without going through the MTA. Dovecot has a utility executable, "deliver", which is well-suited to the task. For those who may have a similar need, here's the Dovecot Antispam pipe script that I'm using, courtesy of Steffen Kaiser on the Dovecot Users mailing list: --- #!/bin/bash mode= for opt; do if test "x$*" == "x--ham"; then mode=HAM break elif test "x$*" == "x--spam"; then mode=SPAM break fi done if test -n "$mode"; then # options from http://wiki1.dovecot.org/LDA /usr/lib/dovecot/deliver -d u...@example.com -m Training.$mode fi exit 0 --- And here are the Antispam plug-in options: --- # For Dovecot < 2.0. antispam_spam_pattern_ignorecase = SPAM;JUNK antispam_mail_tmpdir = /tmp antispam_mail_sendmail = /usr/bin/sa-learn-pipe.sh antispam_mail_spam = --spam antispam_mail_notspam = --ham --- RW, thank you for underscoring the issue with IMAP appends. It looks as though a configuration directive exists to control this behavior: # Whether to allow APPENDing to SPAM folders or not. Must be set to # "yes" (case insensitive) to be activated. Before activating, please # read the discussion below. # antispam_allow_append_to_spam = no Unfortunately, I don't fully understand the implications or enabling or disabling this option. Here's the "discussion below" that is referenced in the above comment: --- ALLOWING APPENDS? You should be careful with allowing APPENDs to SPAM folders. The reason for possibly allowing it is to allow not-SPAM --> SPAM transitions to work with offlineimap. However, because with APPEND the plugin cannot know the source of the message, multiple bad scenarios can happen: 1. SPAM --> SPAM transitions cannot be recognised and are trained 2. the same holds for Trash --> SPAM transitions Additionally, because we cannot recognise SPAM --> not-SPAM transitions, training good messages will never work with APPEND. --- In consideration of the first point, what is a "SPAM --> SPAM transition"? Is that when the mailbox contains more than one "spam folder", e.g., "JUNK" and "SPAM", and the user drags a message from one to the other? Regarding the second point, I'm not sure I understand the problem. If someone drags a message from Trash to SPAM, shouldn't it be submitted for learning as spam? The last sentence sounds like somewhat of a deal-breaker. Doesn't my whole strategy go somewhat limp if ham cannot be submitted for training? John and RW, do you recommend enabling or disabling the append option, given the way I'm reviewing the submissions and sorting them manually? Sorry for all the questions! And thanks! -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On Thu, 31 Jan 2013 12:12:15 -0800 (PST) John Hardin wrote: > On Thu, 31 Jan 2013, Ben Johnson wrote: > > > So, I finally got around to tackling this change. > > > > With a couple of simple modifications, I was able to achieve the > > desired result with the Dovecot Antispam plug-in. > > > > Basically, I changed the last two directive values from the switches > > that are normally passed to the "sa-learn" binary (--spam and > > --ham) to destination email addresses that are passed to "sendmail" > > in my revised pipe script. > > Passing the messages through sendmail again isn't optimal as that > will make further changes to the headers. This may have effects on > the quality of the learning, unless the original message is attached > as an RFC-822 attachment to the message being sent to the corpus > mailbox, which of course means you then can't just run sa-learn > directly against that mailbox - the review process would involve > moving the attachment as a standalone message to the spam or ham > learning mailbox. > > Ideally you want to just move the messages between mailboxes without > involving another delivery processing. I don't know enough about > Dovecot or your topology to say whether that's going to be as easy as > using sendmail to mail the message to you. Actually that's the way that the dovecot plugin works. I think that the sendmail option is mainly a way to get training done on a remote machine - it's a standard feature of DSPAM for which the plugin was originally developed. When I looked at the plugin it seemed to have quite a serious flaw. IIRC it disables IMAP APPENDs on the Spam folder which makes it incompatible with synchronisation tools like OfflineImap and probably some IMAP clients that implement offline support in the same way.
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On Thu, 31 Jan 2013, Ben Johnson wrote: On 1/15/2013 5:22 PM, John Hardin wrote: Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in. They do so unsupervised. Why this could be a problem is obvious. And no, I don't retain their submissions. I probably should. I wonder if I can make a few slight modifications to the shell script that Antispam calls, such that it simply sends a copy of the message to an administrator rather than calling sa-learn on the message. That would be a very good idea if the number of users doing training is small. At the very least, the messages should be captured to a permanent corpus mailbox. Good idea! I'll see if I can set this up. So, I finally got around to tackling this change. With a couple of simple modifications, I was able to achieve the desired result with the Dovecot Antispam plug-in. Basically, I changed the last two directive values from the switches that are normally passed to the "sa-learn" binary (--spam and --ham) to destination email addresses that are passed to "sendmail" in my revised pipe script. Passing the messages through sendmail again isn't optimal as that will make further changes to the headers. This may have effects on the quality of the learning, unless the original message is attached as an RFC-822 attachment to the message being sent to the corpus mailbox, which of course means you then can't just run sa-learn directly against that mailbox - the review process would involve moving the attachment as a standalone message to the spam or ham learning mailbox. Ideally you want to just move the messages between mailboxes without involving another delivery processing. I don't know enough about Dovecot or your topology to say whether that's going to be as easy as using sendmail to mail the message to you. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- If guns kill people, then... -- pencils miss spel words. -- cars make people drive drunk. -- spoons make people fat. --- Tomorrow: the 10th anniversary of the loss of STS-107 Columbia
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/15/2013 5:22 PM, John Hardin wrote: Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in. They do so unsupervised. Why this could be a problem is obvious. And no, I don't retain their submissions. I probably should. I wonder if I can make a few slight modifications to the shell script that Antispam calls, such that it simply sends a copy of the message to an administrator rather than calling sa-learn on the message. >>> >>> That would be a very good idea if the number of users doing training is >>> small. At the very least, the messages should be captured to a permanent >>> corpus mailbox. >> >> Good idea! I'll see if I can set this up. So, I finally got around to tackling this change. With a couple of simple modifications, I was able to achieve the desired result with the Dovecot Antispam plug-in. In dovecot.conf: - plugin { # [...] # For Dovecot < 2.0. antispam_spam_pattern_ignorecase = SPAM;JUNK antispam_mail_tmpdir = /tmp antispam_mail_sendmail = /usr/bin/sa-learn-pipe.sh antispam_mail_spam = proposed-s...@example.com antispam_mail_notspam = proposed-...@example.com } - Basically, I changed the last two directive values from the switches that are normally passed to the "sa-learn" binary (--spam and --ham) to destination email addresses that are passed to "sendmail" in my revised pipe script. Here is the full pipe script, /usr/bin/sa-learn-pipe.sh (apologies for the wrapping); the original commands are commented with two pound symbols [##]): - #!/bin/sh # Add "starting now" string to log. echo "$$-start ($*)" >> /tmp/sa-learn-pipe.log # Copy the message contents to a temporary text file. cat<&0 >> /tmp/sendmail-msg-$$.txt CURRENT_USER=$(whoami) ##echo "Calling (as user $CURRENT_USER) '/usr/bin/sa-learn $* /tmp/sendmail-msg-$$.txt'" >> /tmp/sa-learn-pipe.log echo "Calling (as user $CURRENT_USER) 'sendmail $* < /tmp/sendmail-msg-$$.txt'" >> /tmp/sa-learn-pipe.log # Execute sa-learn, with the passed ham/spam argument, and the temporary message contents. # Send the output to the log file while redirecting stderr to stdout (so we capture debug output). ##/usr/bin/sa-learn $* /tmp/sendmail-msg-$$.txt >> /tmp/sa-learn-pipe.log 2>&1 sendmail $* < /tmp/sendmail-msg-$$.txt >> /tmp/sa-learn-pipe.log 2>&1 # Remove the temporary message. rm -f /tmp/sendmail-msg-$$.txt # Add "ending now" string to log. echo "$$-end" >> /tmp/sa-learn-pipe.log # Exit with "success" status code. exit 0 - It seems as though creating a temporary copy of the message is not strictly necessary, as the message contents could be passed to the "sendmail" command via standard input (stdin), but creating the copy could be useful in debugging. >>> Do your users also train ham? Are the procedures similar enough that >>> your users could become easily confused? >> >> They do. The procedure is implemented via Dovecot's Antispam plug-in. >> Basically, moving mail from Inbox to Junk trains it as spam, and moving >> mail from Junk to Inbox trains it as ham. I really like this setup >> (Antispam + calling SA through Amavis [i.e. not using spamd]) because >> the results are effective immediately, which seems to be crucial for >> combating this snowshoe spam (performance and scalability aside). >> >> I don't find that procedure to be confusing, but people are different, I >> suppose. > > Hm. One thing I would watch out for in that environment is people who > have intentionally subscribed to some sort of mailing list deciding they > don't want to receive it any longer and just junking the messages rather > than unsubscribing. The steps I've taken above will allow me to review submissions and educate users who engage in this practice. Thanks again for elucidating this scenario. I hope that this approach to user-based SpamAssassin training is useful to others. Best regards, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
So, I've been keeping an eye on things again today. Overall, things look pretty good, and most spam is being blocked outright at the MTA and scored appropriately in SA if not. I've been inspecting the X-Spam-Status headers for the handful of messages that do slip through and noticed that most of them lack any evidence of the BAYES_* tests. Here's one such header: No, score=3.115 tagged_above=-999 required=4.5 tests=[HK_NAME_FREE=1, HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, PYZOR_CHECK=1.392, SPF_PASS=-0.001] autolearn=disabled The messages that were delivered just before and after this one do have evidence of BAYES_* tests, so, it's not as though something is completely broken. Are there any normal circumstances under which Bayes tests are not run? Do I need to turn debugging back on and wait until this happens again? Thanks for all the help, everyone! -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/16/2013 2:22 PM, Bowie Bailey wrote: > On 1/16/2013 1:18 PM, Ben Johnson wrote: >> >> On 1/16/2013 11:00 AM, John Hardin wrote: >>> On Wed, 16 Jan 2013, Ben Johnson wrote: >>> Is it possible that the training I've been doing over the last week or so wasn't *effective* until recently, say, after restarting some component of the mail stack? My understanding is that calling SA via Amavis, which does not need/use the spamd daemon, forces all Bayes data to be up-to-date on each call to spamassassin. >>> That shouldn't be the case. SA and sa-learn both use a shared-access >>> database; if you're training the database that SA is learning, the >>> results of training should be effective immediately. >>> >> Okay, good. Bowie's response to this question differed (he suggested >> that Amavis would need to be restarted for Bayes to be updated), but I'm >> pretty sure that restarting Amavis is not necessary. It seems unlikely >> that Amavis would copy the entire Bayes DB (which is stored in MySQL on >> this server) into memory every time that the Amavis service is started. >> To do so seems self-defeating: more RAM usage, worse performance, etc. > > Actually, I was making a general observation. > > For cases where you would normally need to restart spamd, you will need > to restart amavis. This includes things like rule and configuration > changes. > > Bayes data is read dynamically from your MySQL database and thus does > not require a restart of amavis/spamd when updated. > My apologies, Bowie. I misinterpreted your response. Thank you very much for the follow-up and for the clear explanation. Best regards, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
Hi, >>> smtpd_recipient_restrictions = >>> reject_rbl_client bl.spamcop.net, >>> reject_rbl_client list.dsbl.org, >>> reject_rbl_client sbl-xbl.spamhaus.org, >>> reject_rbl_client cbl.abuseat.org, >>> reject_rbl_client dul.dnsbl.sorbs.net, >> >> Several of those are combined into ZEN. If you use Zen instead you'll >> save some DNS queries. See the Spamhaus link I provided earlier for >> details, I don't offhand remember which ones go into ZEN. > > Per Noel's advice, I have shortened the list (dsbl.org is defunct) and > acted upon your mutual suggestion regarding ZEN: > > reject_rbl_client bl.spamcop.net, > reject_rbl_client zen.spamhaus.org, > reject_rbl_client dnsbl.sorbs.net, I've also started using the following, but it could be specific to postfix v2.9: reject_rhsbl_reverse_client zen.spamhaus.org, reject_rhsbl_sender zen.spamhaus.org, reject_rhsbl_helo zen.spamhaus.org, Are you using rbl_reply_maps? Prior to postscreen, I was using it in this way: rbl_reply_maps = hash:/etc/postfix/rbl_reply_maps I'm not sure it's necessary in your situation. You can find more about this here: http://www.postfix.org/STRESS_README.html No doubt the guys on this list have been incredibly helpful in the past. I'd like to thank them again as well. > Okay, good. Bowie's response to this question differed (he suggested > that Amavis would need to be restarted for Bayes to be updated), but I'm > pretty sure that restarting Amavis is not necessary. It seems unlikely > that Amavis would copy the entire Bayes DB (which is stored in MySQL on > this server) into memory every time that the Amavis service is started. > To do so seems self-defeating: more RAM usage, worse performance, etc. I also don't believe it's necessary to restart amavisd when changes are made to bayes. I'm also using mysql. I just wish replication was faster, or it would use it across my multiple mail servers. Instead, I have to have multiple separate mysql bayes databases, each with their own tokens, corpus that's used for training, etc, despite it all being for a single domain. Regarding restarting amavisd, this is always frustrating to me. I'm sometimes making changes very frequently, and amavisd doesn't always restart reliably. Despite a "service amavisd stop" on fedora, it doesn't completely stop, but instead just goes catatonic and requires me to manually kill it. I've asked on the amavisd list, but no one has been able to help. I've tried just issuing a "reload" but that doesn't always work either. Does anyone know if it's possible to send it a signal or a way to more reliably signal amavisd? Thanks, Alex
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/16/2013 1:18 PM, Ben Johnson wrote: On 1/16/2013 11:00 AM, John Hardin wrote: On Wed, 16 Jan 2013, Ben Johnson wrote: Is it possible that the training I've been doing over the last week or so wasn't *effective* until recently, say, after restarting some component of the mail stack? My understanding is that calling SA via Amavis, which does not need/use the spamd daemon, forces all Bayes data to be up-to-date on each call to spamassassin. That shouldn't be the case. SA and sa-learn both use a shared-access database; if you're training the database that SA is learning, the results of training should be effective immediately. Okay, good. Bowie's response to this question differed (he suggested that Amavis would need to be restarted for Bayes to be updated), but I'm pretty sure that restarting Amavis is not necessary. It seems unlikely that Amavis would copy the entire Bayes DB (which is stored in MySQL on this server) into memory every time that the Amavis service is started. To do so seems self-defeating: more RAM usage, worse performance, etc. Actually, I was making a general observation. For cases where you would normally need to restart spamd, you will need to restart amavis. This includes things like rule and configuration changes. Bayes data is read dynamically from your MySQL database and thus does not require a restart of amavis/spamd when updated. -- Bowie
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On Wed, 16 Jan 2013, Ben Johnson wrote: On 1/16/2013 11:00 AM, John Hardin wrote: That's odd. That suggests you SA wasn't looking up those DNSBLs, or they would have contributed to the score. Check your trusted networks setting. One difference between SMTP-time and SA-time DNSBL checks is that SMTP-time checks the IP address of the client talking to the MTA, while SA-time can go back up the relay chain if necessary (e.g. to check the client IP submitting to your ISP if your ISP's MTA is between your MTA and the Internet, rather than always checking your ISP's MTA IP address). Are you referring to SA's "trusted_networks" directive? Yes. If so, it is commented-out (presumably by default). Does this need to be set? I've read the info re: trusted_networks at http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Conf.html , but I'm struggling to understand it. It means "which MTAs are trusted to not forge Received headers". There is a related one: internal_networks, which lists networks that are considered "internal" to your inbound mail topology. Sorry I missed that one in my first message. This one you'd set if you were retrieving your email from your ISP rather than directly exposing a MTA to the Internet. If the info is helpful, I have a very simple setup here: a single server with a single public IP address and a single MTA. That's the assumed default environment. If you aren't explicitly setting trusted_networks and internal_networks you should be okay. That said, the Bayes scores seem to be much more accurate now, too. I was hardly ever seeing BAYES_99 before, but now almost all spam messages have BAYES_99. Odd. SMTP-time hard rejects shouldn't change that. That's what I figured. I wonder if feeding all of the messages that I "auto-learned manually" -- messages that were tagged as spam (but for reasons unrelated to Bayes) -- contributed significantly to this change. Quite possibly. That shouldn't be the case. SA and sa-learn both use a shared-access database; if you're training the database that SA is learning, the results of training should be effective immediately. Okay, good. Bowie's response to this question differed (he suggested that Amavis would need to be restarted for Bayes to be updated), No, he didn't, he said that in a situation where you'd have to restart spamd, you instead need to restart amavisd. One such situation is after running sa-update and getting updated rules. but I'm pretty sure that restarting Amavis is not necessary. It seems unlikely that Amavis would copy the entire Bayes DB (which is stored in MySQL on this server) into memory every time that the Amavis service is started. To do so seems self-defeating: more RAM usage, worse performance, etc. Right. So, I emptied the Bayes DB and re-trained ham and spam on my hand-sorted corpus. The net result was to discard all previous end-user training, if I understand correctly. That is correct. Everything still looks good; mostly BAYES_99 on the messages that are and should be marked as spam, and no false-positives at all. yay! I've disabled the Antispam plug-in for now, for the reasons we've already discussed. I have asked the Dovecot mailing list for suggestions regarding how best to pre-screen end-user training submissions. I think I'm in pretty good shape here, unless setting trusted_networks is a must, in which case I could use some guidance. No, sounds like you're good for that. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- It is criminal to teach a man not to defend himself when he is the constant victim of brutal attacks. -- Malcolm X (1964) --- Tomorrow: Benjamin Franklin's 307th Birthday
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/16/2013 11:00 AM, John Hardin wrote: > On Wed, 16 Jan 2013, Ben Johnson wrote: > >> On 1/15/2013 5:22 PM, John Hardin wrote: >>> On Tue, 15 Jan 2013, Ben Johnson wrote: Wow! Adding several more reject_rbl_client entries to the smtpd_recipient_restrictions directive in the Postfix configuration seems to be having a tremendous impact. The amount of spam coming through has dropped by 90% or more. This was a HUGELY helpful suggestion, John! >>> >>> Which ones are you using now? There are DNSBLs that are good, but not >>> quite good enough to trust as hard-reject SMTP-time filters. That's why >>> SA does scored DNSBL checks. >> >> smtpd_recipient_restrictions = >> reject_rbl_client bl.spamcop.net, >> reject_rbl_client list.dsbl.org, >> reject_rbl_client sbl-xbl.spamhaus.org, >> reject_rbl_client cbl.abuseat.org, >> reject_rbl_client dul.dnsbl.sorbs.net, > > Several of those are combined into ZEN. If you use Zen instead you'll > save some DNS queries. See the Spamhaus link I provided earlier for > details, I don't offhand remember which ones go into ZEN. Per Noel's advice, I have shortened the list (dsbl.org is defunct) and acted upon your mutual suggestion regarding ZEN: reject_rbl_client bl.spamcop.net, reject_rbl_client zen.spamhaus.org, reject_rbl_client dnsbl.sorbs.net, Indeed, block entries for all three lists are being registered in the mail log. Very nice. It seems as though adding these SMTP-time rejects has blocked about 1/2 of the spam that was coming through previously. Awesome. >> These are "hard rejects", right? So if this change has reduced spam, >> said spam would not be accepted for delivery at all; it would be >> rejected outright. Correct? (And if I understand you, this is part of >> your concern.) > > Correct. > >> The reason I ask, and a point that I should have clarified in my last >> post, is that the *volume* of spam didn't drop by 90% (although, it may >> have dropped by some measure), but rather the accuracy with which SA >> tagged spam was 90% higher. > > That's odd. That suggests you SA wasn't looking up those DNSBLs, or they > would have contributed to the score. > > Check your trusted networks setting. One difference between SMTP-time > and SA-time DNSBL checks is that SMTP-time checks the IP address of the > client talking to the MTA, while SA-time can go back up the relay chain > if necessary (e.g. to check the client IP submitting to your ISP if your > ISP's MTA is between your MTA and the Internet, rather than always > checking your ISP's MTA IP address). Are you referring to SA's "trusted_networks" directive? If so, it is commented-out (presumably by default). Does this need to be set? I've read the info re: trusted_networks at http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Conf.html , but I'm struggling to understand it. If the info is helpful, I have a very simple setup here: a single server with a single public IP address and a single MTA. >> Ultimately, I'm wondering if the observed change was simply a product of >> these message "campaigns" being black-listed after a few days of >> circulation, and not the Postfix configuration change. > > Maybe. > >> At this point, the vast majority of X-Spam-Status headers include Razor2 >> and Pyzor tests that contribute significantly to the score. I should >> have mentioned earlier that I installed Razor2 and Pyzor after making my >> initial post. The only reasons I didn't are that a) they didn't seem to >> be making a significant difference for the first day or so after I >> installed them (this could be for the snowshoe reasons we've already >> discussed), and b) the low Bayes scores seemed to be the real problem >> anyway. >> >> That said, the Bayes scores seem to be much more accurate now, too. I >> was hardly ever seeing BAYES_99 before, but now almost all spam messages >> have BAYES_99. > > Odd. SMTP-time hard rejects shouldn't change that. That's what I figured. I wonder if feeding all of the messages that I "auto-learned manually" -- messages that were tagged as spam (but for reasons unrelated to Bayes) -- contributed significantly to this change. I did this late yesterday afternoon and when I took a status check this morning, I was seeing BAYES_99 for almost every message. >> Is it possible that the training I've been doing over the last week or >> so wasn't *effective* until recently, say, after restarting some >> component of the mail stack? My understanding is that calling SA via >> Amavis, which does not need/use the spamd daemon, forces all Bayes data >> to be up-to-date on each call to spamassassin. > > That shouldn't be the case. SA and sa-learn both use a shared-access > database; if you're training the database that SA is learning, the > results of training should be effective immediately. > Okay, good. Bowie's response to this question differed (he suggested that Amavis would need to be restarted for Bayes to be updated),
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/16/2013 9:49 AM, Ben Johnson wrote: > smtpd_recipient_restrictions = > reject_rbl_client bl.spamcop.net, spamcop has a reputation of being somewhat aggressive on blocking, and their website recommends using it in a scoring system (eg. SpamAssassin) rather than for outright blocking. That said, many folks (including me) use it anyway and find it acceptable. See the spamcop website for details, and make your own choice. > reject_rbl_client list.dsbl.org, list.dsbl.org is no longer active. Remove this line. > reject_rbl_client sbl-xbl.spamhaus.org, > reject_rbl_client cbl.abuseat.org, The spamhaus lists are now consolidated in zen.spamhaus.org, replace the above two lines. See the spamhaus web site for details. > reject_rbl_client dul.dnsbl.sorbs.net, This one is OK. Again, you should check their website and review their published listing policy to see if this is something you want to block. Blocking mail is a very site-specific choice. Use the advice you get as a starting point and make your own decision about how aggressive you want to be. -- Noel Jones
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/16/2013 10:49 AM, Ben Johnson wrote: On 1/15/2013 5:22 PM, John Hardin wrote: On Tue, 15 Jan 2013, Ben Johnson wrote: Wow! Adding several more reject_rbl_client entries to the smtpd_recipient_restrictions directive in the Postfix configuration seems to be having a tremendous impact. The amount of spam coming through has dropped by 90% or more. This was a HUGELY helpful suggestion, John! Which ones are you using now? There are DNSBLs that are good, but not quite good enough to trust as hard-reject SMTP-time filters. That's why SA does scored DNSBL checks. smtpd_recipient_restrictions = reject_rbl_client bl.spamcop.net, reject_rbl_client list.dsbl.org, reject_rbl_client sbl-xbl.spamhaus.org, reject_rbl_client cbl.abuseat.org, reject_rbl_client dul.dnsbl.sorbs.net, I acquired this list from the article that I cited a few responses back. It is quite possible that some of these are obsolete, as the article is from 2009. I seem to recall reading that sbl-xbl.spamhaus.org is obsolete, but now I can't find the source. I'm not sure if it is considered "obsolete", but it has been generally replaced by zen.spamhaus.org instead. Zen incorporates SBL, XBL, CSS, and PBL. (See http://www.spamhaus.org/zen/) These are "hard rejects", right? So if this change has reduced spam, said spam would not be accepted for delivery at all; it would be rejected outright. Correct? (And if I understand you, this is part of your concern.) Exactly. The reason I ask, and a point that I should have clarified in my last post, is that the *volume* of spam didn't drop by 90% (although, it may have dropped by some measure), but rather the accuracy with which SA tagged spam was 90% higher. These rejects will drop the total volume of spam. SA's accuracy may appear to go up if some of the more difficult spams are now being blocked by the blacklists. Ultimately, I'm wondering if the observed change was simply a product of these message "campaigns" being black-listed after a few days of circulation, and not the Postfix configuration change. At this point, the vast majority of X-Spam-Status headers include Razor2 and Pyzor tests that contribute significantly to the score. I should have mentioned earlier that I installed Razor2 and Pyzor after making my initial post. The only reasons I didn't are that a) they didn't seem to be making a significant difference for the first day or so after I installed them (this could be for the snowshoe reasons we've already discussed), and b) the low Bayes scores seemed to be the real problem anyway. That said, the Bayes scores seem to be much more accurate now, too. I was hardly ever seeing BAYES_99 before, but now almost all spam messages have BAYES_99. Is it possible that the training I've been doing over the last week or so wasn't *effective* until recently, say, after restarting some component of the mail stack? My understanding is that calling SA via Amavis, which does not need/use the spamd daemon, forces all Bayes data to be up-to-date on each call to spamassassin. Amavis incorporates the SA code into itself. So in any instance where you would normally need to restart spamd, you should instead restart Amavis. In effect, Amavis is its own spamd daemon. -- Bowie
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On Wed, 16 Jan 2013, Ben Johnson wrote: On 1/15/2013 5:22 PM, John Hardin wrote: On Tue, 15 Jan 2013, Ben Johnson wrote: Wow! Adding several more reject_rbl_client entries to the smtpd_recipient_restrictions directive in the Postfix configuration seems to be having a tremendous impact. The amount of spam coming through has dropped by 90% or more. This was a HUGELY helpful suggestion, John! Which ones are you using now? There are DNSBLs that are good, but not quite good enough to trust as hard-reject SMTP-time filters. That's why SA does scored DNSBL checks. smtpd_recipient_restrictions = reject_rbl_client bl.spamcop.net, reject_rbl_client list.dsbl.org, reject_rbl_client sbl-xbl.spamhaus.org, reject_rbl_client cbl.abuseat.org, reject_rbl_client dul.dnsbl.sorbs.net, Several of those are combined into ZEN. If you use Zen instead you'll save some DNS queries. See the Spamhaus link I provided earlier for details, I don't offhand remember which ones go into ZEN. These are "hard rejects", right? So if this change has reduced spam, said spam would not be accepted for delivery at all; it would be rejected outright. Correct? (And if I understand you, this is part of your concern.) Correct. The reason I ask, and a point that I should have clarified in my last post, is that the *volume* of spam didn't drop by 90% (although, it may have dropped by some measure), but rather the accuracy with which SA tagged spam was 90% higher. That's odd. That suggests you SA wasn't looking up those DNSBLs, or they would have contributed to the score. Check your trusted networks setting. One difference between SMTP-time and SA-time DNSBL checks is that SMTP-time checks the IP address of the client talking to the MTA, while SA-time can go back up the relay chain if necessary (e.g. to check the client IP submitting to your ISP if your ISP's MTA is between your MTA and the Internet, rather than always checking your ISP's MTA IP address). Ultimately, I'm wondering if the observed change was simply a product of these message "campaigns" being black-listed after a few days of circulation, and not the Postfix configuration change. Maybe. At this point, the vast majority of X-Spam-Status headers include Razor2 and Pyzor tests that contribute significantly to the score. I should have mentioned earlier that I installed Razor2 and Pyzor after making my initial post. The only reasons I didn't are that a) they didn't seem to be making a significant difference for the first day or so after I installed them (this could be for the snowshoe reasons we've already discussed), and b) the low Bayes scores seemed to be the real problem anyway. That said, the Bayes scores seem to be much more accurate now, too. I was hardly ever seeing BAYES_99 before, but now almost all spam messages have BAYES_99. Odd. SMTP-time hard rejects shouldn't change that. Is it possible that the training I've been doing over the last week or so wasn't *effective* until recently, say, after restarting some component of the mail stack? My understanding is that calling SA via Amavis, which does not need/use the spamd daemon, forces all Bayes data to be up-to-date on each call to spamassassin. That shouldn't be the case. SA and sa-learn both use a shared-access database; if you're training the database that SA is learning, the results of training should be effective immediately. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- One difference between a liberal and a pickpocket is that if you demand your money back from a pickpocket he will not question your motives. -- William Rusher --- Tomorrow: Benjamin Franklin's 307th Birthday
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/16/2013 2:02 AM, Tom Hendrikx wrote: > On 1/15/13 5:26 PM, Ben Johnson wrote: > >> >> In postfix's main.cf: >> > >> >> Hmm, very interesting. No, I have no greylisting in place as yet, and >> no, my userbase doesn't demand immediate delivery. I will look into >> greylisting further. > > If you're running postfix, consider using postscreen. It's a recent > addition to postfix that also can behave in a greylisting alike way, and > much more. > > Read: http://www.postfix.org/POSTSCREEN_README.html > > -- > Tom > Thanks for the suggestion, Tom! Unfortunately, I'm stuck on Postfix 2.7 for a while yet, and Postscreen is available for versions >= 2.8 only. I will definitely look into it once I'm on 2.8+, however. Cheers, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/15/2013 5:22 PM, John Hardin wrote: > On Tue, 15 Jan 2013, Ben Johnson wrote: > >> >> >> On 1/15/2013 1:55 PM, John Hardin wrote: >>> On Tue, 15 Jan 2013, Ben Johnson wrote: >>> On 1/14/2013 8:16 PM, John Hardin wrote: > On Mon, 14 Jan 2013, Ben Johnson wrote: > > Question: do you have any SMTP-time hard-reject DNSBL tests in > place? Or > are they all performed by SA? In postfix's main.cf: smtpd_recipient_restrictions = permit_mynetworks, permit_sasl_authenticated, check_recipient_access mysql:/etc/postfix/mysql-virtual_recipient.cf, reject_unauth_destination, reject_rbl_client bl.spamcop.net Do you recommend something more? >>> >>> Unfortunately I have no experience administering Postfix. Perhaps one of >>> the other listies can help. >> >> Wow! Adding several more reject_rbl_client entries to the >> smtpd_recipient_restrictions directive in the Postfix configuration >> seems to be having a tremendous impact. The amount of spam coming >> through has dropped by 90% or more. This was a HUGELY helpful >> suggestion, John! > > Which ones are you using now? There are DNSBLs that are good, but not > quite good enough to trust as hard-reject SMTP-time filters. That's why > SA does scored DNSBL checks. smtpd_recipient_restrictions = reject_rbl_client bl.spamcop.net, reject_rbl_client list.dsbl.org, reject_rbl_client sbl-xbl.spamhaus.org, reject_rbl_client cbl.abuseat.org, reject_rbl_client dul.dnsbl.sorbs.net, I acquired this list from the article that I cited a few responses back. It is quite possible that some of these are obsolete, as the article is from 2009. I seem to recall reading that sbl-xbl.spamhaus.org is obsolete, but now I can't find the source. These are "hard rejects", right? So if this change has reduced spam, said spam would not be accepted for delivery at all; it would be rejected outright. Correct? (And if I understand you, this is part of your concern.) The reason I ask, and a point that I should have clarified in my last post, is that the *volume* of spam didn't drop by 90% (although, it may have dropped by some measure), but rather the accuracy with which SA tagged spam was 90% higher. Ultimately, I'm wondering if the observed change was simply a product of these message "campaigns" being black-listed after a few days of circulation, and not the Postfix configuration change. At this point, the vast majority of X-Spam-Status headers include Razor2 and Pyzor tests that contribute significantly to the score. I should have mentioned earlier that I installed Razor2 and Pyzor after making my initial post. The only reasons I didn't are that a) they didn't seem to be making a significant difference for the first day or so after I installed them (this could be for the snowshoe reasons we've already discussed), and b) the low Bayes scores seemed to be the real problem anyway. That said, the Bayes scores seem to be much more accurate now, too. I was hardly ever seeing BAYES_99 before, but now almost all spam messages have BAYES_99. Is it possible that the training I've been doing over the last week or so wasn't *effective* until recently, say, after restarting some component of the mail stack? My understanding is that calling SA via Amavis, which does not need/use the spamd daemon, forces all Bayes data to be up-to-date on each call to spamassassin. It bears mention that I haven't yet dumped the Bayes DB and retrained using my corpus. I'll do that next and see where we land once the DB is repopulated. Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in. They do so unsupervised. Why this could be a problem is obvious. And no, I don't retain their submissions. I probably should. I wonder if I can make a few slight modifications to the shell script that Antispam calls, such that it simply sends a copy of the message to an administrator rather than calling sa-learn on the message. >>> >>> That would be a very good idea if the number of users doing training is >>> small. At the very least, the messages should be captured to a permanent >>> corpus mailbox. >> >> Good idea! I'll see if I can set this up. >> >>> Do your users also train ham? Are the procedures similar enough that >>> your users could become easily confused? >> >> They do. The procedure is implemented via Dovecot's Antispam plug-in. >> Basically, moving mail from Inbox to Junk trains it as spam, and moving >> mail from Junk to Inbox trains it as ham. I really like this setup >> (Antispam + calling SA through Amavis [i.e. not using spamd]) because >> the results are effective immediately, which seems to be crucial for >> combating this snowshoe spam (performance and scalability aside). >> >> I don't find that procedure to be confusing, but people are different, I >> suppose. > > Hm. One thing I would watch out for in that environment is people who > have inten
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/15/13 5:26 PM, Ben Johnson wrote: > > In postfix's main.cf: > > > Hmm, very interesting. No, I have no greylisting in place as yet, and > no, my userbase doesn't demand immediate delivery. I will look into > greylisting further. If you're running postfix, consider using postscreen. It's a recent addition to postfix that also can behave in a greylisting alike way, and much more. Read: http://www.postfix.org/POSTSCREEN_README.html -- Tom
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 2013/01/15 17:23, John Hardin wrote: On Tue, 15 Jan 2013, jdow wrote: On 2013/01/15 08:26, Ben Johnson wrote: Based on my responses, what's the next move? Backup the Bayes DB, wipe it, and feed my corpus through the ol' chipper? (Sure to infuriate BUT - read the WHOLE note.) Are you sure your Bayes database is well trained? But let's change that to, "Is the Bayes database SpamAssassin is using when receiving email the same as the Bayes database you are training with sa_learn?" Yeah, we already checked that possibility. OK, then I shut my fat mouth. {^_-}
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On Tue, 15 Jan 2013, jdow wrote: On 2013/01/15 08:26, Ben Johnson wrote: Based on my responses, what's the next move? Backup the Bayes DB, wipe it, and feed my corpus through the ol' chipper? (Sure to infuriate BUT - read the WHOLE note.) Are you sure your Bayes database is well trained? But let's change that to, "Is the Bayes database SpamAssassin is using when receiving email the same as the Bayes database you are training with sa_learn?" Yeah, we already checked that possibility. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Gun Control enables genocide while doing little to reduce crime. --- 2 days until Benjamin Franklin's 307th Birthday
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 2013/01/15 08:26, Ben Johnson wrote: Based on my responses, what's the next move? Backup the Bayes DB, wipe it, and feed my corpus through the ol' chipper? (Sure to infuriate BUT - read the WHOLE note.) Are you sure your Bayes database is well trained? But let's change that to, "Is the Bayes database SpamAssassin is using when receiving email the same as the Bayes database you are training with sa_learn?" If you are training a per user database and do not have that enabled in SpamAssassin then the training is pretty useless. Worst case waste some CPU and disk cycles to find every SpamAssassin related Bayes database on your system. If you find more than one and shouldn't then ask yourself why and sort out that problem. {^_^}
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 2013/01/15 07:27, Ben Johnson wrote: On 1/14/2013 7:48 PM, Noel wrote: On 1/14/2013 2:59 PM, Ben Johnson wrote: jdow, Noel, and John, I can't thank you enough for your very thorough responses. Your time is valuable and I sincerely appreciate your willingness to help. Glad it was even marginally helpful. Ben, do be aware that sometimes you draw the short straw and sit at the very start of the spam distribution cycle. In those cases the BLs will generally not have been alerted yet so they may not trigger. For those situations the rules should be your friends. (I still use my treasured set of SARE rules and personally hand crafted rules my partner and I have created that fit OUR needs but may not be good general purpose rules.) This makes perfect sense and underscores the importance of a finely-tuned rule-set. It's become apparent just how dynamic and capable a monster the spam industry is. No one approach will ever be a panacea, it seems. The advice from your second email is well-received, too. Especially the part about not killing anybody. ;) I do hope fighting spam becomes fun for me, because so far, it's been an uphill battle! Hehe. Noel, thanks for excellent responses to my questions. It got fun enough in the old days with more spam than I'm getting now to taunt the spammers who monitored this list. "Gee, , you only managed a 95 on that last spam I got. Surely you can do better and make it to 100 on small scoring rules." He did. You actually get to the point you can recognize the style of various spam programs and often relate them back to the spammer using spamhaus. These days of full automation might make that harder. But, still, you can probably start recognizing stylistic elements of the various programs soon enough. {^_^}
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On Tue, 15 Jan 2013, Ben Johnson wrote: On 1/15/2013 1:55 PM, John Hardin wrote: On Tue, 15 Jan 2013, Ben Johnson wrote: On 1/14/2013 8:16 PM, John Hardin wrote: On Mon, 14 Jan 2013, Ben Johnson wrote: Question: do you have any SMTP-time hard-reject DNSBL tests in place? Or are they all performed by SA? In postfix's main.cf: smtpd_recipient_restrictions = permit_mynetworks, permit_sasl_authenticated, check_recipient_access mysql:/etc/postfix/mysql-virtual_recipient.cf, reject_unauth_destination, reject_rbl_client bl.spamcop.net Do you recommend something more? Unfortunately I have no experience administering Postfix. Perhaps one of the other listies can help. Wow! Adding several more reject_rbl_client entries to the smtpd_recipient_restrictions directive in the Postfix configuration seems to be having a tremendous impact. The amount of spam coming through has dropped by 90% or more. This was a HUGELY helpful suggestion, John! Which ones are you using now? There are DNSBLs that are good, but not quite good enough to trust as hard-reject SMTP-time filters. That's why SA does scored DNSBL checks. Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in. They do so unsupervised. Why this could be a problem is obvious. And no, I don't retain their submissions. I probably should. I wonder if I can make a few slight modifications to the shell script that Antispam calls, such that it simply sends a copy of the message to an administrator rather than calling sa-learn on the message. That would be a very good idea if the number of users doing training is small. At the very least, the messages should be captured to a permanent corpus mailbox. Good idea! I'll see if I can set this up. Do your users also train ham? Are the procedures similar enough that your users could become easily confused? They do. The procedure is implemented via Dovecot's Antispam plug-in. Basically, moving mail from Inbox to Junk trains it as spam, and moving mail from Junk to Inbox trains it as ham. I really like this setup (Antispam + calling SA through Amavis [i.e. not using spamd]) because the results are effective immediately, which seems to be crucial for combating this snowshoe spam (performance and scalability aside). I don't find that procedure to be confusing, but people are different, I suppose. Hm. One thing I would watch out for in that environment is people who have intentionally subscribed to some sort of mailing list deciding they don't want to receive it any longer and just junking the messages rather than unsubscribing. However, your problem is FN Bayes scores... The extremely odd thing is that you say you sometimes train a message as spam, and its Bayes score goes *down*. Are you training a message and then running it torough spamc to see if the score changed, or is this about _similar_ messages rather than _that_ message? Sorry for the ambiguity. This is about *similar* messages. Identical messages, at least visually speaking (I realize that there is a lot more to it than the visual component). For example, yesterday, I saw several Canadian Pharmacy emails, all of which were identical with respect to appearance. I classified each as spam, yet the Bayes score didn't budge more than a few percent for the first three, and went *down* for the 4th. I have to assume that while the messages (HTML-formatted) *appear* to be identical, the underlying code has some pseudo-random element that is designed very specifically to throw Bayes classifiers. Out of curiosity, does the Bayes engine (or some other element of SpamAssassin) have the ability to "see" rendered HTML messages, by appearance, and not by source code? If it could, it would be far more effective it seems. That I don't know. That, and configure the user-based training to at the very least capture what they submit to a corpus so you can review it. Whether you do that review pre-training or post-bayes-is-insane is up to you. Right, right, that makes sense. I hope I can modify the Antispam plug-in to accommodate this requirement. Well, I can't thank you enough here, John and everyone else. I seem to be on the right track; all is not lost. That said, it seems clear that SA is nowhere near as effective as it can be when an off-the-shelf configuration is used (and without configuring the MTA to do some of the blocking). I'll keep the list posted (pardon the pun) with regard to configuring Antispam to fire-off a copy of any message that is submitted for training. Ideally, whether the message is reviewed before or after sa-learn is called will be configurable. Great! Thanks! -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Your mouse has moved. Your Windows Operating System mu
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/15/2013 4:39 PM, Bowie Bailey wrote: > On 1/15/2013 4:27 PM, Ben Johnson wrote: >> On 1/15/2013 4:05 PM, Bowie Bailey wrote: >>> On 1/15/2013 3:47 PM, Ben Johnson wrote: One final question on this subject (sorry...). Is there value in training Bayes on messages that SA classified as spam *due to other test scores*? In other words, if a message is classified as SPAM due to a block-list test, but the message is new enough for Bayes to assign a zero score, should that message be kept and fed to sa-learn so that Bayes can soak-up all the tokens from a message that is almost certainly spam (based on the other tests)? Am I making any sense? >>> It is always worthwhile to train Bayes. In an ideal world, you would >>> hand-sort and train every email that comes through your system. The >>> more mail Bayes sees the more accurate it can be. >>> >> Thanks, Bowie. Given your response, would it then be prudent to call >> "sa-learn --spam" on any message that *other tests* (non-Bayes tests) >> determine to be spam (given some score threshold)? > > That is exactly what the autolearn setting does. I let my system run > with the default autolearn settings. Some people adjust the thresholds > and some people prefer to turn off autolearn and do purely manual training. > >> The crux of my question/point is that I don't want to have to feed >> messages that Bayes "misses" but that other tests identify *correctly* >> as spam to "sa-learn --spam". > > At one point, I had a script running on my server that looked for > messages that were marked as spam with a low Bayes rating (BAYES_00 to > BAYES_40) or messages marked as ham with a high Bayes rating (BAYES_60 > to BAYES_99). I was then able to check the messages and learn them > properly. This let me learn from the edge cases that were not being > scored properly by Bayes while still making it to the correct folder due > to other rules. > > If you do this, you MUST check the messages yourself prior to learning > since there is no other way to know whether they should be learned as > ham or spam. > >> Is there value in implementing something like this? Or is there some >> caveat that would make doing so self-defeating? > > I find that Bayes autolearn works quite well for me, but others have had > problems with it. > Ah... I get it. Finally. :) Excellent info here; thanks again! You guys are heroes... seriously. Best regards, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/15/2013 4:27 PM, Ben Johnson wrote: On 1/15/2013 4:05 PM, Bowie Bailey wrote: On 1/15/2013 3:47 PM, Ben Johnson wrote: One final question on this subject (sorry...). Is there value in training Bayes on messages that SA classified as spam *due to other test scores*? In other words, if a message is classified as SPAM due to a block-list test, but the message is new enough for Bayes to assign a zero score, should that message be kept and fed to sa-learn so that Bayes can soak-up all the tokens from a message that is almost certainly spam (based on the other tests)? Am I making any sense? It is always worthwhile to train Bayes. In an ideal world, you would hand-sort and train every email that comes through your system. The more mail Bayes sees the more accurate it can be. Thanks, Bowie. Given your response, would it then be prudent to call "sa-learn --spam" on any message that *other tests* (non-Bayes tests) determine to be spam (given some score threshold)? That is exactly what the autolearn setting does. I let my system run with the default autolearn settings. Some people adjust the thresholds and some people prefer to turn off autolearn and do purely manual training. The crux of my question/point is that I don't want to have to feed messages that Bayes "misses" but that other tests identify *correctly* as spam to "sa-learn --spam". At one point, I had a script running on my server that looked for messages that were marked as spam with a low Bayes rating (BAYES_00 to BAYES_40) or messages marked as ham with a high Bayes rating (BAYES_60 to BAYES_99). I was then able to check the messages and learn them properly. This let me learn from the edge cases that were not being scored properly by Bayes while still making it to the correct folder due to other rules. If you do this, you MUST check the messages yourself prior to learning since there is no other way to know whether they should be learned as ham or spam. Is there value in implementing something like this? Or is there some caveat that would make doing so self-defeating? I find that Bayes autolearn works quite well for me, but others have had problems with it. -- Bowie
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/15/2013 4:05 PM, Bowie Bailey wrote: > On 1/15/2013 3:47 PM, Ben Johnson wrote: >> One final question on this subject (sorry...). >> >> Is there value in training Bayes on messages that SA classified as spam >> *due to other test scores*? In other words, if a message is classified >> as SPAM due to a block-list test, but the message is new enough for >> Bayes to assign a zero score, should that message be kept and fed to >> sa-learn so that Bayes can soak-up all the tokens from a message that is >> almost certainly spam (based on the other tests)? >> >> Am I making any sense? > > It is always worthwhile to train Bayes. In an ideal world, you would > hand-sort and train every email that comes through your system. The > more mail Bayes sees the more accurate it can be. > Thanks, Bowie. Given your response, would it then be prudent to call "sa-learn --spam" on any message that *other tests* (non-Bayes tests) determine to be spam (given some score threshold)? The crux of my question/point is that I don't want to have to feed messages that Bayes "misses" but that other tests identify *correctly* as spam to "sa-learn --spam". Is there value in implementing something like this? Or is there some caveat that would make doing so self-defeating? Thanks a bunch, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/15/2013 3:47 PM, Ben Johnson wrote: One final question on this subject (sorry...). Is there value in training Bayes on messages that SA classified as spam *due to other test scores*? In other words, if a message is classified as SPAM due to a block-list test, but the message is new enough for Bayes to assign a zero score, should that message be kept and fed to sa-learn so that Bayes can soak-up all the tokens from a message that is almost certainly spam (based on the other tests)? Am I making any sense? It is always worthwhile to train Bayes. In an ideal world, you would hand-sort and train every email that comes through your system. The more mail Bayes sees the more accurate it can be. -- Bowie
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
One final question on this subject (sorry...). Is there value in training Bayes on messages that SA classified as spam *due to other test scores*? In other words, if a message is classified as SPAM due to a block-list test, but the message is new enough for Bayes to assign a zero score, should that message be kept and fed to sa-learn so that Bayes can soak-up all the tokens from a message that is almost certainly spam (based on the other tests)? Am I making any sense? Thanks again! -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/15/2013 1:55 PM, John Hardin wrote: > On Tue, 15 Jan 2013, Ben Johnson wrote: > >> On 1/14/2013 8:16 PM, John Hardin wrote: >>> On Mon, 14 Jan 2013, Ben Johnson wrote: >>> >>> Question: do you have any SMTP-time hard-reject DNSBL tests in place? Or >>> are they all performed by SA? >> >> In postfix's main.cf: >> >> smtpd_recipient_restrictions = permit_mynetworks, >> permit_sasl_authenticated, check_recipient_access >> mysql:/etc/postfix/mysql-virtual_recipient.cf, >> reject_unauth_destination, reject_rbl_client bl.spamcop.net >> >> Do you recommend something more? > > Unfortunately I have no experience administering Postfix. Perhaps one of > the other listies can help. Wow! Adding several more reject_rbl_client entries to the smtpd_recipient_restrictions directive in the Postfix configuration seems to be having a tremendous impact. The amount of spam coming through has dropped by 90% or more. This was a HUGELY helpful suggestion, John! >>> http://www.greylisting.org/ >> >> Hmm, very interesting. No, I have no greylisting in place as yet, and >> no, my userbase doesn't demand immediate delivery. I will look into >> greylisting further. > > One other thing you might try is publishing an SPF record for your > domain. There is anecdotal evidence that this reduces the raw spam > volume to that domain a bit. We do publish SPF records for the domains within our control. The need to do this arose when senderbase.org, et. al., began blacklisting domains without SPF records. So, we're good there. >> Given this information, it concerns me that Bayes scores hardly seem >> to budge when I feed sa-learn nearly identical messages 3+ times. >> We'll get into that below. >> If so, then I guess the only remedy here is to focus on why Bayes seems to perform so miserably. >>> >>> Agreed. >>> It must be a configuration issue, because I've sa-learn-ed messages that are incredibly similar for two days now and not only do their Bayes scores not change significantly, but sometimes they decrease. And I have a hard time believing that one of my users is sa-train-ing these messages as ham and negating my efforts. >>> >>> This is why you retain your Bayes training corpora: so that if Bayes >>> goes off the rails you can review your corpora for misclassifications, >>> wipe and retrain. Do you have your training corpora? Or do you discard >>> messages once you've trained them? >> >> I had the good sense to retain the corpora. > > Yay! > >>> _Do_ you allow your users to train Bayes? Do they do so unsupervised or >>> do you review their submissions? And if the process is automated, do you >>> retain what they have provided for training so that you can go back >>> later and do a troubleshooting review? >> >> Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in. >> They do so unsupervised. Why this could be a problem is obvious. And no, >> I don't retain their submissions. I probably should. I wonder if I can >> make a few slight modifications to the shell script that Antispam calls, >> such that it simply sends a copy of the message to an administrator >> rather than calling sa-learn on the message. > > That would be a very good idea if the number of users doing training is > small. At the very least, the messages should be captured to a permanent > corpus mailbox. Good idea! I'll see if I can set this up. > Do your users also train ham? Are the procedures similar enough that > your users could become easily confused? They do. The procedure is implemented via Dovecot's Antispam plug-in. Basically, moving mail from Inbox to Junk trains it as spam, and moving mail from Junk to Inbox trains it as ham. I really like this setup (Antispam + calling SA through Amavis [i.e. not using spamd]) because the results are effective immediately, which seems to be crucial for combating this snowshoe spam (performance and scalability aside). I don't find that procedure to be confusing, but people are different, I suppose. >>> Do you have autolearn turned on? My opinion is that autolearn is only >>> appropriate for a large and very diverse userbase where a sufficiently >>> "common" corpus of ham can't be manually collected. but then, I don't >>> admin a Really Large Install, so YMMV. >> >> No, I was sure to disable autolearn after the last Bayes fiasco. :) > > OK. > >>> Do you use per-user or sitewide Bayes? If per-user, then you need to >>> make sure that you're training Bayes as the same user that the MTA is >>> running SA as. >> >> Site-wide. And I have hard-coded the username in the SA configuration to >> prevent confusion in this regard: >> >> bayes_sql_override_username amavis >> >>> What user does your MTA run SA as? What user do you train Bayes as? >> >> The MTA should pass scanning off to "amavis". I train the DB in two >> ways: via Dovecot Antispam and by calling sa-learn on my training >> mailbox. Given that I have hard-coded the username, the output of >> "sa-learn --d
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On Tue, 15 Jan 2013, Ben Johnson wrote: On 1/14/2013 8:16 PM, John Hardin wrote: On Mon, 14 Jan 2013, Ben Johnson wrote: Question: do you have any SMTP-time hard-reject DNSBL tests in place? Or are they all performed by SA? In postfix's main.cf: smtpd_recipient_restrictions = permit_mynetworks, permit_sasl_authenticated, check_recipient_access mysql:/etc/postfix/mysql-virtual_recipient.cf, reject_unauth_destination, reject_rbl_client bl.spamcop.net Do you recommend something more? Unfortunately I have no experience administering Postfix. Perhaps one of the other listies can help. http://www.greylisting.org/ Hmm, very interesting. No, I have no greylisting in place as yet, and no, my userbase doesn't demand immediate delivery. I will look into greylisting further. One other thing you might try is publishing an SPF record for your domain. There is anecdotal evidence that this reduces the raw spam volume to that domain a bit. Given this information, it concerns me that Bayes scores hardly seem to budge when I feed sa-learn nearly identical messages 3+ times. We'll get into that below. If so, then I guess the only remedy here is to focus on why Bayes seems to perform so miserably. Agreed. It must be a configuration issue, because I've sa-learn-ed messages that are incredibly similar for two days now and not only do their Bayes scores not change significantly, but sometimes they decrease. And I have a hard time believing that one of my users is sa-train-ing these messages as ham and negating my efforts. This is why you retain your Bayes training corpora: so that if Bayes goes off the rails you can review your corpora for misclassifications, wipe and retrain. Do you have your training corpora? Or do you discard messages once you've trained them? I had the good sense to retain the corpora. Yay! _Do_ you allow your users to train Bayes? Do they do so unsupervised or do you review their submissions? And if the process is automated, do you retain what they have provided for training so that you can go back later and do a troubleshooting review? Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in. They do so unsupervised. Why this could be a problem is obvious. And no, I don't retain their submissions. I probably should. I wonder if I can make a few slight modifications to the shell script that Antispam calls, such that it simply sends a copy of the message to an administrator rather than calling sa-learn on the message. That would be a very good idea if the number of users doing training is small. At the very least, the messages should be captured to a permanent corpus mailbox. Do your users also train ham? Are the procedures similar enough that your users could become easily confused? Do you have autolearn turned on? My opinion is that autolearn is only appropriate for a large and very diverse userbase where a sufficiently "common" corpus of ham can't be manually collected. but then, I don't admin a Really Large Install, so YMMV. No, I was sure to disable autolearn after the last Bayes fiasco. :) OK. Do you use per-user or sitewide Bayes? If per-user, then you need to make sure that you're training Bayes as the same user that the MTA is running SA as. Site-wide. And I have hard-coded the username in the SA configuration to prevent confusion in this regard: bayes_sql_override_username amavis What user does your MTA run SA as? What user do you train Bayes as? The MTA should pass scanning off to "amavis". I train the DB in two ways: via Dovecot Antispam and by calling sa-learn on my training mailbox. Given that I have hard-coded the username, the output of "sa-learn --dump magic" is the same whether I issue the command under my own account or "su" to the "amavis" user. OK, good. I have ensured that the spam token count increases when I train these messages. That said, I do notice that the token count does not *always* change; sometimes, sa-learn reports "Learned tokens from 0 message(s) (1 message(s) examined)". Does this mean that all tokens from these messages have already been learned, thereby making it pointless to continue feeding them to sa-learn? No, it means that Message-ID has been learned from before. I see. So, when this happens, it means that one of my users has already dragged the message from Inbox to Junk (which triggers the Antispam plug-in and feeds the message to sa-learn). Very likely. The extremely odd thing is that you say you sometimes train a message as spam, and its Bayes score goes *down*. Are you training a message and then running it torough spamc to see if the score changed, or is this about _similar_ messages rather than _that_ message? When this scenario occurs, my efforts in feeding the same message to sa-learn are wasted, right? Bayes doesn't "learn more" from the message the second time, or increase it's tokens' "weight", right? It would be nice if I could eliminate this duplicate effo
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/14/2013 8:16 PM, John Hardin wrote: > On Mon, 14 Jan 2013, Ben Johnson wrote: > >> I understand that snowshoe spam may not hit any net tests. I guess my >> confusion is around what, exactly, classifies spam as "snowshoe". > > http://www.spamhaus.org/faq/section/Glossary > > Basically, a large number of spambots sending the message so that no one > sending IP can be easily tagged as evil. > > Question: do you have any SMTP-time hard-reject DNSBL tests in place? Or > are they all performed by SA? In postfix's main.cf: smtpd_recipient_restrictions = permit_mynetworks, permit_sasl_authenticated, check_recipient_access mysql:/etc/postfix/mysql-virtual_recipient.cf, reject_unauth_destination, reject_rbl_client bl.spamcop.net Do you recommend something more? > Recommendation: consider using the Spamhaus ZEN DNSBL as a hard-reject > SMTP-time DNS check in your MTA. It is well-respected and very reliable. > One thing it includes is ranges of IP addresses that should not ever be > sending email, so it may help reduce snowshoe spam. > > http://www.spamhaus.org/zen/ This article looks to be pretty thorough: http://www.cyberciti.biz/faq/howto-configure-postfix-dnsrbls-under-linux-unix/ I'll add Spamhaus ZEN and a few others to the list. > Another tactic that many report good results from is Greylisting. Do you > have greylisting in place? Does your userbase demand no delays in mail > delivery? In addition to blocking spam from spambots that do not retry, > it can delay mail enough for the BLs to get a chance to list new > IPs/domains, which can reduce the leakage if you happen to be at the > leading edge of a new delivery campaign. > > http://www.greylisting.org/ Hmm, very interesting. No, I have no greylisting in place as yet, and no, my userbase doesn't demand immediate delivery. I will look into greylisting further. >> Are most/all of the BL services hash-based? > > Generally: > > DNSBL: Blacklist of IP addresses > URIBL: Blacklist of domain and host names appearing in URIs > EMAILBL: (not widely used) Blacklist of email addresses (e.g. > phishing response addresses) > Razor, Pyzor: Blacklist of message content checksums/hashes Perfect; that answers my question. >> In other words, if a known spam message was added yesterday, will it >> be considered "snowshoe" spam if the spammer sends the same message >> today and changes only one character within the body? > > No, the diverse IP addresses are the hallmark of "snowshoe", not so much > the specific message content. If you see identical or generally-similar > (e.g.) pharma spam coming from a wide range of different IP addresses, > that's snowshoe. I see. Given this information, it concerns me that Bayes scores hardly seem to budge when I feed sa-learn nearly identical messages 3+ times. We'll get into that below. >> If so, then I guess the only remedy here is to focus on why Bayes seems >> to perform so miserably. > > Agreed. > >> It must be a configuration issue, because I've sa-learn-ed messages >> that are incredibly similar for two days now and not only do their >> Bayes scores not change significantly, but sometimes they decrease. >> And I have a hard time believing that one of my users is sa-train-ing >> these messages as ham and negating my efforts. > > This is why you retain your Bayes training corpora: so that if Bayes > goes off the rails you can review your corpora for misclassifications, > wipe and retrain. Do you have your training corpora? Or do you discard > messages once you've trained them? I had the good sense to retain the corpora. > _Do_ you allow your users to train Bayes? Do they do so unsupervised or > do you review their submissions? And if the process is automated, do you > retain what they have provided for training so that you can go back > later and do a troubleshooting review? Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in. They do so unsupervised. Why this could be a problem is obvious. And no, I don't retain their submissions. I probably should. I wonder if I can make a few slight modifications to the shell script that Antispam calls, such that it simply sends a copy of the message to an administrator rather than calling sa-learn on the message. > Do you have autolearn turned on? My opinion is that autolearn is only > appropriate for a large and very diverse userbase where a sufficiently > "common" corpus of ham can't be manually collected. but then, I don't > admin a Really Large Install, so YMMV. No, I was sure to disable autolearn after the last Bayes fiasco. :) > Do you use per-user or sitewide Bayes? If per-user, then you need to > make sure that you're training Bayes as the same user that the MTA is > running SA as. Site-wide. And I have hard-coded the username in the SA configuration to prevent confusion in this regard: bayes_sql_override_username amavis > What user does your MTA run SA as? What user do you train Bayes as? The MTA shou
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/14/2013 7:48 PM, Noel wrote: > On 1/14/2013 2:59 PM, Ben Johnson wrote: > >> I understand that snowshoe spam may not hit any net tests. I guess my >> confusion is around what, exactly, classifies spam as "snowshoe". > > Snowshoe spam - spreading a spam run across a large number of IPs so > no single IP is sending a large volume. Typically also combined > with "natural language" text, RFC compliant mail servers, verified > SPF and DKIM, business-class ISP with FCrDNS, and every other > criteria to look like a legit mail source. This type of spam is > difficult to catch. > > http://www.spamhaus.org/faq/section/Glossary#233 > and countless other links if you ask google. > >> Are most/all of the BL services hash-based? In other words, if a known >> spam message was added yesterday, will it be considered "snowshoe" spam >> if the spammer sends the same message today and changes only one >> character within the body? > > No, most all DNS blacklists are based on IP reputation. Check each > list's website for their listing policy to see how an IP gets on > their list; generally honypot email addresses or trusted user > reports. Most lists require some number of reports before listing > an IP to prevent false positives; snowshoe spammers take advantage > of this. > >> If so, then I guess the only remedy here is to focus on why Bayes seems >> to perform so miserably. > > Sounds as if your bayes has been improperly trained in the past. > You might do better to just delete the bayes db and start over with > hand-picked spam and ham. > > > > -- Noel Jones > jdow, Noel, and John, I can't thank you enough for your very thorough responses. Your time is valuable and I sincerely appreciate your willingness to help. John, I'll respond to you separately, for the sake of keeping this organized. > Ben, do be aware that sometimes you draw the short straw and sit at the > very start of the spam distribution cycle. In those cases the BLs will > generally not have been alerted yet so they may not trigger. For those > situations the rules should be your friends. (I still use my treasured > set of SARE rules and personally hand crafted rules my partner and I > have created that fit OUR needs but may not be good general purpose > rules.) This makes perfect sense and underscores the importance of a finely-tuned rule-set. It's become apparent just how dynamic and capable a monster the spam industry is. No one approach will ever be a panacea, it seems. The advice from your second email is well-received, too. Especially the part about not killing anybody. ;) I do hope fighting spam becomes fun for me, because so far, it's been an uphill battle! Hehe. Noel, thanks for excellent responses to my questions. > Sounds as if your bayes has been improperly trained in the past. > You might do better to just delete the bayes db and start over with > hand-picked spam and ham. I hope not, because this is my second go-round with the Bayes DB. The first time (as Mr. Hardin may remember), auto-learning was enabled out-of-the-box and some misconfiguration or another (seemingly related to DNSWL_* rules) caused a lot of spam to be learned as ham. With John's help, I corrected the issues (I hope), which I'll detail in my reply to John. Thanks again, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On Mon, 14 Jan 2013, Ben Johnson wrote: I understand that snowshoe spam may not hit any net tests. I guess my confusion is around what, exactly, classifies spam as "snowshoe". http://www.spamhaus.org/faq/section/Glossary Basically, a large number of spambots sending the message so that no one sending IP can be easily tagged as evil. Question: do you have any SMTP-time hard-reject DNSBL tests in place? Or are they all performed by SA? Recommendation: consider using the Spamhaus ZEN DNSBL as a hard-reject SMTP-time DNS check in your MTA. It is well-respected and very reliable. One thing it includes is ranges of IP addresses that should not ever be sending email, so it may help reduce snowshoe spam. http://www.spamhaus.org/zen/ Another tactic that many report good results from is Greylisting. Do you have greylisting in place? Does your userbase demand no delays in mail delivery? In addition to blocking spam from spambots that do not retry, it can delay mail enough for the BLs to get a chance to list new IPs/domains, which can reduce the leakage if you happen to be at the leading edge of a new delivery campaign. http://www.greylisting.org/ Are most/all of the BL services hash-based? Generally: DNSBL: Blacklist of IP addresses URIBL: Blacklist of domain and host names appearing in URIs EMAILBL: (not widely used) Blacklist of email addresses (e.g. phishing response addresses) Razor, Pyzor: Blacklist of message content checksums/hashes In other words, if a known spam message was added yesterday, will it be considered "snowshoe" spam if the spammer sends the same message today and changes only one character within the body? No, the diverse IP addresses are the hallmark of "snowshoe", not so much the specific message content. If you see identical or generally-similar (e.g.) pharma spam coming from a wide range of different IP addresses, that's snowshoe. If so, then I guess the only remedy here is to focus on why Bayes seems to perform so miserably. Agreed. It must be a configuration issue, because I've sa-learn-ed messages that are incredibly similar for two days now and not only do their Bayes scores not change significantly, but sometimes they decrease. And I have a hard time believing that one of my users is sa-train-ing these messages as ham and negating my efforts. This is why you retain your Bayes training corpora: so that if Bayes goes off the rails you can review your corpora for misclassifications, wipe and retrain. Do you have your training corpora? Or do you discard messages once you've trained them? _Do_ you allow your users to train Bayes? Do they do so unsupervised or do you review their submissions? And if the process is automated, do you retain what they have provided for training so that you can go back later and do a troubleshooting review? Do you have autolearn turned on? My opinion is that autolearn is only appropriate for a large and very diverse userbase where a sufficiently "common" corpus of ham can't be manually collected. but then, I don't admin a Really Large Install, so YMMV. Do you use per-user or sitewide Bayes? If per-user, then you need to make sure that you're training Bayes as the same user that the MTA is running SA as. What user does your MTA run SA as? What user do you train Bayes as? One possibility is that the MTA is running SA as a different user than you are training Bayes as, and you have autolearn turned on, and Bayes has been running in its own little world since day one regardless of what you think you're telling it to do. I have ensured that the spam token count increases when I train these messages. That said, I do notice that the token count does not *always* change; sometimes, sa-learn reports "Learned tokens from 0 message(s) (1 message(s) examined)". Does this mean that all tokens from these messages have already been learned, thereby making it pointless to continue feeding them to sa-learn? No, it means that Message-ID has been learned from before. Finally, I added the test you supplied to my SA configuration, restarted Amavis, and all messages appear to be tagged with RCVD_IN_HITALL=0.001. So this proves DNS lookups are indeed working for all messages. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- One death is a tragedy; thirty is a media sensation; a million is a statistic. -- Joseph Stalin, modernized --- 3 days until Benjamin Franklin's 307th Birthday
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/14/2013 2:59 PM, Ben Johnson wrote: > I understand that snowshoe spam may not hit any net tests. I guess my > confusion is around what, exactly, classifies spam as "snowshoe". Snowshoe spam - spreading a spam run across a large number of IPs so no single IP is sending a large volume. Typically also combined with "natural language" text, RFC compliant mail servers, verified SPF and DKIM, business-class ISP with FCrDNS, and every other criteria to look like a legit mail source. This type of spam is difficult to catch. http://www.spamhaus.org/faq/section/Glossary#233 and countless other links if you ask google. > Are most/all of the BL services hash-based? In other words, if a known > spam message was added yesterday, will it be considered "snowshoe" spam > if the spammer sends the same message today and changes only one > character within the body? No, most all DNS blacklists are based on IP reputation. Check each list's website for their listing policy to see how an IP gets on their list; generally honypot email addresses or trusted user reports. Most lists require some number of reports before listing an IP to prevent false positives; snowshoe spammers take advantage of this. > If so, then I guess the only remedy here is to focus on why Bayes seems > to perform so miserably. Sounds as if your bayes has been improperly trained in the past. You might do better to just delete the bayes db and start over with hand-picked spam and ham. -- Noel Jones
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 2013/01/14 12:59, Ben Johnson wrote: On 1/14/2013 2:49 PM, RW wrote: On Mon, 14 Jan 2013 13:24:55 -0500 Ben Johnson wrote: A clear pattern has emerged: the X-Spam-Status headers for very obviously spammy messages never contain evidence that network tests contributed to their SA scores. Ultimately, I need to know whether: a.) Network tests are not being run at all for these messages b.) Network tests are being run, but are failing in some way c.) Network tests are being run, and are succeeding, but return responses that do not contribute to the messages' scores I've had a look at the log entries to which I link in my previous message and I just need a little help interpreting the "dns" and "async" messages. As I said before, it's not unusual for snowshoe spam to hit no net tests at all. Also obvious spam isn't any more likely to be in a blocklist than less obvious spam. However, try adding this to your SpamAssassin configuration, and restart the appropriate daemon: header RCVD_IN_HITALL eval:check_rbl('hitall-lastexternal', 'ipv4.fahq2.com.') tflags RCVD_IN_HITALL net scoreRCVD_IN_HITALL 0.001 It should add a dns test that is hit for all mail delivered from an IPv4 address. Thanks, RW. I understand that snowshoe spam may not hit any net tests. I guess my confusion is around what, exactly, classifies spam as "snowshoe". Are most/all of the BL services hash-based? In other words, if a known spam message was added yesterday, will it be considered "snowshoe" spam if the spammer sends the same message today and changes only one character within the body? If so, then I guess the only remedy here is to focus on why Bayes seems to perform so miserably. It must be a configuration issue, because I've sa-learn-ed messages that are incredibly similar for two days now and not only do their Bayes scores not change significantly, but sometimes they decrease. And I have a hard time believing that one of my users is sa-train-ing these messages as ham and negating my efforts. I have ensured that the spam token count increases when I train these messages. That said, I do notice that the token count does not *always* change; sometimes, sa-learn reports "Learned tokens from 0 message(s) (1 message(s) examined)". Does this mean that all tokens from these messages have already been learned, thereby making it pointless to continue feeding them to sa-learn? If I receive one more uncaught message about how some mom is angering doctors by doing something crazy to her face, I'm going to hunt-down the er and rip her face OFF. Finally, I added the test you supplied to my SA configuration, restarted Amavis, and all messages appear to be tagged with RCVD_IN_HITALL=0.001. As much as I might applaud that sentiment I'd like to note two things. First, it might involve just a whole lot of nasty paperwork and unpleasant contact with authorities. Second the energy wasted doing that might have been better spent had you learned how to create rules and recognize the elements of a spam that are likely to be relatively unique so you can create rules for it. After awhile creating rules to knock down such "stuff" can become fun. (Then after a longer while it gets "old", sigh.) Another thing to learn in the process is that what you consider to be spam is another person's (jerk's?) ham. So crafting rules needs to be done with care if you're filtering for more than one person. Erm, of course this is what allowing per user rules is good for. {^_^}
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 2013/01/14 10:24, Ben Johnson wrote: On 1/11/2013 4:27 PM, Ben Johnson wrote: I enabled Amavis's SA debugging mode on the server in question and was able to extract the debug output for two messages that seem like they should definitely be classified as spam. Message #1: http://pastebin.com/xLMikNJH Message #2: http://pastebin.com/Ug78tPrt A couple points of note and a couple of questions: a.) There seems to be plenty of network activity, but I don't any "results" (for lack of a better term) for those queries. The final X-Spam-Status header that is generated looks like this: No, score=1.592 tagged_above=-999 required=2 tests=[BAYES_50=0.8, RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=disabled Does the absence of network tests in the resultant header simply mean that none of the network tests contributed to the score? If so, why might that be? Are these messages simply "too new" to appear in any blacklists? b.) The scores for both messages are identical, which, I suppose, is not surprising, given that the same exact tests were performed and produced the same exact results. Is this normal? c.) 45 minutes after receiving Message #2 from above, I received a very similar message. The subjects varied only in dollar amount advertised, and the bodies varies only in the hyperlink URLs and the footer/signature. Here's the debug output: http://pastebin.com/sLMgXrf5 The second message was scored at 14.75, which seems much better. Of course, the second score was so much higher because the network/blacklist tests contributed significantly. Is the conclusion to be drawn the same as in a) (these messages are "too new" to appear in blacklists)? One final point of concern on this item: the Bayes score for the first of the two emails was BAYES_50=0.8, and I fed the message through sa-learn as spam shortly after it arrived. Yet, the Bayes score for the second message was BAYES_40=-0.001 -- *lower* than the first. How could this be? Is there some rational explanation? Thanks for all the help here, guys! -Ben Nobody? A clear pattern has emerged: the X-Spam-Status headers for very obviously spammy messages never contain evidence that network tests contributed to their SA scores. Ultimately, I need to know whether: a.) Network tests are not being run at all for these messages b.) Network tests are being run, but are failing in some way c.) Network tests are being run, and are succeeding, but return responses that do not contribute to the messages' scores I've had a look at the log entries to which I link in my previous message and I just need a little help interpreting the "dns" and "async" messages. Ben, do be aware that sometimes you draw the short straw and sit at the very start of the spam distribution cycle. In those cases the BLs will generally not have been alerted yet so they may not trigger. For those situations the rules should be your friends. (I still use my treasured set of SARE rules and personally hand crafted rules my partner and I have created that fit OUR needs but may not be good general purpose rules.) {^_^}
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/14/2013 2:49 PM, RW wrote: > On Mon, 14 Jan 2013 13:24:55 -0500 > Ben Johnson wrote: > > >> A clear pattern has emerged: the X-Spam-Status headers for very >> obviously spammy messages never contain evidence that network tests >> contributed to their SA scores. >> >> Ultimately, I need to know whether: >> >> a.) Network tests are not being run at all for these messages >> >> b.) Network tests are being run, but are failing in some way >> >> c.) Network tests are being run, and are succeeding, but return >> responses that do not contribute to the messages' scores >> >> I've had a look at the log entries to which I link in my previous >> message and I just need a little help interpreting the "dns" and >> "async" messages. > > As I said before, it's not unusual for snowshoe spam to hit no net > tests at all. Also obvious spam isn't any more likely to be in a > blocklist than less obvious spam. > > However, try adding this to your SpamAssassin configuration, and > restart the appropriate daemon: > > header RCVD_IN_HITALL eval:check_rbl('hitall-lastexternal', > 'ipv4.fahq2.com.') > tflags RCVD_IN_HITALL net > scoreRCVD_IN_HITALL 0.001 > > > It should add a dns test that is hit for all mail delivered from an > IPv4 address. > Thanks, RW. I understand that snowshoe spam may not hit any net tests. I guess my confusion is around what, exactly, classifies spam as "snowshoe". Are most/all of the BL services hash-based? In other words, if a known spam message was added yesterday, will it be considered "snowshoe" spam if the spammer sends the same message today and changes only one character within the body? If so, then I guess the only remedy here is to focus on why Bayes seems to perform so miserably. It must be a configuration issue, because I've sa-learn-ed messages that are incredibly similar for two days now and not only do their Bayes scores not change significantly, but sometimes they decrease. And I have a hard time believing that one of my users is sa-train-ing these messages as ham and negating my efforts. I have ensured that the spam token count increases when I train these messages. That said, I do notice that the token count does not *always* change; sometimes, sa-learn reports "Learned tokens from 0 message(s) (1 message(s) examined)". Does this mean that all tokens from these messages have already been learned, thereby making it pointless to continue feeding them to sa-learn? If I receive one more uncaught message about how some mom is angering doctors by doing something crazy to her face, I'm going to hunt-down the er and rip her face OFF. Finally, I added the test you supplied to my SA configuration, restarted Amavis, and all messages appear to be tagged with RCVD_IN_HITALL=0.001. Thanks for all your help, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On Mon, 14 Jan 2013 13:24:55 -0500 Ben Johnson wrote: > A clear pattern has emerged: the X-Spam-Status headers for very > obviously spammy messages never contain evidence that network tests > contributed to their SA scores. > > Ultimately, I need to know whether: > > a.) Network tests are not being run at all for these messages > > b.) Network tests are being run, but are failing in some way > > c.) Network tests are being run, and are succeeding, but return > responses that do not contribute to the messages' scores > > I've had a look at the log entries to which I link in my previous > message and I just need a little help interpreting the "dns" and > "async" messages. As I said before, it's not unusual for snowshoe spam to hit no net tests at all. Also obvious spam isn't any more likely to be in a blocklist than less obvious spam. However, try adding this to your SpamAssassin configuration, and restart the appropriate daemon: header RCVD_IN_HITALL eval:check_rbl('hitall-lastexternal', 'ipv4.fahq2.com.') tflags RCVD_IN_HITALL net scoreRCVD_IN_HITALL 0.001 It should add a dns test that is hit for all mail delivered from an IPv4 address.
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/11/2013 4:27 PM, Ben Johnson wrote: > I enabled Amavis's SA debugging mode on the server in question and was > able to extract the debug output for two messages that seem like they > should definitely be classified as spam. > > Message #1: http://pastebin.com/xLMikNJH > > Message #2: http://pastebin.com/Ug78tPrt > > A couple points of note and a couple of questions: > > a.) There seems to be plenty of network activity, but I don't any > "results" (for lack of a better term) for those queries. The final > X-Spam-Status header that is generated looks like this: > > No, score=1.592 tagged_above=-999 required=2 tests=[BAYES_50=0.8, > RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=disabled > > Does the absence of network tests in the resultant header simply mean > that none of the network tests contributed to the score? If so, why > might that be? Are these messages simply "too new" to appear in any > blacklists? > > b.) The scores for both messages are identical, which, I suppose, is not > surprising, given that the same exact tests were performed and produced > the same exact results. Is this normal? > > c.) 45 minutes after receiving Message #2 from above, I received a very > similar message. The subjects varied only in dollar amount advertised, > and the bodies varies only in the hyperlink URLs and the footer/signature. > > Here's the debug output: http://pastebin.com/sLMgXrf5 > > The second message was scored at 14.75, which seems much better. Of > course, the second score was so much higher because the > network/blacklist tests contributed significantly. > > Is the conclusion to be drawn the same as in a) (these messages are "too > new" to appear in blacklists)? > > One final point of concern on this item: the Bayes score for the first > of the two emails was BAYES_50=0.8, and I fed the message through > sa-learn as spam shortly after it arrived. Yet, the Bayes score for the > second message was BAYES_40=-0.001 -- *lower* than the first. How could > this be? Is there some rational explanation? > > Thanks for all the help here, guys! > > -Ben Nobody? A clear pattern has emerged: the X-Spam-Status headers for very obviously spammy messages never contain evidence that network tests contributed to their SA scores. Ultimately, I need to know whether: a.) Network tests are not being run at all for these messages b.) Network tests are being run, but are failing in some way c.) Network tests are being run, and are succeeding, but return responses that do not contribute to the messages' scores I've had a look at the log entries to which I link in my previous message and I just need a little help interpreting the "dns" and "async" messages. Thanks for any insight, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/10/2013 3:13 PM, Tom Hendrikx wrote: > On 10-01-13 19:55, Ben Johnson wrote: >> >> >> On 1/10/2013 1:06 PM, RW wrote: >>> On Thu, 10 Jan 2013 12:48:07 -0500 >>> Ben Johnson wrote: pon further consideration, this behavior makes perfect sense if the mailbox user has moved the message from Inbox to Junk between scans; Dovecot's Antispam filter is in use on this server. This action would cause the message tokens to be added to the Bayes database, which explains why the SA score is higher on subsequent scans, even with network tests disabled. >>> >>> Also by turning-off network tests you switch to a different score set so >>> the score for RDNS_NONE rose. >>> >> >> Ahh; I didn't realize that disabling network tests changes the score set >> entirely. Thanks for the clarification there. >> >> So, at this point, I'm struggling to understand how the following happened. >> >> Over the course of 15 minutes, I received the same exact message four >> times. Each time, the message was sent to the same recipient mailbox. >> The "From" and "Return-Path" headers changed slightly each time, but the >> message bodies appear to be identical. >> >> Here are the X-Spam-Status headers for each message: >> >> 1:28 PM >> >> Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, >> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449, >> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001, >> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25, >> URIBL_WS_SURBL=1.608] autolearn=disabled >> >> 1:35 PM >> >> No, score=-0.374 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, >> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RDNS_NONE=0.793, >> SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01] autolearn=disabled >> >> 1:36 PM >> >> Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, >> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449, >> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001, >> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25, >> URIBL_WS_SURBL=1.608] autolearn=disabled >> >> 1:41 PM >> >> Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, >> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449, >> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001, >> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25, >> URIBL_WS_SURBL=1.608] autolearn=disabled >> >> Questions: >> >> 1.) I have a fairly well-trained Bayes DB; why on earth does a message >> with the subject "Cash Quick? Get up to 1500 Now", and an equally >> nefarious body, trigger BAYES_00? > > This will solely depend on the contents of your bayes db. Is this shared > between users, etc etc. No good answer ready without looking at it. Yes, the Bayes DB is shared between users. But it seems that focusing on the "low-hanging fruit" (the network test issues) will be more productive in the short term. >> 2.) Why weren't network tests performed on message 2 of 4? This seems to >> be evidence of the fact that network tests are not being performed some >> percentage of the time, which could very well be at the root of this >> whole problem. > > The fact that not a single network test was triggered, is indeed > suspicious. The DNSBL tests are of course sender sender dependent, but > if the body is the same the URIBL stuff should fire. Maybe you DNS > queries timed because your DNS setup is borked? Maybe you should > temporarily enable debug logging for dns lookups in spamassassin? > I enabled Amavis's SA debugging mode on the server in question and was able to extract the debug output for two messages that seem like they should definitely be classified as spam. Message #1: http://pastebin.com/xLMikNJH Message #2: http://pastebin.com/Ug78tPrt A couple points of note and a couple of questions: a.) There seems to be plenty of network activity, but I don't any "results" (for lack of a better term) for those queries. The final X-Spam-Status header that is generated looks like this: No, score=1.592 tagged_above=-999 required=2 tests=[BAYES_50=0.8, RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=disabled Does the absence of network tests in the resultant header simply mean that none of the network tests contributed to the score? If so, why might that be? Are these messages simply "too new" to appear in any blacklists? b.) The scores for both messages are identical, which, I suppose, is not surprising, given that the same exact tests were performed and produced the same exact results. Is this normal? c.) 45 minutes after receiving Message #2 from above, I received a very similar message. The subjects varied only in dollar amount advertised, and the bodies varies only in the hyperlink URLs and the footer/signature. Here's the debug output: http://pastebin.com/sLMgXrf5 The second message was scored at 14.75, which seems much better. Of course, the second score was so much highe
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/10/2013 4:12 PM, John Hardin wrote: > On Thu, 10 Jan 2013, Ben Johnson wrote: > >> So, at this point, I'm struggling to understand how the following >> happened. >> >> Over the course of 15 minutes, I received the same exact message four >> times. Each time, the message was sent to the same recipient mailbox. >> The "From" and "Return-Path" headers changed slightly each time, but the >> message bodies appear to be identical. >> >> Here are the X-Spam-Status headers for each message: >> >> 1:28 PM >> >> Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, >> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449, >> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001, >> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25, >> URIBL_WS_SURBL=1.608] autolearn=disabled >> >> 1:35 PM >> >> No, score=-0.374 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, >> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RDNS_NONE=0.793, >> SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01] autolearn=disabled >> >> 1:36 PM >> >> Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, >> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449, >> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001, >> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25, >> URIBL_WS_SURBL=1.608] autolearn=disabled >> >> 1:41 PM >> >> Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, >> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449, >> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001, >> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25, >> URIBL_WS_SURBL=1.608] autolearn=disabled >> >> Questions: >> >> 1.) I have a fairly well-trained Bayes DB; why on earth does a message >> with the subject "Cash Quick? Get up to 1500 Now", and an equally >> nefarious body, trigger BAYES_00? >> >> 2.) Why weren't network tests performed on message 2 of 4? This seems to >> be evidence of the fact that network tests are not being performed some >> percentage of the time, which could very well be at the root of this >> whole problem. > > How many MTAs do you have? Is it possible the low-scoring one went via a > different MTA? Just one; there should be no possibility of that. > Have you sotpped amavisd, killed all of the amavis processes, and > restarted it? > > I have now. And I enabled amavis's $sa_debug option, so we should see a lot more in the way of useful SA debugging information now. In fact, I was just able to capture the out that I believe we're after, and I'll paste a link in my response to RW's message (shortly forthcoming). Thanks, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On Thu, 10 Jan 2013 13:55:58 -0500 Ben Johnson wrote: > So, at this point, I'm struggling to understand how the following > happened. > > Over the course of 15 minutes, I received the same exact message four > times. Each time, the message was sent to the same recipient mailbox. > The "From" and "Return-Path" headers changed slightly each time, but > the message bodies appear to be identical. > 1.) I have a fairly well-trained Bayes DB; why on earth does a message > with the subject "Cash Quick? Get up to 1500 Now", and an equally > nefarious body, trigger BAYES_00? From what you wrote before the database is trained by end users, so you can't really be sure that it is well trained. > 2.) Why weren't network tests performed on message 2 of 4? This seems > to be evidence of the fact that network tests are not being performed > some percentage of the time, which could very well be at the root of > this whole problem. It may be that there was some local problem, but there is a simpler explanation. Are you sure that message 2 has exactly the same IP and URI as 1 and that it hasn't been delayed with respect to 1. The rest are in RCVD_IN_CSS which is a snow-shoe spam list, so you expect that early spams from a given IP address wont hit any URI or IP blocklist at all.
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On Thu, 10 Jan 2013, Ben Johnson wrote: So, at this point, I'm struggling to understand how the following happened. Over the course of 15 minutes, I received the same exact message four times. Each time, the message was sent to the same recipient mailbox. The "From" and "Return-Path" headers changed slightly each time, but the message bodies appear to be identical. Here are the X-Spam-Status headers for each message: 1:28 PM Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25, URIBL_WS_SURBL=1.608] autolearn=disabled 1:35 PM No, score=-0.374 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RDNS_NONE=0.793, SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01] autolearn=disabled 1:36 PM Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25, URIBL_WS_SURBL=1.608] autolearn=disabled 1:41 PM Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25, URIBL_WS_SURBL=1.608] autolearn=disabled Questions: 1.) I have a fairly well-trained Bayes DB; why on earth does a message with the subject "Cash Quick? Get up to 1500 Now", and an equally nefarious body, trigger BAYES_00? 2.) Why weren't network tests performed on message 2 of 4? This seems to be evidence of the fact that network tests are not being performed some percentage of the time, which could very well be at the root of this whole problem. How many MTAs do you have? Is it possible the low-scoring one went via a different MTA? Have you sotpped amavisd, killed all of the amavis processes, and restarted it? -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Maxim I: Pillage, _then_ burn. --- 7 days until Benjamin Franklin's 307th Birthday
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 10-01-13 19:55, Ben Johnson wrote: > > > On 1/10/2013 1:06 PM, RW wrote: >> On Thu, 10 Jan 2013 12:48:07 -0500 >> Ben Johnson wrote: >>> pon further consideration, this behavior makes perfect sense if the >>> mailbox user has moved the message from Inbox to Junk between scans; >>> Dovecot's Antispam filter is in use on this server. This action would >>> cause the message tokens to be added to the Bayes database, which >>> explains why the SA score is higher on subsequent scans, even with >>> network tests disabled. >> >> Also by turning-off network tests you switch to a different score set so >> the score for RDNS_NONE rose. >> > > Ahh; I didn't realize that disabling network tests changes the score set > entirely. Thanks for the clarification there. > > So, at this point, I'm struggling to understand how the following happened. > > Over the course of 15 minutes, I received the same exact message four > times. Each time, the message was sent to the same recipient mailbox. > The "From" and "Return-Path" headers changed slightly each time, but the > message bodies appear to be identical. > > Here are the X-Spam-Status headers for each message: > > 1:28 PM > > Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, > HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449, > RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001, > T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25, > URIBL_WS_SURBL=1.608] autolearn=disabled > > 1:35 PM > > No, score=-0.374 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, > HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RDNS_NONE=0.793, > SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01] autolearn=disabled > > 1:36 PM > > Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, > HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449, > RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001, > T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25, > URIBL_WS_SURBL=1.608] autolearn=disabled > > 1:41 PM > > Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, > HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449, > RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001, > T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25, > URIBL_WS_SURBL=1.608] autolearn=disabled > > Questions: > > 1.) I have a fairly well-trained Bayes DB; why on earth does a message > with the subject "Cash Quick? Get up to 1500 Now", and an equally > nefarious body, trigger BAYES_00? This will solely depend on the contents of your bayes db. Is this shared between users, etc etc. No good answer ready without looking at it. > 2.) Why weren't network tests performed on message 2 of 4? This seems to > be evidence of the fact that network tests are not being performed some > percentage of the time, which could very well be at the root of this > whole problem. The fact that not a single network test was triggered, is indeed suspicious. The DNSBL tests are of course sender sender dependent, but if the body is the same the URIBL stuff should fire. Maybe you DNS queries timed because your DNS setup is borked? Maybe you should temporarily enable debug logging for dns lookups in spamassassin? -- Tom
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/10/2013 1:06 PM, RW wrote: > On Thu, 10 Jan 2013 12:48:07 -0500 > Ben Johnson wrote: >> pon further consideration, this behavior makes perfect sense if the >> mailbox user has moved the message from Inbox to Junk between scans; >> Dovecot's Antispam filter is in use on this server. This action would >> cause the message tokens to be added to the Bayes database, which >> explains why the SA score is higher on subsequent scans, even with >> network tests disabled. > > Also by turning-off network tests you switch to a different score set so > the score for RDNS_NONE rose. > Ahh; I didn't realize that disabling network tests changes the score set entirely. Thanks for the clarification there. So, at this point, I'm struggling to understand how the following happened. Over the course of 15 minutes, I received the same exact message four times. Each time, the message was sent to the same recipient mailbox. The "From" and "Return-Path" headers changed slightly each time, but the message bodies appear to be identical. Here are the X-Spam-Status headers for each message: 1:28 PM Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25, URIBL_WS_SURBL=1.608] autolearn=disabled 1:35 PM No, score=-0.374 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RDNS_NONE=0.793, SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01] autolearn=disabled 1:36 PM Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25, URIBL_WS_SURBL=1.608] autolearn=disabled 1:41 PM Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25, URIBL_WS_SURBL=1.608] autolearn=disabled Questions: 1.) I have a fairly well-trained Bayes DB; why on earth does a message with the subject "Cash Quick? Get up to 1500 Now", and an equally nefarious body, trigger BAYES_00? 2.) Why weren't network tests performed on message 2 of 4? This seems to be evidence of the fact that network tests are not being performed some percentage of the time, which could very well be at the root of this whole problem. Thanks, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On Thu, 10 Jan 2013 12:48:07 -0500 Ben Johnson wrote: > pon further consideration, this behavior makes perfect sense if the > mailbox user has moved the message from Inbox to Junk between scans; > Dovecot's Antispam filter is in use on this server. This action would > cause the message tokens to be added to the Bayes database, which > explains why the SA score is higher on subsequent scans, even with > network tests disabled. Also by turning-off network tests you switch to a different score set so the score for RDNS_NONE rose.
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/10/2013 12:18 PM, Ben Johnson wrote: > > > On 1/10/2013 11:49 AM, RW wrote: >> On Thu, 10 Jan 2013 11:43:44 -0500 >> Ben Johnson wrote: >> >> >>> This observation begs the question: why are network tests being >>> performed for some messages but not others? To my knowledge, no >>> white/gray/black listing has been done on this box. >> >> As has already been said, the score from network tests is commonly a >> lot higher on retesting because of all the reporting that happened >> in-between. >> > > RW, > > I understand that, but that doesn't explain why if I retest a given > message by calling SpamAssassin directly, and I *disable network tests*, > the score is sometimes *higher* than when the message was scanned > initially with AMaViS. > > When this message came through initially, the X-Spam-Status header was: > > No, score=1.593 tagged_above=-999 required=2 tests=[BAYES_50=0.8, > HTML_MESSAGE=0.001, RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=disabled > > About an hour later, I fed the same message to the spamassassin > executable, while disabling network tests: > > # spamassassin -L -t -D < /tmp/msg.txt > > Content analysis details: (5.0 points, 5.0 required) > > pts rule name description > -- > -- > 3.8 BAYES_99 BODY: Bayes spam probability is 99 to 100% > [score: 1.] > 0.0 HTML_MESSAGE BODY: HTML included in message > 1.2 RDNS_NONE Delivered to internal network by a host with > no rDNS > > To restate the question, if network tests are not outright disabled in > Amavis, why is Amavis returning lower scores than the SA binary does > when called directly with network tests disabled? Shouldn't the SA score > with network tests disabled *always* be lower than or equal to the > Amavis score with network tests enabled (provided that all else is equal)? > > Or am I way off-base here? > > Thanks again, > > -Ben > Upon further consideration, this behavior makes perfect sense if the mailbox user has moved the message from Inbox to Junk between scans; Dovecot's Antispam filter is in use on this server. This action would cause the message tokens to be added to the Bayes database, which explains why the SA score is higher on subsequent scans, even with network tests disabled. Sorry... I'm still trying to wrap my head around all of this. Lots of moving parts. -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/10/2013 11:49 AM, RW wrote: > On Thu, 10 Jan 2013 11:43:44 -0500 > Ben Johnson wrote: > > >> This observation begs the question: why are network tests being >> performed for some messages but not others? To my knowledge, no >> white/gray/black listing has been done on this box. > > As has already been said, the score from network tests is commonly a > lot higher on retesting because of all the reporting that happened > in-between. > RW, I understand that, but that doesn't explain why if I retest a given message by calling SpamAssassin directly, and I *disable network tests*, the score is sometimes *higher* than when the message was scanned initially with AMaViS. When this message came through initially, the X-Spam-Status header was: No, score=1.593 tagged_above=-999 required=2 tests=[BAYES_50=0.8, HTML_MESSAGE=0.001, RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=disabled About an hour later, I fed the same message to the spamassassin executable, while disabling network tests: # spamassassin -L -t -D < /tmp/msg.txt Content analysis details: (5.0 points, 5.0 required) pts rule name description -- -- 3.8 BAYES_99 BODY: Bayes spam probability is 99 to 100% [score: 1.] 0.0 HTML_MESSAGE BODY: HTML included in message 1.2 RDNS_NONE Delivered to internal network by a host with no rDNS To restate the question, if network tests are not outright disabled in Amavis, why is Amavis returning lower scores than the SA binary does when called directly with network tests disabled? Shouldn't the SA score with network tests disabled *always* be lower than or equal to the Amavis score with network tests enabled (provided that all else is equal)? Or am I way off-base here? Thanks again, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On Thu, 10 Jan 2013 11:43:44 -0500 Ben Johnson wrote: > This observation begs the question: why are network tests being > performed for some messages but not others? To my knowledge, no > white/gray/black listing has been done on this box. As has already been said, the score from network tests is commonly a lot higher on retesting because of all the reporting that happened in-between.
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/9/2013 9:13 PM, John Hardin wrote: > On Wed, 9 Jan 2013, Ben Johnson wrote: > >> On 1/9/2013 7:36 PM, wolfgang wrote: >>> RCVD_IN_BRBL_LASTEXT,RCVD_IN_CSS,RCVD_IN_PSBL,RCVD_IN_XBL,URIBL_DBL_S PAM, URIBL_JP_SURBL autolearn=disabled version=3.3.1 >>> >>> I am not familiar with amavis, but I know that it calls spamassassin in >>> a special way, depending on the amavis config. Wild guess: could it be >>> that RBL/URIBL queries are disabled in your amavis config? >> >> Thanks for the reply. >> >> What you say about the RBL/URIBL tests makes sense. > > Check your amavis configuration to see whether you have network tests > disabled. That's the simplest explanation. > Thanks, John. On the surface, network tests appear to be enabled: # grep -ir sa_local_tests_only /etc/amavis /etc/amavis/conf.d/20-debian_defaults:$sa_local_tests_only = 0;# only tests which do not require internet access? Also, some of the incoming messages do contain network test scoring data in the X-Spam-Status header; here are two examples: Yes, score=8.451 tagged_above=-999 required=2 tests=[BAYES_99=3.5, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_CSS=1, RDNS_NONE=0.793, SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7] autolearn=disabled Yes, score=12.266 tagged_above=-999 required=2 tests=[BAYES_50=0.8, DATE_IN_FUTURE_12_24=3.199, DIET_1=0.001, HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_PSBL=2.7, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25] autolearn=disabled (Several of those are network tests, right?) What's strange is that another message was delivered at nearly the same time as the above two, yet it shows no evidence of network tests being performed (right?): No, score=0.8 tagged_above=-999 required=2 tests=[BAYES_50=0.8, HTML_MESSAGE=0.001, SPF_PASS=-0.001] autolearn=disabled It seems as though the SPAM that slips through never shows evidence of network tests, whereas the SPAM that is caught (and usually has a high score -- 10 or higher) always seems to show evidence of network tests. This observation begs the question: why are network tests being performed for some messages but not others? To my knowledge, no white/gray/black listing has been done on this box. Thanks again, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 10/01/13 00:03, Ben Johnson wrote: On 1/9/2013 5:36 PM, RW wrote: This is not better, it indicates that SA didn't recognise it as an email, not that it recognised it as a spam. Whatever /tmp/msg.txt was it wasn't a properly formatted email. Thanks for the quick replies, Marius and RW. I see; I saved the email message out of Thunderbird (with View -> Headers -> All), as a plain text file. Apparently, that process butchers the original message. Ben, In thunderbird, select the message and then press Ctrl-U (or from the menus: View > Message Source) and select File > Save to save the email including all headers in plain text format. You can then feed it to spamassassin as above. Hope that helps.
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On Wed, 9 Jan 2013, Ben Johnson wrote: On 1/9/2013 7:36 PM, wolfgang wrote: RCVD_IN_BRBL_LASTEXT,RCVD_IN_CSS,RCVD_IN_PSBL,RCVD_IN_XBL,URIBL_DBL_S PAM, URIBL_JP_SURBL autolearn=disabled version=3.3.1 I am not familiar with amavis, but I know that it calls spamassassin in a special way, depending on the amavis config. Wild guess: could it be that RBL/URIBL queries are disabled in your amavis config? Thanks for the reply. What you say about the RBL/URIBL tests makes sense. Check your amavis configuration to see whether you have network tests disabled. That's the simplest explanation. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- "They will be slaughtered as result of England's anti-gun laws that concentrates power to the Government." -- Shifty Powers (101 abn) observing British subjects training to repel a German invasion using rakes, hoes and pitchforks --- 8 days until Benjamin Franklin's 307th Birthday
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/9/2013 7:36 PM, wolfgang wrote: > On 2013-01-10 01:03, Ben Johnson wrote: > >> I see; I saved the email message out of Thunderbird (with View -> >> Headers -> All), as a plain text file. Apparently, that process >> butchers the original message. > > In Thunderbird, rather use File > Save as to save the entire message. > >> RCVD_IN_BRBL_LASTEXT,RCVD_IN_CSS,RCVD_IN_PSBL,RCVD_IN_XBL,URIBL_DBL_S >> PAM, URIBL_JP_SURBL autolearn=disabled version=3.3.1 > > Rules based on RBL/URIBL checks depend on DNS based blacklist queries. > And between the time you first receive an email and the time you > re-scan it, the originating client IP and/or URIs from the mail body > may have been added the the black lists after you first received the > mail. Did you re-scan the mail with amavis, too, or did you post the > X-Spam header lines from the original amavis scan and re-scan the mail > with spamassassin significantly later? > > I am not familiar with amavis, but I know that it calls spamassassin in > a special way, depending on the amavis config. Wild guess: could it be > that RBL/URIBL queries are disabled in your amavis config? > > Hope this helps. > > Cheers, > > wolfgang > Hi, Wolfgang, Thanks for the reply. What you say about the RBL/URIBL tests makes sense. I did not rescan the message with amavis; I posted the X-Spam-Status header contents from the original scan. The only reason for which I did not rescan the message with Amavis is that I don't know how to perform a SpamAssassin scan through Amavis in a manual capacity. And I can't find instructions regarding the process. All of that said, less than eight hours elapsed between the original scan with Amavis and the manual scan with "spamassassin". But, that's probably long enough for the IP addresses to be blacklisted. If nobody knows how to scan messages through Amavis, maybe I need to take this question over to the Amavis list for the time being. Thanks again, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 2013-01-10 01:03, Ben Johnson wrote: > I see; I saved the email message out of Thunderbird (with View -> > Headers -> All), as a plain text file. Apparently, that process > butchers the original message. In Thunderbird, rather use File > Save as to save the entire message. > RCVD_IN_BRBL_LASTEXT,RCVD_IN_CSS,RCVD_IN_PSBL,RCVD_IN_XBL,URIBL_DBL_S >PAM, URIBL_JP_SURBL autolearn=disabled version=3.3.1 Rules based on RBL/URIBL checks depend on DNS based blacklist queries. And between the time you first receive an email and the time you re-scan it, the originating client IP and/or URIs from the mail body may have been added the the black lists after you first received the mail. Did you re-scan the mail with amavis, too, or did you post the X-Spam header lines from the original amavis scan and re-scan the mail with spamassassin significantly later? I am not familiar with amavis, but I know that it calls spamassassin in a special way, depending on the amavis config. Wild guess: could it be that RBL/URIBL queries are disabled in your amavis config? Hope this helps. Cheers, wolfgang
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/9/2013 5:36 PM, RW wrote: > On Wed, 09 Jan 2013 17:14:05 -0500 > Ben Johnson wrote: > >> About five months ago, I experienced a problem that I *thought* I had >> resolved, but I am observing similar behavior after retraining the >> Bayes database. While the symptoms are similar, the root cause seems >> to be different (thankfully). The original problem is documented at >> http://spamassassin.1065346.n5.nabble.com/Very-spammy-messages-yield-BAYES-00-1-9-td101167.html >> .. >> >> In any case, I am again seeing SA scores that seem way too low for the >> message content in question. My "glue", as it were, is Amavis-New. >> >> In particular, certain messages that are clearly SPAM are scored >> between 0 and 3 when processed via Amavis. However, if I process the >> same messages with the "spamassassin" binary, directly, the scores >> are much higher and much more in-line with what one would expect. >> ... >> When I process the same message with spamassassin, directly >> (spamassassin -t -D < /tmp/msg.txt), the header looks like this: >> >> -- >> X-Spam-Status: Yes, score=7.5 required=5.0 >> tests=BAYES_50,MISSING_DATE,MISSING_HEADERS,MISSING_MID,MISSING_SUBJECT,NO_HEADERS_MESSAGE,NO_RECEIVED,NO_RELAYS >> autolearn=disabled version=3.3.1 > > > This is not better, it indicates that SA didn't recognise it as an > email, not that it recognised it as a spam. Whatever /tmp/msg.txt was > it wasn't a properly formatted email. > Thanks for the quick replies, Marius and RW. I see; I saved the email message out of Thunderbird (with View -> Headers -> All), as a plain text file. Apparently, that process butchers the original message. I'm reviewing SA's behavior using an email client to view the messages, but I also have access to the mailbox on the server. I realize that this question may seem amateurish, but how does one discern the "message ID" from the email client and locate the corresponding file in the user's "Maildir"? I'm using Dovecot 1.x. The file names in the user's Maildir look like this: 1357762471.M952293P32429.example.com,S=4300,W=4381:2, I assume that the first bit is a UNIX timestamp. Is there any means by which to correlate the second bit (M952293P32429) to the message as I see it in my email client (Thunderbird)? I don't see that string anywhere in the headers (maybe that's by design). In other words, when I spot a message that SA seems to be scoring incorrectly in my Inbox, how do I track-down the actual file on the server that should be fed into "spamassassin"? Is there some better method than doing something like # grep -ir 20B2834E4242 /var/vmail/example.com/user/Maildir where 20B2834E4242 is the ID in the "Received" header? In any case, I tracked-down the original message on the server and repeated the process (spamassassin -t < /tmp/msg.txt): -- X-Spam-Status: Yes, score=9.3 required=5.0 tests=BAYES_50,HTML_MESSAGE, RCVD_IN_BRBL_LASTEXT,RCVD_IN_CSS,RCVD_IN_PSBL,RCVD_IN_XBL,URIBL_DBL_SPAM, URIBL_JP_SURBL autolearn=disabled version=3.3.1 [...] Content analysis details: (9.3 points, 5.0 required) pts rule name description -- -- 0.4 RCVD_IN_XBLRBL: Received via a relay in Spamhaus XBL [188.165.126.107 listed in zen.spamhaus.org] 1.0 RCVD_IN_CSSRBL: Received via a relay in Spamhaus CSS 2.7 RCVD_IN_PSBL RBL: Received via a relay in PSBL [188.165.126.107 listed in psbl.surriel.com] 1.2 URIBL_JP_SURBL Contains an URL listed in the JP SURBL blocklist [URIs: ehylle.info] 1.4 RCVD_IN_BRBL_LASTEXT RBL: RCVD_IN_BRBL_LASTEXT [188.165.126.107 listed in bb.barracudacentral.org] 1.7 URIBL_DBL_SPAM Contains an URL listed in the DBL blocklist [URIs: ehylle.info] 0.0 HTML_MESSAGE BODY: HTML included in message 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.5428] -- So, if I've done this correctly, the score discrepancy is even larger. Thanks, guys! -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On Wed, 09 Jan 2013 17:14:05 -0500 Ben Johnson wrote: > About five months ago, I experienced a problem that I *thought* I had > resolved, but I am observing similar behavior after retraining the > Bayes database. While the symptoms are similar, the root cause seems > to be different (thankfully). The original problem is documented at > http://spamassassin.1065346.n5.nabble.com/Very-spammy-messages-yield-BAYES-00-1-9-td101167.html > .. > > In any case, I am again seeing SA scores that seem way too low for the > message content in question. My "glue", as it were, is Amavis-New. > > In particular, certain messages that are clearly SPAM are scored > between 0 and 3 when processed via Amavis. However, if I process the > same messages with the "spamassassin" binary, directly, the scores > are much higher and much more in-line with what one would expect. >... > When I process the same message with spamassassin, directly > (spamassassin -t -D < /tmp/msg.txt), the header looks like this: > > -- > X-Spam-Status: Yes, score=7.5 required=5.0 > tests=BAYES_50,MISSING_DATE,MISSING_HEADERS,MISSING_MID,MISSING_SUBJECT,NO_HEADERS_MESSAGE,NO_RECEIVED,NO_RELAYS > autolearn=disabled version=3.3.1 This is not better, it indicates that SA didn't recognise it as an email, not that it recognised it as a spam. Whatever /tmp/msg.txt was it wasn't a properly formatted email.
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On Wed, Jan 09, 2013 at 05:14:05PM -0500, Ben Johnson wrote: > Content analysis details: (7.5 points, 5.0 required) > > pts rule name description > -- > -- > -0.0 NO_RELAYS Informational: message was not relayed via SMTP > 1.2 MISSING_HEADERSMissing To: header > 2.0 BAYES_50 BODY: Bayes spam probability is 40 to 60% > [score: 0.5000] > 1.2 MISSING_MIDMissing Message-Id: header > 1.3 MISSING_SUBJECTMissing Subject: header > -0.0 NO_RECEIVEDInformational: message has no Received headers > 1.8 MISSING_DATE Missing Date: header > 0.0 NO_HEADERS_MESSAGE Message appears to be missing most RFC-822 > headers > -- These hits indicate that the mail you're testing (/tmp/msg.txt) is corrupted, as it is missing most email headers. > In short, my question is, how the is the message scoring 0.8 in one > case and 7.5 in another? That is a massive discrepancy. In the second case, the mail you are testing is corrupted. Open /tmp/msg.txt in a text editor and check if it looks sane. -- Marius Gavrilescu (warnings) Do not dangle the mouse by its cable or throw the mouse at co-workers. --From a manual for an SGI computer. signature.asc Description: Digital signature
Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
About five months ago, I experienced a problem that I *thought* I had resolved, but I am observing similar behavior after retraining the Bayes database. While the symptoms are similar, the root cause seems to be different (thankfully). The original problem is documented at http://spamassassin.1065346.n5.nabble.com/Very-spammy-messages-yield-BAYES-00-1-9-td101167.html . In any case, I am again seeing SA scores that seem way too low for the message content in question. My "glue", as it were, is Amavis-New. In particular, certain messages that are clearly SPAM are scored between 0 and 3 when processed via Amavis. However, if I process the same messages with the "spamassassin" binary, directly, the scores are much higher and much more in-line with what one would expect. The X-Spam-Status header, when processed via Amavis, looks like this: X-Spam-Status: No, score=0.8 tagged_above=-999 required=2 tests=[BAYES_50=0.8, HTML_MESSAGE=0.001, SPF_PASS=-0.001] autolearn=disabled When I process the same message with spamassassin, directly (spamassassin -t -D < /tmp/msg.txt), the header looks like this: -- X-Spam-Status: Yes, score=7.5 required=5.0 tests=BAYES_50,MISSING_DATE,MISSING_HEADERS,MISSING_MID,MISSING_SUBJECT,NO_HEADERS_MESSAGE,NO_RECEIVED,NO_RELAYS autolearn=disabled version=3.3.1 [...] Content analysis details: (7.5 points, 5.0 required) pts rule name description -- -- -0.0 NO_RELAYS Informational: message was not relayed via SMTP 1.2 MISSING_HEADERSMissing To: header 2.0 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.5000] 1.2 MISSING_MIDMissing Message-Id: header 1.3 MISSING_SUBJECTMissing Subject: header -0.0 NO_RECEIVEDInformational: message has no Received headers 1.8 MISSING_DATE Missing Date: header 0.0 NO_HEADERS_MESSAGE Message appears to be missing most RFC-822 headers -- In short, my question is, how the is the message scoring 0.8 in one case and 7.5 in another? That is a massive discrepancy. >From what I can tell, the same tests aren't even being performed in each case. I have to assume that the options that are passed to SA are wildly different in each case. It bears mention that the server in question uses ISPConfig 3. ISPConfig allows for SA policies to be configured per-domain and per-user, and Amavis leverages MySQL to make that happen. If relevant, I can provide more information about this aspect of my setup. These are the only directives that I've added to /etc/spamassassin/local.cf: -- bayes_path /var/lib/amavis/.spamassassin/bayes use_bayes 1 bayes_auto_expire 0 bayes_store_module Mail::SpamAssassin::BayesStore::MySQL bayes_sql_dsn DBI:mysql:sa_bayes:localhost bayes_sql_username sa_user bayes_sql_password [scrubbed] bayes_sql_override_username amavis -- Given the first directive, SA should always use the same Bayes database (the one I've configured in MySQL), regardless of how SA is called, right? For those curious about the state of the Bayes database, here's the output from "sa-learn --dump magic" (sorry for the wrapping): 0.000 0 3 0 non-token data: bayes db version 0.000 0 2007 0 non-token data: nspam 0.000 0 6554 0 non-token data: nham 0.000 0 188379 0 non-token data: ntokens 0.000 0 1356345829 0 non-token data: oldest atime 0.000 0 1357769317 0 non-token data: newest atime 0.000 0 0 0 non-token data: last journal sync atime 0.000 0 1357727978 0 non-token data: last expiry atime 0.000 01382400 0 non-token data: last expire atime delta 0.000 0 3191 0 non-token data: last expire reduction count Ultimately, it seems that I should be trying to figure out how, exactly, Amavis is calling SpamAssassin in the course of normal operation. Thanks for any help here, folks! -Ben