Re: Bayes scoring priority
On 6/24/2013 1:29 PM, Amir 'CG' Caspi wrote: > Has anyone modified their Bayes scoring priority, and if so, what were > your experiences? What scores did you assign? This has been discussed at length; perhaps start with this archived topic: http://spamassassin.1065346.n5.nabble.com/BAYES-99-and-ham-td38832.html The short answer is that you can, and probably should, increase the BAYES_99 score value to 4 or 4.5. Setting it to 5 puts you at risk (albeit very slight) for false-positives. -Ben
Re: New rule for HTML spam, using comments?
On 6/18/2013 1:18 PM, Amir 'CG' Caspi wrote: > At 8:58 AM -0400 06/18/2013, Ben Johnson wrote: >> a.) You are copying/pasting the body of the email, but not the headers. > > No, I am copying the headers... however, I am using Eudora (ancient, I > know) as a mail client, and it's possible the headers are not properly > formatted. For example, for SpamCop I have to use their "workaround" > script. I don't know what exactly is mal-formed, though. For the sake of troubleshooting, can you try accessing the mail by some other means, e.g., opening the file directly from the filesystem? Doesn't mbox store email messages as plaintext files? (Kris already beat me to it regarding this suggestion.) > I should admit at this point that much of my sa-learn has been on > Eudora's mboxes, by the way. That is, I would take the Eudora mbox and > sa-learn on that. Eudora is supposed to use standard mbox format, but > I'm wondering if maybe it's not so standard after all... How would anything ever be flagged with a score higher than BAYES_00 if this were to be the problem? Didn't you report a score of BAYES_99 in one of your tests? > Either way, I am _trying_ to copy the entire message. Not sure what is > misformatted there. If you take a look at my two pasted examples (links > below for convenience), those are direct copy/paste from Eudora's "raw > source" view. Any idea what is malformed? Do I need an extra newline > between the header and body, or something more complicated? > > http://pastebin.com/HD0rNdxU > http://pastebin.com/Zswg77Ds How are you feeding the messages to sa-learn? Are you not just passing the email file, e.g., /var/vmail/example.com/...? Why copy from Eudora and paste into a temporary file when you can just point sa-learn straight to the message on disk? >> b.) You are running Bayes as two different users when you perform your > > No, I have been careful for that. You saw that I pasted the maillog > entries... notice that spamd runs as setuid. I made sure the same > userid was in the logs, and in my command. I had missed that detail; looks okay. >> Have a look at the thread I cited and see if anything jumps-out at you. > > Will do, but unfortunately, I don't think the problem is as clear cut as > (b) ... maybe it's (a) though, in which case I wonder if I have to > modify my Eudora mboxes before learning on them. Do you retain your training corpus? This may be one of those instances in which the best way to debug the problem is to wipe and retrain Bayes. Of course, that can be a nightmare if you don't retain the messages that you've trained as ham and spam. > Thanks. > > -- Amir
Re: New rule for HTML spam, using comments?
On 6/18/2013 5:31 AM, Amir 'CG' Caspi wrote: > At 4:37 PM -0400 06/14/2013, Alex wrote: >> On Fri, Jun 14, 2013 at 4:18 PM, Amir 'CG' Caspi >> wrote: >> > I wonder if there's some >> > difference between running spamassassin manually on the message versus >> > running spamd. >> >> I think the only difference would be if spamd somehow didn't recognize >> all the locations for your rules. > > OK, I've got some more weirdness here. I just received two FN spams... > one had bayes00, another bayes50. To test what the heck might be going > on, I ran both of the emails through spamc manually... this SHOULD > recreate the same thing that occurs when sendmail delivers the email and > spamc gets run automatically. > > The first email, which was bayes00 originally, hit with bayes99 when I > ran it manually through spamc. There were only a few minutes between > the first and second run (see timestamps below)... nothing very > important happened to the Bayes DB between those two runs. The second > email, bayes50, stayed exactly the same (also bayes50). I looked > through the /var/log/maillog to see if I could figure out some > difference between the two runs, but they look basically identical. > > The only difference I can figure is that the second (manual) run happens > on mail source that I copy/paste from my email program... that is, it's > pure text, copied and pasted. The first (automatic) run is on the mail > as it enters the system, which might be somehow formatted differently. > All of my sa-learn training is done directly on my mbox files, which > perhaps is more akin to copy/paste than anything else... > > Anyone know what the hell is going on here? For reference, here is the > maillog entry for the bayes00 message when it went through automatically: > > Jun 18 05:00:32 kismet sendmail[27721]: r5I90WGI027721: > from=, size=16502, class=0, nrcpts=1, > msgid=, > proto=ESMTP, relay=root@localhost > Jun 18 05:00:32 kismet sendmail[27707]: r5I90U4N027657: > to=, delay=00:00:01, xdelay=00:00:00, > mailer=virthostmail, pri=136089, relay=domain.com, dsn=2.0.0, stat=Sent > (r5I90WGI027721 Message accepted for delivery) > Jun 18 05:00:32 kismet spamd[27586]: spamd: connection from > localhost.localdomain [127.0.0.1] at port 53424 > Jun 18 05:00:32 kismet spamd[27586]: spamd: setuid to u...@domain.com > succeeded > Jun 18 05:00:32 kismet spamd[27586]: spamd: processing message > for > u...@domain.com:22001 > Jun 18 05:00:33 kismet spamd[27586]: spf: lookup failed: Can't locate > object method "new_from_string" via package "Mail::SPF::v1::Record" at > /usr/lib/perl5/vendor_perl/5.8.8/Mail/SPF/Server.pm line 524. > Jun 18 05:00:37 kismet spamd[27586]: pyzor: [27730] error: TERMINATED, > signal 15 (000f) > Jun 18 05:00:37 kismet spamd[27586]: spamd: clean message (-1.1/5.0) for > u...@domain.com:22001 in 5.0 seconds, 16781 bytes. > Jun 18 05:00:37 kismet spamd[27586]: spamd: result: . -1 - > BAYES_00,HTML_EXTRA_CLOSE,HTML_IMAGE_RATIO_08,HTML_MESSAGE,RDNS_NONE > scantime=5.0,size=16781,user=u...@domain.com,uid=22001,required_score=5.0,rhost=localhost.localdomain,raddr=127.0.0.1,rport=53424,mid=, > bayes=0.00,autolearn=no > > > And here is when it went through manually: > > Jun 18 05:05:45 kismet spamd[27984]: spamd: connection from > localhost.localdomain [127.0.0.1] at port 53447 > Jun 18 05:05:45 kismet spamd[27984]: spamd: setuid to u...@domain.com > succeeded > Jun 18 05:05:45 kismet spamd[27984]: spamd: processing message > for > u...@domain.com:22001 > Jun 18 05:05:45 kismet spamd[27984]: spf: lookup failed: Can't locate > object method "new_from_string" via package "Mail::SPF::v1::Record" at > /usr/lib/perl5/vendor_perl/5.8.8/Mail/SPF/Server.pm line 524. > Jun 18 05:05:47 kismet spamd[27984]: spamd: identified spam (6.0/5.0) > for u...@domain.com:22001 in 2.2 seconds, 16223 bytes. > Jun 18 05:05:47 kismet spamd[27984]: spamd: result: Y 6 - > BAYES_99,MISSING_MIME_HB_SEP,RDNS_NONE,T_MIME_NO_TEXT,URIBL_BLACK > scantime=2.2,size=16223,user=u...@domain.com,uid=22001,required_score=5.0,rhost=localhost.localdomain,raddr=127.0.0.1,rport=53447,mid=,bayes=1.00,autolearn=no > > > > So... what the heck is going on? I see basically no difference between > the two maillog entries. The only difference between the two runs, as > far as I can tell, is that pyzor died on the first one (and I don't know > why, but that shouldn't have ANY effect on the Bayes score), and the > manual run was using the copy/paste from my mail program. > > But, as mentioned, the bayes50 spam looked identical for both the > automatic and manual runs. > > Anyone have any idea what the heck is going on, and how I can fix it? > > Is my Bayes DB worthless because I've been training it on MBOX format > (i.e. ASCII), but when it runs the first time around, it's running on > binary (MIME) instead? If so, how can I fix this -- do I need to store > my mail in some different format instead of MBOX? (Except that sendmail > de
Re: Massive spamruns
On 6/12/2013 12:22 PM, Alex wrote: > Hi, > >>> # 2013 cars local dealership >>> http://pastebin.com/3bEMiV3B >> >> URI in that sample >> >> pohformed.com listed on black.uribl.com >> pohformed.com listed on jp.surbl.org >> pohformed.com listed on sc.surbl.org >> pohformed.com listed on dbl.spamhaus.org > > I know I should have mentioned that. Yes, I'm using the above RBLs, > and they're all correctly tagged here now. > > I was hoping for something more preemptive to trigger on these more > generally because the IPs are only used for a short while, but long > enough to get 25 spams in from the address. I was hoping to find > commonalities between the messages that could be used to generate some > other rules. > > Thanks, > Alex > Isn't this the function that Bayes is intended to serve, rather precisely? -Ben
Re: Large # of Spam getting through all of a sudden.
On 6/10/2013 4:46 PM, David F. Skoll wrote: > [Lost track of who wrote this] > >> 66.96.253.241 >> 64.120.241.228 >> 66.197.142.29 >> 66.197.142.23 >> 66.197.207.152 >> 66.197.177.174 >> 64.191.61.25 > > Every single one of those IPs is on our "GreylistStumbler" list, meaning > they've been greylisted, but have not been seen to pass greylisting. > > Implementing greylisting might stop most of the problem messages. > > Regards, > > David. > (Brian is the one who wrote it :)) That's an interesting observation, David. As someone who recently implemented greylisting, its efficacy in this particular type of situation cannot be overstated. Our spam volume dropped from about 35% to less than 2% overnight, thanks to greylisting at the MTA level. While somewhat of a generic prescription, it's well-prescribed for a reason: sort-out your Bayes situation (will probably require wiping and starting over with a hand-sorted corpus that is *retained*) and implement greylisting (provided you can live with its caveats). The DNSBLs can be used to supplement the above. Good luck, Brian! --Ben
Re: Large # of Spam getting through all of a sudden.
On 6/10/2013 2:45 PM, Duncan, Brian M. wrote: > I rarely have seen any SpamAssasin hits on the bodies of these messages. > > (cached, score=-0.125,required 6.5, autolearn=not spam, > RP_MATCHES_RCVD -0.12) Do you train the Bayes database manually? Or via autolearn only? I use SA via AMaViS, and the header changes look slightly different from yours, but I see no evidence that Bayes scoring is being used in the above header (if, in fact, that is a sample header with all SA markup appended). --Ben
Re: .pw / Palau URL domains in spam
On 5/7/2013 11:02 PM, Steve Prior wrote: > On 5/7/2013 1:44 AM, Benny Pedersen wrote: >> Chris Santerre skrev den 2013-05-06 17:27: >>> 10 days and still being abused badly. Recommending for everyone to >>> just refuse any .pw >> >> time for spamhaus ? :=) >> >>> for those wanting an SA rule, here: >>> >>> header PW_IS_BAD_TLD From =~ /.pwb/ >>> describe PW_IS_BAD_TLD PW TLD ABUSE >>> score PW_IS_BAD_TLD 3 >> >> here i would like to use -3 >> >>> Change score to whatever you want. Enjoy. >> >> thats the point of opensource imho :) >> >> hopefully the good pw domains start using opendkim, and then let the >> world >> repute it from there >> > > I blocked everything from TLD pw at the Postfix level so the email gets > rejected without ever hitting spamassassin. > > I created /etc/postfix/sender_access with the contents: > pwREJECT > > ran postmap sender_access > > and then added > check_sender_access hash:/etc/postfix/sender_access > to smtpd_recipient_restrictions > > Problem went away completely, sorry Palau. > > Steve > Steve, just wanted to thank you for providing an elegant solution to this problem. It seems far more preferable to block this nonsense right at the MTA level (for now). Your instructions worked for me and I now see the following in my mail log for any .pw sender: postfix/smtpd[10660]: NOQUEUE: reject: RCPT from unknown[173.213.124.203]: 554 5.7.1 : Sender address rejected: Access denied Much appreciated! -Ben
Re: dns*.registrar-servers.com as a rogue registrar?
I'll top-post, too, just for the sake of consistency. :) I've had pretty good experiences with Namecheap, actually. I'm in no way affiliated; I've just used them for cheap domain registrations (apparently, I'm not the only one) and for cheap SSL certificates in bulk. But, that's neither here nor there. As the company relates to this conversation, I reported a domain that was spamming heavily and registered with Namecheap and the company took swift action: > Hello, > > Thank you for your email. > > While jecon.us domain name is registered with Namecheap it is hosted with > another company. So we cannot check the logs for a domain and confirm if it > is involved in sending unsolicited bulk emails. > > However, as we can see the domain name is blacklisted by trusted > organizations. Thus we opened a case regarding the domain name. Please allow > about 48 hours for our further investigation. > > Thank you for letting us know about the issue. Five days later, the domain was shut-down: > Hello, > > This is to inform you that jecon.us domain was suspended. It is now pointed > to non-resolving nameservers and will be nullrouted once the propagation is > over. The domain is locked for modifications in our system. > > Thank you for letting us know about the issue. So, if you are having problems with domains registered with Namecheap, I suggest that you open a support request for the "Domains -- Legal and Abuse" department. From the sounds of it, you'd be doing us all a big favor! -Ben On 5/7/2013 3:26 PM, Chris Santerre wrote: > The owner is NameCheap, Inc. > > A quick google will bring up historical problems with NameCheap and its > owner and its DBAs. > > I dare not say anything bad about them and let you judge for yourself on > their history. Richard Kirkendall has a tendency to yell "Slander!" when > someone even mentions their name. > > > --Chris > (I top post because I care.) > > >> -Original Message- >> From: lcon...@go2france.com [mailto:lcon...@go2france.com] >> Sent: 2013-05-07 14:15 >> To: users@spamassassin.apache.org >> Subject: dns*.registrar-servers.com as a rogue registrar? >> >> >> >> Nearly all of the .pw domains have their authoritative NS at >> dns*.registrar-servers.com. >> >> that registrar and few others are always at the top of my reports for >> NSs of sender domains of spam we reject. >> >> Does anybody score a msg if its sender domain is DNS hosted by >> registrar-servers.com or other? >> >> what would that rule look like? >> >> Len >> >> >
Re: SQL error: Duplicate entry
On 4/25/2013 11:55 AM, Matus UHLAR - fantomas wrote: >> On Thu, Apr 25, 2013 at 1:47 PM, Matus UHLAR - fantomas >> wrote: >>> I don't think so... IIRC the "REPLACE INTO" deletes existing record and >>> inserts new one, does not update existing. This caused some issues >>> for me >>> some ~10 years ago, so i switched to the update or insert. > > On 25.04.13 16:36, Matthias Leisi wrote: >> "REPLACE INTO" is a MySQL-specific extension and not part of standard >> SQL. > > I know, but what does this have in common with what I wrote? > It seems that Matthias's point is that SA doesn't use "REPLACE INTO" because "REPLACE INTO" is MySQL-specific, and SA must remain database-agnostic. This leaves one to assume that SA is performing an INSERT or an UPDATE only. The question then becomes, why is SA attempting to perform an INSERT (and failing with a duplicate key conflict on the PRIMARY KEY, which, in my moderately stale version of SA, is a UNIQUE key across the `id` and `token` columns) when it should be performing an UPDATE? (Possible explanation two paragraphs down.) Presumably, the `id` column is a foreign key to the `bayes_vars`.`id` column, which indicates that only one record for each SA Bayes user and token combination may exist. Sounds reasonable. My understanding is that it's better (with respect to performance and atomicity) to attempt the INSERT and have it fail than to check if the ID/token combination already exists and UPDATE it if it does. In other words, I'm not sure that this warning is a problem (beyond log bloat or similar). It's entirely possible that SA *only* performs INSERT statements for the reasons I mention above. Only a developer or disciple of the SA source code can say for sure. I wish I had time to look myself. Out of curiosity, how did this SQL error come to your attention in the first place? -Ben
Re: SQL error: Duplicate entry
On 4/24/2013 2:42 PM, psychobyte wrote: > Hi, > > I've noticed that SA is getting a lot of "Duplicate entry" errors for > AWL and bayes plugins. I can verify that the sql schema is up to date > for SA 3.3.1-r4 and I've tried retraining the bayes db. Any hints on how > to troubleshoot this? > > AWL: > > Apr 24 11:31:57 mserv amavisd[24336]: (24336-05) SA dbg: auto-whitelist: > sql-based add_score/insert amavis|myem...@example.com|14.43|1|-0.699: > SQL error: Duplicate entry 'amavis-myem...@example.com--14.43' for key > 'PRIMARY' > > > Bayes: > > Apr 24 11:31:57 mserv amavisd[24336]: (24336-05) SA dbg: bayes: > _put_token: SQL error: Duplicate entry '1-\312\270j' for key 'PRIMARY' > > > > I know very little about how Bayes interacts with SQL, but it's clear that SA is trying to insert a record that is identical to one that's already present, and the key(s) that are defined on the table are preventing it. Makes one wonder if a "REPLACE INTO ..." was replaced with an "INSERT INTO ..." in a recent version of SA. Of course, the messages that you're seeing tell us nothing about which DB table is causing the problem. Maybe one of the developers will see this and recall making such a change. Alternatively, you could dig into your tables and attempt to identify where those values actually live. Once we have the offending table, further troubleshooting will be possible. -Ben
Re: Seminar Spam
On 4/24/2013 12:12 PM, hospice admin wrote: > Hi, > > we're having problems with an outfit called 'Bite Sized Seminars' in the > UK, who seem to be sending mail out through another company called > 'Communicado'. A quick google suggests we aren't the only ones. > > We have developed a number of rules that identify their mail by looking > for their phone numbers, common phrases, etc in their mail shots with > varying success (I'm happy to share these with anyone who may find them > helpful). > > The problem I'm trying to solve is that they seem to register hundreds > of .co.uk domains, and have access to loads of sending IPs, so I can't > just write a rule to do the obvious. I've complained about them to > Nominet, and they aren't interested ... according to them, they are > doing nothing wrong. I've also complained to various IP providers, some > of which say they will do something, but rarely do. I've even rung them > ... again ... no joy. > > Here's my question - am I missing a trick here, particularly regarding > the hundreds of domain names? For example, is it possible to do a > 'whois' and process the output in some way? > > Thanks > > Judy. > Have you been feeding Bayes samples of these messages? I would think Bayes to be far more effective against this type of spamming (given the dynamic nature of the domains and IP addresses) than writing custom rules. -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 4/20/2013 3:20 PM, Benny Pedersen wrote: > Ben Johnson skrev den 2013-04-20 05:02: > >> Yes, I believe that me and the system always execute SA commands as the >> "amavis" user. When I was using the SQL setup, I had the following in >> local.cf: >> >> bayes_path /var/lib/amavis/.spamassassin/bayes > > is amavis have homedir in /var/lib/ ? The amavis user's home directory is /var/lib/amavis. This seems to be the default on Ubuntu; I didn't change this path. > in gentoo its default as /var/amavis where the .spamassassin dir is > created by amavisd > > use user_prefs to set bayes_path does not make sense if sql is used > Thanks; I did comment-out the "bayes_path" directive. I figured that it wouldn't matter whether it is commented or not, in the presence of SQL-related directives, but it can't hurt to comment-out this line. >> With the DBM setup, I had the following (I have since commented it out, >> while attempting to debug this Bayes issue): >> >> bayes_sql_override_username amavis > > +1 to this one since amavis cant use multiple sa users very easy, but > depending on what amavis it being supported with complicated setups :( I only need one Bayes user, so this setup is adequate. > i changed away from amavisd to clamav-milter, spampd in postfix after > queue, this is working very well for me, and i hope sa 3.4.x will not > break spampd :=) Hmm, I will consider your sound advice in this regard. amavis does seem to be overly memory-hungry (despite setting $max_servers = 1 and $max_requests = 1). If there is a better alternative, I'm all ears (or eyes, as the case may be). >> Is something more required to ensure that my mail system, which runs >> under the "amavis" user, is always reading from and writing to the >> same DB? > > nope just remember that amavis also reads .spamassassin/user_prefs > > if you like you can copy that user_prefs to /root/.spamassassin so you > dont have to remember something :) > > user_prefs should ONLY be readble by that user that runs it > Thanks for pointing this out. I will double-check the permissions. I'll respond to your other email momentarily. Thanks, Benny! -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
So, the problem seems not to be SQL-specific, as it occurs with SQL or flat-file DB. Upon following Benny Pedersen's advice (to move SA configuration directives from /etc/spamassassin/local.cf to /var/lib/amavis/.spamassassin/user_prefs), I noticed something unusual: $ ls -lah /var/lib/amavis/.spamassassin/ total 7.6M drwx-- 2 amavis amavis 4.0K Apr 20 08:54 . drwxr-xr-x 7 amavis amavis 4.0K Apr 20 08:56 .. -rw--- 1 root root 8.0K Apr 20 08:33 bayes_journal -rw--- 1 root root 1.3M Apr 20 00:09 bayes_seen -rw--- 1 root root 4.8M Apr 20 08:29 bayes_toks -rw-r--r-- 1 root root799 Jun 28 2004 gtube.txt -rw-r--r-- 1 amavis amavis 2.7K Apr 20 08:55 user_prefs Welp, that'll do it! How those four files were set to root:root ownership is beyond me, but that was certainly a factor. Maybe this was a result of executing my training script as root (even though I had hard-coded the bayes_path to use /var/lib/amavis/.spamassassin/bayes, and when using SQL, hard-coded bayes_sql_override_username to use amavis)? I changed ownership to amavis:amavis and now messages are being scored with Bayes (all of them, from what I can tell so far). Also, I looked into the fact that I was running the cron job that trains ham and spam as root. I did this only because the amavis user lacks access to /var/vmail, which is where mail is stored on this system. (As a corollary, I'm a bit curious as to how amavis is able to scan incoming mail, given this lack of access -- maybe it does so using a pipe or some other method that does not require access to /var/vmail.) I think the disconnect was in the fact that I placed my custom configuration directives in /etc/spamassassin/local.cf, when I should have placed them in /var/lib/amavis/.spamassassin/user_prefs (for message scoring) *and* /root/.spamassassin/user_prefs (for ham/spam training). (Thanks for pointing-out this mistake, Benny P.) Putting my custom SA configuration directives in both of these files was the only way I was able to train mail and score incoming messages using the same credentials "across-the-board". Once I did this, I was able to use SQL or flat-file DB with the same exact results. Is there a better way to achieve this consistency, aside from putting duplicate content into /var/lib/amavis/.spamassassin/user_prefs and /root/.spamassassin/user_prefs? Feels like I'm out of the woods here! Thanks for all the expert help, guys. -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
Apologies for the rapid-fire here folks, but I wanted to correct something. I had these backwards: >> Yes, I believe that me and the system always execute SA commands as the >> "amavis" user. When I was using the SQL setup, I had the following in >> local.cf: >> >> bayes_path /var/lib/amavis/.spamassassin/bayes >> >> With the DBM setup, I had the following (I have since commented it out, >> while attempting to debug this Bayes issue): >> >> bayes_sql_override_username amavis I meant to say that I have *always* had bayes_path /var/lib/amavis/.spamassassin/bayes in local.cf, and using the SQL setup, I added bayes_sql_override_username amavis Sorry for the confusion! -Ben On 4/19/2013 11:02 PM, Ben Johnson wrote: > > > On 4/19/2013 1:54 PM, Benny Pedersen wrote: >> Ben Johnson skrev den 2013-04-19 18:02: >> >>> Still stumped here... >> >> for amavisd-new, put spamassassin sql setup into user_prefs file for the >> user amavisd-new runs as might be working better then have insecure sql >> settings in /etc/mail/spamassassin :) >> >> i dont know if this is really that you have another user for amavisd, >> and test spamassassin -t msg with another user that uses another sql user ? >> >> make sure both users is really using same sql user as intended >> > > Benny, thanks for the suggestion regarding moving the SA SQL setup into > user_prefs. I will look into that soon. > > Yes, I believe that me and the system always execute SA commands as the > "amavis" user. When I was using the SQL setup, I had the following in > local.cf: > > bayes_path /var/lib/amavis/.spamassassin/bayes > > With the DBM setup, I had the following (I have since commented it out, > while attempting to debug this Bayes issue): > > bayes_sql_override_username amavis > > Is something more required to ensure that my mail system, which runs > under the "amavis" user, is always reading from and writing to the same DB? > > Best regards, > > -Ben > >
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 4/19/2013 1:54 PM, Benny Pedersen wrote: > Ben Johnson skrev den 2013-04-19 18:02: > >> Still stumped here... > > for amavisd-new, put spamassassin sql setup into user_prefs file for the > user amavisd-new runs as might be working better then have insecure sql > settings in /etc/mail/spamassassin :) > > i dont know if this is really that you have another user for amavisd, > and test spamassassin -t msg with another user that uses another sql user ? > > make sure both users is really using same sql user as intended > Benny, thanks for the suggestion regarding moving the SA SQL setup into user_prefs. I will look into that soon. Yes, I believe that me and the system always execute SA commands as the "amavis" user. When I was using the SQL setup, I had the following in local.cf: bayes_path /var/lib/amavis/.spamassassin/bayes With the DBM setup, I had the following (I have since commented it out, while attempting to debug this Bayes issue): bayes_sql_override_username amavis Is something more required to ensure that my mail system, which runs under the "amavis" user, is always reading from and writing to the same DB? Best regards, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 4/19/2013 12:12 PM, Axb wrote: > On 04/19/2013 06:02 PM, Ben Johnson wrote: > >> Still stumped here... > > do a bayes sa-learn --backup > > switch to file based in SDBM format (which is fast) > > do a > > sa-learn --restore > > feed it a few thousand NEW spams > > see what happens > > > > > > Thanks for the suggestion, Axb. Your help and time is much appreciated. By "feed it a few thousand NEW spams", do you mean to scrap the training corpora that I've hand-sorted in favor of starting over? Or do you mean to clear the database and re-run the training script against the corpora? If your thinking is that the token data may be "stale", then I will really be stumped. When I hand-classify 12 messages with a subject and body about a retractable garden hose as spam, I expect the 13th message about the same hose to score high on the Bayes tests. Is this an unreasonable expectation? I commented-out all of the DB-related lines in my SA configuration file (local.cf) and restarted amavis-new. I also cleared the existing DB tokens (with "sa-learn --clear") after amavis had restarted, and then executed my normal training script against my ham and spam corpora. I'll keep an eye on incoming messages to see if those that "slip through" and score below 4.0 demonstrate evidence of Bayes testing. I am beginning to wonder if some kind of "corruption", for lack of a better term, had been introduced by using utf8 to store the token data (Benny Pedersen mentioned that Unicode is overkill, and he is probably right). Performance aside, could using utf8_bin (instead of ascii) introduce a problem for SA (despite no errors during "sa-learn" training or --restore commands)? The strange thing is that Bayes seems to work fine most of the time. But as I've stated previously, almost all "obvious to a human" spam that scores below 4.0 lacks evidence of Bayes testing. Since switching back to a DBM Bayes setup, the results look pretty much as expected (wrapped), and this is the type of thing I expect to see on every message: --- spamassassin -D -t < "/tmp/email.txt" 2>&1 | egrep '(bayes:|whitelist:|AWL)' dbg: bayes: learner_new self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x37520f0), bayes_store_module=Mail::SpamAssassin::BayesStore::DBM dbg: bayes: learner_new: got store=Mail::SpamAssassin::BayesStore::DBM=HASH(0x2c52558) dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_toks dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_seen dbg: bayes: found bayes db version 3 dbg: bayes: DB journal sync: last sync: 0 dbg: bayes: DB journal sync: last sync: 0 dbg: bayes: corpus size: nspam = 6203, nham = 2479 dbg: bayes: score = 5.55111512312578e-17 dbg: bayes: DB journal sync: last sync: 0 dbg: bayes: untie-ing dbg: timing: total 2925 ms - init: 907 (31.0%), parse: 1.92 (0.1%), extract_message_metadata: 108 (3.7%), poll_dns_idle: 1040 (35.6%), get_uri_detail_list: 1.22 (0.0%), tests_pri_-1000: 19 (0.7%), compile_gen: 185 (6.3%), compile_eval: 19 (0.6%), tests_pri_-950: 5 (0.2%), tests_pri_-900: 5 (0.2%), tests_pri_-400: 32 (1.1%), check_bayes: 26 (0.9%), tests_pri_0: 836 (28.6%), dkim_load_modules: 27 (0.9%), check_dkim_signature: 1.23 (0.0%), check_dkim_adsp: 24 (0.8%), check_spf: 70 (2.4%), check_razor2: 202 (6.9%), check_pyzor: 135 (4.6%), tests_pri_500: 988 (33.8%) --- I'll wait and see if I receive messages without Bayes results and report back. Even if using DBM "works", I don't see this as a long-term solution -- only as a troubleshooting step. I would really like to keep my Bayes data in a MySQL or PostgreSQL database. Thanks again for the help! -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 4/19/2013 11:42 AM, Alex wrote: > Hi, > >> Is this normal? If so, what is the explanation for this behavior? I have > > marked dozens of nearly-identical messages with the subject > "Garden hose > expands up to three times its length" as SPAM (over the course of > several weeks) as SPAM, and yet SA reports "not enough usable > tokens found". > > > If they are identical, I don't believe it will create new tokens, > per se. > > > > Is SA referring to the number of tokens in the message? Or the > Bayes DB? > > > I should also mention that while training a message, use "--progress", > as such (assuming you're running it on an mbox or message that's in mbox > format): > > # sa-learn --progress --spam --mbox mymboxfile > > It will show you how many tokens have been learned during that run. It > might also be a good idea to add the token summary flag to your config: > > add_header all Tok-Stat _TOKENSUMMARY_ > > If you run spamassassin on a message directly, and add the -t option, it > will show you the number of different types of tokens found in the message: > > X-Spam-Tok-Stat: Tokens: new, 0; hammy, 6; neutral, 84; spammy, 36. > > Regards, > Alex > Alex, thanks very much for the quick reply. I really appreciate it. One can see from the output in my previous message (two messages back) that the user is amavis (correct for my system) and the corpus size, as well as the token count: dbg: bayes: corpus size: nspam = 6155, nham = 2342 dbg: bayes: tok_get_all: token count: 176 dbg: bayes: cannot use bayes on this message; not enough usable tokens found bayes: not scoring message, returning undef Now that I look at this output again, the "token count: 176" stands-out. That seems like a pretty low value. Is this the token count for the entire Bayes DB??? Or only the tokens that apply to the particular message being fed to SA? The "garden hose" messages are probably not *identical*, but they are very similar, so it seems that each variant should have tokens to offer. The concern I expressed around bug 6624 relates to Mark's comment, which seems to imply that while SA will not insert a token twice, it *will* increase the token "count". Here's an excerpt from Mark's comment from that bug report: "The effect of the bug with SpamAssassin is that tokens are only able to be inserted once, but their counts cannot increase, leading to terrible bayes results if the bug is not noticed. Also the conversion form db fails, as reported by Dave." Is it possible that training similar messages as SPAM is not having the intended effect due to this bug in my version of SA? My "bayes_vars" table looks like this (sorry for the wrapping, this is the best I could do): id usernamespam_count ham_count token_count last_expire last_atime_deltalast_expire_reduce oldest_token_agenewest_token_age 1 amavis 61852427120092 1366364379 8380417 14747 1357985848 1366386865 The SQL query: SELECT count( * ) FROM `bayes_token` returns 120092 rows, so the above value is accurate (that is, the "token_count" value in the `bayes_vars` table matches the actual row count in the `bayes_token` table). Also, thanks for the other tips regarding the "token summary flag" directive an the -t switch. I was actually using the -t switch to produce the output that I pasted two messages back. So, it seems that the "X-Spam-Tok-Stat" output is added only when the token count is high enough to be useful. Still stumped here... -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 4/18/2013 12:18 PM, Ben Johnson wrote: > > My concern now is that I am on 3.3.1, with little control over upgrades. > I have read all three bug reports in their entirety and Bug 6624 seems > to be a very legitimate concern. To quote Mark in the bug description: > >> The effect of the bug with SpamAssassin is that tokens are only able >> to be inserted once, but their counts cannot increase, leading to >> terrible bayes results if the bug is not noticed. Also the conversion >> form db fails, as reported by Dave. >> >> Attached is a patch for lib/Mail/SpamAssassin/BayesStore/MySQL.pm to >> provide a workaround for the MySQL server bug, and improved debug logging. > > How can I discern whether or not this bug does, in fact, affect me? Are > my Bayes results being crippled as a result of this bug? > >> It's possible that there's a good reason the default script still uses >> myISAM. If so, the documentation for this fix should at least be easier >> to find. >> > > In any event, I'm a little concerned because while the majority of > messages are now tagged with BAYES_* hits, I am now seeing this debug > output on a significant percentage of messages ("cannot use bayes on > this message; not enough usable tokens found"): > > # spamassassin -D -t < /tmp/msg.txt 2>&1 | egrep '(bayes:|whitelist:|AWL)' > > -- > Apr 18 09:15:36.537 [21797] dbg: bayes: learner_new > self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x4430388), > bayes_store_module=Mail::SpamAssassin::BayesStore::MySQL > Apr 18 09:15:36.568 [21797] dbg: bayes: using username: amavis > Apr 18 09:15:36.568 [21797] dbg: bayes: learner_new: got > store=Mail::SpamAssassin::BayesStore::MySQL=HASH(0x4779778) > Apr 18 09:15:36.580 [21797] dbg: bayes: database connection established > Apr 18 09:15:36.580 [21797] dbg: bayes: found bayes db version 3 > Apr 18 09:15:36.581 [21797] dbg: bayes: Using userid: 1 > Apr 18 09:15:36.781 [21797] dbg: bayes: corpus size: nspam = 6155, nham > = 2342 > Apr 18 09:15:36.787 [21797] dbg: bayes: tok_get_all: token count: 176 > Apr 18 09:15:36.790 [21797] dbg: bayes: cannot use bayes on this > message; not enough usable tokens found > Apr 18 09:15:36.790 [21797] dbg: bayes: not scoring message, returning undef > Apr 18 09:15:37.861 [21797] dbg: timing: total 2109 ms - init: 830 > (39.4%), parse: 7 (0.4%), extract_message_metadata: 123 (5.9%), > poll_dns_idle: 74 (3.5%), get_uri_detail_list: 2 (0.1%), > tests_pri_-1000: 26 (1.3%), compile_gen: 155 (7.4%), compile_eval: 19 > (0.9%), tests_pri_-950: 7 (0.3%), tests_pri_-900: 7 (0.3%), > tests_pri_-400: 15 (0.7%), check_bayes: 10 (0.5%), tests_pri_0: 1018 > (48.3%), dkim_load_modules: 25 (1.2%), check_dkim_signature: 3 (0.2%), > check_dkim_adsp: 16 (0.7%), check_spf: 78 (3.7%), check_razor2: 91 > (4.3%), check_pyzor: 430 (20.4%), tests_pri_500: 50 (2.4%) > -- > > I have done some searching-around on the string "cannot use bayes on > this message; not enough usable tokens found" and have not found > anything authoritative regarding what this message might mean and > whether or not it can be ignored or if it is symptomatic of a larger > Bayes problem. > > Thank you, > > -Ben > Might anyone be in a position to offer an authoritative response to these questions? I continue to see messages that are very similar to dozens of messages that have been marked as SPAM slipping through with *no Bayes scoring* (this is *after* fixing the SQL syntax error issue): bayes: cannot use bayes on this message; not enough usable tokens found bayes: not scoring message, returning undef Is this normal? If so, what is the explanation for this behavior? I have marked dozens of nearly-identical messages with the subject "Garden hose expands up to three times its length" as SPAM (over the course of several weeks) as SPAM, and yet SA reports "not enough usable tokens found". Is SA referring to the number of tokens in the message? Or the Bayes DB? Thanks, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 4/18/2013 12:26 PM, Axb wrote: > On 04/18/2013 06:18 PM, Ben Johnson wrote: >> I have done some searching-around on the string "cannot use bayes on >> this message; not enough usable tokens found" and have not found >> anything authoritative regarding what this message might mean and >> whether or not it can be ignored or if it is symptomatic of a larger >> Bayes problem. > > Curious: what are your reasons for using Bayes in SQL? > Are you sharing the DB among several machines? Or is this a single > box/global bayes setup? > > Not yet, but that is the ultimate plan (to share the DB across multiple servers). Also, I like the idea that the Bayes DB is backed-up automatically along with all other databases on the server (we run a cron script that performs the dump). Granted, it would be trivial to schedule a call to "sa-learn --backup", but storing the data in SQL seems more portable and makes it easier to query the data for reporting purposes. Then again, I retain the corpora, so backing-up the DB is only useful for when data needs to be moved from one server or database to another (as moving the corpora seems far less practical). Are you suggesting that I should scrap SQL and go back to a flat-file DB? Is that the only path to a fix (short of upgrading SA)? Thanks for your help! -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 4/17/2013 10:15 PM, John Hardin wrote: > On Wed, 17 Apr 2013, Ben Johnson wrote: > >> The first post on that page was the key. In particular, adding the >> following to each MySQL "CREATE TABLE" statement: >> >> ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin; > > Please check the SpamAssassin bugzilla to see if this situation is > already mentioned, and if not, add a bug. This seems pretty critical. Mark Martinec opened three reports in relation to this issue (quoted from the archive thread cited in my previous post): [Bug 6624] BayesStore/MySQL.pm fails to update tokens due to MySQL server bug (wrong count of rows affected) https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6624 (^^ Fixed in 3.4 ^^) [Bug 6625] Bayes SQL schema treats bayes_token.token as char instead of binary, fails chset checks https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6625 (^^ Fixed in 3.4 ^^) [Bug 6626] Newer MySQL chokes on TYPE=MyISAM syntax https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6626 (^^ Fixed in 3.4 ^^) My concern now is that I am on 3.3.1, with little control over upgrades. I have read all three bug reports in their entirety and Bug 6624 seems to be a very legitimate concern. To quote Mark in the bug description: > The effect of the bug with SpamAssassin is that tokens are only able > to be inserted once, but their counts cannot increase, leading to > terrible bayes results if the bug is not noticed. Also the conversion > form db fails, as reported by Dave. > > Attached is a patch for lib/Mail/SpamAssassin/BayesStore/MySQL.pm to > provide a workaround for the MySQL server bug, and improved debug logging. How can I discern whether or not this bug does, in fact, affect me? Are my Bayes results being crippled as a result of this bug? > It's possible that there's a good reason the default script still uses > myISAM. If so, the documentation for this fix should at least be easier > to find. > If there is a good reason, I have yet to discern what it might be. The third bug from above (Mark's comments, specifically) imply that there is no particular reason for using MyISAM. I have good reason for wanting to use the InnoDB storage engine, and I have seen no performance hit as a result of so doing. (In fact, performance seems better than with MyISAM in my scripted, once-a-day training setup.) The perfectly acceptable performance I'm observing could be because a) the InnoDB-related resources allocated to MySQL are more than sufficient, b) the schema that I used has a newly-added INDEX whereas those prior to it did not, or c) I was sure to use the "MySQL" module instead of the "SQL" module with my InnoDB setup: bayes_store_module Mail::SpamAssassin::BayesStore::MySQL The bottom line seems to be that for those who have settings like these in their MySQL configurations > default_storage_engine=InnoDB > skip-character-set-client-handshake > collation_server=utf8_unicode_ci > character_set_server=utf8 it is absolutely necessary to include ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin; at the end of each CREATE TABLE statement (otherwise, the MySQL syntax error results and all Bayes SELECT statements fail). In any event, I'm a little concerned because while the majority of messages are now tagged with BAYES_* hits, I am now seeing this debug output on a significant percentage of messages ("cannot use bayes on this message; not enough usable tokens found"): # spamassassin -D -t < /tmp/msg.txt 2>&1 | egrep '(bayes:|whitelist:|AWL)' -- Apr 18 09:15:36.537 [21797] dbg: bayes: learner_new self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x4430388), bayes_store_module=Mail::SpamAssassin::BayesStore::MySQL Apr 18 09:15:36.568 [21797] dbg: bayes: using username: amavis Apr 18 09:15:36.568 [21797] dbg: bayes: learner_new: got store=Mail::SpamAssassin::BayesStore::MySQL=HASH(0x4779778) Apr 18 09:15:36.580 [21797] dbg: bayes: database connection established Apr 18 09:15:36.580 [21797] dbg: bayes: found bayes db version 3 Apr 18 09:15:36.581 [21797] dbg: bayes: Using userid: 1 Apr 18 09:15:36.781 [21797] dbg: bayes: corpus size: nspam = 6155, nham = 2342 Apr 18 09:15:36.787 [21797] dbg: bayes: tok_get_all: token count: 176 Apr 18 09:15:36.790 [21797] dbg: bayes: cannot use bayes on this message; not enough usable tokens found Apr 18 09:15:36.790 [21797] dbg: bayes: not scoring message, returning undef Apr 18 09:15:37.861 [21797] dbg: timing: total 2109 ms - init: 830 (39.4%), parse: 7 (0.4%), extract_message_metadata: 123 (5.9%), poll_dns_idle: 74 (3.5%), get_uri_detail_list: 2 (0.1%), tests_pri_-1000: 26 (1.3%), compile_gen: 155 (7.4%), compile_eval: 19 (0.9%), tests_pri_-950: 7 (0.3%), tests_pri_-900: 7 (0.3%), tests_pri_-400: 1
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 4/17/2013 5:39 PM, Tom Hendrikx wrote: > On 17-04-13 21:40, Ben Johnson wrote: >> Ideally, using the above directives will tell us whether we're >> experiencing timeouts, or these spam messages are simply not in the >> Pyzor or Razor2 databases. >> >> Off the top of your head, do you happen to know what will happen if one >> or both of the Pyzor/Razor2 tests timeout? Will some indication that the >> tests were at least *started* still be added to the SA header? > > The razor client (don't know about pyzor) logs its activity to some > logfile in ~razor. There you can see what (or what not) is happening. > > It's also possible to raise logfile verbosity by changing the razor > config file. See the man page for details. > > -- > Tom > Tom, thanks for the excellent tip regarding Razor's own log file. Tailing that log will make this kind of debugging much simpler in the future. Much appreciated. One of the reasons for which I also like the idea of using Daniel McDonald's include-scores-in-header rule (for Pyzor and Razor) is that the data is embedded right in the message, which can be useful. For one, this makes the scoring data more "portable" (it stays with the message to which it applies). Secondly, when tailing a log, it can be difficult to determine where the data relevant to one message ends and another begins. Thanks again, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 4/17/2013 6:47 PM, Ben Johnson wrote: > > > On 4/17/2013 5:05 PM, Kris Deugau wrote: >> Ben Johnson wrote: >>> Is there anything else that would cause Bayes tests not be performed? I >>> ask because other types of tests are disabled automatically under >>> certain circumstances (e.g., network tests), and I'm wondering if there >>> is some obscure combination of factors that causes Bayes tests not to be >>> performed. >> >> Do you have bayes_sql_override_username set? (This forces use of a >> single Bayes DB for all SA calls that reference this configuration file >> set.) >> >> If not, you may be getting a Bayes DB for each user on your system; >> IIRC this is supported (sort of) and default with Amavis. >> >> -kgd >> > > Thanks for jumping-in here, Kris. > > Yes, I do have the following in my SA local.cf: > > bayes_sql_override_username amavis > > So, all users are sharing the same Bayes DB. I train Bayes daily and the > token count, etc., etc. all look good and correct. > > Just a quick update to my previous post. > > The Pyzor and Razor2 score information is indeed coming through for the > handful of messages that have landed since I made those configuration > changes. So, all seems to be well on the Pyzor / Razor2 front. > > However, I still don't see any evidence that Bayes testing was performed > on the messages that are "slipping through". > > It bears mention that *most* messages do indeed show evidence of Bayes > scoring. > > --- OH, SNAP! I found the root cause. --- > > Well, when I went to confirm the above statement, regarding most > messages showing evidence of Bayes scoring, I realized that *none* show > evidence of it since 3/23! No wonder all of this garbage is slipping > through! > > I recognized the date 3/23 immediately; it was the date on which we > upgraded ISPConfig from 3.0.4.6 to 3.0.5. (For those who have no > knowledge of ISPConfig, it is basically a FOSS solution to managing vast > numbers of websites, domains, mailboxes, etc., as the name implies.) > > We also updated OS packages (security only) on that day. > > After diff-ing all of the relevant service configuration files > (amavis-new, spamassassin, postfix, dovecot, etc.) I couldn't find any > discrepancies. > > Then, I tried: > > - > # spamassassin -D -t < /tmp/msg.txt 2>&1 | egrep '(bayes:|whitelist:|AWL)' > > Apr 17 15:36:08.723 [23302] dbg: bayes: learner_new > self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x2fbc508), > bayes_store_module=Mail::SpamAssassin::BayesStore::MySQL > Apr 17 15:36:08.746 [23302] dbg: bayes: using username: amavis > Apr 17 15:36:08.746 [23302] dbg: bayes: learner_new: got > store=Mail::SpamAssassin::BayesStore::MySQL=HASH(0x3305358) > Apr 17 15:36:08.758 [23302] dbg: bayes: database connection established > Apr 17 15:36:08.758 [23302] dbg: bayes: found bayes db version 3 > Apr 17 15:36:08.759 [23302] dbg: bayes: Using userid: 1 > Apr 17 15:36:08.914 [23302] dbg: bayes: corpus size: nspam = 6083, nham > = 2334 > Apr 17 15:36:08.920 [23302] dbg: bayes: tok_get_all: token count: 163 > Apr 17 15:36:08.921 [23302] dbg: bayes: tok_get_all: SQL error: Illegal > mix of collations for operation ' IN ' > Apr 17 15:36:08.921 [23302] dbg: bayes: cannot use bayes on this > message; none of the tokens were found in the database > Apr 17 15:36:08.921 [23302] dbg: bayes: not scoring message, returning undef > Apr 17 15:36:13.116 [23302] dbg: timing: total 5159 ms - init: 804 > (15.6%), parse: 10 (0.2%), extract_message_metadata: 99 (1.9%), > poll_dns_idle: 3426 (66.4%), get_uri_detail_list: 0.24 (0.0%), > tests_pri_-1000: 11 (0.2%), compile_gen: 133 (2.6%), compile_eval: 18 > (0.3%), tests_pri_-950: 5 (0.1%), tests_pri_-900: 5 (0.1%), > tests_pri_-400: 12 (0.2%), check_bayes: 8 (0.1%), tests_pri_0: 804 > (15.6%), dkim_load_modules: 23 (0.4%), check_dkim_signature: 5 (0.1%), > check_dkim_adsp: 99 (1.9%), check_spf: 61 (1.2%), check_razor2: 211 > (4.1%), check_pyzor: 138 (2.7%), tests_pri_500: 3387 (65.7%) > - > > Check-out the message buried half-way down: > > bayes: tok_get_all: SQL error: Illegal mix of collations for operation ' > IN ' > > I have run into this unsightly message before, but in that case, I could > see the entire query, which enabled me to change the collations accordingly. > > In this case, I have no idea what the original query might have been. > > Further, I have no idea what changed that introduced this problems on 3/23. > >
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 4/17/2013 5:05 PM, Kris Deugau wrote: > Ben Johnson wrote: >> Is there anything else that would cause Bayes tests not be performed? I >> ask because other types of tests are disabled automatically under >> certain circumstances (e.g., network tests), and I'm wondering if there >> is some obscure combination of factors that causes Bayes tests not to be >> performed. > > Do you have bayes_sql_override_username set? (This forces use of a > single Bayes DB for all SA calls that reference this configuration file > set.) > > If not, you may be getting a Bayes DB for each user on your system; > IIRC this is supported (sort of) and default with Amavis. > > -kgd > Thanks for jumping-in here, Kris. Yes, I do have the following in my SA local.cf: bayes_sql_override_username amavis So, all users are sharing the same Bayes DB. I train Bayes daily and the token count, etc., etc. all look good and correct. Just a quick update to my previous post. The Pyzor and Razor2 score information is indeed coming through for the handful of messages that have landed since I made those configuration changes. So, all seems to be well on the Pyzor / Razor2 front. However, I still don't see any evidence that Bayes testing was performed on the messages that are "slipping through". It bears mention that *most* messages do indeed show evidence of Bayes scoring. --- OH, SNAP! I found the root cause. --- Well, when I went to confirm the above statement, regarding most messages showing evidence of Bayes scoring, I realized that *none* show evidence of it since 3/23! No wonder all of this garbage is slipping through! I recognized the date 3/23 immediately; it was the date on which we upgraded ISPConfig from 3.0.4.6 to 3.0.5. (For those who have no knowledge of ISPConfig, it is basically a FOSS solution to managing vast numbers of websites, domains, mailboxes, etc., as the name implies.) We also updated OS packages (security only) on that day. After diff-ing all of the relevant service configuration files (amavis-new, spamassassin, postfix, dovecot, etc.) I couldn't find any discrepancies. Then, I tried: - # spamassassin -D -t < /tmp/msg.txt 2>&1 | egrep '(bayes:|whitelist:|AWL)' Apr 17 15:36:08.723 [23302] dbg: bayes: learner_new self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x2fbc508), bayes_store_module=Mail::SpamAssassin::BayesStore::MySQL Apr 17 15:36:08.746 [23302] dbg: bayes: using username: amavis Apr 17 15:36:08.746 [23302] dbg: bayes: learner_new: got store=Mail::SpamAssassin::BayesStore::MySQL=HASH(0x3305358) Apr 17 15:36:08.758 [23302] dbg: bayes: database connection established Apr 17 15:36:08.758 [23302] dbg: bayes: found bayes db version 3 Apr 17 15:36:08.759 [23302] dbg: bayes: Using userid: 1 Apr 17 15:36:08.914 [23302] dbg: bayes: corpus size: nspam = 6083, nham = 2334 Apr 17 15:36:08.920 [23302] dbg: bayes: tok_get_all: token count: 163 Apr 17 15:36:08.921 [23302] dbg: bayes: tok_get_all: SQL error: Illegal mix of collations for operation ' IN ' Apr 17 15:36:08.921 [23302] dbg: bayes: cannot use bayes on this message; none of the tokens were found in the database Apr 17 15:36:08.921 [23302] dbg: bayes: not scoring message, returning undef Apr 17 15:36:13.116 [23302] dbg: timing: total 5159 ms - init: 804 (15.6%), parse: 10 (0.2%), extract_message_metadata: 99 (1.9%), poll_dns_idle: 3426 (66.4%), get_uri_detail_list: 0.24 (0.0%), tests_pri_-1000: 11 (0.2%), compile_gen: 133 (2.6%), compile_eval: 18 (0.3%), tests_pri_-950: 5 (0.1%), tests_pri_-900: 5 (0.1%), tests_pri_-400: 12 (0.2%), check_bayes: 8 (0.1%), tests_pri_0: 804 (15.6%), dkim_load_modules: 23 (0.4%), check_dkim_signature: 5 (0.1%), check_dkim_adsp: 99 (1.9%), check_spf: 61 (1.2%), check_razor2: 211 (4.1%), check_pyzor: 138 (2.7%), tests_pri_500: 3387 (65.7%) - Check-out the message buried half-way down: bayes: tok_get_all: SQL error: Illegal mix of collations for operation ' IN ' I have run into this unsightly message before, but in that case, I could see the entire query, which enabled me to change the collations accordingly. In this case, I have no idea what the original query might have been. Further, I have no idea what changed that introduced this problems on 3/23. Was it a MySQL upgrade? Was it an ISPConfig change? Has anybody else run into this? Thanks again, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
Daniel, thanks for the quick reply. I'll reply inline, below. On 4/16/2013 5:01 PM, Daniel McDonald wrote: > > > > On 4/16/13 2:59 PM, "Ben Johnson" wrote: > >> Are there any normal circumstances under which Bayes tests are not run? > Yes, if USE_BAYES = 0 is included in the local.cf file. I checked in /etc/spamassassin/local.cf, and find the following: use_bayes 1 So, that seems not to be the issue. >> >> If not, are there circumstances under which Bayes tests are run but >> their results are not included in the message headers? (I have tag_level >> set to -999, so SA headers are always added.) > > That sounds like an amavisd command, you may want to check in > ~amavisd/.spamassassin/user_prefs as well I checked in the equivalent path on my system (/var/lib/amavis/.spamassassin/user_prefs) and the entire file is commented-out. So, that seems not to be the issue, either. Is there anything else that would cause Bayes tests not be performed? I ask because other types of tests are disabled automatically under certain circumstances (e.g., network tests), and I'm wondering if there is some obscure combination of factors that causes Bayes tests not to be performed. >> >> Likewise, for the vast majority of spam messages that slip-through, I >> see no evidence of Pyzor or Razor2 activity. I have heretofore assumed >> that this observation indicates that the network tests were performed, >> but did not contribute to the SA score. Is this assumption valid? > Yes. Okay, very good. It occurred to me that perhaps the Pyzor and/or Razor2 tests are timing-out (both timeouts are set to 15 seconds) some percentage of the time, which may explain why these tests do not contribute to a given message's score. That's why I asked about forcing the results into the SA header. >> >> Also, is there some means by which to *force* Pyzor and Razor2 scores to >> be added to the SA header, even if they did not contribute to the score? > > I imagine you would want something like this: > > fullRAZOR2_CF_RANGE_0_50 eval:check_razor2_range('','0','50') > tflags RAZOR2_CF_RANGE_0_50 net > reuse RAZOR2_CF_RANGE_0_50 > describe RAZOR2_CF_RANGE_0_50 Razor2 gives confidence level under 50% > score RAZOR2_CF_RANGE_0_500.01 > > fullRAZOR2_CF_RANGE_E4_0_50 eval:check_razor2_range('4','0','50') > tflags RAZOR2_CF_RANGE_E4_0_50 net > reuse RAZOR2_CF_RANGE_E4_0_50 > describe RAZOR2_CF_RANGE_E4_0_50 Razor2 gives engine 4 confidence level > below 50% > score RAZOR2_CF_RANGE_E4_0_50 0.01 > > fullRAZOR2_CF_RANGE_E8_0_50 eval:check_razor2_range('8','0','50') > tflags RAZOR2_CF_RANGE_E8_0_50 net > reuse RAZOR2_CF_RANGE_E8_0_50 > describe RAZOR2_CF_RANGE_E8_0_50 Razor2 gives engine 8 confidence level > below 50% > score RAZOR2_CF_RANGE_E8_0_50 0.01 This seems to work brilliantly. I can't thank you enough; I never would have figured this out. Ideally, using the above directives will tell us whether we're experiencing timeouts, or these spam messages are simply not in the Pyzor or Razor2 databases. Off the top of your head, do you happen to know what will happen if one or both of the Pyzor/Razor2 tests timeout? Will some indication that the tests were at least *started* still be added to the SA header? >> >> To refresh folks' memories, we have verified that Bayes is setup >> correctly (database was wiped and now training is done manually and is >> supervised), and that network tests are being performed when messages >> are scanned. >> >> Thanks for sticking with me through all of this, guys! >> >> -Ben > Thanks again, Daniel! -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
Apologies for resurrecting the thread, but I never did receive a response to this particular aspect of the problem (asked on Jan 18, 2013, 8:51 AM). This is probably because I replied to my own post before anyone else did, and changed the subject slightly. We are being hammered pretty hard with spam (again), and as I inspect messages whose score is below tag2_level, BAYES_* is conspicuously absent from the headers. To reiterate my question: >> Are there any normal circumstances under which Bayes tests are not run? If not, are there circumstances under which Bayes tests are run but their results are not included in the message headers? (I have tag_level set to -999, so SA headers are always added.) Likewise, for the vast majority of spam messages that slip-through, I see no evidence of Pyzor or Razor2 activity. I have heretofore assumed that this observation indicates that the network tests were performed, but did not contribute to the SA score. Is this assumption valid? Also, is there some means by which to *force* Pyzor and Razor2 scores to be added to the SA header, even if they did not contribute to the score? To refresh folks' memories, we have verified that Bayes is setup correctly (database was wiped and now training is done manually and is supervised), and that network tests are being performed when messages are scanned. Thanks for sticking with me through all of this, guys! -Ben On 1/18/2013 11:51 AM, Ben Johnson wrote: > So, I've been keeping an eye on things again today. > > Overall, things look pretty good, and most spam is being blocked > outright at the MTA and scored appropriately in SA if not. > > I've been inspecting the X-Spam-Status headers for the handful of > messages that do slip through and noticed that most of them lack any > evidence of the BAYES_* tests. Here's one such header: > > No, score=3.115 tagged_above=-999 required=4.5 tests=[HK_NAME_FREE=1, > HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, PYZOR_CHECK=1.392, > SPF_PASS=-0.001] autolearn=disabled > > The messages that were delivered just before and after this one do have > evidence of BAYES_* tests, so, it's not as though something is > completely broken. > > Are there any normal circumstances under which Bayes tests are not run? > Do I need to turn debugging back on and wait until this happens again? > > Thanks for all the help, everyone! > > -Ben >
Re: Telling BAYES not to learn?
On 2/7/2013 11:13 AM, Marc Perkel wrote: > > On 2/7/2013 6:58 AM, RW wrote: >> On Tue, 05 Feb 2013 07:20:24 -0800 >> Marc Perkel wrote: >> >>> is there a way I can put something in a rule that would cause bayes >>> not to learn - such as a rule that detects bayes poisoning? >> Why do you think this is a good idea? >> >> > Because when a message uses invisible text to poison bayes then I don't > want to learn that because it will make bayes less effective. > Invisible text is a problem only for humans, not for machines. So, it sounds as though the problem you're describing relates to reviewing messages, manually (with your eyes), and taking some action as a result. If this is so, why not read the messages in *plaintext*, so you see the "invisible text" and can therefore act accordingly?
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 2/1/2013 7:58 PM, John Hardin wrote: > On Sat, 2 Feb 2013, RW wrote: > >> ALLOWING APPENDS >>By appends we mean the case of mail moving when the source folder is >>unknown, e.g. when you move from some other account or with tools >>like offlineimap. You should be careful with allowing APPENDs to >>SPAM folders. The reason for possibly allowing it is to allow >>not-SPAM --> SPAM transitions to work and be trained. However, >>because the plugin cannot know the source of the message (it is >>assumed to be from OTHER folder), multiple bad scenarios can happen: >> >>1. SPAM --> SPAM transitions cannot be recognised and are trained; >>2. TRASH --> SPAM transitions cannot be recognised and are trained; >>3. SPAM --> not-SPAM transitions cannot be recognised therefore >> training good messages will never work with APPENDs. >> >> >> I presume that the plugin works by monitoring COPY commands and so >> can't work properly when a move is done by FETCH-APPEND-DELETE. >> >> For sa-learn the problem would be 3, but I don't see how that is >> affected by allowing appends on the spam folder. > > Yeah, all of that sounds like they're talking about non-vetted training > mailboxes where the users are effectively talking directly to sa-learn. > > I think I may see at least part of what they are driving at. > > If one user trains a message as ham and another user who got a copy of > the same message trains it as spam, who wins? > > Absent some conflict-detection mechanism, the last mailbox trained > (either spam or ham) wins. > > As for the other two: > > spam -> spam transitions don't matter, sa-learn recognises message-IDs > and won't learn from the same message in the same corpus more than once > (i.e. having the same message in the spam corpus multiple times does not > "weight" the tokens learned from that message). So (1) may be a > performance concern but it won't affect the database. > > trash -> spam transition being learned is a problem how? > > That latter brings up another concern for the vetted-corpora model: if a > message is *removed* from a training corpora mailbox rather than > reclassified, you'd have to wipe and retrain your database from scratch > to remove that message's effects. > > So, you need *three* vetted corpus mailboxes: spam, ham, and > should-not-have-been-trained (forget). Rather than deleting a message > from the ham or spam corpus mailbox you move it to the forget mailbox > and the in next training pass sa-learn forgets the message and removes > it from the forget mailbox. This would be some special scripting, > because you can't just "sa-learn --forget" a whole mailbox. > > There would also need to be an audit process to detect whether the same > message_id is in both the ham and spam corpus mailboxes, so that the > admin can delete (NOT forget) the incorrect classification, or forget > the message if neither classification is reasonable. > You reveal some crucial information with regard to corpora management here, John. I've taken your good advice and created a third mailbox (well, a third "folder" within the same mailbox), named "Forget". It sounds as though the key here is never to delete messages from either corpus -- unless the same message exists in both corpora, in which case the misclassified message should be deleted. If neither classification is reasonable and the message should instead be forgotten, what's the order of operations? Should a copy of the message be created in the "Forget" corpus and then the message deleted from both the "Ham" and "Spam" corpora? With regard to the specialized scripting required to "forget" messages, this sounds cumbersome > because you can't just "sa-learn --forget" a whole mailbox. Is there a non-obvious reason for this? Would the logic behind a recursive --forget switch not be the same or similar as with the existing --ham and --spam switches? Finally, when a user submits a message to be classified as ham or spam, how should I be sorting the messages? I see the following scenarios: 1.) I agree with the end-user's classification. 2.) I disagree with the end-user's classification. a.) Because the message was submitted as ham but is really spam (or vice versa) b.) Because neither classification is reasonable In case 1.), should I *copy* the message from the submission inbox's Ham folder to the permanent Ham corpus folder? Or should I *move* the message? I'm trying to discern whether or not there's value in retaining end-user submissions *as they were classified upon submission*. In case 2.), should I simply delete the message from the submission folder? Or is there some reason to retain the message (i.e., move it into an "Erroneous" folder within the submission mailbox)? I did read http://wiki.apache.org/spamassassin/HandClassifiedCorpora , but it doesn't address these issues, specifically. Thanks again! -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 2/1/2013 12:00 PM, John Hardin wrote: > On Fri, 1 Feb 2013, Ben Johnson wrote: > >> John, thanks for pointing-out the problems associated with re-sending >> the messages via sendmail. >> >> I threw a line out to the Dovecot users group and learned how to move >> messages without going through the MTA. Dovecot has a utility >> executable, "deliver", which is well-suited to the task. >> >> For those who may have a similar need, here's the Dovecot Antispam pipe >> script that I'm using, courtesy of Steffen Kaiser on the Dovecot Users >> mailing list: >> >> --- >> #!/bin/bash >> >> mode= >> for opt; do >> if test "x$*" == "x--ham"; then >> mode=HAM >> break >> elif test "x$*" == "x--spam"; then >> mode=SPAM >> break >> fi >> done >> >> if test -n "$mode"; then >> # options from http://wiki1.dovecot.org/LDA >> /usr/lib/dovecot/deliver -d u...@example.com -m Training.$mode >> fi >> >> exit 0 >> --- > > That seems a lot better. > >> Regarding the second point, I'm not sure I understand the problem. If >> someone drags a message from Trash to SPAM, shouldn't it be submitted >> for learning as spam? >> >> The last sentence sounds like somewhat of a deal-breaker. Doesn't my >> whole strategy go somewhat limp if ham cannot be submitted for training? >> >> John and RW, do you recommend enabling or disabling the append option, >> given the way I'm reviewing the submissions and sorting them manually? > > I think they're proceeding from the assumption of *un-reviewed* > training, i.e. blind trust in the reliability of the users. > > If it's possible to enable IMAP Append on a per-folder basis then > enabling it only on your training inbox folders shouldn't be an issue - > the messages won't be trained until you've reviewed them. > > Without that level of fine-grain control I still don't see an issue from > this if you can prevent the users from adding content directly to the > folders that sa-learn actually processes. If IMAP Append only applies to > "shared" folders then there shouldn't be a problem - configure sa-learn > to learn from folders in *your account*, that nobody else can access > directly. > Thanks, John. If I'm understanding you correctly, your assessment is that enabling IMAP append in the Antispam plug-in configuration (not the default, by the way) shouldn't cause problems for my Bayes training setup, primarily because users cannot train Bayes unsupervised. If that is so, what's the real benefit to enabling this "feature" that is off by default? Users will be able to submit messages for training while "offline" and when they reconnect the plug-in will be triggered and the messages copied to the training mailbox? -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/31/2013 5:50 PM, RW wrote: > On Thu, 31 Jan 2013 12:12:15 -0800 (PST) > John Hardin wrote: > >> On Thu, 31 Jan 2013, Ben Johnson wrote: >> > >>> So, I finally got around to tackling this change. >>> >>> With a couple of simple modifications, I was able to achieve the >>> desired result with the Dovecot Antispam plug-in. >>> >>> Basically, I changed the last two directive values from the switches >>> that are normally passed to the "sa-learn" binary (--spam and >>> --ham) to destination email addresses that are passed to "sendmail" >>> in my revised pipe script. >> >> Passing the messages through sendmail again isn't optimal as that >> will make further changes to the headers. This may have effects on >> the quality of the learning, unless the original message is attached >> as an RFC-822 attachment to the message being sent to the corpus >> mailbox, which of course means you then can't just run sa-learn >> directly against that mailbox - the review process would involve >> moving the attachment as a standalone message to the spam or ham >> learning mailbox. >> >> Ideally you want to just move the messages between mailboxes without >> involving another delivery processing. I don't know enough about >> Dovecot or your topology to say whether that's going to be as easy as >> using sendmail to mail the message to you. > > Actually that's the way that the dovecot plugin works. I think that the > sendmail option is mainly a way to get training done on a remote > machine - it's a standard feature of DSPAM for which the plugin was > originally developed. > > When I looked at the plugin it seemed to have quite a serious flaw. > IIRC it disables IMAP APPENDs on the Spam folder which makes it > incompatible with synchronisation tools like OfflineImap and probably > some IMAP clients that implement offline support in the same way. > John, thanks for pointing-out the problems associated with re-sending the messages via sendmail. I threw a line out to the Dovecot users group and learned how to move messages without going through the MTA. Dovecot has a utility executable, "deliver", which is well-suited to the task. For those who may have a similar need, here's the Dovecot Antispam pipe script that I'm using, courtesy of Steffen Kaiser on the Dovecot Users mailing list: --- #!/bin/bash mode= for opt; do if test "x$*" == "x--ham"; then mode=HAM break elif test "x$*" == "x--spam"; then mode=SPAM break fi done if test -n "$mode"; then # options from http://wiki1.dovecot.org/LDA /usr/lib/dovecot/deliver -d u...@example.com -m Training.$mode fi exit 0 --- And here are the Antispam plug-in options: --- # For Dovecot < 2.0. antispam_spam_pattern_ignorecase = SPAM;JUNK antispam_mail_tmpdir = /tmp antispam_mail_sendmail = /usr/bin/sa-learn-pipe.sh antispam_mail_spam = --spam antispam_mail_notspam = --ham --- RW, thank you for underscoring the issue with IMAP appends. It looks as though a configuration directive exists to control this behavior: # Whether to allow APPENDing to SPAM folders or not. Must be set to # "yes" (case insensitive) to be activated. Before activating, please # read the discussion below. # antispam_allow_append_to_spam = no Unfortunately, I don't fully understand the implications or enabling or disabling this option. Here's the "discussion below" that is referenced in the above comment: --- ALLOWING APPENDS? You should be careful with allowing APPENDs to SPAM folders. The reason for possibly allowing it is to allow not-SPAM --> SPAM transitions to work with offlineimap. However, because with APPEND the plugin cannot know the source of the message, multiple bad scenarios can happen: 1. SPAM --> SPAM transitions cannot be recognised and are trained 2. the same holds for Trash --> SPAM transitions Additionally, because we cannot recognise SPAM --> not-SPAM transitions, training good messages will never work with APPEND. --- In consideration of the first point, what is a "SPAM --> SPAM transition"? Is that when the mailbox contains more than one "spam folder", e.g., "JUNK" and "SPAM", and the user drags a message from one to the other? Regarding the second point, I'm not sure I understand the problem. If someone drags a message from Trash to SPAM, shouldn't it be submitted for learning as spam? The last sentence sounds like somewhat of a deal-breaker. Doesn't my whole strategy go somewhat limp if ham cannot be submitted for training? John and RW, do you recommend enabling or disabling the append option, given the way I'm reviewing the submissions and sorting them manually? Sorry for all the questions! And thanks! -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/15/2013 5:22 PM, John Hardin wrote: Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in. They do so unsupervised. Why this could be a problem is obvious. And no, I don't retain their submissions. I probably should. I wonder if I can make a few slight modifications to the shell script that Antispam calls, such that it simply sends a copy of the message to an administrator rather than calling sa-learn on the message. >>> >>> That would be a very good idea if the number of users doing training is >>> small. At the very least, the messages should be captured to a permanent >>> corpus mailbox. >> >> Good idea! I'll see if I can set this up. So, I finally got around to tackling this change. With a couple of simple modifications, I was able to achieve the desired result with the Dovecot Antispam plug-in. In dovecot.conf: - plugin { # [...] # For Dovecot < 2.0. antispam_spam_pattern_ignorecase = SPAM;JUNK antispam_mail_tmpdir = /tmp antispam_mail_sendmail = /usr/bin/sa-learn-pipe.sh antispam_mail_spam = proposed-s...@example.com antispam_mail_notspam = proposed-...@example.com } - Basically, I changed the last two directive values from the switches that are normally passed to the "sa-learn" binary (--spam and --ham) to destination email addresses that are passed to "sendmail" in my revised pipe script. Here is the full pipe script, /usr/bin/sa-learn-pipe.sh (apologies for the wrapping); the original commands are commented with two pound symbols [##]): - #!/bin/sh # Add "starting now" string to log. echo "$$-start ($*)" >> /tmp/sa-learn-pipe.log # Copy the message contents to a temporary text file. cat<&0 >> /tmp/sendmail-msg-$$.txt CURRENT_USER=$(whoami) ##echo "Calling (as user $CURRENT_USER) '/usr/bin/sa-learn $* /tmp/sendmail-msg-$$.txt'" >> /tmp/sa-learn-pipe.log echo "Calling (as user $CURRENT_USER) 'sendmail $* < /tmp/sendmail-msg-$$.txt'" >> /tmp/sa-learn-pipe.log # Execute sa-learn, with the passed ham/spam argument, and the temporary message contents. # Send the output to the log file while redirecting stderr to stdout (so we capture debug output). ##/usr/bin/sa-learn $* /tmp/sendmail-msg-$$.txt >> /tmp/sa-learn-pipe.log 2>&1 sendmail $* < /tmp/sendmail-msg-$$.txt >> /tmp/sa-learn-pipe.log 2>&1 # Remove the temporary message. rm -f /tmp/sendmail-msg-$$.txt # Add "ending now" string to log. echo "$$-end" >> /tmp/sa-learn-pipe.log # Exit with "success" status code. exit 0 - It seems as though creating a temporary copy of the message is not strictly necessary, as the message contents could be passed to the "sendmail" command via standard input (stdin), but creating the copy could be useful in debugging. >>> Do your users also train ham? Are the procedures similar enough that >>> your users could become easily confused? >> >> They do. The procedure is implemented via Dovecot's Antispam plug-in. >> Basically, moving mail from Inbox to Junk trains it as spam, and moving >> mail from Junk to Inbox trains it as ham. I really like this setup >> (Antispam + calling SA through Amavis [i.e. not using spamd]) because >> the results are effective immediately, which seems to be crucial for >> combating this snowshoe spam (performance and scalability aside). >> >> I don't find that procedure to be confusing, but people are different, I >> suppose. > > Hm. One thing I would watch out for in that environment is people who > have intentionally subscribed to some sort of mailing list deciding they > don't want to receive it any longer and just junking the messages rather > than unsubscribing. The steps I've taken above will allow me to review submissions and educate users who engage in this practice. Thanks again for elucidating this scenario. I hope that this approach to user-based SpamAssassin training is useful to others. Best regards, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
So, I've been keeping an eye on things again today. Overall, things look pretty good, and most spam is being blocked outright at the MTA and scored appropriately in SA if not. I've been inspecting the X-Spam-Status headers for the handful of messages that do slip through and noticed that most of them lack any evidence of the BAYES_* tests. Here's one such header: No, score=3.115 tagged_above=-999 required=4.5 tests=[HK_NAME_FREE=1, HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, PYZOR_CHECK=1.392, SPF_PASS=-0.001] autolearn=disabled The messages that were delivered just before and after this one do have evidence of BAYES_* tests, so, it's not as though something is completely broken. Are there any normal circumstances under which Bayes tests are not run? Do I need to turn debugging back on and wait until this happens again? Thanks for all the help, everyone! -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/16/2013 2:22 PM, Bowie Bailey wrote: > On 1/16/2013 1:18 PM, Ben Johnson wrote: >> >> On 1/16/2013 11:00 AM, John Hardin wrote: >>> On Wed, 16 Jan 2013, Ben Johnson wrote: >>> >>>> Is it possible that the training I've been doing over the last week or >>>> so wasn't *effective* until recently, say, after restarting some >>>> component of the mail stack? My understanding is that calling SA via >>>> Amavis, which does not need/use the spamd daemon, forces all Bayes data >>>> to be up-to-date on each call to spamassassin. >>> That shouldn't be the case. SA and sa-learn both use a shared-access >>> database; if you're training the database that SA is learning, the >>> results of training should be effective immediately. >>> >> Okay, good. Bowie's response to this question differed (he suggested >> that Amavis would need to be restarted for Bayes to be updated), but I'm >> pretty sure that restarting Amavis is not necessary. It seems unlikely >> that Amavis would copy the entire Bayes DB (which is stored in MySQL on >> this server) into memory every time that the Amavis service is started. >> To do so seems self-defeating: more RAM usage, worse performance, etc. > > Actually, I was making a general observation. > > For cases where you would normally need to restart spamd, you will need > to restart amavis. This includes things like rule and configuration > changes. > > Bayes data is read dynamically from your MySQL database and thus does > not require a restart of amavis/spamd when updated. > My apologies, Bowie. I misinterpreted your response. Thank you very much for the follow-up and for the clear explanation. Best regards, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/16/2013 11:00 AM, John Hardin wrote: > On Wed, 16 Jan 2013, Ben Johnson wrote: > >> On 1/15/2013 5:22 PM, John Hardin wrote: >>> On Tue, 15 Jan 2013, Ben Johnson wrote: >>>> >>>> Wow! Adding several more reject_rbl_client entries to the >>>> smtpd_recipient_restrictions directive in the Postfix configuration >>>> seems to be having a tremendous impact. The amount of spam coming >>>> through has dropped by 90% or more. This was a HUGELY helpful >>>> suggestion, John! >>> >>> Which ones are you using now? There are DNSBLs that are good, but not >>> quite good enough to trust as hard-reject SMTP-time filters. That's why >>> SA does scored DNSBL checks. >> >> smtpd_recipient_restrictions = >> reject_rbl_client bl.spamcop.net, >> reject_rbl_client list.dsbl.org, >> reject_rbl_client sbl-xbl.spamhaus.org, >> reject_rbl_client cbl.abuseat.org, >> reject_rbl_client dul.dnsbl.sorbs.net, > > Several of those are combined into ZEN. If you use Zen instead you'll > save some DNS queries. See the Spamhaus link I provided earlier for > details, I don't offhand remember which ones go into ZEN. Per Noel's advice, I have shortened the list (dsbl.org is defunct) and acted upon your mutual suggestion regarding ZEN: reject_rbl_client bl.spamcop.net, reject_rbl_client zen.spamhaus.org, reject_rbl_client dnsbl.sorbs.net, Indeed, block entries for all three lists are being registered in the mail log. Very nice. It seems as though adding these SMTP-time rejects has blocked about 1/2 of the spam that was coming through previously. Awesome. >> These are "hard rejects", right? So if this change has reduced spam, >> said spam would not be accepted for delivery at all; it would be >> rejected outright. Correct? (And if I understand you, this is part of >> your concern.) > > Correct. > >> The reason I ask, and a point that I should have clarified in my last >> post, is that the *volume* of spam didn't drop by 90% (although, it may >> have dropped by some measure), but rather the accuracy with which SA >> tagged spam was 90% higher. > > That's odd. That suggests you SA wasn't looking up those DNSBLs, or they > would have contributed to the score. > > Check your trusted networks setting. One difference between SMTP-time > and SA-time DNSBL checks is that SMTP-time checks the IP address of the > client talking to the MTA, while SA-time can go back up the relay chain > if necessary (e.g. to check the client IP submitting to your ISP if your > ISP's MTA is between your MTA and the Internet, rather than always > checking your ISP's MTA IP address). Are you referring to SA's "trusted_networks" directive? If so, it is commented-out (presumably by default). Does this need to be set? I've read the info re: trusted_networks at http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Conf.html , but I'm struggling to understand it. If the info is helpful, I have a very simple setup here: a single server with a single public IP address and a single MTA. >> Ultimately, I'm wondering if the observed change was simply a product of >> these message "campaigns" being black-listed after a few days of >> circulation, and not the Postfix configuration change. > > Maybe. > >> At this point, the vast majority of X-Spam-Status headers include Razor2 >> and Pyzor tests that contribute significantly to the score. I should >> have mentioned earlier that I installed Razor2 and Pyzor after making my >> initial post. The only reasons I didn't are that a) they didn't seem to >> be making a significant difference for the first day or so after I >> installed them (this could be for the snowshoe reasons we've already >> discussed), and b) the low Bayes scores seemed to be the real problem >> anyway. >> >> That said, the Bayes scores seem to be much more accurate now, too. I >> was hardly ever seeing BAYES_99 before, but now almost all spam messages >> have BAYES_99. > > Odd. SMTP-time hard rejects shouldn't change that. That's what I figured. I wonder if feeding all of the messages that I "auto-learned manually" -- messages that were tagged as spam (but for reasons unrelated to Bayes) -- contributed significantly to this change. I did this late yesterday afternoon and when I took a status check this morning, I was seeing BAYES_99 for almost every message. >> Is it possible that the training I've been doing over the last week or >> so wasn't *effective* until recently, say, after restarting some >
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/16/2013 2:02 AM, Tom Hendrikx wrote: > On 1/15/13 5:26 PM, Ben Johnson wrote: > >> >> In postfix's main.cf: >> > >> >> Hmm, very interesting. No, I have no greylisting in place as yet, and >> no, my userbase doesn't demand immediate delivery. I will look into >> greylisting further. > > If you're running postfix, consider using postscreen. It's a recent > addition to postfix that also can behave in a greylisting alike way, and > much more. > > Read: http://www.postfix.org/POSTSCREEN_README.html > > -- > Tom > Thanks for the suggestion, Tom! Unfortunately, I'm stuck on Postfix 2.7 for a while yet, and Postscreen is available for versions >= 2.8 only. I will definitely look into it once I'm on 2.8+, however. Cheers, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/15/2013 5:22 PM, John Hardin wrote: > On Tue, 15 Jan 2013, Ben Johnson wrote: > >> >> >> On 1/15/2013 1:55 PM, John Hardin wrote: >>> On Tue, 15 Jan 2013, Ben Johnson wrote: >>> >>>> On 1/14/2013 8:16 PM, John Hardin wrote: >>>>> On Mon, 14 Jan 2013, Ben Johnson wrote: >>>>> >>>>> Question: do you have any SMTP-time hard-reject DNSBL tests in >>>>> place? Or >>>>> are they all performed by SA? >>>> >>>> In postfix's main.cf: >>>> >>>> smtpd_recipient_restrictions = permit_mynetworks, >>>> permit_sasl_authenticated, check_recipient_access >>>> mysql:/etc/postfix/mysql-virtual_recipient.cf, >>>> reject_unauth_destination, reject_rbl_client bl.spamcop.net >>>> >>>> Do you recommend something more? >>> >>> Unfortunately I have no experience administering Postfix. Perhaps one of >>> the other listies can help. >> >> Wow! Adding several more reject_rbl_client entries to the >> smtpd_recipient_restrictions directive in the Postfix configuration >> seems to be having a tremendous impact. The amount of spam coming >> through has dropped by 90% or more. This was a HUGELY helpful >> suggestion, John! > > Which ones are you using now? There are DNSBLs that are good, but not > quite good enough to trust as hard-reject SMTP-time filters. That's why > SA does scored DNSBL checks. smtpd_recipient_restrictions = reject_rbl_client bl.spamcop.net, reject_rbl_client list.dsbl.org, reject_rbl_client sbl-xbl.spamhaus.org, reject_rbl_client cbl.abuseat.org, reject_rbl_client dul.dnsbl.sorbs.net, I acquired this list from the article that I cited a few responses back. It is quite possible that some of these are obsolete, as the article is from 2009. I seem to recall reading that sbl-xbl.spamhaus.org is obsolete, but now I can't find the source. These are "hard rejects", right? So if this change has reduced spam, said spam would not be accepted for delivery at all; it would be rejected outright. Correct? (And if I understand you, this is part of your concern.) The reason I ask, and a point that I should have clarified in my last post, is that the *volume* of spam didn't drop by 90% (although, it may have dropped by some measure), but rather the accuracy with which SA tagged spam was 90% higher. Ultimately, I'm wondering if the observed change was simply a product of these message "campaigns" being black-listed after a few days of circulation, and not the Postfix configuration change. At this point, the vast majority of X-Spam-Status headers include Razor2 and Pyzor tests that contribute significantly to the score. I should have mentioned earlier that I installed Razor2 and Pyzor after making my initial post. The only reasons I didn't are that a) they didn't seem to be making a significant difference for the first day or so after I installed them (this could be for the snowshoe reasons we've already discussed), and b) the low Bayes scores seemed to be the real problem anyway. That said, the Bayes scores seem to be much more accurate now, too. I was hardly ever seeing BAYES_99 before, but now almost all spam messages have BAYES_99. Is it possible that the training I've been doing over the last week or so wasn't *effective* until recently, say, after restarting some component of the mail stack? My understanding is that calling SA via Amavis, which does not need/use the spamd daemon, forces all Bayes data to be up-to-date on each call to spamassassin. It bears mention that I haven't yet dumped the Bayes DB and retrained using my corpus. I'll do that next and see where we land once the DB is repopulated. >>>> Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in. >>>> They do so unsupervised. Why this could be a problem is obvious. And >>>> no, >>>> I don't retain their submissions. I probably should. I wonder if I can >>>> make a few slight modifications to the shell script that Antispam >>>> calls, >>>> such that it simply sends a copy of the message to an administrator >>>> rather than calling sa-learn on the message. >>> >>> That would be a very good idea if the number of users doing training is >>> small. At the very least, the messages should be captured to a permanent >>> corpus mailbox. >> >> Good idea! I'll see if I can set this up. >> >>> Do your users also train ham? Are the procedures similar enough that >>> your users could become easily confused? >> >> They do. The p
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/15/2013 4:39 PM, Bowie Bailey wrote: > On 1/15/2013 4:27 PM, Ben Johnson wrote: >> On 1/15/2013 4:05 PM, Bowie Bailey wrote: >>> On 1/15/2013 3:47 PM, Ben Johnson wrote: >>>> One final question on this subject (sorry...). >>>> >>>> Is there value in training Bayes on messages that SA classified as spam >>>> *due to other test scores*? In other words, if a message is classified >>>> as SPAM due to a block-list test, but the message is new enough for >>>> Bayes to assign a zero score, should that message be kept and fed to >>>> sa-learn so that Bayes can soak-up all the tokens from a message >>>> that is >>>> almost certainly spam (based on the other tests)? >>>> >>>> Am I making any sense? >>> It is always worthwhile to train Bayes. In an ideal world, you would >>> hand-sort and train every email that comes through your system. The >>> more mail Bayes sees the more accurate it can be. >>> >> Thanks, Bowie. Given your response, would it then be prudent to call >> "sa-learn --spam" on any message that *other tests* (non-Bayes tests) >> determine to be spam (given some score threshold)? > > That is exactly what the autolearn setting does. I let my system run > with the default autolearn settings. Some people adjust the thresholds > and some people prefer to turn off autolearn and do purely manual training. > >> The crux of my question/point is that I don't want to have to feed >> messages that Bayes "misses" but that other tests identify *correctly* >> as spam to "sa-learn --spam". > > At one point, I had a script running on my server that looked for > messages that were marked as spam with a low Bayes rating (BAYES_00 to > BAYES_40) or messages marked as ham with a high Bayes rating (BAYES_60 > to BAYES_99). I was then able to check the messages and learn them > properly. This let me learn from the edge cases that were not being > scored properly by Bayes while still making it to the correct folder due > to other rules. > > If you do this, you MUST check the messages yourself prior to learning > since there is no other way to know whether they should be learned as > ham or spam. > >> Is there value in implementing something like this? Or is there some >> caveat that would make doing so self-defeating? > > I find that Bayes autolearn works quite well for me, but others have had > problems with it. > Ah... I get it. Finally. :) Excellent info here; thanks again! You guys are heroes... seriously. Best regards, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/15/2013 4:05 PM, Bowie Bailey wrote: > On 1/15/2013 3:47 PM, Ben Johnson wrote: >> One final question on this subject (sorry...). >> >> Is there value in training Bayes on messages that SA classified as spam >> *due to other test scores*? In other words, if a message is classified >> as SPAM due to a block-list test, but the message is new enough for >> Bayes to assign a zero score, should that message be kept and fed to >> sa-learn so that Bayes can soak-up all the tokens from a message that is >> almost certainly spam (based on the other tests)? >> >> Am I making any sense? > > It is always worthwhile to train Bayes. In an ideal world, you would > hand-sort and train every email that comes through your system. The > more mail Bayes sees the more accurate it can be. > Thanks, Bowie. Given your response, would it then be prudent to call "sa-learn --spam" on any message that *other tests* (non-Bayes tests) determine to be spam (given some score threshold)? The crux of my question/point is that I don't want to have to feed messages that Bayes "misses" but that other tests identify *correctly* as spam to "sa-learn --spam". Is there value in implementing something like this? Or is there some caveat that would make doing so self-defeating? Thanks a bunch, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
One final question on this subject (sorry...). Is there value in training Bayes on messages that SA classified as spam *due to other test scores*? In other words, if a message is classified as SPAM due to a block-list test, but the message is new enough for Bayes to assign a zero score, should that message be kept and fed to sa-learn so that Bayes can soak-up all the tokens from a message that is almost certainly spam (based on the other tests)? Am I making any sense? Thanks again! -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/15/2013 1:55 PM, John Hardin wrote: > On Tue, 15 Jan 2013, Ben Johnson wrote: > >> On 1/14/2013 8:16 PM, John Hardin wrote: >>> On Mon, 14 Jan 2013, Ben Johnson wrote: >>> >>> Question: do you have any SMTP-time hard-reject DNSBL tests in place? Or >>> are they all performed by SA? >> >> In postfix's main.cf: >> >> smtpd_recipient_restrictions = permit_mynetworks, >> permit_sasl_authenticated, check_recipient_access >> mysql:/etc/postfix/mysql-virtual_recipient.cf, >> reject_unauth_destination, reject_rbl_client bl.spamcop.net >> >> Do you recommend something more? > > Unfortunately I have no experience administering Postfix. Perhaps one of > the other listies can help. Wow! Adding several more reject_rbl_client entries to the smtpd_recipient_restrictions directive in the Postfix configuration seems to be having a tremendous impact. The amount of spam coming through has dropped by 90% or more. This was a HUGELY helpful suggestion, John! >>> http://www.greylisting.org/ >> >> Hmm, very interesting. No, I have no greylisting in place as yet, and >> no, my userbase doesn't demand immediate delivery. I will look into >> greylisting further. > > One other thing you might try is publishing an SPF record for your > domain. There is anecdotal evidence that this reduces the raw spam > volume to that domain a bit. We do publish SPF records for the domains within our control. The need to do this arose when senderbase.org, et. al., began blacklisting domains without SPF records. So, we're good there. >> Given this information, it concerns me that Bayes scores hardly seem >> to budge when I feed sa-learn nearly identical messages 3+ times. >> We'll get into that below. >> >>>> If so, then I guess the only remedy here is to focus on why Bayes seems >>>> to perform so miserably. >>> >>> Agreed. >>> >>>> It must be a configuration issue, because I've sa-learn-ed messages >>>> that are incredibly similar for two days now and not only do their >>>> Bayes scores not change significantly, but sometimes they decrease. >>>> And I have a hard time believing that one of my users is sa-train-ing >>>> these messages as ham and negating my efforts. >>> >>> This is why you retain your Bayes training corpora: so that if Bayes >>> goes off the rails you can review your corpora for misclassifications, >>> wipe and retrain. Do you have your training corpora? Or do you discard >>> messages once you've trained them? >> >> I had the good sense to retain the corpora. > > Yay! > >>> _Do_ you allow your users to train Bayes? Do they do so unsupervised or >>> do you review their submissions? And if the process is automated, do you >>> retain what they have provided for training so that you can go back >>> later and do a troubleshooting review? >> >> Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in. >> They do so unsupervised. Why this could be a problem is obvious. And no, >> I don't retain their submissions. I probably should. I wonder if I can >> make a few slight modifications to the shell script that Antispam calls, >> such that it simply sends a copy of the message to an administrator >> rather than calling sa-learn on the message. > > That would be a very good idea if the number of users doing training is > small. At the very least, the messages should be captured to a permanent > corpus mailbox. Good idea! I'll see if I can set this up. > Do your users also train ham? Are the procedures similar enough that > your users could become easily confused? They do. The procedure is implemented via Dovecot's Antispam plug-in. Basically, moving mail from Inbox to Junk trains it as spam, and moving mail from Junk to Inbox trains it as ham. I really like this setup (Antispam + calling SA through Amavis [i.e. not using spamd]) because the results are effective immediately, which seems to be crucial for combating this snowshoe spam (performance and scalability aside). I don't find that procedure to be confusing, but people are different, I suppose. >>> Do you have autolearn turned on? My opinion is that autolearn is only >>> appropriate for a large and very diverse userbase where a sufficiently >>> "common" corpus of ham can't be manually collected. but then, I don't >>> admin a Really Large Install, so YMMV. >> >> No, I was sure to disable autolearn after the last Bayes fiasco. :) > > OK. > >>> Do you use per-user or s
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/14/2013 8:16 PM, John Hardin wrote: > On Mon, 14 Jan 2013, Ben Johnson wrote: > >> I understand that snowshoe spam may not hit any net tests. I guess my >> confusion is around what, exactly, classifies spam as "snowshoe". > > http://www.spamhaus.org/faq/section/Glossary > > Basically, a large number of spambots sending the message so that no one > sending IP can be easily tagged as evil. > > Question: do you have any SMTP-time hard-reject DNSBL tests in place? Or > are they all performed by SA? In postfix's main.cf: smtpd_recipient_restrictions = permit_mynetworks, permit_sasl_authenticated, check_recipient_access mysql:/etc/postfix/mysql-virtual_recipient.cf, reject_unauth_destination, reject_rbl_client bl.spamcop.net Do you recommend something more? > Recommendation: consider using the Spamhaus ZEN DNSBL as a hard-reject > SMTP-time DNS check in your MTA. It is well-respected and very reliable. > One thing it includes is ranges of IP addresses that should not ever be > sending email, so it may help reduce snowshoe spam. > > http://www.spamhaus.org/zen/ This article looks to be pretty thorough: http://www.cyberciti.biz/faq/howto-configure-postfix-dnsrbls-under-linux-unix/ I'll add Spamhaus ZEN and a few others to the list. > Another tactic that many report good results from is Greylisting. Do you > have greylisting in place? Does your userbase demand no delays in mail > delivery? In addition to blocking spam from spambots that do not retry, > it can delay mail enough for the BLs to get a chance to list new > IPs/domains, which can reduce the leakage if you happen to be at the > leading edge of a new delivery campaign. > > http://www.greylisting.org/ Hmm, very interesting. No, I have no greylisting in place as yet, and no, my userbase doesn't demand immediate delivery. I will look into greylisting further. >> Are most/all of the BL services hash-based? > > Generally: > > DNSBL: Blacklist of IP addresses > URIBL: Blacklist of domain and host names appearing in URIs > EMAILBL: (not widely used) Blacklist of email addresses (e.g. > phishing response addresses) > Razor, Pyzor: Blacklist of message content checksums/hashes Perfect; that answers my question. >> In other words, if a known spam message was added yesterday, will it >> be considered "snowshoe" spam if the spammer sends the same message >> today and changes only one character within the body? > > No, the diverse IP addresses are the hallmark of "snowshoe", not so much > the specific message content. If you see identical or generally-similar > (e.g.) pharma spam coming from a wide range of different IP addresses, > that's snowshoe. I see. Given this information, it concerns me that Bayes scores hardly seem to budge when I feed sa-learn nearly identical messages 3+ times. We'll get into that below. >> If so, then I guess the only remedy here is to focus on why Bayes seems >> to perform so miserably. > > Agreed. > >> It must be a configuration issue, because I've sa-learn-ed messages >> that are incredibly similar for two days now and not only do their >> Bayes scores not change significantly, but sometimes they decrease. >> And I have a hard time believing that one of my users is sa-train-ing >> these messages as ham and negating my efforts. > > This is why you retain your Bayes training corpora: so that if Bayes > goes off the rails you can review your corpora for misclassifications, > wipe and retrain. Do you have your training corpora? Or do you discard > messages once you've trained them? I had the good sense to retain the corpora. > _Do_ you allow your users to train Bayes? Do they do so unsupervised or > do you review their submissions? And if the process is automated, do you > retain what they have provided for training so that you can go back > later and do a troubleshooting review? Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in. They do so unsupervised. Why this could be a problem is obvious. And no, I don't retain their submissions. I probably should. I wonder if I can make a few slight modifications to the shell script that Antispam calls, such that it simply sends a copy of the message to an administrator rather than calling sa-learn on the message. > Do you have autolearn turned on? My opinion is that autolearn is only > appropriate for a large and very diverse userbase where a sufficiently > "common" corpus of ham can't be manually collected. but then, I don't > admin a Really Large Install, so YMMV. No, I was sure to disable autolearn after the last Bayes fiasco. :) > Do you use per-user or sitewide Bayes? If per-
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/14/2013 7:48 PM, Noel wrote: > On 1/14/2013 2:59 PM, Ben Johnson wrote: > >> I understand that snowshoe spam may not hit any net tests. I guess my >> confusion is around what, exactly, classifies spam as "snowshoe". > > Snowshoe spam - spreading a spam run across a large number of IPs so > no single IP is sending a large volume. Typically also combined > with "natural language" text, RFC compliant mail servers, verified > SPF and DKIM, business-class ISP with FCrDNS, and every other > criteria to look like a legit mail source. This type of spam is > difficult to catch. > > http://www.spamhaus.org/faq/section/Glossary#233 > and countless other links if you ask google. > >> Are most/all of the BL services hash-based? In other words, if a known >> spam message was added yesterday, will it be considered "snowshoe" spam >> if the spammer sends the same message today and changes only one >> character within the body? > > No, most all DNS blacklists are based on IP reputation. Check each > list's website for their listing policy to see how an IP gets on > their list; generally honypot email addresses or trusted user > reports. Most lists require some number of reports before listing > an IP to prevent false positives; snowshoe spammers take advantage > of this. > >> If so, then I guess the only remedy here is to focus on why Bayes seems >> to perform so miserably. > > Sounds as if your bayes has been improperly trained in the past. > You might do better to just delete the bayes db and start over with > hand-picked spam and ham. > > > > -- Noel Jones > jdow, Noel, and John, I can't thank you enough for your very thorough responses. Your time is valuable and I sincerely appreciate your willingness to help. John, I'll respond to you separately, for the sake of keeping this organized. > Ben, do be aware that sometimes you draw the short straw and sit at the > very start of the spam distribution cycle. In those cases the BLs will > generally not have been alerted yet so they may not trigger. For those > situations the rules should be your friends. (I still use my treasured > set of SARE rules and personally hand crafted rules my partner and I > have created that fit OUR needs but may not be good general purpose > rules.) This makes perfect sense and underscores the importance of a finely-tuned rule-set. It's become apparent just how dynamic and capable a monster the spam industry is. No one approach will ever be a panacea, it seems. The advice from your second email is well-received, too. Especially the part about not killing anybody. ;) I do hope fighting spam becomes fun for me, because so far, it's been an uphill battle! Hehe. Noel, thanks for excellent responses to my questions. > Sounds as if your bayes has been improperly trained in the past. > You might do better to just delete the bayes db and start over with > hand-picked spam and ham. I hope not, because this is my second go-round with the Bayes DB. The first time (as Mr. Hardin may remember), auto-learning was enabled out-of-the-box and some misconfiguration or another (seemingly related to DNSWL_* rules) caused a lot of spam to be learned as ham. With John's help, I corrected the issues (I hope), which I'll detail in my reply to John. Thanks again, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/14/2013 2:49 PM, RW wrote: > On Mon, 14 Jan 2013 13:24:55 -0500 > Ben Johnson wrote: > > >> A clear pattern has emerged: the X-Spam-Status headers for very >> obviously spammy messages never contain evidence that network tests >> contributed to their SA scores. >> >> Ultimately, I need to know whether: >> >> a.) Network tests are not being run at all for these messages >> >> b.) Network tests are being run, but are failing in some way >> >> c.) Network tests are being run, and are succeeding, but return >> responses that do not contribute to the messages' scores >> >> I've had a look at the log entries to which I link in my previous >> message and I just need a little help interpreting the "dns" and >> "async" messages. > > As I said before, it's not unusual for snowshoe spam to hit no net > tests at all. Also obvious spam isn't any more likely to be in a > blocklist than less obvious spam. > > However, try adding this to your SpamAssassin configuration, and > restart the appropriate daemon: > > header RCVD_IN_HITALL eval:check_rbl('hitall-lastexternal', > 'ipv4.fahq2.com.') > tflags RCVD_IN_HITALL net > scoreRCVD_IN_HITALL 0.001 > > > It should add a dns test that is hit for all mail delivered from an > IPv4 address. > Thanks, RW. I understand that snowshoe spam may not hit any net tests. I guess my confusion is around what, exactly, classifies spam as "snowshoe". Are most/all of the BL services hash-based? In other words, if a known spam message was added yesterday, will it be considered "snowshoe" spam if the spammer sends the same message today and changes only one character within the body? If so, then I guess the only remedy here is to focus on why Bayes seems to perform so miserably. It must be a configuration issue, because I've sa-learn-ed messages that are incredibly similar for two days now and not only do their Bayes scores not change significantly, but sometimes they decrease. And I have a hard time believing that one of my users is sa-train-ing these messages as ham and negating my efforts. I have ensured that the spam token count increases when I train these messages. That said, I do notice that the token count does not *always* change; sometimes, sa-learn reports "Learned tokens from 0 message(s) (1 message(s) examined)". Does this mean that all tokens from these messages have already been learned, thereby making it pointless to continue feeding them to sa-learn? If I receive one more uncaught message about how some mom is angering doctors by doing something crazy to her face, I'm going to hunt-down the er and rip her face OFF. Finally, I added the test you supplied to my SA configuration, restarted Amavis, and all messages appear to be tagged with RCVD_IN_HITALL=0.001. Thanks for all your help, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/11/2013 4:27 PM, Ben Johnson wrote: > I enabled Amavis's SA debugging mode on the server in question and was > able to extract the debug output for two messages that seem like they > should definitely be classified as spam. > > Message #1: http://pastebin.com/xLMikNJH > > Message #2: http://pastebin.com/Ug78tPrt > > A couple points of note and a couple of questions: > > a.) There seems to be plenty of network activity, but I don't any > "results" (for lack of a better term) for those queries. The final > X-Spam-Status header that is generated looks like this: > > No, score=1.592 tagged_above=-999 required=2 tests=[BAYES_50=0.8, > RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=disabled > > Does the absence of network tests in the resultant header simply mean > that none of the network tests contributed to the score? If so, why > might that be? Are these messages simply "too new" to appear in any > blacklists? > > b.) The scores for both messages are identical, which, I suppose, is not > surprising, given that the same exact tests were performed and produced > the same exact results. Is this normal? > > c.) 45 minutes after receiving Message #2 from above, I received a very > similar message. The subjects varied only in dollar amount advertised, > and the bodies varies only in the hyperlink URLs and the footer/signature. > > Here's the debug output: http://pastebin.com/sLMgXrf5 > > The second message was scored at 14.75, which seems much better. Of > course, the second score was so much higher because the > network/blacklist tests contributed significantly. > > Is the conclusion to be drawn the same as in a) (these messages are "too > new" to appear in blacklists)? > > One final point of concern on this item: the Bayes score for the first > of the two emails was BAYES_50=0.8, and I fed the message through > sa-learn as spam shortly after it arrived. Yet, the Bayes score for the > second message was BAYES_40=-0.001 -- *lower* than the first. How could > this be? Is there some rational explanation? > > Thanks for all the help here, guys! > > -Ben Nobody? A clear pattern has emerged: the X-Spam-Status headers for very obviously spammy messages never contain evidence that network tests contributed to their SA scores. Ultimately, I need to know whether: a.) Network tests are not being run at all for these messages b.) Network tests are being run, but are failing in some way c.) Network tests are being run, and are succeeding, but return responses that do not contribute to the messages' scores I've had a look at the log entries to which I link in my previous message and I just need a little help interpreting the "dns" and "async" messages. Thanks for any insight, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/10/2013 3:13 PM, Tom Hendrikx wrote: > On 10-01-13 19:55, Ben Johnson wrote: >> >> >> On 1/10/2013 1:06 PM, RW wrote: >>> On Thu, 10 Jan 2013 12:48:07 -0500 >>> Ben Johnson wrote: >>>> pon further consideration, this behavior makes perfect sense if the >>>> mailbox user has moved the message from Inbox to Junk between scans; >>>> Dovecot's Antispam filter is in use on this server. This action would >>>> cause the message tokens to be added to the Bayes database, which >>>> explains why the SA score is higher on subsequent scans, even with >>>> network tests disabled. >>> >>> Also by turning-off network tests you switch to a different score set so >>> the score for RDNS_NONE rose. >>> >> >> Ahh; I didn't realize that disabling network tests changes the score set >> entirely. Thanks for the clarification there. >> >> So, at this point, I'm struggling to understand how the following happened. >> >> Over the course of 15 minutes, I received the same exact message four >> times. Each time, the message was sent to the same recipient mailbox. >> The "From" and "Return-Path" headers changed slightly each time, but the >> message bodies appear to be identical. >> >> Here are the X-Spam-Status headers for each message: >> >> 1:28 PM >> >> Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, >> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449, >> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001, >> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25, >> URIBL_WS_SURBL=1.608] autolearn=disabled >> >> 1:35 PM >> >> No, score=-0.374 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, >> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RDNS_NONE=0.793, >> SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01] autolearn=disabled >> >> 1:36 PM >> >> Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, >> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449, >> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001, >> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25, >> URIBL_WS_SURBL=1.608] autolearn=disabled >> >> 1:41 PM >> >> Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, >> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449, >> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001, >> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25, >> URIBL_WS_SURBL=1.608] autolearn=disabled >> >> Questions: >> >> 1.) I have a fairly well-trained Bayes DB; why on earth does a message >> with the subject "Cash Quick? Get up to 1500 Now", and an equally >> nefarious body, trigger BAYES_00? > > This will solely depend on the contents of your bayes db. Is this shared > between users, etc etc. No good answer ready without looking at it. Yes, the Bayes DB is shared between users. But it seems that focusing on the "low-hanging fruit" (the network test issues) will be more productive in the short term. >> 2.) Why weren't network tests performed on message 2 of 4? This seems to >> be evidence of the fact that network tests are not being performed some >> percentage of the time, which could very well be at the root of this >> whole problem. > > The fact that not a single network test was triggered, is indeed > suspicious. The DNSBL tests are of course sender sender dependent, but > if the body is the same the URIBL stuff should fire. Maybe you DNS > queries timed because your DNS setup is borked? Maybe you should > temporarily enable debug logging for dns lookups in spamassassin? > I enabled Amavis's SA debugging mode on the server in question and was able to extract the debug output for two messages that seem like they should definitely be classified as spam. Message #1: http://pastebin.com/xLMikNJH Message #2: http://pastebin.com/Ug78tPrt A couple points of note and a couple of questions: a.) There seems to be plenty of network activity, but I don't any "results" (for lack of a better term) for those queries. The final X-Spam-Status header that is generated looks like this: No, score=1.592 tagged_above=-999 required=2 tests=[BAYES_50=0.8, RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=disabled Does the absence of network tests in the resultant header simply mean that none of the network tests contributed to the score? If so, why might that be? Are these messages simply "too new" to appear in an
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/10/2013 4:12 PM, John Hardin wrote: > On Thu, 10 Jan 2013, Ben Johnson wrote: > >> So, at this point, I'm struggling to understand how the following >> happened. >> >> Over the course of 15 minutes, I received the same exact message four >> times. Each time, the message was sent to the same recipient mailbox. >> The "From" and "Return-Path" headers changed slightly each time, but the >> message bodies appear to be identical. >> >> Here are the X-Spam-Status headers for each message: >> >> 1:28 PM >> >> Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, >> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449, >> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001, >> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25, >> URIBL_WS_SURBL=1.608] autolearn=disabled >> >> 1:35 PM >> >> No, score=-0.374 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, >> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RDNS_NONE=0.793, >> SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01] autolearn=disabled >> >> 1:36 PM >> >> Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, >> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449, >> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001, >> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25, >> URIBL_WS_SURBL=1.608] autolearn=disabled >> >> 1:41 PM >> >> Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, >> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449, >> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001, >> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25, >> URIBL_WS_SURBL=1.608] autolearn=disabled >> >> Questions: >> >> 1.) I have a fairly well-trained Bayes DB; why on earth does a message >> with the subject "Cash Quick? Get up to 1500 Now", and an equally >> nefarious body, trigger BAYES_00? >> >> 2.) Why weren't network tests performed on message 2 of 4? This seems to >> be evidence of the fact that network tests are not being performed some >> percentage of the time, which could very well be at the root of this >> whole problem. > > How many MTAs do you have? Is it possible the low-scoring one went via a > different MTA? Just one; there should be no possibility of that. > Have you sotpped amavisd, killed all of the amavis processes, and > restarted it? > > I have now. And I enabled amavis's $sa_debug option, so we should see a lot more in the way of useful SA debugging information now. In fact, I was just able to capture the out that I believe we're after, and I'll paste a link in my response to RW's message (shortly forthcoming). Thanks, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/10/2013 1:06 PM, RW wrote: > On Thu, 10 Jan 2013 12:48:07 -0500 > Ben Johnson wrote: >> pon further consideration, this behavior makes perfect sense if the >> mailbox user has moved the message from Inbox to Junk between scans; >> Dovecot's Antispam filter is in use on this server. This action would >> cause the message tokens to be added to the Bayes database, which >> explains why the SA score is higher on subsequent scans, even with >> network tests disabled. > > Also by turning-off network tests you switch to a different score set so > the score for RDNS_NONE rose. > Ahh; I didn't realize that disabling network tests changes the score set entirely. Thanks for the clarification there. So, at this point, I'm struggling to understand how the following happened. Over the course of 15 minutes, I received the same exact message four times. Each time, the message was sent to the same recipient mailbox. The "From" and "Return-Path" headers changed slightly each time, but the message bodies appear to be identical. Here are the X-Spam-Status headers for each message: 1:28 PM Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25, URIBL_WS_SURBL=1.608] autolearn=disabled 1:35 PM No, score=-0.374 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RDNS_NONE=0.793, SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01] autolearn=disabled 1:36 PM Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25, URIBL_WS_SURBL=1.608] autolearn=disabled 1:41 PM Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25, URIBL_WS_SURBL=1.608] autolearn=disabled Questions: 1.) I have a fairly well-trained Bayes DB; why on earth does a message with the subject "Cash Quick? Get up to 1500 Now", and an equally nefarious body, trigger BAYES_00? 2.) Why weren't network tests performed on message 2 of 4? This seems to be evidence of the fact that network tests are not being performed some percentage of the time, which could very well be at the root of this whole problem. Thanks, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/10/2013 12:18 PM, Ben Johnson wrote: > > > On 1/10/2013 11:49 AM, RW wrote: >> On Thu, 10 Jan 2013 11:43:44 -0500 >> Ben Johnson wrote: >> >> >>> This observation begs the question: why are network tests being >>> performed for some messages but not others? To my knowledge, no >>> white/gray/black listing has been done on this box. >> >> As has already been said, the score from network tests is commonly a >> lot higher on retesting because of all the reporting that happened >> in-between. >> > > RW, > > I understand that, but that doesn't explain why if I retest a given > message by calling SpamAssassin directly, and I *disable network tests*, > the score is sometimes *higher* than when the message was scanned > initially with AMaViS. > > When this message came through initially, the X-Spam-Status header was: > > No, score=1.593 tagged_above=-999 required=2 tests=[BAYES_50=0.8, > HTML_MESSAGE=0.001, RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=disabled > > About an hour later, I fed the same message to the spamassassin > executable, while disabling network tests: > > # spamassassin -L -t -D < /tmp/msg.txt > > Content analysis details: (5.0 points, 5.0 required) > > pts rule name description > -- > -- > 3.8 BAYES_99 BODY: Bayes spam probability is 99 to 100% > [score: 1.] > 0.0 HTML_MESSAGE BODY: HTML included in message > 1.2 RDNS_NONE Delivered to internal network by a host with > no rDNS > > To restate the question, if network tests are not outright disabled in > Amavis, why is Amavis returning lower scores than the SA binary does > when called directly with network tests disabled? Shouldn't the SA score > with network tests disabled *always* be lower than or equal to the > Amavis score with network tests enabled (provided that all else is equal)? > > Or am I way off-base here? > > Thanks again, > > -Ben > Upon further consideration, this behavior makes perfect sense if the mailbox user has moved the message from Inbox to Junk between scans; Dovecot's Antispam filter is in use on this server. This action would cause the message tokens to be added to the Bayes database, which explains why the SA score is higher on subsequent scans, even with network tests disabled. Sorry... I'm still trying to wrap my head around all of this. Lots of moving parts. -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/10/2013 11:49 AM, RW wrote: > On Thu, 10 Jan 2013 11:43:44 -0500 > Ben Johnson wrote: > > >> This observation begs the question: why are network tests being >> performed for some messages but not others? To my knowledge, no >> white/gray/black listing has been done on this box. > > As has already been said, the score from network tests is commonly a > lot higher on retesting because of all the reporting that happened > in-between. > RW, I understand that, but that doesn't explain why if I retest a given message by calling SpamAssassin directly, and I *disable network tests*, the score is sometimes *higher* than when the message was scanned initially with AMaViS. When this message came through initially, the X-Spam-Status header was: No, score=1.593 tagged_above=-999 required=2 tests=[BAYES_50=0.8, HTML_MESSAGE=0.001, RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=disabled About an hour later, I fed the same message to the spamassassin executable, while disabling network tests: # spamassassin -L -t -D < /tmp/msg.txt Content analysis details: (5.0 points, 5.0 required) pts rule name description -- -- 3.8 BAYES_99 BODY: Bayes spam probability is 99 to 100% [score: 1.] 0.0 HTML_MESSAGE BODY: HTML included in message 1.2 RDNS_NONE Delivered to internal network by a host with no rDNS To restate the question, if network tests are not outright disabled in Amavis, why is Amavis returning lower scores than the SA binary does when called directly with network tests disabled? Shouldn't the SA score with network tests disabled *always* be lower than or equal to the Amavis score with network tests enabled (provided that all else is equal)? Or am I way off-base here? Thanks again, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/9/2013 9:13 PM, John Hardin wrote: > On Wed, 9 Jan 2013, Ben Johnson wrote: > >> On 1/9/2013 7:36 PM, wolfgang wrote: >>> >>>> RCVD_IN_BRBL_LASTEXT,RCVD_IN_CSS,RCVD_IN_PSBL,RCVD_IN_XBL,URIBL_DBL_S >>>> PAM, URIBL_JP_SURBL autolearn=disabled version=3.3.1 >>> >>> I am not familiar with amavis, but I know that it calls spamassassin in >>> a special way, depending on the amavis config. Wild guess: could it be >>> that RBL/URIBL queries are disabled in your amavis config? >> >> Thanks for the reply. >> >> What you say about the RBL/URIBL tests makes sense. > > Check your amavis configuration to see whether you have network tests > disabled. That's the simplest explanation. > Thanks, John. On the surface, network tests appear to be enabled: # grep -ir sa_local_tests_only /etc/amavis /etc/amavis/conf.d/20-debian_defaults:$sa_local_tests_only = 0;# only tests which do not require internet access? Also, some of the incoming messages do contain network test scoring data in the X-Spam-Status header; here are two examples: Yes, score=8.451 tagged_above=-999 required=2 tests=[BAYES_99=3.5, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_CSS=1, RDNS_NONE=0.793, SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7] autolearn=disabled Yes, score=12.266 tagged_above=-999 required=2 tests=[BAYES_50=0.8, DATE_IN_FUTURE_12_24=3.199, DIET_1=0.001, HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_PSBL=2.7, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25] autolearn=disabled (Several of those are network tests, right?) What's strange is that another message was delivered at nearly the same time as the above two, yet it shows no evidence of network tests being performed (right?): No, score=0.8 tagged_above=-999 required=2 tests=[BAYES_50=0.8, HTML_MESSAGE=0.001, SPF_PASS=-0.001] autolearn=disabled It seems as though the SPAM that slips through never shows evidence of network tests, whereas the SPAM that is caught (and usually has a high score -- 10 or higher) always seems to show evidence of network tests. This observation begs the question: why are network tests being performed for some messages but not others? To my knowledge, no white/gray/black listing has been done on this box. Thanks again, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/9/2013 7:36 PM, wolfgang wrote: > On 2013-01-10 01:03, Ben Johnson wrote: > >> I see; I saved the email message out of Thunderbird (with View -> >> Headers -> All), as a plain text file. Apparently, that process >> butchers the original message. > > In Thunderbird, rather use File > Save as to save the entire message. > >> RCVD_IN_BRBL_LASTEXT,RCVD_IN_CSS,RCVD_IN_PSBL,RCVD_IN_XBL,URIBL_DBL_S >> PAM, URIBL_JP_SURBL autolearn=disabled version=3.3.1 > > Rules based on RBL/URIBL checks depend on DNS based blacklist queries. > And between the time you first receive an email and the time you > re-scan it, the originating client IP and/or URIs from the mail body > may have been added the the black lists after you first received the > mail. Did you re-scan the mail with amavis, too, or did you post the > X-Spam header lines from the original amavis scan and re-scan the mail > with spamassassin significantly later? > > I am not familiar with amavis, but I know that it calls spamassassin in > a special way, depending on the amavis config. Wild guess: could it be > that RBL/URIBL queries are disabled in your amavis config? > > Hope this helps. > > Cheers, > > wolfgang > Hi, Wolfgang, Thanks for the reply. What you say about the RBL/URIBL tests makes sense. I did not rescan the message with amavis; I posted the X-Spam-Status header contents from the original scan. The only reason for which I did not rescan the message with Amavis is that I don't know how to perform a SpamAssassin scan through Amavis in a manual capacity. And I can't find instructions regarding the process. All of that said, less than eight hours elapsed between the original scan with Amavis and the manual scan with "spamassassin". But, that's probably long enough for the IP addresses to be blacklisted. If nobody knows how to scan messages through Amavis, maybe I need to take this question over to the Amavis list for the time being. Thanks again, -Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
On 1/9/2013 5:36 PM, RW wrote: > On Wed, 09 Jan 2013 17:14:05 -0500 > Ben Johnson wrote: > >> About five months ago, I experienced a problem that I *thought* I had >> resolved, but I am observing similar behavior after retraining the >> Bayes database. While the symptoms are similar, the root cause seems >> to be different (thankfully). The original problem is documented at >> http://spamassassin.1065346.n5.nabble.com/Very-spammy-messages-yield-BAYES-00-1-9-td101167.html >> .. >> >> In any case, I am again seeing SA scores that seem way too low for the >> message content in question. My "glue", as it were, is Amavis-New. >> >> In particular, certain messages that are clearly SPAM are scored >> between 0 and 3 when processed via Amavis. However, if I process the >> same messages with the "spamassassin" binary, directly, the scores >> are much higher and much more in-line with what one would expect. >> ... >> When I process the same message with spamassassin, directly >> (spamassassin -t -D < /tmp/msg.txt), the header looks like this: >> >> -- >> X-Spam-Status: Yes, score=7.5 required=5.0 >> tests=BAYES_50,MISSING_DATE,MISSING_HEADERS,MISSING_MID,MISSING_SUBJECT,NO_HEADERS_MESSAGE,NO_RECEIVED,NO_RELAYS >> autolearn=disabled version=3.3.1 > > > This is not better, it indicates that SA didn't recognise it as an > email, not that it recognised it as a spam. Whatever /tmp/msg.txt was > it wasn't a properly formatted email. > Thanks for the quick replies, Marius and RW. I see; I saved the email message out of Thunderbird (with View -> Headers -> All), as a plain text file. Apparently, that process butchers the original message. I'm reviewing SA's behavior using an email client to view the messages, but I also have access to the mailbox on the server. I realize that this question may seem amateurish, but how does one discern the "message ID" from the email client and locate the corresponding file in the user's "Maildir"? I'm using Dovecot 1.x. The file names in the user's Maildir look like this: 1357762471.M952293P32429.example.com,S=4300,W=4381:2, I assume that the first bit is a UNIX timestamp. Is there any means by which to correlate the second bit (M952293P32429) to the message as I see it in my email client (Thunderbird)? I don't see that string anywhere in the headers (maybe that's by design). In other words, when I spot a message that SA seems to be scoring incorrectly in my Inbox, how do I track-down the actual file on the server that should be fed into "spamassassin"? Is there some better method than doing something like # grep -ir 20B2834E4242 /var/vmail/example.com/user/Maildir where 20B2834E4242 is the ID in the "Received" header? In any case, I tracked-down the original message on the server and repeated the process (spamassassin -t < /tmp/msg.txt): -- X-Spam-Status: Yes, score=9.3 required=5.0 tests=BAYES_50,HTML_MESSAGE, RCVD_IN_BRBL_LASTEXT,RCVD_IN_CSS,RCVD_IN_PSBL,RCVD_IN_XBL,URIBL_DBL_SPAM, URIBL_JP_SURBL autolearn=disabled version=3.3.1 [...] Content analysis details: (9.3 points, 5.0 required) pts rule name description -- -- 0.4 RCVD_IN_XBLRBL: Received via a relay in Spamhaus XBL [188.165.126.107 listed in zen.spamhaus.org] 1.0 RCVD_IN_CSSRBL: Received via a relay in Spamhaus CSS 2.7 RCVD_IN_PSBL RBL: Received via a relay in PSBL [188.165.126.107 listed in psbl.surriel.com] 1.2 URIBL_JP_SURBL Contains an URL listed in the JP SURBL blocklist [URIs: ehylle.info] 1.4 RCVD_IN_BRBL_LASTEXT RBL: RCVD_IN_BRBL_LASTEXT [188.165.126.107 listed in bb.barracudacentral.org] 1.7 URIBL_DBL_SPAM Contains an URL listed in the DBL blocklist [URIs: ehylle.info] 0.0 HTML_MESSAGE BODY: HTML included in message 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.5428] -- So, if I've done this correctly, the score discrepancy is even larger. Thanks, guys! -Ben
Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
About five months ago, I experienced a problem that I *thought* I had resolved, but I am observing similar behavior after retraining the Bayes database. While the symptoms are similar, the root cause seems to be different (thankfully). The original problem is documented at http://spamassassin.1065346.n5.nabble.com/Very-spammy-messages-yield-BAYES-00-1-9-td101167.html . In any case, I am again seeing SA scores that seem way too low for the message content in question. My "glue", as it were, is Amavis-New. In particular, certain messages that are clearly SPAM are scored between 0 and 3 when processed via Amavis. However, if I process the same messages with the "spamassassin" binary, directly, the scores are much higher and much more in-line with what one would expect. The X-Spam-Status header, when processed via Amavis, looks like this: X-Spam-Status: No, score=0.8 tagged_above=-999 required=2 tests=[BAYES_50=0.8, HTML_MESSAGE=0.001, SPF_PASS=-0.001] autolearn=disabled When I process the same message with spamassassin, directly (spamassassin -t -D < /tmp/msg.txt), the header looks like this: -- X-Spam-Status: Yes, score=7.5 required=5.0 tests=BAYES_50,MISSING_DATE,MISSING_HEADERS,MISSING_MID,MISSING_SUBJECT,NO_HEADERS_MESSAGE,NO_RECEIVED,NO_RELAYS autolearn=disabled version=3.3.1 [...] Content analysis details: (7.5 points, 5.0 required) pts rule name description -- -- -0.0 NO_RELAYS Informational: message was not relayed via SMTP 1.2 MISSING_HEADERSMissing To: header 2.0 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.5000] 1.2 MISSING_MIDMissing Message-Id: header 1.3 MISSING_SUBJECTMissing Subject: header -0.0 NO_RECEIVEDInformational: message has no Received headers 1.8 MISSING_DATE Missing Date: header 0.0 NO_HEADERS_MESSAGE Message appears to be missing most RFC-822 headers -- In short, my question is, how the is the message scoring 0.8 in one case and 7.5 in another? That is a massive discrepancy. >From what I can tell, the same tests aren't even being performed in each case. I have to assume that the options that are passed to SA are wildly different in each case. It bears mention that the server in question uses ISPConfig 3. ISPConfig allows for SA policies to be configured per-domain and per-user, and Amavis leverages MySQL to make that happen. If relevant, I can provide more information about this aspect of my setup. These are the only directives that I've added to /etc/spamassassin/local.cf: -- bayes_path /var/lib/amavis/.spamassassin/bayes use_bayes 1 bayes_auto_expire 0 bayes_store_module Mail::SpamAssassin::BayesStore::MySQL bayes_sql_dsn DBI:mysql:sa_bayes:localhost bayes_sql_username sa_user bayes_sql_password [scrubbed] bayes_sql_override_username amavis -- Given the first directive, SA should always use the same Bayes database (the one I've configured in MySQL), regardless of how SA is called, right? For those curious about the state of the Bayes database, here's the output from "sa-learn --dump magic" (sorry for the wrapping): 0.000 0 3 0 non-token data: bayes db version 0.000 0 2007 0 non-token data: nspam 0.000 0 6554 0 non-token data: nham 0.000 0 188379 0 non-token data: ntokens 0.000 0 1356345829 0 non-token data: oldest atime 0.000 0 1357769317 0 non-token data: newest atime 0.000 0 0 0 non-token data: last journal sync atime 0.000 0 1357727978 0 non-token data: last expiry atime 0.000 01382400 0 non-token data: last expire atime delta 0.000 0 3191 0 non-token data: last expire reduction count Ultimately, it seems that I should be trying to figure out how, exactly, Amavis is calling SpamAssassin in the course of normal operation. Thanks for any help here, folks! -Ben
Re: Try to run sa-learn
On 10/4/2012 2:06 PM, troxlinux wrote: > Hi list , I try to run sa-learn on centos 6.3 but no work > > sa-learn --spam --showdots /dir/dir/domain.com.ni/spam/.spam/cur/ > > Learned tokens from 0 message(s) (1 message(s) examined) > ERROR: the Bayes learn function returned an error, please re-run with > -D for more information at /usr/bin/sa-learn line 493. > > any idea ? , is a bug? , selinux is disabled Well, did you do what the error message suggested (run 'sa-learn' with the -D switch)? What's the relevant output? > my version of spamassassin > spamassassin-3.3.2-4.el6.rfx.x86_64 > > > regardss >
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/22/2012 10:26 AM, Axb wrote: > On 08/22/2012 04:10 PM, Ben Johnson wrote: >> >> I did end-up overriding the bayes_path, which provided a workaround for >> the permissions issues. Cheers to the suggestion. > > This is not a workaround, it's common practice in many types of setups > and documented, but due to numerous reasons can't be set as a default. > If the install routine would require/create a > /etc/mail/spamassassin/bayes path it could bite "other" systems than > standard Linux distros. > (note to myself: discuss this in dev list) Right; it makes sense that this path cannot have a default value (other than ~/...). That said, it seems that for some users (myself included), setting this path manually is a critical step in creating a maximally functional (that is, Bayes-enabled) SpamAssassin installation. This would be especially true if the SA developers were to change the "bayes_auto_learn" default value to zero, or lower the default value for "bayes_auto_learn_threshold_nonspam" (as a result of my "incident" here). For this reason, it seems prudent for developers/contributors to take one of two actions (or both): 1.) Add the "bayes_path" directive to the default/stock "local.cf" that ships with SpamAssassin, in a commented-out state. I realize that this file may be maintainer/distribution specific, and that there are attendant challenges associated with such a change. This measure would underscore the directive's importance for the administrator who is configuring the software. 2.) Where possible, modify the SpamAssassin installer package to prompt the user for the "bayes_path" during installation. These types of prompts are common among related packages. For example, Postfix asks for all kinds of information during its installation (on Debian-based systems, anyway). Again, I realize that the SA developers likely have no control over how the software is packaged and delivered, so if this point seems valid, I am happy to open distro-specific bug reports (or feature requests). Thanks, Axb. -Ben
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/22/2012 9:43 AM, John Hardin wrote: > On Wed, 22 Aug 2012, Bowie Bailey wrote: > >> On 8/21/2012 5:51 PM, Ben Johnson wrote: >>> >>> What good is the --username switch, then? Thanks for the follow-up, John! > See other responses. > >>> Why does this command train the "root" user's database? > > Because you ran the command as root. > > I apologize, I didn't provide sufficient details. When I said "train as > the user who runs SA" I meant "su to that OS user ID before running the > sa-learn command". No apology necessary; I knew what you meant, and did indeed try running the sa-learn command as "root", initially, but the problem then was a lack of access to the mail directories. On Debian/Ubuntu systems, when using Dovecot, all mail directories are vmail:vmail owned, with 700 permissions, which prevents the "amavis" user from having access to them. (This is by design, I'm sure, and makes sense.) > You can either override the default Bayes database files path to > explicitly specify a shared global database as has been suggested, or > run sa-learn as the amavis user via su or a cron job. I did end-up overriding the bayes_path, which provided a workaround for the permissions issues. Cheers to the suggestion. Defining a global > bayes database is probably a better solution overall, but bear in mind > if you have to wipe and retrain you need to check the permissions on the > new database files after you run sa-learn the first time. > This is an important point; thanks for articulating it. All appears to be well in SpamAssassin Town for the time being (don't think you've heard the last of me, though!). Thanks to everyone who shared his or her expertise. -Ben
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/22/2012 9:05 AM, Bowie Bailey wrote: > On 8/21/2012 5:51 PM, Ben Johnson wrote: >> >> On 8/21/2012 5:19 PM, John Hardin wrote: >>> On Tue, 21 Aug 2012, Ben Johnson wrote: >>> >>>> Aug 21 13:08:33.729 [23714] dbg: bayes: tie-ing to DB file R/O >>>> /var/lib/amavis/.spamassassin/bayes_toks >>>> >>>> ---8<-- >>>> # sa-learn --username=amavis --dump magic >>> Run that with --debug and verify that the filenames match. >>> >> Sure enough, they don't match: >> >> ---8<-- >> [...] >> dbg: bayes: tie-ing to DB file R/O /root/.spamassassin/bayes_toks >> dbg: bayes: tie-ing to DB file R/O /root/.spamassassin/bayes_seen >> Aug 21 14:41:13.112 [32170] dbg: bayes: found bayes db version 3 >> 0.000 0 3 0 non-token data: bayes db version >> 0.000 0 95 0 non-token data: nspam >> 0.000 0307 0 non-token data: nham >> 0.000 0 62301 0 non-token data: ntokens >> 0.000 0 1345469997 0 non-token data: oldest atime >> 0.000 0 1345579297 0 non-token data: newest atime >> 0.000 0 0 0 non-token data: last journal >> sync atime >> 0.000 0 0 0 non-token data: last expiry atime >> 0.000 0 0 0 non-token data: last expire >> atime delta >> 0.000 0 0 0 non-token data: last expire >> reduction count >> ---8<-- >> >> So, I suppose that I didn't actually resolve the problem from yesterday, >> which was that I cannot seem to train under the "amavis" user due to the >> ownership/permissions on the /var/vmail directory. >> >> What good is the --username switch, then? >> >> Why does this command train the "root" user's database? >> >> # sa-learn --username=amavis --spam "/path/to/spam" >> >> And why does this command dump the "root" user's database? >> >> # sa-learn --username=amavis --dump magic >> >> Thanks very much, > > As has already been mentioned, the '--username' option is only useful if > you're using SQL. You should set your bayes_path so there is no confusion. Thank you Axb and Bowie for clarifying this point. Perhaps the sa-learn documentation should be updated to eliminate the ambiguity around this switch. In particular, I am referring to this page: http://spamassassin.apache.org/full/3.0.x/dist/doc/sa-learn.html , which states only the following: "If specified this username will override the username taken from the runtime environment. You can use this option to specify users in a virtual user configuration." Maybe adding the "SQL" keyword will make the "virtual user configuration" distinction more evident. > Since you have been training the root database, you may want to copy > that one over. > > $ cp /root/.spamassassin/bayes* /var/lib/amavis/.spamassassin/ > > Then fix the permissions and ownership back to what they should be for > the amavis user. I did think to do this, but I approached it a bit differently, and used "sa-learn --backup" (and --restore), under the "amavis" user account, which mitigated the need to modify the permissions on the database. > Then set the bayes path in your local.cf: > > bayes_path /var/lib/amavis/.spamassassin/bayes > > (Don't double the 'bayes' at the end as was suggested previously unless > you want to move the bayes files into a 'bayes' directory) > > Restart amavis and try again... > Again, thanks to Axb and Bowie for making this suggestion. Hard-coding the bayes_path was the missing link for me; this is what allowed me to train under the "amavis" user while having "root" (or "vmail") privileges, which on Debian, are necessary to read mail during training. I think I'm sorted here; thanks again, guys! -Ben
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/21/2012 5:19 PM, John Hardin wrote: > On Tue, 21 Aug 2012, Ben Johnson wrote: > >> Aug 21 13:08:33.729 [23714] dbg: bayes: tie-ing to DB file R/O >> /var/lib/amavis/.spamassassin/bayes_toks >> >> ---8<-- >> # sa-learn --username=amavis --dump magic > > Run that with --debug and verify that the filenames match. > Sure enough, they don't match: ---8<-- [...] dbg: bayes: tie-ing to DB file R/O /root/.spamassassin/bayes_toks dbg: bayes: tie-ing to DB file R/O /root/.spamassassin/bayes_seen Aug 21 14:41:13.112 [32170] dbg: bayes: found bayes db version 3 0.000 0 3 0 non-token data: bayes db version 0.000 0 95 0 non-token data: nspam 0.000 0307 0 non-token data: nham 0.000 0 62301 0 non-token data: ntokens 0.000 0 1345469997 0 non-token data: oldest atime 0.000 0 1345579297 0 non-token data: newest atime 0.000 0 0 0 non-token data: last journal sync atime 0.000 0 0 0 non-token data: last expiry atime 0.000 0 0 0 non-token data: last expire atime delta 0.000 0 0 0 non-token data: last expire reduction count ---8<-- So, I suppose that I didn't actually resolve the problem from yesterday, which was that I cannot seem to train under the "amavis" user due to the ownership/permissions on the /var/vmail directory. What good is the --username switch, then? Why does this command train the "root" user's database? # sa-learn --username=amavis --spam "/path/to/spam" And why does this command dump the "root" user's database? # sa-learn --username=amavis --dump magic Thanks very much, -Ben
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/20/2012 2:47 PM, Ben Johnson wrote: > I was able to resolve the issue by adding the --username switch to the > 'sa-learn' executable: > > # sa-learn --username=amavis --spam > /var/vmail/example.com/trainer/Maildir/.INBOX.Spam/cur > > Thanks for all of the hints, folks! So, I've been training SpamAssassin like a mad-man for a couple of days. I don't have over 200 spams and 200 hams, so I don't expect Bayes to be used yet (and it's not), but the following output is puzzling (particularly, "only 0 spam(s) in bayes DB < 200"): ---8<-- # su amavis -c "spamassassin -D -t < /usr/share/doc/spamassassin/examples/sample-spam.txt 2>&1 | egrep '(bayes:|whitelist:|AWL)'" Aug 21 13:08:33.717 [23714] dbg: bayes: learner_new self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x213613f8), bayes_store_module=Mail::SpamAssassin::BayesStore::DBM Aug 21 13:08:33.728 [23714] dbg: bayes: learner_new: got store=Mail::SpamAssassin::BayesStore::DBM=HASH(0x2153b400) Aug 21 13:08:33.729 [23714] dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_toks Aug 21 13:08:33.729 [23714] dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_seen Aug 21 13:08:33.730 [23714] dbg: bayes: found bayes db version 3 Aug 21 13:08:33.730 [23714] dbg: bayes: DB journal sync: last sync: 0 Aug 21 13:08:33.730 [23714] dbg: bayes: not available for scanning, only 0 spam(s) in bayes DB < 200 Aug 21 13:08:33.730 [23714] dbg: bayes: untie-ing Aug 21 13:08:33.732 [23714] dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_toks Aug 21 13:08:33.732 [23714] dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_seen Aug 21 13:08:33.733 [23714] dbg: bayes: found bayes db version 3 Aug 21 13:08:33.733 [23714] dbg: bayes: DB journal sync: last sync: 0 Aug 21 13:08:33.733 [23714] dbg: bayes: not available for scanning, only 0 spam(s) in bayes DB < 200 Aug 21 13:08:33.733 [23714] dbg: bayes: untie-ing ---8<-- Restarting Amavis does not change the output above. And the output below seems to contradict the above (300 spams and 95 hams): ---8<-- # sa-learn --username=amavis --dump magic 0.000 0 3 0 non-token data: bayes db version 0.000 0 95 0 non-token data: nspam 0.000 0300 0 non-token data: nham 0.000 0 59420 0 non-token data: ntokens 0.000 0 1345469997 0 non-token data: oldest atime 0.000 0 1345577900 0 non-token data: newest atime 0.000 0 0 0 non-token data: last journal sync atime 0.000 0 0 0 non-token data: last expiry atime 0.000 0 0 0 non-token data: last expire atime delta 0.000 0 0 0 non-token data: last expire reduction count ---8<-- Am I doing something silly? Thanks for any help, -Ben
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/20/2012 2:02 PM, Ben Johnson wrote: > > > On 8/20/2012 12:56 PM, Bowie Bailey wrote: >> On 8/20/2012 12:46 PM, Axb wrote: >>> On 08/20/2012 06:42 PM, Ben Johnson wrote: >>>> >>>> On 8/17/2012 11:28 AM, John Hardin wrote: >>>>> On Fri, 17 Aug 2012, Ben Johnson wrote: >>>>> >>>>>> On 8/16/2012 2:00 PM, Ben Johnson wrote: >>>>>> Basically, I need to do something about the spam inundation, as >>>>>> soon as >>>>>> possible. >>>>>> >>>>>> Is there any reason that I should NOT be performing the sa-learn >>>>>> training under the "amavis" user account? >>>>> In general, all training should be done as the user that SA (in your >>>>> case, SA via Amavis) is running as. >>>> I have tried to do this, but to no avail: >>>> >>>> --- >>>> # su amavis -c 'sa-learn --spam >>>> /var/vmail/example.com/trainer/Maildir/.INBOX.Spam' >>>> >>>> archive-iterator: no access to >>>> /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at >>>> /usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 539. >>>> archive-iterator: no access to >>>> /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at >>>> /usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 771. >>>> archive-iterator: unable to open >>>> /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 >>>> --- >>> ~/Maildir/* assumes 1 file=1 mail >>> >>> pls try >>> >>> su amavis -c 'sa-learn --spam --progress --dir >>> /var/vmail/example.com/trainer/Maildir/.INBOX.Spam/cur/' >>> >>> or wherever the message are stored >> >> But first, you need access to the files. The simplest way is probably >> to add the amavis user account to the group used by the mail directories. >> >> Assuming the group is "vmail", the command should look like this (on >> RedHat/CentOS): >> >> $ usermod -a -G vmail amavis > > Thanks, guys. I did consider adding the "amavis" user to the "vmail" > group, but the default permissions on the directories within "Maildir" > are 700 (with vmail:vmail ownership). > > So, I'd have to fiddle with the permissions on the entire directory > tree, for each user, which seems like a bad idea. > > Furthermore, ISPconfig handles the creation (and deletion) of these > directories, so I hesitate to change anything manually and muck-up the > installation. > > While there may be permissions mask that is applied, modifying it seems > risky. > > I wonder what the rest of the Dovecot + Amavis + SA world is doing about > this. Maybe I should ask on the Amavis mailing list. > > If anyone has other suggestions, by all means, please do share. > >> This command will probably need to be run as root. If you are using a >> different distro, you will need to look up the command to add the amavis >> user to the vmail group. >> > > Much thanks, > > -Ben > I was able to resolve the issue by adding the --username switch to the 'sa-learn' executable: # sa-learn --username=amavis --spam /var/vmail/example.com/trainer/Maildir/.INBOX.Spam/cur Thanks for all of the hints, folks! -Ben
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/20/2012 12:56 PM, Bowie Bailey wrote: > On 8/20/2012 12:46 PM, Axb wrote: >> On 08/20/2012 06:42 PM, Ben Johnson wrote: >>> >>> On 8/17/2012 11:28 AM, John Hardin wrote: >>>> On Fri, 17 Aug 2012, Ben Johnson wrote: >>>> >>>>> On 8/16/2012 2:00 PM, Ben Johnson wrote: >>>>> Basically, I need to do something about the spam inundation, as >>>>> soon as >>>>> possible. >>>>> >>>>> Is there any reason that I should NOT be performing the sa-learn >>>>> training under the "amavis" user account? >>>> In general, all training should be done as the user that SA (in your >>>> case, SA via Amavis) is running as. >>> I have tried to do this, but to no avail: >>> >>> --- >>> # su amavis -c 'sa-learn --spam >>> /var/vmail/example.com/trainer/Maildir/.INBOX.Spam' >>> >>> archive-iterator: no access to >>> /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at >>> /usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 539. >>> archive-iterator: no access to >>> /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at >>> /usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 771. >>> archive-iterator: unable to open >>> /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 >>> --- >> ~/Maildir/* assumes 1 file=1 mail >> >> pls try >> >> su amavis -c 'sa-learn --spam --progress --dir >> /var/vmail/example.com/trainer/Maildir/.INBOX.Spam/cur/' >> >> or wherever the message are stored > > But first, you need access to the files. The simplest way is probably > to add the amavis user account to the group used by the mail directories. > > Assuming the group is "vmail", the command should look like this (on > RedHat/CentOS): > > $ usermod -a -G vmail amavis Thanks, guys. I did consider adding the "amavis" user to the "vmail" group, but the default permissions on the directories within "Maildir" are 700 (with vmail:vmail ownership). So, I'd have to fiddle with the permissions on the entire directory tree, for each user, which seems like a bad idea. Furthermore, ISPconfig handles the creation (and deletion) of these directories, so I hesitate to change anything manually and muck-up the installation. While there may be permissions mask that is applied, modifying it seems risky. I wonder what the rest of the Dovecot + Amavis + SA world is doing about this. Maybe I should ask on the Amavis mailing list. If anyone has other suggestions, by all means, please do share. > This command will probably need to be run as root. If you are using a > different distro, you will need to look up the command to add the amavis > user to the vmail group. > Much thanks, -Ben
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/17/2012 11:28 AM, John Hardin wrote: > On Fri, 17 Aug 2012, Ben Johnson wrote: > >> On 8/16/2012 2:00 PM, Ben Johnson wrote: >> Basically, I need to do something about the spam inundation, as soon as >> possible. >> >> Is there any reason that I should NOT be performing the sa-learn >> training under the "amavis" user account? > > In general, all training should be done as the user that SA (in your > case, SA via Amavis) is running as. I have tried to do this, but to no avail: --- # su amavis -c 'sa-learn --spam /var/vmail/example.com/trainer/Maildir/.INBOX.Spam' archive-iterator: no access to /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at /usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 539. archive-iterator: no access to /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at /usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 771. archive-iterator: unable to open /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 --- This seems to occur because the virtual mail directory permissions do not provide the "amavis" user with the required access level. The "vmail" user is the only user with any type of access to /var/vmail/example.com/user/Maildir. I suspect that there is a good reason for this and that the ownership/permissions should not be changed. I've done some research on this issue and there isn't much to be found. This archived thread ( http://marc.info/?l=amavis-user&m=116457786312019 ) discusses overriding the Bayes user with "bayes_sql_override_username amavis", but that doesn't solve the problem (obviously). I still see the same permission errors, although the need to use the 'su' wrapper does go away. Is there a conventional means by which to deal with this issue? > If you have your system configured for per-user Bayes databases, then > you'd need to train as the user whose database you want to affect. The system in question leverages ISPConfig 3, which implements virtual users/mailboxes, although, I don't know if ISPConfig configures Amavis to utilize individual Bayes databases or if there's an individual database for the "amavis" user. I can check with the developers. > What is your bayes_path config? I don't see this directive anywhere on the system in question; perhaps a default value is being used. The only instance of that string exists in a source file: /usr/share/perl5/Mail/SpamAssassin/Conf.pm:=item bayes_path /path/filename (default: ~/.spamassassin/bayes) So, presumably, "bayes_path" is equating to "~/.spamassassin/bayes", or in my case, "/var/lib/amavis/.spamassassin". >> Would doing so preclude me from creating training folders for >> individual IMAP users in the future? > > They're not related. Per-user ham and spam training folders doesn't > preclude using those messages for training a global Bayes database. Understood. > You actually may want to implement a hybrid folder model: per-user ham > training folders and a global spam training folder. Misclassified ham > could potentially be private messages that the recipient doesn't want > other users to see, but for misclassified spam who cares? Right, that makes sense. >> Or can I train under the "amavis" user for now and then "layer-on" >> training for individual IMAP users in the future without undesirable >> consequences? > > As stated above, if you're not enabling per-user Bayes *databases*, the > question is meaningless. Are you going to configure per-user Bayes > databases? Or (as I suspect is more likely) perform global database > training from individual users whose judgement you trust? > I suppose that I need to determine whether or not ISPConfig implements per-user Bayes database already. I'll report-back for those who may be curious. Thanks again, -Ben
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/16/2012 2:00 PM, Ben Johnson wrote: > In any event, at this point, I'm confused as to which user account I > should be using when executing "sa-learn --spam", for example. > > As a bit of background, I'm using ISPConfig 3, which implements virtual > mailbox users via MySQL. > > I dug through the mailing list archive and found > http://spamassassin.1065346.n5.nabble.com/Problem-with-sa-learn-and-virtual-user-td44666.html > , which seems to be relevant. > > Ultimately, I'm wondering if I should be using the "amavis" user to > learn ham/spam, or individual mailbox user accounts. > > If it is possible to use either, are there pros and cons of which one > should be aware before settling on an approach? > > As I mentioned previously, I would like to set-up "LearnHam" and > "LearnSpam" folders for each IMAP user, eventually, so perhaps this > answers my question? > > Thanks again for all the help! John Hardin, sorry to bust you up here... just curious whether or not you saw the rest of my previous note. If you didn't address these questions intentionally, then please ignore me. :) Basically, I need to do something about the spam inundation, as soon as possible. Is there any reason that I should NOT be performing the sa-learn training under the "amavis" user account? Would doing so preclude me from creating training folders for individual IMAP users in the future? Or can I train under the "amavis" user for now and then "layer-on" training for individual IMAP users in the future without undesirable consequences? Thanks again, -Ben
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/16/2012 12:32 PM, John Hardin wrote: > On Thu, 16 Aug 2012, Ben Johnson wrote: > >> On 8/16/2012 11:38 AM, John Hardin wrote: >>> On Thu, 16 Aug 2012, Ben Johnson wrote: >>> >>>> So, after disabling auto-learn (for now) and executing "sa-learn >>>> --clear", and restarting Amavis, I'm still seeing this: >>>> >>>> No, score=0.593 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, >>>> HTML_MESSAGE=0.001, RDNS_NONE=0.793, SPF_PASS=-0.001, >>>> URIBL_DBL_SPAM=1.7] autolearn=disabled >>>> >>>> Why BAYES_00 still? Am I running the wrong command to clear the >>>> database? >>> >>> That's correct. Be sure that you're running it as the same user that >>> amavis+SA is running as, otherwise you're clearing the wrong files. >>> >>> You might want to run "sa-learn --dump magic" afterwards to verify the >>> database is cleared. >> >> John, >> >> You were exactly right; I forgot to execute "sa-learn --clear" as the >> "amavis" user. >> >> What is the expected output of "sa-learn --dump magic" once the database >> has been cleared successfully? >> >> # su amavis -c 'sa-learn --dump magic' >> >> ERROR: Bayes dump returned an error, please re-run with -D for more >> information >> >> # su amavis -c 'sa-learn -D --dump magic' >> >> [...] >> dbg: bayes: no dbs present, cannot tie DB R/O: >> /var/lib/amavis/.spamassassin/bayes_toks >> [...] >> >> Is this to be expected? Or did I muck-up the works? > > Heh. I was expecting zeroes, but "no dbs present" is also a good > confirmation that the Bayes database has been reset... :) > > You might need to restart amavis now, too. > So, I preemptively restarted Amavis, per your suggestion (without executing "su amavis -c 'sa-learn -D --dump magic'" first), and when I executed the aforementioned command after the restart, I received the "expected" output: # su amavis -c 'sa-learn --dump magic' 0.000 0 3 0 non-token data: bayes db version 0.000 0 0 0 non-token data: nspam 0.000 0 0 0 non-token data: nham 0.000 0 0 0 non-token data: ntokens 0.000 0 0 0 non-token data: oldest atime 0.000 0 0 0 non-token data: newest atime 0.000 0 0 0 non-token data: last journal sync atime 0.000 0 0 0 non-token data: last expiry atime 0.000 0 0 0 non-token data: last expire atime delta 0.000 0 0 0 non-token data: last expire reduction count All looks well. (I'm performing these actions in a test/development environment, by the way.) So, I went to follow the same procedure in production: # su amavis -c 'sa-learn --clear' # service amavis restart # su amavis -c 'sa-learn -D --dump magic' Yet this yields that familiar message: ERROR: Bayes dump returned an error, please re-run with -D for more information I waited a little while (at least an hour) and tried again. Same thing. I restarted Amavis again, same thing. A few minutes later, I decided to give it one last shot, and sure enough, I received the expected output with all zeros. It may be academic at this point, but I'm now curious as to what causes the DB file to be recreated, if not restarting Amavis. (It bears mention that plenty of mail came in between using the "--clear" switch and when using the "--dump" switch began to produce valid [all-zero] output. In other words, the DB didn't seem to be recreated when the first message was received after clearing the old DB and restarting Amavis.) In any event, at this point, I'm confused as to which user account I should be using when executing "sa-learn --spam", for example. As a bit of background, I'm using ISPConfig 3, which implements virtual mailbox users via MySQL. I dug through the mailing list archive and found http://spamassassin.1065346.n5.nabble.com/Problem-with-sa-learn-and-virtual-user-td44666.html , which seems to be relevant. Ultimately, I'm wondering if I should be using the "amavis" user to learn ham/spam, or individual mailbox user accounts. If it is possible to use either, are there pros and cons of which one should be aware before settling on an approach? As I mentioned previously, I would like to set-up "LearnHam" and "LearnSpam" folders for each IMAP user, eventually, so perhaps this answers my question? Thanks again for all the help! -Ben
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/16/2012 11:38 AM, John Hardin wrote: > On Thu, 16 Aug 2012, Ben Johnson wrote: > >> So, after disabling auto-learn (for now) and executing "sa-learn >> --clear", and restarting Amavis, I'm still seeing this: >> >> No, score=0.593 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, >> HTML_MESSAGE=0.001, RDNS_NONE=0.793, SPF_PASS=-0.001, >> URIBL_DBL_SPAM=1.7] autolearn=disabled >> >> Why BAYES_00 still? Am I running the wrong command to clear the database? > > That's correct. Be sure that you're running it as the same user that > amavis+SA is running as, otherwise you're clearing the wrong files. > > You might want to run "sa-learn --dump magic" afterwards to verify the > database is cleared. > John, You were exactly right; I forgot to execute "sa-learn --clear" as the "amavis" user. What is the expected output of "sa-learn --dump magic" once the database has been cleared successfully? # su amavis -c 'sa-learn --dump magic' ERROR: Bayes dump returned an error, please re-run with -D for more information # su amavis -c 'sa-learn -D --dump magic' [...] dbg: bayes: no dbs present, cannot tie DB R/O: /var/lib/amavis/.spamassassin/bayes_toks [...] Is this to be expected? Or did I muck-up the works? Thanks again, -Ben
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/16/2012 10:14 AM, Ben Johnson wrote: > > > On 8/15/2012 4:05 PM, John Hardin wrote: >> On Wed, 15 Aug 2012, Ben Johnson wrote: >> >>> On 8/15/2012 2:24 PM, John Hardin wrote: >>>> On Wed, 15 Aug 2012, Ben Johnson wrote: >>>> >>>>> Some 99% of the spam that I receive, which is grossly spammy (we're >>>>> talking auto loans, cash advances, dink pills, the whole lot) contains >>>>> "BAYES_00=-1.9" in the tests portion of the X-Spam-Status header. >>>>> >>>>> Might anyone know why? >>>> >>>> Poor training. >>> >>> John, I can't thank you enough for the thoroughness of your response. >> >> I like to show off. :) >> >>>> Apart from the Bayes score, what kind of scores are those spams getting? >>> >>> Here are a few examples (the first two of which are two of VERY few in >>> which the BAYES_* value is over 00): >>> >>> - >>> No, score=0.192 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001, >>> HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, >>> SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=no >>> >>> No, score=2.241 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001, >>> HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RDNS_NONE=0.793, >>> SPF_PASS=-0.001] autolearn=no >>> >>> No, score=-0.836 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, >>> HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3, >>> RDNS_NONE=0.793, SPF_PASS=-0.001, URI_HEX=1.122] autolearn=no >>> >>> No, score=1.256 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, >>> HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3, >>> RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7, >>> URIBL_RHS_DOB=1.514] autolearn=no >>> - >> >> It might be interesting to see some log entries where autolearn=yes... > > Here you go: > > No, score=-4.2 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, > HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001] autolearn=ham > > No, score=-3.407 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, > HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, > SPF_PASS=-0.001] autolearn=ham > > No, score=-2.5 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, > HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, > URIBL_DBL_SPAM=1.7] autolearn=ham > > No, score=-3.407 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, > HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, > SPF_PASS=-0.001] autolearn=ham > >>> It bears mention that the RCVD_IN_DNSWL_MED test is having even more >>> of a negative impact (pardon the pun) than BAYES_*. I am already >>> working with the dnswl.org folks (off-list, for privacy reasons) to >>> get to the bottom of that issue. >> >> This might be a major contributing factor. If your system was taught >> from scratch by autolearn, and DNSWL (which is fairly well trusted) has >> been pushing a lot of spams to low scores... > > It looks as though this is exactly what happened. I'll post back once > I've done some more troubleshooting with the folks at dnswl.org. > >> You might want to set: >> bayes_auto_learn_threshold_nonspam -3 > > Done. > >> That won't _fix_ the problem (at least not quickly) or avoid the need to >> wipe and retrain, but it might keep things from getting worse. > > I disabled auto-learn and executed "sa-learn --clear", too. So, I should > be starting with a "clean slate", right? > > I have also disabled the DNSWL rules, until the issue can be resolved, > and will begin manual training immediately. > >> See perldoc Mail::SpamAssassin::Plugin::AutoLearnThreshold for more info. >> >>> Most of the list is probably laughing, but given the complexity of Spam >>> Assassin, this crucial requirement was lost on me, amidst the sea of >>> information and instructions. For example, there is no mention of the >>> fact that SA is essentially useless without Bayesian training on >>> http://wiki.apache.org/spamassassin/StartUsing . >> >> That's because that shouldn't be the case. The base ruleset + URIBL >> should be very effective pretty much out-of-the-box. >> >>>> What version of SA is this? >>> >>> # spamassassin --version >>> SpamAssassin version 3.3.1 >>> running on Perl version 5.10.1 >> >> A little st
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/15/2012 4:05 PM, John Hardin wrote: > On Wed, 15 Aug 2012, Ben Johnson wrote: > >> On 8/15/2012 2:24 PM, John Hardin wrote: >>> On Wed, 15 Aug 2012, Ben Johnson wrote: >>> >>>> Some 99% of the spam that I receive, which is grossly spammy (we're >>>> talking auto loans, cash advances, dink pills, the whole lot) contains >>>> "BAYES_00=-1.9" in the tests portion of the X-Spam-Status header. >>>> >>>> Might anyone know why? >>> >>> Poor training. >> >> John, I can't thank you enough for the thoroughness of your response. > > I like to show off. :) > >>> Apart from the Bayes score, what kind of scores are those spams getting? >> >> Here are a few examples (the first two of which are two of VERY few in >> which the BAYES_* value is over 00): >> >> - >> No, score=0.192 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001, >> HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, >> SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=no >> >> No, score=2.241 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001, >> HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RDNS_NONE=0.793, >> SPF_PASS=-0.001] autolearn=no >> >> No, score=-0.836 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, >> HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3, >> RDNS_NONE=0.793, SPF_PASS=-0.001, URI_HEX=1.122] autolearn=no >> >> No, score=1.256 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, >> HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3, >> RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7, >> URIBL_RHS_DOB=1.514] autolearn=no >> - > > It might be interesting to see some log entries where autolearn=yes... Here you go: No, score=-4.2 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001] autolearn=ham No, score=-3.407 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=ham No, score=-2.5 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=ham No, score=-3.407 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=ham >> It bears mention that the RCVD_IN_DNSWL_MED test is having even more >> of a negative impact (pardon the pun) than BAYES_*. I am already >> working with the dnswl.org folks (off-list, for privacy reasons) to >> get to the bottom of that issue. > > This might be a major contributing factor. If your system was taught > from scratch by autolearn, and DNSWL (which is fairly well trusted) has > been pushing a lot of spams to low scores... It looks as though this is exactly what happened. I'll post back once I've done some more troubleshooting with the folks at dnswl.org. > You might want to set: > bayes_auto_learn_threshold_nonspam -3 Done. > That won't _fix_ the problem (at least not quickly) or avoid the need to > wipe and retrain, but it might keep things from getting worse. I disabled auto-learn and executed "sa-learn --clear", too. So, I should be starting with a "clean slate", right? I have also disabled the DNSWL rules, until the issue can be resolved, and will begin manual training immediately. > See perldoc Mail::SpamAssassin::Plugin::AutoLearnThreshold for more info. > >> Most of the list is probably laughing, but given the complexity of Spam >> Assassin, this crucial requirement was lost on me, amidst the sea of >> information and instructions. For example, there is no mention of the >> fact that SA is essentially useless without Bayesian training on >> http://wiki.apache.org/spamassassin/StartUsing . > > That's because that shouldn't be the case. The base ruleset + URIBL > should be very effective pretty much out-of-the-box. > >>> What version of SA is this? >> >> # spamassassin --version >> SpamAssassin version 3.3.1 >> running on Perl version 5.10.1 > > A little stale, but not bad. 'Tis the major drawback with using LTS Linux distros and managing software via packages, I suppose. >>> You may also want to set up some mechanism for users to submit >>> misclassified messages for training. Depending on how much you trust >>> their judgement the learning from these can be automatic or can go >>> through you as a reviewer. >> >> That sounds like a good idea. Is there a particular HOW TO or tutorial >> that you recommend? If it depends on the environment/configuration, this >> server runs Ubuntu 10.04 with Dovecot, Amavis, Sieve, and Spam Assassin. > > I'm not sure, I don't lurk the Wiki much. About the best I can suggest > is search the SA users mailing list archives for "training dovecot". > Thanks, I'll look into setting-up IMAP folders for individual users in some programmatic way. -Ben
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/15/2012 4:19 PM, Kris Deugau wrote: > John Hardin wrote: >> I wasn't aware that autolearning could do a cold-start of Bayes, can >> anyone confirm whether this is the case? > > If you let it run long enough to pass the 200/200 ham/spam thresholds, > yes; there's no distinction I've ever met about where the learning came > from. > > That said, I wouldn't trust a pure autolearn setup with stock autolearn > thresholds - all too much spam will get learned scoring under 0.1. :( > > -kgd > It's a bit disappointing to learn this (pardon the pun), given: a.) This exchange between John Hardin and I, which occurred previously in this thread: ---8<-- Me: > Most of the list is probably laughing, but given the complexity of Spam > Assassin, this crucial requirement was lost on me, amidst the sea of > information and instructions. For example, there is no mention of the > fact that SA is essentially useless without Bayesian training on > http://wiki.apache.org/spamassassin/StartUsing . John: That's because that shouldn't be the case. The base ruleset + URIBL should be very effective pretty much out-of-the-box. ---8<-- b.) The default value for bayes_auto_learn is 1 (on). (At least in my particular distribution.) Correct me if I'm wrong, but this issue's root cause seems to be that bayes_auto_learn was on, out-of-the-box, yet I was not complementing its efficacy via sa-learn. Is this an accurate summary? Because if so, it seems prudent to change the default bayes_auto_learn value to zero, and scorn any package maintainer or developer who modifies it, or, alternatively, put a banner, at font-size 100em, on the SpamAssassin homepage that issues an unmistakable warning about Bayesian training's importance. (John, I'll respond to your most recent message tomorrow most likely; had enough for one day!) Thank you, -Ben
Re: Very spammy messages yield BAYES_00 (-1.9)
On 8/15/2012 2:24 PM, John Hardin wrote: > On Wed, 15 Aug 2012, Ben Johnson wrote: > >> Some 99% of the spam that I receive, which is grossly spammy (we're >> talking auto loans, cash advances, dink pills, the whole lot) contains >> "BAYES_00=-1.9" in the tests portion of the X-Spam-Status header. >> >> Might anyone know why? > > Poor training. John, I can't thank you enough for the thoroughness of your response. > Apart from the Bayes score, what kind of scores are those spams getting? Here are a few examples (the first two of which are two of VERY few in which the BAYES_* value is over 00): - No, score=0.192 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=no No, score=2.241 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001, HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=no No, score=-0.836 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, SPF_PASS=-0.001, URI_HEX=1.122] autolearn=no No, score=1.256 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7, URIBL_RHS_DOB=1.514] autolearn=no - It bears mention that the RCVD_IN_DNSWL_MED test is having even more of a negative impact (pardon the pun) than BAYES_*. I am already working with the dnswl.org folks (off-list, for privacy reasons) to get to the bottom of that issue. >> While I have not trained the Bayesian filter manually to date, > > Is there any provision for any manual training in your environment? Have > you set up training folders where your users can submit message for > training? Do you run sa-learn at all? No, there is no provision. No, I have not set-up training folders, and no, I have no run sa-learn manually at all. Most of the list is probably laughing, but given the complexity of Spam Assassin, this crucial requirement was lost on me, amidst the sea of information and instructions. For example, there is no mention of the fact that SA is essentially useless without Bayesian training on http://wiki.apache.org/spamassassin/StartUsing . >> how is it that the spammiest of the spam is being classified with >> BAYES_00 (thereby receiving the score -1.9)? Doesn't BAYES_00 imply >> that the message is almost certainly not spam? > > BAYES_00 implies that the message in question looks very similar to > messages the Bayes system has been told are not spam. It depends solely > on how it has been trained. > > I wasn't aware that autolearning could do a cold-start of Bayes, can > anyone confirm whether this is the case? > > If it can't then someone somewhere trained bayes up to the default > minimum 200 hams and 200 spams needed for it to start classifying. > > Before we offer suggestions, some more data from you please: > > What version of SA is this? # spamassassin --version SpamAssassin version 3.3.1 running on Perl version 5.10.1 > What does "sa-learn --dump magic" report about your current Bayes database? # sa-learn --dump magic ERROR: Bayes dump returned an error, please re-run with -D for more information # su amavis -c 'sa-learn --dump magic' # su amavis -c 'sa-learn --dump magic' 0.000 0 3 0 non-token data: bayes db version 0.000 0 11499 0 non-token data: nspam 0.000 0 39412 0 non-token data: nham 0.000 0 197769 0 non-token data: ntokens 0.000 0 1344331893 0 non-token data: oldest atime 0.000 0 1345056746 0 non-token data: newest atime 0.000 0 1345053771 0 non-token data: last journal sync atime 0.000 0 1345023550 0 non-token data: last expiry atime 0.000 0 345600 0 non-token data: last expire atime delta 0.000 0 6482 0 non-token data: last expire reduction count > What are all of the bayes_* configuration options in your local config? None are defined there. There are a few defaults/examples, but they are commented-out. > > What will probably end up happening is this: > (1) wipe your Bayes database > (2) turn off autolearn > (3) collect several hundred hams and spams for an initial training corpus > (4) train using that corpus > (5) evaluate results > > Depending on your mail volume, once Bayes is working well after manual > training, you may then want to reenable autolearn; I personally suggest > it only where the volume is high enough and/or the character of mail is > varied enough
Very spammy messages yield BAYES_00 (-1.9)
Hello, Some 99% of the spam that I receive, which is grossly spammy (we're talking auto loans, cash advances, dink pills, the whole lot) contains "BAYES_00=-1.9" in the tests portion of the X-Spam-Status header. Might anyone know why? This is a stock installation (Ubuntu package on 10.04). local.cf contains # Bayesian classifier auto-learning (default: 1) # # bayes_auto_learn 1 and I have not overridden the default elsewhere. So, presumably, auto-learning is enabled (if that's event relevant). While I have not trained the Bayesian filter manually to date, how is it that the spammiest of the spam is being classified with BAYES_00 (thereby receiving the score -1.9)? Doesn't BAYES_00 imply that the message is almost certainly not spam? Others have run into this same problem, but I see no resolution; here is one such example: http://forums.eukhost.com/f38/problems-spamassassin-bayes-filter-16948/ Outside of the above forum post, search query results for this issue are scant. Thanks for any help, -Ben
Re: RCVD_IN_DNSWL_BLOCKED
On 8/14/2012 9:33 AM, Bowie Bailey wrote: > On 8/14/2012 12:35 AM, JP Kelly wrote: >> How can I disable the DNSWL rule/plugin or whatever. Not just give it >> a low/zero score but disable it completely. >> I am tired of seeing RCVD_IN_DNSWL_BLOCKED in my headers. > > If you set the score to zero, the rule will be disabled and you should > no longer see it show up in the score report. > > If you want to disable the DNSWL lookup completely, you should zero out > the main rules and the sub-rule: > >score RCVD_IN_DNSWL_BLOCKED 0 >score RCVD_IN_DNSWL_HI 0 >score RCVD_IN_DNSWL_LOW 0 >score RCVD_IN_DNSWL_MED 0 >score RCVD_IN_DNSWL_NONE 0 >score __RCVD_IN_DNSWL 0 > Thanks, Bowie. I was wondering how to do this, too. The majority of the spam that our users receive is a direct result of this one rule; it seems that plenty of spammers are white-listed in this database, and it is a weighty test (it reduces the score by as much as 2 or 3 points in some cases, often putting the message just below the required-for-spam score). We have no use for it. -Ben
Re: SpamAssassin scores and 12-letter domains
On 8/6/2012 1:32 PM, Axb wrote: > On 08/06/2012 05:25 PM, Ben Johnson wrote: > >> Given that ASF has no other public support channel, and no way to >> contact anybody to request that the filters be adjusted, what choice do >> I have beyond pushing to have the software modified? > > bare in mind: SpamAssassin is a framework and VERY flexible. > It's aiming to be the global solution for spam filtering. > > The SpamAssassin project delivers a set of rules and scores. > > These may not fit all types fo traffic, globally - with minimal skills > you can modify the ruleset to work for your setup. > Thanks, Axb. Yes, I understand that SpamAssassin is very flexible. The problem I'm describing, however, is not with my SpamAssassin configuration (in which case I would simply adjust it); it is with Apache Software Foundation's configuration (over which I have no control). I raised this issue because ASF's SpamAssassin configuration -- specifically, the 12-letter-domain check -- causes my messages to its various mailing lists to be rejected more often than not. This list is very forgiving in that the required score is 10.0, but other ASF lists require only 5.0. All of that said, it sounds like this issue will be discussed among the developers, so maybe something will be done and not all 12-letter-domain owners will be blacklisted throughout the Internet. Best regards, -Ben
Re: SpamAssassin scores and 12-letter domains
On 8/6/2012 8:01 AM, Benny Pedersen wrote: > Den 2012-08-05 20:30, Michael Scheidell skrev: > >>> X-ASF-Spam-Status: No, hits=4.8 required=10.0 >>> tests=FROM_12LTRDOM,SPF_HELO_PASS,SPF_PASS,URI_HEX >> default is 5.0, not 10.0 > > why did ASF change it ?, did thay only change reguired ? :=) > >>> as you see there is long way to 10 >> .2 points to go to 5.0 > > irrelevant on ASF > >> score FROM_12LTRDOM 0.099 3.499 0.099 3.499 >> is a HUGE difference, any score over 2.75 points should be suspect. > > spamassassin is opensource, scores is not hardcoded > > i think what is more needed is just more comiters with ham and spam to > the public corpus scores is generated from, dont fight rules, change > scores if one is not comitter > > this rule does not hit ham here > Thanks for the replies thus far. Benny, it bears mention that not all of ASF's servers/mailing lists are configured the same way. The Apache HTTP Server mailing list requires 5.0. My best guess is that they cranked-up the threshold for the SpamAssassin mailing list because, by nature, the discussion contains a lot of "spammy" content and false-positives were becoming a problem. The fact that SpamAssassin is open-source is what's irrelevant; I have no control over how ASF configures its servers, and therefore no ability to disable the ridiculous 12-letter-domain check. ASF would have to change its configuration if my messages are to be accepted. Better still would be to remove this "feature" from SpamAssassin altogether, as it is completely useless. That way, the problem would disappear as soon as ASF updates to a version of SpamAssassin in which the 12-letter-domain check is removed. The fact is that nobody has articulated the rationale behind the 12-letter-domain check speaks for itself. If a rule is deemed to be useless, why is it not removed? It is wasting CPU cycles and affecting genuine ASF mailing list subscribers adversely (by rejecting their messages without basis). Further, it's not as though ASF's servers are the only ones using FROM_12LTRDOM; this ridiculous issue is affecting my ability to communicate across the Internet at large. Given that ASF has no other public support channel, and no way to contact anybody to request that the filters be adjusted, what choice do I have beyond pushing to have the software modified? Thank you, -Ben
SpamAssassin scores and 12-letter domains
Hello, As an owner of a 12-letter domain, and someone who is unable to post to any of the Apache mailing lists due to messages being rejected as SPAM (I'll be surprised if this one if any different), I have to ask, what is the rationale for the infamous 12-letter-domain-ding? How many 12-letter domains exist? A few million? I can't think of a less useful metric, nor one that is more likely to yield false-positives. There is hardly any published information on this subject, so perhaps one of the experts here will weigh-in. Apparently, I'm not the only one who feels this "feature" needs to die: http://spamassassin.1065346.n5.nabble.com/FROM-12LTRDOM-high-scored-remove-td100710.html . Thanks for any insight. -Ben