Re: Bayes scoring priority

2013-06-24 Thread Ben Johnson


On 6/24/2013 1:29 PM, Amir 'CG' Caspi wrote:
> Has anyone modified their Bayes scoring priority, and if so, what were
> your experiences?  What scores did you assign?

This has been discussed at length; perhaps start with this archived topic:

http://spamassassin.1065346.n5.nabble.com/BAYES-99-and-ham-td38832.html

The short answer is that you can, and probably should, increase the
BAYES_99 score value to 4 or 4.5. Setting it to 5 puts you at risk
(albeit very slight) for false-positives.

-Ben


Re: New rule for HTML spam, using comments?

2013-06-18 Thread Ben Johnson


On 6/18/2013 1:18 PM, Amir 'CG' Caspi wrote:
> At 8:58 AM -0400 06/18/2013, Ben Johnson wrote:
>> a.) You are copying/pasting the body of the email, but not the headers.
> 
> No, I am copying the headers... however, I am using Eudora (ancient, I
> know) as a mail client, and it's possible the headers are not properly
> formatted.  For example, for SpamCop I have to use their "workaround"
> script.  I don't know what exactly is mal-formed, though.

For the sake of troubleshooting, can you try accessing the mail by some
other means, e.g., opening the file directly from the filesystem?
Doesn't mbox store email messages as plaintext files? (Kris already beat
me to it regarding this suggestion.)

> I should admit at this point that much of my sa-learn has been on
> Eudora's mboxes, by the way.  That is, I would take the Eudora mbox and
> sa-learn on that.  Eudora is supposed to use standard mbox format, but
> I'm wondering if maybe it's not so standard after all...

How would anything ever be flagged with a score higher than BAYES_00 if
this were to be the problem? Didn't you report a score of BAYES_99 in
one of your tests?

> Either way, I am _trying_ to copy the entire message.  Not sure what is
> misformatted there.  If you take a look at my two pasted examples (links
> below for convenience), those are direct copy/paste from Eudora's "raw
> source" view.  Any idea what is malformed?  Do I need an extra newline
> between the header and body, or something more complicated?
> 
> http://pastebin.com/HD0rNdxU
> http://pastebin.com/Zswg77Ds

How are you feeding the messages to sa-learn? Are you not just passing
the email file, e.g., /var/vmail/example.com/...? Why copy from Eudora
and paste into a temporary file when you can just point sa-learn
straight to the message on disk?

>> b.) You are running Bayes as two different users when you perform your
> 
> No, I have been careful for that.  You saw that I pasted the maillog
> entries... notice that spamd runs as setuid.  I made sure the same
> userid was in the logs, and in my command.

I had missed that detail; looks okay.

>> Have a look at the thread I cited and see if anything jumps-out at you.
> 
> Will do, but unfortunately, I don't think the problem is as clear cut as
> (b) ... maybe it's (a) though, in which case I wonder if I have to
> modify my Eudora mboxes before learning on them.

Do you retain your training corpus? This may be one of those instances
in which the best way to debug the problem is to wipe and retrain Bayes.
Of course, that can be a nightmare if you don't retain the messages that
you've trained as ham and spam.

> Thanks.
> 
> -- Amir


Re: New rule for HTML spam, using comments?

2013-06-18 Thread Ben Johnson


On 6/18/2013 5:31 AM, Amir 'CG' Caspi wrote:
> At 4:37 PM -0400 06/14/2013, Alex wrote:
>> On Fri, Jun 14, 2013 at 4:18 PM, Amir 'CG' Caspi 
>> wrote:
>>  > I wonder if there's some
>>  > difference between running spamassassin manually on the message versus
>>  > running spamd.
>>
>> I think the only difference would be if spamd somehow didn't recognize
>> all the locations for your rules.
> 
> OK, I've got some more weirdness here.  I just received two FN spams...
> one had bayes00, another bayes50.  To test what the heck might be going
> on, I ran both of the emails through spamc manually... this SHOULD
> recreate the same thing that occurs when sendmail delivers the email and
> spamc gets run automatically.
> 
> The first email, which was bayes00 originally, hit with bayes99 when I
> ran it manually through spamc.  There were only a few minutes between
> the first and second run (see timestamps below)... nothing very
> important happened to the Bayes DB between those two runs.  The second
> email, bayes50, stayed exactly the same (also bayes50).  I looked
> through the /var/log/maillog to see if I could figure out some
> difference between the two runs, but they look basically identical.
> 
> The only difference I can figure is that the second (manual) run happens
> on mail source that I copy/paste from my email program... that is, it's
> pure text, copied and pasted.  The first (automatic) run is on the mail
> as it enters the system, which might be somehow formatted differently. 
> All of my sa-learn training is done directly on my mbox files, which
> perhaps is more akin to copy/paste than anything else...
> 
> Anyone know what the hell is going on here?  For reference, here is the
> maillog entry for the bayes00 message when it went through automatically:
> 
> Jun 18 05:00:32 kismet sendmail[27721]: r5I90WGI027721:
> from=, size=16502, class=0, nrcpts=1,
> msgid=,
> proto=ESMTP, relay=root@localhost
> Jun 18 05:00:32 kismet sendmail[27707]: r5I90U4N027657:
> to=, delay=00:00:01, xdelay=00:00:00,
> mailer=virthostmail, pri=136089, relay=domain.com, dsn=2.0.0, stat=Sent
> (r5I90WGI027721 Message accepted for delivery)
> Jun 18 05:00:32 kismet spamd[27586]: spamd: connection from
> localhost.localdomain [127.0.0.1] at port 53424
> Jun 18 05:00:32 kismet spamd[27586]: spamd: setuid to u...@domain.com
> succeeded
> Jun 18 05:00:32 kismet spamd[27586]: spamd: processing message
>  for
> u...@domain.com:22001
> Jun 18 05:00:33 kismet spamd[27586]: spf: lookup failed: Can't locate
> object method "new_from_string" via package "Mail::SPF::v1::Record" at
> /usr/lib/perl5/vendor_perl/5.8.8/Mail/SPF/Server.pm line 524.
> Jun 18 05:00:37 kismet spamd[27586]: pyzor: [27730] error: TERMINATED,
> signal 15 (000f)
> Jun 18 05:00:37 kismet spamd[27586]: spamd: clean message (-1.1/5.0) for
> u...@domain.com:22001 in 5.0 seconds, 16781 bytes.
> Jun 18 05:00:37 kismet spamd[27586]: spamd: result: . -1 -
> BAYES_00,HTML_EXTRA_CLOSE,HTML_IMAGE_RATIO_08,HTML_MESSAGE,RDNS_NONE
> scantime=5.0,size=16781,user=u...@domain.com,uid=22001,required_score=5.0,rhost=localhost.localdomain,raddr=127.0.0.1,rport=53424,mid=,
> bayes=0.00,autolearn=no
> 
> 
> And here is when it went through manually:
> 
> Jun 18 05:05:45 kismet spamd[27984]: spamd: connection from
> localhost.localdomain [127.0.0.1] at port 53447
> Jun 18 05:05:45 kismet spamd[27984]: spamd: setuid to u...@domain.com
> succeeded
> Jun 18 05:05:45 kismet spamd[27984]: spamd: processing message
>  for
> u...@domain.com:22001
> Jun 18 05:05:45 kismet spamd[27984]: spf: lookup failed: Can't locate
> object method "new_from_string" via package "Mail::SPF::v1::Record" at
> /usr/lib/perl5/vendor_perl/5.8.8/Mail/SPF/Server.pm line 524.
> Jun 18 05:05:47 kismet spamd[27984]: spamd: identified spam (6.0/5.0)
> for u...@domain.com:22001 in 2.2 seconds, 16223 bytes.
> Jun 18 05:05:47 kismet spamd[27984]: spamd: result: Y 6 -
> BAYES_99,MISSING_MIME_HB_SEP,RDNS_NONE,T_MIME_NO_TEXT,URIBL_BLACK
> scantime=2.2,size=16223,user=u...@domain.com,uid=22001,required_score=5.0,rhost=localhost.localdomain,raddr=127.0.0.1,rport=53447,mid=,bayes=1.00,autolearn=no
> 
> 
> 
> So... what the heck is going on?  I see basically no difference between
> the two maillog entries.  The only difference between the two runs, as
> far as I can tell, is that pyzor died on the first one (and I don't know
> why, but that shouldn't have ANY effect on the Bayes score), and the
> manual run was using the copy/paste from my mail program.
> 
> But, as mentioned, the bayes50 spam looked identical for both the
> automatic and manual runs.
> 
> Anyone have any idea what the heck is going on, and how I can fix it?
> 
> Is my Bayes DB worthless because I've been training it on MBOX format
> (i.e. ASCII), but when it runs the first time around, it's running on
> binary (MIME) instead?  If so, how can I fix this -- do I need to store
> my mail in some different format instead of MBOX?  (Except that sendmail
> de

Re: Massive spamruns

2013-06-12 Thread Ben Johnson


On 6/12/2013 12:22 PM, Alex wrote:
> Hi,
> 
>>> # 2013 cars local dealership
>>> http://pastebin.com/3bEMiV3B
>>
>> URI in that sample
>>
>> pohformed.com listed on black.uribl.com
>> pohformed.com listed on jp.surbl.org
>> pohformed.com listed on sc.surbl.org
>> pohformed.com listed on dbl.spamhaus.org
> 
> I know I should have mentioned that. Yes, I'm using the above RBLs,
> and they're all correctly tagged here now.
> 
> I was hoping for something more preemptive to trigger on these more
> generally because the IPs are only used for a short while, but long
> enough to get 25 spams in from the address. I was hoping to find
> commonalities between the messages that could be used to generate some
> other rules.
> 
> Thanks,
> Alex
> 

Isn't this the function that Bayes is intended to serve, rather precisely?

-Ben


Re: Large # of Spam getting through all of a sudden.

2013-06-10 Thread Ben Johnson


On 6/10/2013 4:46 PM, David F. Skoll wrote:
> [Lost track of who wrote this]
> 
>> 66.96.253.241
>> 64.120.241.228
>> 66.197.142.29
>> 66.197.142.23
>> 66.197.207.152
>> 66.197.177.174
>> 64.191.61.25
> 
> Every single one of those IPs is on our "GreylistStumbler" list, meaning
> they've been greylisted, but have not been seen to pass greylisting.
> 
> Implementing greylisting might stop most of the problem messages.
> 
> Regards,
> 
> David.
> 

(Brian is the one who wrote it :))

That's an interesting observation, David.

As someone who recently implemented greylisting, its efficacy in this
particular type of situation cannot be overstated. Our spam volume
dropped from about 35% to less than 2% overnight, thanks to greylisting
at the MTA level.

While somewhat of a generic prescription, it's well-prescribed for a
reason: sort-out your Bayes situation (will probably require wiping and
starting over with a hand-sorted corpus that is *retained*) and
implement greylisting (provided you can live with its caveats).

The DNSBLs can be used to supplement the above.

Good luck, Brian!

--Ben


Re: Large # of Spam getting through all of a sudden.

2013-06-10 Thread Ben Johnson


On 6/10/2013 2:45 PM, Duncan, Brian M. wrote:
> I rarely have seen any SpamAssasin hits on the bodies of these messages.
> 
> (cached, score=-0.125,required 6.5, autolearn=not spam, 
> RP_MATCHES_RCVD -0.12)

Do you train the Bayes database manually? Or via autolearn only?

I use SA via AMaViS, and the header changes look slightly different from
yours, but I see no evidence that Bayes scoring is being used in the
above header (if, in fact, that is a sample header with all SA markup
appended).

--Ben


Re: .pw / Palau URL domains in spam

2013-05-25 Thread Ben Johnson


On 5/7/2013 11:02 PM, Steve Prior wrote:
> On 5/7/2013 1:44 AM, Benny Pedersen wrote:
>> Chris Santerre skrev den 2013-05-06 17:27:
>>> 10 days and still being abused badly. Recommending for everyone to
>>> just refuse any .pw
>>
>> time for spamhaus ? :=)
>>
>>> for those wanting an SA rule, here:
>>>
>>> header PW_IS_BAD_TLD From =~ /.pwb/
>>> describe PW_IS_BAD_TLD PW TLD ABUSE
>>> score PW_IS_BAD_TLD 3
>>
>> here i would like to use -3
>>
>>> Change score to whatever you want. Enjoy.
>>
>> thats the point of opensource imho :)
>>
>> hopefully the good pw domains start using opendkim, and then let the
>> world
>> repute it from there
>>
> 
> I blocked everything from TLD pw at the Postfix level so the email gets
> rejected without ever hitting spamassassin.
> 
> I created /etc/postfix/sender_access with the contents:
> pwREJECT
> 
> ran postmap sender_access
> 
> and then added
> check_sender_access hash:/etc/postfix/sender_access
> to smtpd_recipient_restrictions
> 
> Problem went away completely, sorry Palau.
> 
> Steve
> 

Steve, just wanted to thank you for providing an elegant solution to
this problem. It seems far more preferable to block this nonsense right
at the MTA level (for now). Your instructions worked for me and I now
see the following in my mail log for any .pw sender:

postfix/smtpd[10660]: NOQUEUE: reject: RCPT from
unknown[173.213.124.203]: 554 5.7.1 : Sender
address rejected: Access denied

Much appreciated!

-Ben


Re: dns*.registrar-servers.com as a rogue registrar?

2013-05-07 Thread Ben Johnson
I'll top-post, too, just for the sake of consistency. :)

I've had pretty good experiences with Namecheap, actually. I'm in no way
affiliated; I've just used them for cheap domain registrations
(apparently, I'm not the only one) and for cheap SSL certificates in bulk.

But, that's neither here nor there. As the company relates to this
conversation, I reported a domain that was spamming heavily and
registered with Namecheap and the company took swift action:

> Hello,
> 
> Thank you for your email.
> 
> While jecon.us domain name is registered with Namecheap it is hosted with 
> another company. So we cannot check the logs for a domain and confirm if it 
> is involved in sending unsolicited bulk emails.
> 
> However, as we can see the domain name is blacklisted by trusted 
> organizations. Thus we opened a case regarding the domain name. Please allow 
> about 48 hours for our further investigation.
> 
> Thank you for letting us know about the issue. 

Five days later, the domain was shut-down:

> Hello,
> 
> This is to inform you that jecon.us domain was suspended. It is now pointed 
> to non-resolving nameservers and will be nullrouted once the propagation is 
> over. The domain is locked for modifications in our system.
> 
> Thank you for letting us know about the issue.

So, if you are having problems with domains registered with Namecheap, I
suggest that you open a support request for the "Domains -- Legal and
Abuse" department. From the sounds of it, you'd be doing us all a big favor!

-Ben



On 5/7/2013 3:26 PM, Chris Santerre wrote:
> The owner is NameCheap, Inc.
> 
> A quick google will bring up historical problems with NameCheap and its
> owner and its DBAs.
> 
> I dare not say anything bad about them and let you judge for yourself on
> their history. Richard Kirkendall has a tendency to yell "Slander!" when
> someone even mentions their name.
> 
> 
> --Chris
> (I top post because I care.)
> 
> 
>> -Original Message-
>> From: lcon...@go2france.com [mailto:lcon...@go2france.com]
>> Sent: 2013-05-07 14:15
>> To: users@spamassassin.apache.org
>> Subject: dns*.registrar-servers.com as a rogue registrar?
>>
>>
>>
>> Nearly all of the .pw domains have their authoritative NS at
>> dns*.registrar-servers.com.
>>
>> that registrar and few others are always at the top of my reports for
>> NSs of sender domains of spam we reject.
>>
>> Does anybody score a msg if its sender domain is DNS hosted by
>> registrar-servers.com or other?
>>
>> what would that rule look like?
>>
>> Len
>>
>>
> 


Re: SQL error: Duplicate entry

2013-04-25 Thread Ben Johnson


On 4/25/2013 11:55 AM, Matus UHLAR - fantomas wrote:
>> On Thu, Apr 25, 2013 at 1:47 PM, Matus UHLAR - fantomas
>> wrote:
>>> I don't think so... IIRC the "REPLACE INTO" deletes existing record and
>>> inserts new one, does not update existing. This caused some issues
>>> for me
>>> some ~10 years ago, so i switched to the update or insert.
> 
> On 25.04.13 16:36, Matthias Leisi wrote:
>> "REPLACE INTO" is a MySQL-specific extension and not part of standard
>> SQL.
> 
> I know, but what does this have in common with what I wrote?
> 

It seems that Matthias's point is that SA doesn't use "REPLACE INTO"
because "REPLACE INTO" is MySQL-specific, and SA must remain
database-agnostic.

This leaves one to assume that SA is performing an INSERT or an UPDATE only.

The question then becomes, why is SA attempting to perform an INSERT
(and failing with a duplicate key conflict on the PRIMARY KEY, which, in
my moderately stale version of SA, is a UNIQUE key across the `id` and
`token` columns) when it should be performing an UPDATE? (Possible
explanation two paragraphs down.)

Presumably, the `id` column is a foreign key to the `bayes_vars`.`id`
column, which indicates that only one record for each SA Bayes user and
token combination may exist. Sounds reasonable.

My understanding is that it's better (with respect to performance and
atomicity) to attempt the INSERT and have it fail than to check if the
ID/token combination already exists and UPDATE it if it does.

In other words, I'm not sure that this warning is a problem (beyond log
bloat or similar). It's entirely possible that SA *only* performs INSERT
statements for the reasons I mention above.

Only a developer or disciple of the SA source code can say for sure. I
wish I had time to look myself.

Out of curiosity, how did this SQL error come to your attention in the
first place?

-Ben


Re: SQL error: Duplicate entry

2013-04-24 Thread Ben Johnson


On 4/24/2013 2:42 PM, psychobyte wrote:
> Hi,
> 
> I've noticed that SA is getting a lot of "Duplicate entry" errors for
> AWL and bayes plugins. I can verify that the sql schema is up to date
> for SA 3.3.1-r4 and I've tried retraining the bayes db. Any hints on how
> to troubleshoot this?
> 
> AWL:
> 
> Apr 24 11:31:57 mserv amavisd[24336]: (24336-05) SA dbg: auto-whitelist:
> sql-based add_score/insert amavis|myem...@example.com|14.43|1|-0.699:
> SQL error: Duplicate entry 'amavis-myem...@example.com--14.43' for key
> 'PRIMARY'
> 
> 
> Bayes:
> 
> Apr 24 11:31:57 mserv amavisd[24336]: (24336-05) SA dbg: bayes:
> _put_token: SQL error: Duplicate entry '1-\312\270j' for key 'PRIMARY'
> 
> 
> 
> 

I know very little about how Bayes interacts with SQL, but it's clear
that SA is trying to insert a record that is identical to one that's
already present, and the key(s) that are defined on the table are
preventing it.

Makes one wonder if a "REPLACE INTO ..." was replaced with an "INSERT
INTO ..." in a recent version of SA. Of course, the messages that you're
seeing tell us nothing about which DB table is causing the problem.

Maybe one of the developers will see this and recall making such a change.

Alternatively, you could dig into your tables and attempt to identify
where those values actually live. Once we have the offending table,
further troubleshooting will be possible.

-Ben


Re: Seminar Spam

2013-04-24 Thread Ben Johnson


On 4/24/2013 12:12 PM, hospice admin wrote:
> Hi,
> 
> we're having problems with an outfit called 'Bite Sized Seminars' in the
> UK, who seem to be sending mail out through another company called
> 'Communicado'. A quick google suggests we aren't the only ones.
> 
> We have developed a number of rules that identify their mail by looking
> for their phone numbers, common phrases, etc in their mail shots with
> varying success (I'm happy to share these with anyone who may find them
> helpful).
> 
> The problem I'm trying to solve is that they seem to register hundreds
> of .co.uk domains, and have access to loads of sending IPs, so I can't
> just write a rule to do the obvious. I've complained about them to
> Nominet, and they aren't interested ... according to them, they are
> doing nothing wrong. I've also complained to various IP providers, some
> of which say they will do something, but rarely do. I've even rung them
> ... again ... no joy.
> 
> Here's my question - am I missing a trick here, particularly regarding
> the hundreds of domain names? For example, is it possible to do a
> 'whois' and process the output in some way?
> 
> Thanks
> 
> Judy.
> 

Have you been feeding Bayes samples of these messages? I would think
Bayes to be far more effective against this type of spamming (given the
dynamic nature of the domains and IP addresses) than writing custom rules.

-Ben


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-04-22 Thread Ben Johnson


On 4/20/2013 3:20 PM, Benny Pedersen wrote:
> Ben Johnson skrev den 2013-04-20 05:02:
> 
>> Yes, I believe that me and the system always execute SA commands as the
>> "amavis" user. When I was using the SQL setup, I had the following in
>> local.cf:
>>
>> bayes_path /var/lib/amavis/.spamassassin/bayes
> 
> is amavis have homedir in /var/lib/ ?

The amavis user's home directory is /var/lib/amavis. This seems to be
the default on Ubuntu; I didn't change this path.

> in gentoo its default as /var/amavis where the .spamassassin dir is
> created by amavisd
> 
> use user_prefs to set bayes_path does not make sense if sql is used
> 

Thanks; I did comment-out the "bayes_path" directive. I figured that it
wouldn't matter whether it is commented or not, in the presence of
SQL-related directives, but it can't hurt to comment-out this line.

>> With the DBM setup, I had the following (I have since commented it out,
>> while attempting to debug this Bayes issue):
>>
>> bayes_sql_override_username amavis
> 
> +1 to this one since amavis cant use multiple sa users very easy, but
> depending on what amavis it being supported with complicated setups :(

I only need one Bayes user, so this setup is adequate.

> i changed away from amavisd to clamav-milter, spampd in postfix after
> queue, this is working very well for me, and i hope sa 3.4.x will not
> break spampd :=)

Hmm, I will consider your sound advice in this regard. amavis does seem
to be overly memory-hungry (despite setting $max_servers = 1 and
$max_requests = 1). If there is a better alternative, I'm all ears (or
eyes, as the case may be).

>> Is something more required to ensure that my mail system, which runs
>> under the "amavis" user, is always reading from and writing to the
>> same DB?
> 
> nope just remember that amavis also reads .spamassassin/user_prefs
> 
> if you like you can copy that user_prefs to /root/.spamassassin so you
> dont have to remember something :)
> 
> user_prefs should ONLY be readble by that user that runs it
> 

Thanks for pointing this out. I will double-check the permissions.

I'll respond to your other email momentarily.

Thanks, Benny!

-Ben


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-04-20 Thread Ben Johnson
So, the problem seems not to be SQL-specific, as it occurs with SQL or
flat-file DB.

Upon following Benny Pedersen's advice (to move SA configuration
directives from /etc/spamassassin/local.cf to
/var/lib/amavis/.spamassassin/user_prefs), I noticed something unusual:

$ ls -lah /var/lib/amavis/.spamassassin/
total 7.6M
drwx-- 2 amavis amavis 4.0K Apr 20 08:54 .
drwxr-xr-x 7 amavis amavis 4.0K Apr 20 08:56 ..
-rw--- 1 root   root   8.0K Apr 20 08:33 bayes_journal
-rw--- 1 root   root   1.3M Apr 20 00:09 bayes_seen
-rw--- 1 root   root   4.8M Apr 20 08:29 bayes_toks
-rw-r--r-- 1 root   root799 Jun 28  2004 gtube.txt
-rw-r--r-- 1 amavis amavis 2.7K Apr 20 08:55 user_prefs

Welp, that'll do it! How those four files were set to root:root
ownership is beyond me, but that was certainly a factor. Maybe this was
a result of executing my training script as root (even though I had
hard-coded the bayes_path to use /var/lib/amavis/.spamassassin/bayes,
and when using SQL, hard-coded bayes_sql_override_username to use amavis)?

I changed ownership to amavis:amavis and now messages are being scored
with Bayes (all of them, from what I can tell so far).

Also, I looked into the fact that I was running the cron job that trains
ham and spam as root. I did this only because the amavis user lacks
access to /var/vmail, which is where mail is stored on this system. (As
a corollary, I'm a bit curious as to how amavis is able to scan incoming
mail, given this lack of access -- maybe it does so using a pipe or some
other method that does not require access to /var/vmail.)

I think the disconnect was in the fact that I placed my custom
configuration directives in /etc/spamassassin/local.cf, when I should
have placed them in /var/lib/amavis/.spamassassin/user_prefs (for
message scoring) *and* /root/.spamassassin/user_prefs (for ham/spam
training). (Thanks for pointing-out this mistake, Benny P.)

Putting my custom SA configuration directives in both of these files was
the only way I was able to train mail and score incoming messages using
the same credentials "across-the-board".

Once I did this, I was able to use SQL or flat-file DB with the same
exact results.

Is there a better way to achieve this consistency, aside from putting
duplicate content into /var/lib/amavis/.spamassassin/user_prefs and
/root/.spamassassin/user_prefs?

Feels like I'm out of the woods here! Thanks for all the expert help, guys.

-Ben



Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-04-19 Thread Ben Johnson
Apologies for the rapid-fire here folks, but I wanted to correct something.

I had these backwards:

>> Yes, I believe that me and the system always execute SA commands as the
>> "amavis" user. When I was using the SQL setup, I had the following in
>> local.cf:
>> 
>> bayes_path /var/lib/amavis/.spamassassin/bayes
>> 
>> With the DBM setup, I had the following (I have since commented it out,
>> while attempting to debug this Bayes issue):
>> 
>> bayes_sql_override_username amavis

I meant to say that I have *always* had

bayes_path /var/lib/amavis/.spamassassin/bayes

in local.cf, and using the SQL setup, I added

bayes_sql_override_username amavis

Sorry for the confusion!

-Ben



On 4/19/2013 11:02 PM, Ben Johnson wrote:
> 
> 
> On 4/19/2013 1:54 PM, Benny Pedersen wrote:
>> Ben Johnson skrev den 2013-04-19 18:02:
>>
>>> Still stumped here...
>>
>> for amavisd-new, put spamassassin sql setup into user_prefs file for the
>> user amavisd-new runs as might be working better then have insecure sql
>> settings in /etc/mail/spamassassin :)
>>
>> i dont know if this is really that you have another user for amavisd,
>> and test spamassassin -t msg with another user that uses another sql user ?
>>
>> make sure both users is really using same sql user as intended
>>
> 
> Benny, thanks for the suggestion regarding moving the SA SQL setup into
> user_prefs. I will look into that soon.
> 
> Yes, I believe that me and the system always execute SA commands as the
> "amavis" user. When I was using the SQL setup, I had the following in
> local.cf:
> 
> bayes_path /var/lib/amavis/.spamassassin/bayes
> 
> With the DBM setup, I had the following (I have since commented it out,
> while attempting to debug this Bayes issue):
> 
> bayes_sql_override_username amavis
> 
> Is something more required to ensure that my mail system, which runs
> under the "amavis" user, is always reading from and writing to the same DB?
> 
> Best regards,
> 
> -Ben
> 
> 


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-04-19 Thread Ben Johnson


On 4/19/2013 1:54 PM, Benny Pedersen wrote:
> Ben Johnson skrev den 2013-04-19 18:02:
> 
>> Still stumped here...
> 
> for amavisd-new, put spamassassin sql setup into user_prefs file for the
> user amavisd-new runs as might be working better then have insecure sql
> settings in /etc/mail/spamassassin :)
> 
> i dont know if this is really that you have another user for amavisd,
> and test spamassassin -t msg with another user that uses another sql user ?
> 
> make sure both users is really using same sql user as intended
> 

Benny, thanks for the suggestion regarding moving the SA SQL setup into
user_prefs. I will look into that soon.

Yes, I believe that me and the system always execute SA commands as the
"amavis" user. When I was using the SQL setup, I had the following in
local.cf:

bayes_path /var/lib/amavis/.spamassassin/bayes

With the DBM setup, I had the following (I have since commented it out,
while attempting to debug this Bayes issue):

bayes_sql_override_username amavis

Is something more required to ensure that my mail system, which runs
under the "amavis" user, is always reading from and writing to the same DB?

Best regards,

-Ben




Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-04-19 Thread Ben Johnson


On 4/19/2013 12:12 PM, Axb wrote:
> On 04/19/2013 06:02 PM, Ben Johnson wrote:
> 
>> Still stumped here...
> 
> do a bayes sa-learn --backup
> 
> switch to file based in SDBM format (which is fast)
> 
> do a
> 
> sa-learn --restore
> 
> feed it a few thousand NEW spams
> 
> see what happens
> 
> 
> 
> 
> 
> 

Thanks for the suggestion, Axb. Your help and time is much appreciated.

By "feed it a few thousand NEW spams", do you mean to scrap the training
corpora that I've hand-sorted in favor of starting over? Or do you mean
to clear the database and re-run the training script against the corpora?

If your thinking is that the token data may be "stale", then I will
really be stumped. When I hand-classify 12 messages with a subject and
body about a retractable garden hose as spam, I expect the 13th message
about the same hose to score high on the Bayes tests. Is this an
unreasonable expectation?

I commented-out all of the DB-related lines in my SA configuration file
(local.cf) and restarted amavis-new.

I also cleared the existing DB tokens (with "sa-learn --clear") after
amavis had restarted, and then executed my normal training script
against my ham and spam corpora.

I'll keep an eye on incoming messages to see if those that "slip
through" and score below 4.0 demonstrate evidence of Bayes testing.

I am beginning to wonder if some kind of "corruption", for lack of a
better term, had been introduced by using utf8 to store the token data
(Benny Pedersen mentioned that Unicode is overkill, and he is probably
right). Performance aside, could using utf8_bin (instead of ascii)
introduce a problem for SA (despite no errors during "sa-learn" training
or --restore commands)?

The strange thing is that Bayes seems to work fine most of the time. But
as I've stated previously, almost all "obvious to a human" spam that
scores below 4.0 lacks evidence of Bayes testing.

Since switching back to a DBM Bayes setup, the results look pretty much
as expected (wrapped), and this is the type of thing I expect to see on
every message:

---
spamassassin -D -t < "/tmp/email.txt" 2>&1 | egrep '(bayes:|whitelist:|AWL)'
dbg: bayes: learner_new
self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x37520f0),
bayes_store_module=Mail::SpamAssassin::BayesStore::DBM
dbg: bayes: learner_new: got
store=Mail::SpamAssassin::BayesStore::DBM=HASH(0x2c52558)
dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_toks
dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_seen
dbg: bayes: found bayes db version 3
dbg: bayes: DB journal sync: last sync: 0
dbg: bayes: DB journal sync: last sync: 0
dbg: bayes: corpus size: nspam = 6203, nham = 2479
dbg: bayes: score = 5.55111512312578e-17
dbg: bayes: DB journal sync: last sync: 0
dbg: bayes: untie-ing
dbg: timing: total 2925 ms - init: 907 (31.0%), parse: 1.92 (0.1%),
extract_message_metadata: 108 (3.7%), poll_dns_idle: 1040 (35.6%),
get_uri_detail_list: 1.22 (0.0%), tests_pri_-1000: 19 (0.7%),
compile_gen: 185 (6.3%), compile_eval: 19 (0.6%), tests_pri_-950: 5
(0.2%), tests_pri_-900: 5 (0.2%), tests_pri_-400: 32 (1.1%),
check_bayes: 26 (0.9%), tests_pri_0: 836 (28.6%), dkim_load_modules: 27
(0.9%), check_dkim_signature: 1.23 (0.0%), check_dkim_adsp: 24 (0.8%),
check_spf: 70 (2.4%), check_razor2: 202 (6.9%), check_pyzor: 135 (4.6%),
tests_pri_500: 988 (33.8%)
---

I'll wait and see if I receive messages without Bayes results and report
back.

Even if using DBM "works", I don't see this as a long-term solution --
only as a troubleshooting step. I would really like to keep my Bayes
data in a MySQL or PostgreSQL database.

Thanks again for the help!

-Ben


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-04-19 Thread Ben Johnson


On 4/19/2013 11:42 AM, Alex wrote:
> Hi,
> 
>> Is this normal? If so, what is the explanation for this behavior? I have
> 
> marked dozens of nearly-identical messages with the subject
> "Garden hose
> expands up to three times its length" as SPAM (over the course of
> several weeks) as SPAM, and yet SA reports "not enough usable
> tokens found".
> 
> 
> If they are identical, I don't believe it will create new tokens,
> per se.
> 
>  
> 
> Is SA referring to the number of tokens in the message? Or the
> Bayes DB?
> 
> 
> I should also mention that while training a message, use "--progress",
> as such (assuming you're running it on an mbox or message that's in mbox
> format):
> 
> # sa-learn --progress --spam --mbox mymboxfile
> 
> It will show you how many tokens have been learned during that run. It
> might also be a good idea to add the token summary flag to your config:
> 
> add_header all Tok-Stat _TOKENSUMMARY_
> 
> If you run spamassassin on a message directly, and add the -t option, it
> will show you the number of different types of tokens found in the message:
> 
> X-Spam-Tok-Stat: Tokens: new, 0; hammy, 6; neutral, 84; spammy, 36.
> 
> Regards,
> Alex
> 

Alex, thanks very much for the quick reply. I really appreciate it.

One can see from the output in my previous message (two messages back)
that the user is amavis (correct for my system) and the corpus size, as
well as the token count:

dbg: bayes: corpus size: nspam = 6155, nham = 2342
dbg: bayes: tok_get_all: token count: 176
dbg: bayes: cannot use bayes on this message; not enough usable tokens found
bayes: not scoring message, returning undef

Now that I look at this output again, the "token count: 176" stands-out.
That seems like a pretty low value. Is this the token count for the
entire Bayes DB??? Or only the tokens that apply to the particular
message being fed to SA?

The "garden hose" messages are probably not *identical*, but they are
very similar, so it seems that each variant should have tokens to offer.

The concern I expressed around bug 6624 relates to Mark's comment, which
seems to imply that while SA will not insert a token twice, it *will*
increase the token "count". Here's an excerpt from Mark's comment from
that bug report:

"The effect of the bug with SpamAssassin is that tokens are only able
to be inserted once, but their counts cannot increase, leading to
terrible bayes results if the bug is not noticed. Also the conversion
form db fails, as reported by Dave."

Is it possible that training similar messages as SPAM is not having the
intended effect due to this bug in my version of SA?

My "bayes_vars" table looks like this (sorry for the wrapping, this is
the best I could do):

id  usernamespam_count  ham_count   token_count last_expire
last_atime_deltalast_expire_reduce  oldest_token_agenewest_token_age
1   amavis  61852427120092  1366364379  8380417
14747   1357985848  1366386865

The SQL query:

SELECT count( * )
FROM `bayes_token`

returns 120092 rows, so the above value is accurate (that is, the
"token_count" value in the `bayes_vars` table matches the actual row
count in the `bayes_token` table).

Also, thanks for the other tips regarding the "token summary flag"
directive an the -t switch. I was actually using the -t switch to
produce the output that I pasted two messages back. So, it seems that
the "X-Spam-Tok-Stat" output is added only when the token count is high
enough to be useful.

Still stumped here...

-Ben


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-04-19 Thread Ben Johnson


On 4/18/2013 12:18 PM, Ben Johnson wrote:
> 
> My concern now is that I am on 3.3.1, with little control over upgrades.
> I have read all three bug reports in their entirety and Bug 6624 seems
> to be a very legitimate concern. To quote Mark in the bug description:
> 
>> The effect of the bug with SpamAssassin is that tokens are only able
>> to be inserted once, but their counts cannot increase, leading to
>> terrible bayes results if the bug is not noticed. Also the conversion
>> form db fails, as reported by Dave.
>>
>> Attached is a patch for lib/Mail/SpamAssassin/BayesStore/MySQL.pm to
>> provide a workaround for the MySQL server bug, and improved debug logging.
> 
> How can I discern whether or not this bug does, in fact, affect me? Are
> my Bayes results being crippled as a result of this bug?
> 
>> It's possible that there's a good reason the default script still uses
>> myISAM. If so, the documentation for this fix should at least be easier
>> to find.
>>
> 
> In any event, I'm a little concerned because while the majority of
> messages are now tagged with BAYES_* hits, I am now seeing this debug
> output on a significant percentage of messages ("cannot use bayes on
> this message; not enough usable tokens found"):
> 
> # spamassassin -D -t < /tmp/msg.txt 2>&1 | egrep '(bayes:|whitelist:|AWL)'
> 
> --
> Apr 18 09:15:36.537 [21797] dbg: bayes: learner_new
> self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x4430388),
> bayes_store_module=Mail::SpamAssassin::BayesStore::MySQL
> Apr 18 09:15:36.568 [21797] dbg: bayes: using username: amavis
> Apr 18 09:15:36.568 [21797] dbg: bayes: learner_new: got
> store=Mail::SpamAssassin::BayesStore::MySQL=HASH(0x4779778)
> Apr 18 09:15:36.580 [21797] dbg: bayes: database connection established
> Apr 18 09:15:36.580 [21797] dbg: bayes: found bayes db version 3
> Apr 18 09:15:36.581 [21797] dbg: bayes: Using userid: 1
> Apr 18 09:15:36.781 [21797] dbg: bayes: corpus size: nspam = 6155, nham
> = 2342
> Apr 18 09:15:36.787 [21797] dbg: bayes: tok_get_all: token count: 176
> Apr 18 09:15:36.790 [21797] dbg: bayes: cannot use bayes on this
> message; not enough usable tokens found
> Apr 18 09:15:36.790 [21797] dbg: bayes: not scoring message, returning undef
> Apr 18 09:15:37.861 [21797] dbg: timing: total 2109 ms - init: 830
> (39.4%), parse: 7 (0.4%), extract_message_metadata: 123 (5.9%),
> poll_dns_idle: 74 (3.5%), get_uri_detail_list: 2 (0.1%),
> tests_pri_-1000: 26 (1.3%), compile_gen: 155 (7.4%), compile_eval: 19
> (0.9%), tests_pri_-950: 7 (0.3%), tests_pri_-900: 7 (0.3%),
> tests_pri_-400: 15 (0.7%), check_bayes: 10 (0.5%), tests_pri_0: 1018
> (48.3%), dkim_load_modules: 25 (1.2%), check_dkim_signature: 3 (0.2%),
> check_dkim_adsp: 16 (0.7%), check_spf: 78 (3.7%), check_razor2: 91
> (4.3%), check_pyzor: 430 (20.4%), tests_pri_500: 50 (2.4%)
> --
> 
> I have done some searching-around on the string "cannot use bayes on
> this message; not enough usable tokens found" and have not found
> anything authoritative regarding what this message might mean and
> whether or not it can be ignored or if it is symptomatic of a larger
> Bayes problem.
> 
> Thank you,
> 
> -Ben
> 

Might anyone be in a position to offer an authoritative response to
these questions?

I continue to see messages that are very similar to dozens of messages
that have been marked as SPAM slipping through with *no Bayes scoring*
(this is *after* fixing the SQL syntax error issue):

bayes: cannot use bayes on this message; not enough usable tokens found
bayes: not scoring message, returning undef

Is this normal? If so, what is the explanation for this behavior? I have
marked dozens of nearly-identical messages with the subject "Garden hose
expands up to three times its length" as SPAM (over the course of
several weeks) as SPAM, and yet SA reports "not enough usable tokens found".

Is SA referring to the number of tokens in the message? Or the Bayes DB?

Thanks,

-Ben


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-04-18 Thread Ben Johnson


On 4/18/2013 12:26 PM, Axb wrote:
> On 04/18/2013 06:18 PM, Ben Johnson wrote:
>> I have done some searching-around on the string "cannot use bayes on
>> this message; not enough usable tokens found" and have not found
>> anything authoritative regarding what this message might mean and
>> whether or not it can be ignored or if it is symptomatic of a larger
>> Bayes problem.
> 
> Curious: what are your reasons for using Bayes in SQL?
> Are you sharing the DB among several machines? Or is this a single
> box/global bayes  setup?
> 
> 

Not yet, but that is the ultimate plan (to share the DB across multiple
servers). Also, I like the idea that the Bayes DB is backed-up
automatically along with all other databases on the server (we run a
cron script that performs the dump). Granted, it would be trivial to
schedule a call to "sa-learn --backup", but storing the data in SQL
seems more portable and makes it easier to query the data for reporting
purposes.

Then again, I retain the corpora, so backing-up the DB is only useful
for when data needs to be moved from one server or database to another
(as moving the corpora seems far less practical).

Are you suggesting that I should scrap SQL and go back to a flat-file
DB? Is that the only path to a fix (short of upgrading SA)?

Thanks for your help!

-Ben


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-04-18 Thread Ben Johnson


On 4/17/2013 10:15 PM, John Hardin wrote:
> On Wed, 17 Apr 2013, Ben Johnson wrote:
> 
>> The first post on that page was the key. In particular, adding the
>> following to each MySQL "CREATE TABLE" statement:
>>
>> ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
> 
> Please check the SpamAssassin bugzilla to see if this situation is
> already mentioned, and if not, add a bug. This seems pretty critical.

Mark Martinec opened three reports in relation to this issue (quoted
from the archive thread cited in my previous post):

[Bug 6624] BayesStore/MySQL.pm fails to update tokens due to
MySQL server bug (wrong count of rows affected)
  https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6624

(^^ Fixed in 3.4 ^^)

[Bug 6625] Bayes SQL schema treats bayes_token.token as char
instead of binary, fails chset checks
  https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6625

(^^ Fixed in 3.4 ^^)

[Bug 6626] Newer MySQL chokes on TYPE=MyISAM syntax
  https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6626

(^^ Fixed in 3.4 ^^)

My concern now is that I am on 3.3.1, with little control over upgrades.
I have read all three bug reports in their entirety and Bug 6624 seems
to be a very legitimate concern. To quote Mark in the bug description:

> The effect of the bug with SpamAssassin is that tokens are only able
> to be inserted once, but their counts cannot increase, leading to
> terrible bayes results if the bug is not noticed. Also the conversion
> form db fails, as reported by Dave.
> 
> Attached is a patch for lib/Mail/SpamAssassin/BayesStore/MySQL.pm to
> provide a workaround for the MySQL server bug, and improved debug logging.

How can I discern whether or not this bug does, in fact, affect me? Are
my Bayes results being crippled as a result of this bug?

> It's possible that there's a good reason the default script still uses
> myISAM. If so, the documentation for this fix should at least be easier
> to find.
> 

If there is a good reason, I have yet to discern what it might be. The
third bug from above (Mark's comments, specifically) imply that there is
no particular reason for using MyISAM.

I have good reason for wanting to use the InnoDB storage engine, and I
have seen no performance hit as a result of so doing. (In fact,
performance seems better than with MyISAM in my scripted, once-a-day
training setup.)

The perfectly acceptable performance I'm observing could be because a)
the InnoDB-related resources allocated to MySQL are more than
sufficient, b) the schema that I used has a newly-added INDEX whereas
those prior to it did not, or c) I was sure to use the "MySQL" module
instead of the "SQL" module with my InnoDB setup:

bayes_store_module  Mail::SpamAssassin::BayesStore::MySQL

The bottom line seems to be that for those who have settings like these
in their MySQL configurations

> default_storage_engine=InnoDB
> skip-character-set-client-handshake
> collation_server=utf8_unicode_ci
> character_set_server=utf8

it is absolutely necessary to include

ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

at the end of each CREATE TABLE statement (otherwise, the MySQL syntax
error results and all Bayes SELECT statements fail).

In any event, I'm a little concerned because while the majority of
messages are now tagged with BAYES_* hits, I am now seeing this debug
output on a significant percentage of messages ("cannot use bayes on
this message; not enough usable tokens found"):

# spamassassin -D -t < /tmp/msg.txt 2>&1 | egrep '(bayes:|whitelist:|AWL)'

--
Apr 18 09:15:36.537 [21797] dbg: bayes: learner_new
self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x4430388),
bayes_store_module=Mail::SpamAssassin::BayesStore::MySQL
Apr 18 09:15:36.568 [21797] dbg: bayes: using username: amavis
Apr 18 09:15:36.568 [21797] dbg: bayes: learner_new: got
store=Mail::SpamAssassin::BayesStore::MySQL=HASH(0x4779778)
Apr 18 09:15:36.580 [21797] dbg: bayes: database connection established
Apr 18 09:15:36.580 [21797] dbg: bayes: found bayes db version 3
Apr 18 09:15:36.581 [21797] dbg: bayes: Using userid: 1
Apr 18 09:15:36.781 [21797] dbg: bayes: corpus size: nspam = 6155, nham
= 2342
Apr 18 09:15:36.787 [21797] dbg: bayes: tok_get_all: token count: 176
Apr 18 09:15:36.790 [21797] dbg: bayes: cannot use bayes on this
message; not enough usable tokens found
Apr 18 09:15:36.790 [21797] dbg: bayes: not scoring message, returning undef
Apr 18 09:15:37.861 [21797] dbg: timing: total 2109 ms - init: 830
(39.4%), parse: 7 (0.4%), extract_message_metadata: 123 (5.9%),
poll_dns_idle: 74 (3.5%), get_uri_detail_list: 2 (0.1%),
tests_pri_-1000: 26 (1.3%), compile_gen: 155 (7.4%), compile_eval: 19
(0.9%), tests_pri_-950: 7 (0.3%), tests_pri_-900: 7 (0.3%),
tests_pri_-400: 1

Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-04-17 Thread Ben Johnson


On 4/17/2013 5:39 PM, Tom Hendrikx wrote:
> On 17-04-13 21:40, Ben Johnson wrote:
>> Ideally, using the above directives will tell us whether we're
>> experiencing timeouts, or these spam messages are simply not in the
>> Pyzor or Razor2 databases.
>>
>> Off the top of your head, do you happen to know what will happen if one
>> or both of the Pyzor/Razor2 tests timeout? Will some indication that the
>> tests were at least *started* still be added to the SA header?
> 
> The razor client (don't know about pyzor) logs its activity to some
> logfile in ~razor. There you can see what (or what not) is happening.
> 
> It's also possible to raise logfile verbosity by changing the razor
> config file. See the man page for details.
> 
> --
> Tom
> 

Tom, thanks for the excellent tip regarding Razor's own log file.
Tailing that log will make this kind of debugging much simpler in the
future. Much appreciated.

One of the reasons for which I also like the idea of using Daniel
McDonald's include-scores-in-header rule (for Pyzor and Razor) is that
the data is embedded right in the message, which can be useful. For one,
this makes the scoring data more "portable" (it stays with the message
to which it applies). Secondly, when tailing a log, it can be difficult
to determine where the data relevant to one message ends and another begins.

Thanks again,

-Ben


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-04-17 Thread Ben Johnson


On 4/17/2013 6:47 PM, Ben Johnson wrote:
> 
> 
> On 4/17/2013 5:05 PM, Kris Deugau wrote:
>> Ben Johnson wrote:
>>> Is there anything else that would cause Bayes tests not be performed? I
>>> ask because other types of tests are disabled automatically under
>>> certain circumstances (e.g., network tests), and I'm wondering if there
>>> is some obscure combination of factors that causes Bayes tests not to be
>>> performed.
>>
>> Do you have bayes_sql_override_username set?  (This forces use of a
>> single Bayes DB for all SA calls that reference this configuration file
>> set.)
>>
>> If not, you may be getting a Bayes DB for each user on your system;
>> IIRC this is supported (sort of) and default with Amavis.
>>
>> -kgd
>>
> 
> Thanks for jumping-in here, Kris.
> 
> Yes, I do have the following in my SA local.cf:
> 
> bayes_sql_override_username amavis
> 
> So, all users are sharing the same Bayes DB. I train Bayes daily and the
> token count, etc., etc. all look good and correct.
> 
> Just a quick update to my previous post.
> 
> The Pyzor and Razor2 score information is indeed coming through for the
> handful of messages that have landed since I made those configuration
> changes. So, all seems to be well on the Pyzor / Razor2 front.
> 
> However, I still don't see any evidence that Bayes testing was performed
> on the messages that are "slipping through".
> 
> It bears mention that *most* messages do indeed show evidence of Bayes
> scoring.
> 
> --- OH, SNAP! I found the root cause. ---
> 
> Well, when I went to confirm the above statement, regarding most
> messages showing evidence of Bayes scoring, I realized that *none* show
> evidence of it since 3/23! No wonder all of this garbage is slipping
> through!
> 
> I recognized the date 3/23 immediately; it was the date on which we
> upgraded ISPConfig from 3.0.4.6 to 3.0.5. (For those who have no
> knowledge of ISPConfig, it is basically a FOSS solution to managing vast
> numbers of websites, domains, mailboxes, etc., as the name implies.)
> 
> We also updated OS packages (security only) on that day.
> 
> After diff-ing all of the relevant service configuration files
> (amavis-new, spamassassin, postfix, dovecot, etc.) I couldn't find any
> discrepancies.
> 
> Then, I tried:
> 
> -
> # spamassassin -D -t < /tmp/msg.txt 2>&1 | egrep '(bayes:|whitelist:|AWL)'
> 
> Apr 17 15:36:08.723 [23302] dbg: bayes: learner_new
> self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x2fbc508),
> bayes_store_module=Mail::SpamAssassin::BayesStore::MySQL
> Apr 17 15:36:08.746 [23302] dbg: bayes: using username: amavis
> Apr 17 15:36:08.746 [23302] dbg: bayes: learner_new: got
> store=Mail::SpamAssassin::BayesStore::MySQL=HASH(0x3305358)
> Apr 17 15:36:08.758 [23302] dbg: bayes: database connection established
> Apr 17 15:36:08.758 [23302] dbg: bayes: found bayes db version 3
> Apr 17 15:36:08.759 [23302] dbg: bayes: Using userid: 1
> Apr 17 15:36:08.914 [23302] dbg: bayes: corpus size: nspam = 6083, nham
> = 2334
> Apr 17 15:36:08.920 [23302] dbg: bayes: tok_get_all: token count: 163
> Apr 17 15:36:08.921 [23302] dbg: bayes: tok_get_all: SQL error: Illegal
> mix of collations for operation ' IN '
> Apr 17 15:36:08.921 [23302] dbg: bayes: cannot use bayes on this
> message; none of the tokens were found in the database
> Apr 17 15:36:08.921 [23302] dbg: bayes: not scoring message, returning undef
> Apr 17 15:36:13.116 [23302] dbg: timing: total 5159 ms - init: 804
> (15.6%), parse: 10 (0.2%), extract_message_metadata: 99 (1.9%),
> poll_dns_idle: 3426 (66.4%), get_uri_detail_list: 0.24 (0.0%),
> tests_pri_-1000: 11 (0.2%), compile_gen: 133 (2.6%), compile_eval: 18
> (0.3%), tests_pri_-950: 5 (0.1%), tests_pri_-900: 5 (0.1%),
> tests_pri_-400: 12 (0.2%), check_bayes: 8 (0.1%), tests_pri_0: 804
> (15.6%), dkim_load_modules: 23 (0.4%), check_dkim_signature: 5 (0.1%),
> check_dkim_adsp: 99 (1.9%), check_spf: 61 (1.2%), check_razor2: 211
> (4.1%), check_pyzor: 138 (2.7%), tests_pri_500: 3387 (65.7%)
> -
> 
> Check-out the message buried half-way down:
> 
> bayes: tok_get_all: SQL error: Illegal mix of collations for operation '
> IN '
> 
> I have run into this unsightly message before, but in that case, I could
> see the entire query, which enabled me to change the collations accordingly.
> 
> In this case, I have no idea what the original query might have been.
> 
> Further, I have no idea what changed that introduced this problems on 3/23.
> 
>

Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-04-17 Thread Ben Johnson


On 4/17/2013 5:05 PM, Kris Deugau wrote:
> Ben Johnson wrote:
>> Is there anything else that would cause Bayes tests not be performed? I
>> ask because other types of tests are disabled automatically under
>> certain circumstances (e.g., network tests), and I'm wondering if there
>> is some obscure combination of factors that causes Bayes tests not to be
>> performed.
> 
> Do you have bayes_sql_override_username set?  (This forces use of a
> single Bayes DB for all SA calls that reference this configuration file
> set.)
> 
> If not, you may be getting a Bayes DB for each user on your system;
> IIRC this is supported (sort of) and default with Amavis.
> 
> -kgd
> 

Thanks for jumping-in here, Kris.

Yes, I do have the following in my SA local.cf:

bayes_sql_override_username amavis

So, all users are sharing the same Bayes DB. I train Bayes daily and the
token count, etc., etc. all look good and correct.

Just a quick update to my previous post.

The Pyzor and Razor2 score information is indeed coming through for the
handful of messages that have landed since I made those configuration
changes. So, all seems to be well on the Pyzor / Razor2 front.

However, I still don't see any evidence that Bayes testing was performed
on the messages that are "slipping through".

It bears mention that *most* messages do indeed show evidence of Bayes
scoring.

--- OH, SNAP! I found the root cause. ---

Well, when I went to confirm the above statement, regarding most
messages showing evidence of Bayes scoring, I realized that *none* show
evidence of it since 3/23! No wonder all of this garbage is slipping
through!

I recognized the date 3/23 immediately; it was the date on which we
upgraded ISPConfig from 3.0.4.6 to 3.0.5. (For those who have no
knowledge of ISPConfig, it is basically a FOSS solution to managing vast
numbers of websites, domains, mailboxes, etc., as the name implies.)

We also updated OS packages (security only) on that day.

After diff-ing all of the relevant service configuration files
(amavis-new, spamassassin, postfix, dovecot, etc.) I couldn't find any
discrepancies.

Then, I tried:

-
# spamassassin -D -t < /tmp/msg.txt 2>&1 | egrep '(bayes:|whitelist:|AWL)'

Apr 17 15:36:08.723 [23302] dbg: bayes: learner_new
self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x2fbc508),
bayes_store_module=Mail::SpamAssassin::BayesStore::MySQL
Apr 17 15:36:08.746 [23302] dbg: bayes: using username: amavis
Apr 17 15:36:08.746 [23302] dbg: bayes: learner_new: got
store=Mail::SpamAssassin::BayesStore::MySQL=HASH(0x3305358)
Apr 17 15:36:08.758 [23302] dbg: bayes: database connection established
Apr 17 15:36:08.758 [23302] dbg: bayes: found bayes db version 3
Apr 17 15:36:08.759 [23302] dbg: bayes: Using userid: 1
Apr 17 15:36:08.914 [23302] dbg: bayes: corpus size: nspam = 6083, nham
= 2334
Apr 17 15:36:08.920 [23302] dbg: bayes: tok_get_all: token count: 163
Apr 17 15:36:08.921 [23302] dbg: bayes: tok_get_all: SQL error: Illegal
mix of collations for operation ' IN '
Apr 17 15:36:08.921 [23302] dbg: bayes: cannot use bayes on this
message; none of the tokens were found in the database
Apr 17 15:36:08.921 [23302] dbg: bayes: not scoring message, returning undef
Apr 17 15:36:13.116 [23302] dbg: timing: total 5159 ms - init: 804
(15.6%), parse: 10 (0.2%), extract_message_metadata: 99 (1.9%),
poll_dns_idle: 3426 (66.4%), get_uri_detail_list: 0.24 (0.0%),
tests_pri_-1000: 11 (0.2%), compile_gen: 133 (2.6%), compile_eval: 18
(0.3%), tests_pri_-950: 5 (0.1%), tests_pri_-900: 5 (0.1%),
tests_pri_-400: 12 (0.2%), check_bayes: 8 (0.1%), tests_pri_0: 804
(15.6%), dkim_load_modules: 23 (0.4%), check_dkim_signature: 5 (0.1%),
check_dkim_adsp: 99 (1.9%), check_spf: 61 (1.2%), check_razor2: 211
(4.1%), check_pyzor: 138 (2.7%), tests_pri_500: 3387 (65.7%)
-

Check-out the message buried half-way down:

bayes: tok_get_all: SQL error: Illegal mix of collations for operation '
IN '

I have run into this unsightly message before, but in that case, I could
see the entire query, which enabled me to change the collations accordingly.

In this case, I have no idea what the original query might have been.

Further, I have no idea what changed that introduced this problems on 3/23.

Was it a MySQL upgrade? Was it an ISPConfig change?

Has anybody else run into this?

Thanks again,

-Ben


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-04-17 Thread Ben Johnson
Daniel, thanks for the quick reply. I'll reply inline, below.

On 4/16/2013 5:01 PM, Daniel McDonald wrote:
> 
> 
> 
> On 4/16/13 2:59 PM, "Ben Johnson"  wrote:
> 
>> Are there any normal circumstances under which Bayes tests are not run?
> Yes, if USE_BAYES = 0 is included in the local.cf file.

I checked in /etc/spamassassin/local.cf, and find the following:

use_bayes 1

So, that seems not to be the issue.

>>
>> If not, are there circumstances under which Bayes tests are run but
>> their results are not included in the message headers? (I have tag_level
>> set to -999, so SA headers are always added.)
> 
> That sounds like an amavisd command, you may want to check in
> ~amavisd/.spamassassin/user_prefs as well

I checked in the equivalent path on my system
(/var/lib/amavis/.spamassassin/user_prefs) and the entire file is
commented-out. So, that seems not to be the issue, either.

Is there anything else that would cause Bayes tests not be performed? I
ask because other types of tests are disabled automatically under
certain circumstances (e.g., network tests), and I'm wondering if there
is some obscure combination of factors that causes Bayes tests not to be
performed.

>>
>> Likewise, for the vast majority of spam messages that slip-through, I
>> see no evidence of Pyzor or Razor2 activity. I have heretofore assumed
>> that this observation indicates that the network tests were performed,
>> but did not contribute to the SA score. Is this assumption valid?
> Yes.

Okay, very good.

It occurred to me that perhaps the Pyzor and/or Razor2 tests are
timing-out (both timeouts are set to 15 seconds) some percentage of the
time, which may explain why these tests do not contribute to a given
message's score.

That's why I asked about forcing the results into the SA header.

>>
>> Also, is there some means by which to *force* Pyzor and Razor2 scores to
>> be added to the SA header, even if they did not contribute to the score?
> 
> I imagine you would want something like this:
> 
> fullRAZOR2_CF_RANGE_0_50  eval:check_razor2_range('','0','50')
> tflags  RAZOR2_CF_RANGE_0_50  net
> reuse   RAZOR2_CF_RANGE_0_50
> describe RAZOR2_CF_RANGE_0_50 Razor2 gives confidence level under 50%
> score   RAZOR2_CF_RANGE_0_500.01
> 
> fullRAZOR2_CF_RANGE_E4_0_50  eval:check_razor2_range('4','0','50')
> tflags  RAZOR2_CF_RANGE_E4_0_50   net
> reuse   RAZOR2_CF_RANGE_E4_0_50
> describe RAZOR2_CF_RANGE_E4_0_50  Razor2 gives engine 4 confidence level
> below 50%
> score RAZOR2_CF_RANGE_E4_0_50   0.01
> 
> fullRAZOR2_CF_RANGE_E8_0_50  eval:check_razor2_range('8','0','50')
> tflags  RAZOR2_CF_RANGE_E8_0_50   net
> reuse   RAZOR2_CF_RANGE_E8_0_50
> describe RAZOR2_CF_RANGE_E8_0_50  Razor2 gives engine 8 confidence level
> below 50%
> score RAZOR2_CF_RANGE_E8_0_50   0.01

This seems to work brilliantly. I can't thank you enough; I never would
have figured this out.

Ideally, using the above directives will tell us whether we're
experiencing timeouts, or these spam messages are simply not in the
Pyzor or Razor2 databases.

Off the top of your head, do you happen to know what will happen if one
or both of the Pyzor/Razor2 tests timeout? Will some indication that the
tests were at least *started* still be added to the SA header?

>>
>> To refresh folks' memories, we have verified that Bayes is setup
>> correctly (database was wiped and now training is done manually and is
>> supervised), and that network tests are being performed when messages
>> are scanned.
>>
>> Thanks for sticking with me through all of this, guys!
>>
>> -Ben
> 

Thanks again, Daniel!

-Ben


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-04-16 Thread Ben Johnson
Apologies for resurrecting the thread, but I never did receive a
response to this particular aspect of the problem (asked on Jan 18,
2013, 8:51 AM). This is probably because I replied to my own post before
anyone else did, and changed the subject slightly.

We are being hammered pretty hard with spam (again), and as I inspect
messages whose score is below tag2_level, BAYES_* is conspicuously
absent from the headers.

To reiterate my question:

>> Are there any normal circumstances under which Bayes tests are not run?

If not, are there circumstances under which Bayes tests are run but
their results are not included in the message headers? (I have tag_level
set to -999, so SA headers are always added.)

Likewise, for the vast majority of spam messages that slip-through, I
see no evidence of Pyzor or Razor2 activity. I have heretofore assumed
that this observation indicates that the network tests were performed,
but did not contribute to the SA score. Is this assumption valid?

Also, is there some means by which to *force* Pyzor and Razor2 scores to
be added to the SA header, even if they did not contribute to the score?

To refresh folks' memories, we have verified that Bayes is setup
correctly (database was wiped and now training is done manually and is
supervised), and that network tests are being performed when messages
are scanned.

Thanks for sticking with me through all of this, guys!

-Ben



On 1/18/2013 11:51 AM, Ben Johnson wrote:
> So, I've been keeping an eye on things again today.
> 
> Overall, things look pretty good, and most spam is being blocked
> outright at the MTA and scored appropriately in SA if not.
> 
> I've been inspecting the X-Spam-Status headers for the handful of
> messages that do slip through and noticed that most of them lack any
> evidence of the BAYES_* tests. Here's one such header:
> 
> No, score=3.115 tagged_above=-999 required=4.5 tests=[HK_NAME_FREE=1,
> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, PYZOR_CHECK=1.392,
> SPF_PASS=-0.001] autolearn=disabled
> 
> The messages that were delivered just before and after this one do have
> evidence of BAYES_* tests, so, it's not as though something is
> completely broken.
> 
> Are there any normal circumstances under which Bayes tests are not run?
> Do I need to turn debugging back on and wait until this happens again?
> 
> Thanks for all the help, everyone!
> 
> -Ben
> 


Re: Telling BAYES not to learn?

2013-02-07 Thread Ben Johnson


On 2/7/2013 11:13 AM, Marc Perkel wrote:
> 
> On 2/7/2013 6:58 AM, RW wrote:
>> On Tue, 05 Feb 2013 07:20:24 -0800
>> Marc Perkel wrote:
>>
>>> is there a way I can put something in a rule that would cause bayes
>>> not to learn - such as a rule that detects bayes poisoning?
>> Why do you think this is a good idea?
>>
>>
> Because when a message uses invisible text to poison bayes then I don't
> want to learn that because it will make bayes less effective.
> 

Invisible text is a problem only for humans, not for machines. So, it
sounds as though the problem you're describing relates to reviewing
messages, manually (with your eyes), and taking some action as a result.

If this is so, why not read the messages in *plaintext*, so you see the
"invisible text" and can therefore act accordingly?


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-02-06 Thread Ben Johnson


On 2/1/2013 7:58 PM, John Hardin wrote:
> On Sat, 2 Feb 2013, RW wrote:
> 
>> ALLOWING APPENDS
>>By appends we mean the case of mail moving when the source folder is
>>unknown, e.g. when you move from some other account or with tools
>>like offlineimap. You should be careful with allowing APPENDs to
>>SPAM folders. The reason for possibly allowing it is to allow
>>not-SPAM --> SPAM transitions to work and be trained. However,
>>because the plugin cannot know the source of the message (it is
>>assumed to be from OTHER folder), multiple bad scenarios can happen:
>>
>>1. SPAM --> SPAM transitions cannot be recognised and are trained;
>>2. TRASH --> SPAM transitions cannot be recognised and are trained;
>>3. SPAM --> not-SPAM transitions cannot be recognised therefore
>>   training good messages will never work with APPENDs.
>>
>>
>> I presume that the plugin works by monitoring COPY commands and so
>> can't work properly when a move is done by FETCH-APPEND-DELETE.
>>
>> For sa-learn the problem would be 3, but I don't see how that is
>> affected by allowing appends on the spam folder.
> 
> Yeah, all of that sounds like they're talking about non-vetted training
> mailboxes where the users are effectively talking directly to sa-learn.
> 
> I think I may see at least part of what they are driving at.
> 
> If one user trains a message as ham and another user who got a copy of
> the same message trains it as spam, who wins?
> 
> Absent some conflict-detection mechanism, the last mailbox trained
> (either spam or ham) wins.
> 
> As for the other two:
> 
> spam -> spam transitions don't matter, sa-learn recognises message-IDs
> and won't learn from the same message in the same corpus more than once
> (i.e. having the same message in the spam corpus multiple times does not
> "weight" the tokens learned from that message). So (1) may be a
> performance concern but it won't affect the database.
> 
> trash -> spam transition being learned is a problem how?
> 
> That latter brings up another concern for the vetted-corpora model: if a
> message is *removed* from a training corpora mailbox rather than
> reclassified, you'd have to wipe and retrain your database from scratch
> to remove that message's effects.
> 
> So, you need *three* vetted corpus mailboxes: spam, ham, and
> should-not-have-been-trained (forget). Rather than deleting a message
> from the ham or spam corpus mailbox you move it to the forget mailbox
> and the in next training pass sa-learn forgets the message and removes
> it from the forget mailbox. This would be some special scripting,
> because you can't just "sa-learn --forget" a whole mailbox.
> 
> There would also need to be an audit process to detect whether the same
> message_id is in both the ham and spam corpus mailboxes, so that the
> admin can delete (NOT forget) the incorrect classification, or forget
> the message if neither classification is reasonable.
> 

You reveal some crucial information with regard to corpora management
here, John.

I've taken your good advice and created a third mailbox (well, a third
"folder" within the same mailbox), named "Forget".

It sounds as though the key here is never to delete messages from either
corpus -- unless the same message exists in both corpora, in which case
the misclassified message should be deleted. If neither classification
is reasonable and the message should instead be forgotten, what's the
order of operations? Should a copy of the message be created in the
"Forget" corpus and then the message deleted from both the "Ham" and
"Spam" corpora?

With regard to the specialized scripting required to "forget" messages,
this sounds cumbersome

> because you can't just "sa-learn --forget" a whole mailbox.

Is there a non-obvious reason for this? Would the logic behind a
recursive --forget switch not be the same or similar as with the
existing --ham and --spam switches?

Finally, when a user submits a message to be classified as ham or spam,
how should I be sorting the messages? I see the following scenarios:

1.) I agree with the end-user's classification.

2.) I disagree with the end-user's classification.
a.) Because the message was submitted as ham but is really spam (or
vice versa)
b.) Because neither classification is reasonable

In case 1.), should I *copy* the message from the submission inbox's Ham
folder to the permanent Ham corpus folder? Or should I *move* the
message? I'm trying to discern whether or not there's value in retaining
end-user submissions *as they were classified upon submission*.

In case 2.), should I simply delete the message from the submission
folder? Or is there some reason to retain the message (i.e., move it
into an "Erroneous" folder within the submission mailbox)?

I did read http://wiki.apache.org/spamassassin/HandClassifiedCorpora ,
but it doesn't address these issues, specifically.

Thanks again!

-Ben


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-02-06 Thread Ben Johnson


On 2/1/2013 12:00 PM, John Hardin wrote:
> On Fri, 1 Feb 2013, Ben Johnson wrote:
> 
>> John, thanks for pointing-out the problems associated with re-sending
>> the messages via sendmail.
>>
>> I threw a line out to the Dovecot users group and learned how to move
>> messages without going through the MTA. Dovecot has a utility
>> executable, "deliver", which is well-suited to the task.
>>
>> For those who may have a similar need, here's the Dovecot Antispam pipe
>> script that I'm using, courtesy of Steffen Kaiser on the Dovecot Users
>> mailing list:
>>
>> ---
>> #!/bin/bash
>>
>> mode=
>> for opt; do
>> if test "x$*" == "x--ham"; then
>> mode=HAM
>> break
>> elif test "x$*" == "x--spam"; then
>> mode=SPAM
>> break
>> fi
>> done
>>
>> if test -n "$mode"; then
>> # options from http://wiki1.dovecot.org/LDA
>> /usr/lib/dovecot/deliver -d u...@example.com -m Training.$mode
>> fi
>>
>> exit 0
>> ---
> 
> That seems a lot better.
> 
>> Regarding the second point, I'm not sure I understand the problem. If
>> someone drags a message from Trash to SPAM, shouldn't it be submitted
>> for learning as spam?
>>
>> The last sentence sounds like somewhat of a deal-breaker. Doesn't my
>> whole strategy go somewhat limp if ham cannot be submitted for training?
>>
>> John and RW, do you recommend enabling or disabling the append option,
>> given the way I'm reviewing the submissions and sorting them manually?
> 
> I think they're proceeding from the assumption of *un-reviewed*
> training, i.e. blind trust in the reliability of the users.
> 
> If it's possible to enable IMAP Append on a per-folder basis then
> enabling it only on your training inbox folders shouldn't be an issue -
> the messages won't be trained until you've reviewed them.
> 
> Without that level of fine-grain control I still don't see an issue from
> this if you can prevent the users from adding content directly to the
> folders that sa-learn actually processes. If IMAP Append only applies to
> "shared" folders then there shouldn't be a problem - configure sa-learn
> to learn from folders in *your account*, that nobody else can access
> directly.
> 

Thanks, John.

If I'm understanding you correctly, your assessment is that enabling
IMAP append in the Antispam plug-in configuration (not the default, by
the way) shouldn't cause problems for my Bayes training setup, primarily
because users cannot train Bayes unsupervised.

If that is so, what's the real benefit to enabling this "feature" that
is off by default? Users will be able to submit messages for training
while "offline" and when they reconnect the plug-in will be triggered
and the messages copied to the training mailbox?

-Ben


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-02-01 Thread Ben Johnson


On 1/31/2013 5:50 PM, RW wrote:
> On Thu, 31 Jan 2013 12:12:15 -0800 (PST)
> John Hardin wrote:
> 
>> On Thu, 31 Jan 2013, Ben Johnson wrote:
>>
> 
>>> So, I finally got around to tackling this change.
>>>
>>> With a couple of simple modifications, I was able to achieve the
>>> desired result with the Dovecot Antispam plug-in.
>>>
>>> Basically, I changed the last two directive values from the switches
>>> that are normally passed to the "sa-learn" binary (--spam and
>>> --ham) to destination email addresses that are passed to "sendmail"
>>> in my revised pipe script.
>>
>> Passing the messages through sendmail again isn't optimal as that
>> will make further changes to the headers. This may have effects on
>> the quality of the learning, unless the original message is attached
>> as an RFC-822 attachment to the message being sent to the corpus
>> mailbox, which of course means you then can't just run sa-learn
>> directly against that mailbox - the review process would involve
>> moving the attachment as a standalone message to the spam or ham
>> learning mailbox.
>>
>> Ideally you want to just move the messages between mailboxes without 
>> involving another delivery processing. I don't know enough about
>> Dovecot or your topology to say whether that's going to be as easy as
>> using sendmail to mail the message to you.
> 
> Actually that's the way that the dovecot plugin works. I think that the
> sendmail option is mainly a way to get training done on a remote
> machine - it's a standard feature of DSPAM for which the plugin was
> originally developed.
> 
> When I looked at the plugin it seemed to have quite a serious flaw.
> IIRC it disables IMAP APPENDs on the Spam folder which makes it
> incompatible with synchronisation tools like OfflineImap and probably
> some IMAP clients that implement offline support in the same way.
> 

John, thanks for pointing-out the problems associated with re-sending
the messages via sendmail.

I threw a line out to the Dovecot users group and learned how to move
messages without going through the MTA. Dovecot has a utility
executable, "deliver", which is well-suited to the task.

For those who may have a similar need, here's the Dovecot Antispam pipe
script that I'm using, courtesy of Steffen Kaiser on the Dovecot Users
mailing list:

---
#!/bin/bash

mode=
for opt; do
if test "x$*" == "x--ham"; then
mode=HAM
break
elif test "x$*" == "x--spam"; then
mode=SPAM
break
fi
done

if test -n "$mode"; then
# options from http://wiki1.dovecot.org/LDA
/usr/lib/dovecot/deliver -d u...@example.com -m Training.$mode
fi

exit 0
---


And here are the Antispam plug-in options:


---
  # For Dovecot < 2.0.
  antispam_spam_pattern_ignorecase = SPAM;JUNK
  antispam_mail_tmpdir = /tmp
  antispam_mail_sendmail = /usr/bin/sa-learn-pipe.sh
  antispam_mail_spam = --spam
  antispam_mail_notspam = --ham
---

RW, thank you for underscoring the issue with IMAP appends. It looks as
though a configuration directive exists to control this behavior:

# Whether to allow APPENDing to SPAM folders or not. Must be set to
# "yes" (case insensitive) to be activated. Before activating, please
# read the discussion below.
# antispam_allow_append_to_spam = no

Unfortunately, I don't fully understand the implications or enabling or
disabling this option. Here's the "discussion below" that is referenced
in the above comment:

---
ALLOWING APPENDS?

You should be careful with allowing APPENDs to SPAM folders. The reason
for  possibly  allowing it is to allow not-SPAM --> SPAM transitions to
work with offlineimap. However, because with APPEND the  plugin  cannot
know the source of the message, multiple bad scenarios can happen:

1. SPAM --> SPAM transitions cannot be recognised and are trained

2. the same holds for Trash --> SPAM transitions

Additionally,   because   we   cannot   recognise   SPAM  -->  not-SPAM
transitions, training good messages will never work with APPEND.
---

In consideration of the first point, what is a "SPAM --> SPAM
transition"? Is that when the mailbox contains more than one "spam
folder", e.g., "JUNK" and "SPAM", and the user drags a message from one
to the other?

Regarding the second point, I'm not sure I understand the problem. If
someone drags a message from Trash to SPAM, shouldn't it be submitted
for learning as spam?

The last sentence sounds like somewhat of a deal-breaker. Doesn't my
whole strategy go somewhat limp if ham cannot be submitted for training?

John and RW, do you recommend enabling or disabling the append option,
given the way I'm reviewing the submissions and sorting them manually?

Sorry for all the questions! And thanks!

-Ben


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-01-31 Thread Ben Johnson


On 1/15/2013 5:22 PM, John Hardin wrote:
 Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in.
 They do so unsupervised. Why this could be a problem is obvious. And
 no,
 I don't retain their submissions. I probably should. I wonder if I can
 make a few slight modifications to the shell script that Antispam
 calls,
 such that it simply sends a copy of the message to an administrator
 rather than calling sa-learn on the message.
>>>
>>> That would be a very good idea if the number of users doing training is
>>> small. At the very least, the messages should be captured to a permanent
>>> corpus mailbox.
>>
>> Good idea! I'll see if I can set this up.

So, I finally got around to tackling this change.

With a couple of simple modifications, I was able to achieve the desired
result with the Dovecot Antispam plug-in.

In dovecot.conf:

-
plugin {
  # [...]

  # For Dovecot < 2.0.
  antispam_spam_pattern_ignorecase = SPAM;JUNK
  antispam_mail_tmpdir = /tmp
  antispam_mail_sendmail = /usr/bin/sa-learn-pipe.sh
  antispam_mail_spam = proposed-s...@example.com
  antispam_mail_notspam = proposed-...@example.com
}
-

Basically, I changed the last two directive values from the switches
that are normally passed to the "sa-learn" binary (--spam and --ham) to
destination email addresses that are passed to "sendmail" in my revised
pipe script.

Here is the full pipe script, /usr/bin/sa-learn-pipe.sh (apologies for
the wrapping); the original commands are commented with two pound
symbols [##]):

-
#!/bin/sh

# Add "starting now" string to log.
echo "$$-start ($*)" >> /tmp/sa-learn-pipe.log

# Copy the message contents to a temporary text file.
cat<&0 >> /tmp/sendmail-msg-$$.txt

CURRENT_USER=$(whoami)

##echo "Calling (as user $CURRENT_USER) '/usr/bin/sa-learn $*
/tmp/sendmail-msg-$$.txt'" >> /tmp/sa-learn-pipe.log
echo "Calling (as user $CURRENT_USER) 'sendmail $* <
/tmp/sendmail-msg-$$.txt'" >> /tmp/sa-learn-pipe.log

# Execute sa-learn, with the passed ham/spam argument, and the temporary
message contents.
# Send the output to the log file while redirecting stderr to stdout (so
we capture debug output).
##/usr/bin/sa-learn $* /tmp/sendmail-msg-$$.txt >>
/tmp/sa-learn-pipe.log 2>&1
sendmail $* < /tmp/sendmail-msg-$$.txt >> /tmp/sa-learn-pipe.log 2>&1

# Remove the temporary message.
rm -f /tmp/sendmail-msg-$$.txt

# Add "ending now" string to log.
echo "$$-end" >> /tmp/sa-learn-pipe.log

# Exit with "success" status code.
exit 0
-

It seems as though creating a temporary copy of the message is not
strictly necessary, as the message contents could be passed to the
"sendmail" command via standard input (stdin), but creating the copy
could be useful in debugging.

>>> Do your users also train ham? Are the procedures similar enough that
>>> your users could become easily confused?
>>
>> They do. The procedure is implemented via Dovecot's Antispam plug-in.
>> Basically, moving mail from Inbox to Junk trains it as spam, and moving
>> mail from Junk to Inbox trains it as ham. I really like this setup
>> (Antispam + calling SA through Amavis [i.e. not using spamd]) because
>> the results are effective immediately, which seems to be crucial for
>> combating this snowshoe spam (performance and scalability aside).
>>
>> I don't find that procedure to be confusing, but people are different, I
>> suppose.
> 
> Hm. One thing I would watch out for in that environment is people who
> have intentionally subscribed to some sort of mailing list deciding they
> don't want to receive it any longer and just junking the messages rather
> than unsubscribing.

The steps I've taken above will allow me to review submissions and
educate users who engage in this practice. Thanks again for elucidating
this scenario.

I hope that this approach to user-based SpamAssassin training is useful
to others.

Best regards,

-Ben


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-01-18 Thread Ben Johnson
So, I've been keeping an eye on things again today.

Overall, things look pretty good, and most spam is being blocked
outright at the MTA and scored appropriately in SA if not.

I've been inspecting the X-Spam-Status headers for the handful of
messages that do slip through and noticed that most of them lack any
evidence of the BAYES_* tests. Here's one such header:

No, score=3.115 tagged_above=-999 required=4.5 tests=[HK_NAME_FREE=1,
HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, PYZOR_CHECK=1.392,
SPF_PASS=-0.001] autolearn=disabled

The messages that were delivered just before and after this one do have
evidence of BAYES_* tests, so, it's not as though something is
completely broken.

Are there any normal circumstances under which Bayes tests are not run?
Do I need to turn debugging back on and wait until this happens again?

Thanks for all the help, everyone!

-Ben


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-01-16 Thread Ben Johnson


On 1/16/2013 2:22 PM, Bowie Bailey wrote:
> On 1/16/2013 1:18 PM, Ben Johnson wrote:
>>
>> On 1/16/2013 11:00 AM, John Hardin wrote:
>>> On Wed, 16 Jan 2013, Ben Johnson wrote:
>>>
>>>> Is it possible that the training I've been doing over the last week or
>>>> so wasn't *effective* until recently, say, after restarting some
>>>> component of the mail stack? My understanding is that calling SA via
>>>> Amavis, which does not need/use the spamd daemon, forces all Bayes data
>>>> to be up-to-date on each call to spamassassin.
>>> That shouldn't be the case. SA and sa-learn both use a shared-access
>>> database; if you're training the database that SA is learning, the
>>> results of training should be effective immediately.
>>>
>> Okay, good. Bowie's response to this question differed (he suggested
>> that Amavis would need to be restarted for Bayes to be updated), but I'm
>> pretty sure that restarting Amavis is not necessary. It seems unlikely
>> that Amavis would copy the entire Bayes DB (which is stored in MySQL on
>> this server) into memory every time that the Amavis service is started.
>> To do so seems self-defeating: more RAM usage, worse performance, etc.
> 
> Actually, I was making a general observation.
> 
> For cases where you would normally need to restart spamd, you will need
> to restart amavis.  This includes things like rule and configuration
> changes.
> 
> Bayes data is read dynamically from your MySQL database and thus does
> not require a restart of amavis/spamd when updated.
> 

My apologies, Bowie. I misinterpreted your response. Thank you very much
for the follow-up and for the clear explanation.

Best regards,

-Ben


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-01-16 Thread Ben Johnson


On 1/16/2013 11:00 AM, John Hardin wrote:
> On Wed, 16 Jan 2013, Ben Johnson wrote:
> 
>> On 1/15/2013 5:22 PM, John Hardin wrote:
>>> On Tue, 15 Jan 2013, Ben Johnson wrote:
>>>>
>>>> Wow! Adding several more reject_rbl_client entries to the
>>>> smtpd_recipient_restrictions directive in the Postfix configuration
>>>> seems to be having a tremendous impact. The amount of spam coming
>>>> through has dropped by 90% or more. This was a HUGELY helpful
>>>> suggestion, John!
>>>
>>> Which ones are you using now? There are DNSBLs that are good, but not
>>> quite good enough to trust as hard-reject SMTP-time filters. That's why
>>> SA does scored DNSBL checks.
>>
>> smtpd_recipient_restrictions =
>> reject_rbl_client bl.spamcop.net,
>> reject_rbl_client list.dsbl.org,
>> reject_rbl_client sbl-xbl.spamhaus.org,
>> reject_rbl_client cbl.abuseat.org,
>> reject_rbl_client dul.dnsbl.sorbs.net,
> 
> Several of those are combined into ZEN. If you use Zen instead you'll
> save some DNS queries. See the Spamhaus link I provided earlier for
> details, I don't offhand remember which ones go into ZEN.

Per Noel's advice, I have shortened the list (dsbl.org is defunct) and
acted upon your mutual suggestion regarding ZEN:

reject_rbl_client bl.spamcop.net,
reject_rbl_client zen.spamhaus.org,
reject_rbl_client dnsbl.sorbs.net,

Indeed, block entries for all three lists are being registered in the
mail log. Very nice.

It seems as though adding these SMTP-time rejects has blocked about 1/2
of the spam that was coming through previously. Awesome.

>> These are "hard rejects", right? So if this change has reduced spam,
>> said spam would not be accepted for delivery at all; it would be
>> rejected outright. Correct? (And if I understand you, this is part of
>> your concern.)
> 
> Correct.
> 
>> The reason I ask, and a point that I should have clarified in my last
>> post, is that the *volume* of spam didn't drop by 90% (although, it may
>> have dropped by some measure), but rather the accuracy with which SA
>> tagged spam was 90% higher.
> 
> That's odd. That suggests you SA wasn't looking up those DNSBLs, or they
> would have contributed to the score.
> 
> Check your trusted networks setting. One difference between SMTP-time
> and SA-time DNSBL checks is that SMTP-time checks the IP address of the
> client talking to the MTA, while SA-time can go back up the relay chain
> if necessary (e.g. to check the client IP submitting to your ISP if your
> ISP's MTA is between your MTA and the Internet, rather than always
> checking your ISP's MTA IP address).

Are you referring to SA's "trusted_networks" directive? If so, it is
commented-out (presumably by default). Does this need to be set? I've
read the info re: trusted_networks at
http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Conf.html ,
but I'm struggling to understand it.

If the info is helpful, I have a very simple setup here: a single server
with a single public IP address and a single MTA.

>> Ultimately, I'm wondering if the observed change was simply a product of
>> these message "campaigns" being black-listed after a few days of
>> circulation, and not the Postfix configuration change.
> 
> Maybe.
> 
>> At this point, the vast majority of X-Spam-Status headers include Razor2
>> and Pyzor tests that contribute significantly to the score. I should
>> have mentioned earlier that I installed Razor2 and Pyzor after making my
>> initial post. The only reasons I didn't are that a) they didn't seem to
>> be making a significant difference for the first day or so after I
>> installed them (this could be for the snowshoe reasons we've already
>> discussed), and b) the low Bayes scores seemed to be the real problem
>> anyway.
>>
>> That said, the Bayes scores seem to be much more accurate now, too. I
>> was hardly ever seeing BAYES_99 before, but now almost all spam messages
>> have BAYES_99.
> 
> Odd. SMTP-time hard rejects shouldn't change that.

That's what I figured. I wonder if feeding all of the messages that I
"auto-learned manually" -- messages that were tagged as spam (but for
reasons unrelated to Bayes) -- contributed significantly to this change.
I did this late yesterday afternoon and when I took a status check this
morning, I was seeing BAYES_99 for almost every message.

>> Is it possible that the training I've been doing over the last week or
>> so wasn't *effective* until recently, say, after restarting some
>

Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-01-16 Thread Ben Johnson


On 1/16/2013 2:02 AM, Tom Hendrikx wrote:
> On 1/15/13 5:26 PM, Ben Johnson wrote:
> 
>>
>> In postfix's main.cf:
>>
> 
>>
>> Hmm, very interesting. No, I have no greylisting in place as yet, and
>> no, my userbase doesn't demand immediate delivery. I will look into
>> greylisting further.
> 
> If you're running postfix, consider using postscreen. It's a recent
> addition to postfix that also can behave in a greylisting alike way, and
> much more.
> 
> Read: http://www.postfix.org/POSTSCREEN_README.html
> 
> --
> Tom
> 

Thanks for the suggestion, Tom!

Unfortunately, I'm stuck on Postfix 2.7 for a while yet, and Postscreen
is available for versions >= 2.8 only.

I will definitely look into it once I'm on 2.8+, however.

Cheers,

-Ben


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-01-16 Thread Ben Johnson


On 1/15/2013 5:22 PM, John Hardin wrote:
> On Tue, 15 Jan 2013, Ben Johnson wrote:
> 
>>
>>
>> On 1/15/2013 1:55 PM, John Hardin wrote:
>>> On Tue, 15 Jan 2013, Ben Johnson wrote:
>>>
>>>> On 1/14/2013 8:16 PM, John Hardin wrote:
>>>>> On Mon, 14 Jan 2013, Ben Johnson wrote:
>>>>>
>>>>> Question: do you have any SMTP-time hard-reject DNSBL tests in
>>>>> place? Or
>>>>> are they all performed by SA?
>>>>
>>>> In postfix's main.cf:
>>>>
>>>> smtpd_recipient_restrictions = permit_mynetworks,
>>>> permit_sasl_authenticated, check_recipient_access
>>>> mysql:/etc/postfix/mysql-virtual_recipient.cf,
>>>> reject_unauth_destination, reject_rbl_client bl.spamcop.net
>>>>
>>>> Do you recommend something more?
>>>
>>> Unfortunately I have no experience administering Postfix. Perhaps one of
>>> the other listies can help.
>>
>> Wow! Adding several more reject_rbl_client entries to the
>> smtpd_recipient_restrictions directive in the Postfix configuration
>> seems to be having a tremendous impact. The amount of spam coming
>> through has dropped by 90% or more. This was a HUGELY helpful
>> suggestion, John!
> 
> Which ones are you using now? There are DNSBLs that are good, but not
> quite good enough to trust as hard-reject SMTP-time filters. That's why
> SA does scored DNSBL checks.

smtpd_recipient_restrictions =
reject_rbl_client bl.spamcop.net,
reject_rbl_client list.dsbl.org,
reject_rbl_client sbl-xbl.spamhaus.org,
reject_rbl_client cbl.abuseat.org,
reject_rbl_client dul.dnsbl.sorbs.net,

I acquired this list from the article that I cited a few responses back.
It is quite possible that some of these are obsolete, as the article is
from 2009. I seem to recall reading that sbl-xbl.spamhaus.org is
obsolete, but now I can't find the source.

These are "hard rejects", right? So if this change has reduced spam,
said spam would not be accepted for delivery at all; it would be
rejected outright. Correct? (And if I understand you, this is part of
your concern.)

The reason I ask, and a point that I should have clarified in my last
post, is that the *volume* of spam didn't drop by 90% (although, it may
have dropped by some measure), but rather the accuracy with which SA
tagged spam was 90% higher.

Ultimately, I'm wondering if the observed change was simply a product of
these message "campaigns" being black-listed after a few days of
circulation, and not the Postfix configuration change.

At this point, the vast majority of X-Spam-Status headers include Razor2
and Pyzor tests that contribute significantly to the score. I should
have mentioned earlier that I installed Razor2 and Pyzor after making my
initial post. The only reasons I didn't are that a) they didn't seem to
be making a significant difference for the first day or so after I
installed them (this could be for the snowshoe reasons we've already
discussed), and b) the low Bayes scores seemed to be the real problem
anyway.

That said, the Bayes scores seem to be much more accurate now, too. I
was hardly ever seeing BAYES_99 before, but now almost all spam messages
have BAYES_99.

Is it possible that the training I've been doing over the last week or
so wasn't *effective* until recently, say, after restarting some
component of the mail stack? My understanding is that calling SA via
Amavis, which does not need/use the spamd daemon, forces all Bayes data
to be up-to-date on each call to spamassassin.

It bears mention that I haven't yet dumped the Bayes DB and retrained
using my corpus. I'll do that next and see where we land once the DB is
repopulated.

>>>> Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in.
>>>> They do so unsupervised. Why this could be a problem is obvious. And
>>>> no,
>>>> I don't retain their submissions. I probably should. I wonder if I can
>>>> make a few slight modifications to the shell script that Antispam
>>>> calls,
>>>> such that it simply sends a copy of the message to an administrator
>>>> rather than calling sa-learn on the message.
>>>
>>> That would be a very good idea if the number of users doing training is
>>> small. At the very least, the messages should be captured to a permanent
>>> corpus mailbox.
>>
>> Good idea! I'll see if I can set this up.
>>
>>> Do your users also train ham? Are the procedures similar enough that
>>> your users could become easily confused?
>>
>> They do. The p

Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-01-15 Thread Ben Johnson


On 1/15/2013 4:39 PM, Bowie Bailey wrote:
> On 1/15/2013 4:27 PM, Ben Johnson wrote:
>> On 1/15/2013 4:05 PM, Bowie Bailey wrote:
>>> On 1/15/2013 3:47 PM, Ben Johnson wrote:
>>>> One final question on this subject (sorry...).
>>>>
>>>> Is there value in training Bayes on messages that SA classified as spam
>>>> *due to other test scores*? In other words, if a message is classified
>>>> as SPAM due to a block-list test, but the message is new enough for
>>>> Bayes to assign a zero score, should that message be kept and fed to
>>>> sa-learn so that Bayes can soak-up all the tokens from a message
>>>> that is
>>>> almost certainly spam (based on the other tests)?
>>>>
>>>> Am I making any sense?
>>> It is always worthwhile to train Bayes.  In an ideal world, you would
>>> hand-sort and train every email that comes through your system.  The
>>> more mail Bayes sees the more accurate it can be.
>>>
>> Thanks, Bowie. Given your response, would it then be prudent to call
>> "sa-learn --spam" on any message that *other tests* (non-Bayes tests)
>> determine to be spam (given some score threshold)?
> 
> That is exactly what the autolearn setting does.  I let my system run
> with the default autolearn settings.  Some people adjust the thresholds
> and some people prefer to turn off autolearn and do purely manual training.
> 
>> The crux of my question/point is that I don't want to have to feed
>> messages that Bayes "misses" but that other tests identify *correctly*
>> as spam to "sa-learn --spam".
> 
> At one point, I had a script running on my server that looked for
> messages that were marked as spam with a low Bayes rating (BAYES_00 to
> BAYES_40) or messages marked as ham with a high Bayes rating (BAYES_60
> to BAYES_99).  I was then able to check the messages and learn them
> properly.  This let me learn from the edge cases that were not being
> scored properly by Bayes while still making it to the correct folder due
> to other rules.
> 
> If you do this, you MUST check the messages yourself prior to learning
> since there is no other way to know whether they should be learned as
> ham or spam.
> 
>> Is there value in implementing something like this? Or is there some
>> caveat that would make doing so self-defeating?
> 
> I find that Bayes autolearn works quite well for me, but others have had
> problems with it.
> 

Ah... I get it. Finally. :)

Excellent info here; thanks again!

You guys are heroes... seriously.

Best regards,

-Ben


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-01-15 Thread Ben Johnson


On 1/15/2013 4:05 PM, Bowie Bailey wrote:
> On 1/15/2013 3:47 PM, Ben Johnson wrote:
>> One final question on this subject (sorry...).
>>
>> Is there value in training Bayes on messages that SA classified as spam
>> *due to other test scores*? In other words, if a message is classified
>> as SPAM due to a block-list test, but the message is new enough for
>> Bayes to assign a zero score, should that message be kept and fed to
>> sa-learn so that Bayes can soak-up all the tokens from a message that is
>> almost certainly spam (based on the other tests)?
>>
>> Am I making any sense?
> 
> It is always worthwhile to train Bayes.  In an ideal world, you would
> hand-sort and train every email that comes through your system.  The
> more mail Bayes sees the more accurate it can be.
> 

Thanks, Bowie. Given your response, would it then be prudent to call
"sa-learn --spam" on any message that *other tests* (non-Bayes tests)
determine to be spam (given some score threshold)?

The crux of my question/point is that I don't want to have to feed
messages that Bayes "misses" but that other tests identify *correctly*
as spam to "sa-learn --spam".

Is there value in implementing something like this? Or is there some
caveat that would make doing so self-defeating?

Thanks a bunch,

-Ben


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-01-15 Thread Ben Johnson
One final question on this subject (sorry...).

Is there value in training Bayes on messages that SA classified as spam
*due to other test scores*? In other words, if a message is classified
as SPAM due to a block-list test, but the message is new enough for
Bayes to assign a zero score, should that message be kept and fed to
sa-learn so that Bayes can soak-up all the tokens from a message that is
almost certainly spam (based on the other tests)?

Am I making any sense?

Thanks again!

-Ben



Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-01-15 Thread Ben Johnson


On 1/15/2013 1:55 PM, John Hardin wrote:
> On Tue, 15 Jan 2013, Ben Johnson wrote:
> 
>> On 1/14/2013 8:16 PM, John Hardin wrote:
>>> On Mon, 14 Jan 2013, Ben Johnson wrote:
>>>
>>> Question: do you have any SMTP-time hard-reject DNSBL tests in place? Or
>>> are they all performed by SA?
>>
>> In postfix's main.cf:
>>
>> smtpd_recipient_restrictions = permit_mynetworks,
>> permit_sasl_authenticated, check_recipient_access
>> mysql:/etc/postfix/mysql-virtual_recipient.cf,
>> reject_unauth_destination, reject_rbl_client bl.spamcop.net
>>
>> Do you recommend something more?
> 
> Unfortunately I have no experience administering Postfix. Perhaps one of
> the other listies can help.

Wow! Adding several more reject_rbl_client entries to the
smtpd_recipient_restrictions directive in the Postfix configuration
seems to be having a tremendous impact. The amount of spam coming
through has dropped by 90% or more. This was a HUGELY helpful
suggestion, John!

>>>   http://www.greylisting.org/
>>
>> Hmm, very interesting. No, I have no greylisting in place as yet, and
>> no, my userbase doesn't demand immediate delivery. I will look into
>> greylisting further.
> 
> One other thing you might try is publishing an SPF record for your
> domain. There is anecdotal evidence that this reduces the raw spam
> volume to that domain a bit.

We do publish SPF records for the domains within our control. The need
to do this arose when senderbase.org, et. al., began blacklisting
domains without SPF records. So, we're good there.

>> Given this information, it concerns me that Bayes scores hardly seem
>> to budge when I feed sa-learn nearly identical messages 3+ times.
>> We'll get into that below.
>>
>>>> If so, then I guess the only remedy here is to focus on why Bayes seems
>>>> to perform so miserably.
>>>
>>> Agreed.
>>>
>>>> It must be a configuration issue, because I've sa-learn-ed messages
>>>> that are incredibly similar for two days now and not only do their
>>>> Bayes scores not change significantly, but sometimes they decrease.
>>>> And I have a hard time believing that one of my users is sa-train-ing
>>>> these messages as ham and negating my efforts.
>>>
>>> This is why you retain your Bayes training corpora: so that if Bayes
>>> goes off the rails you can review your corpora for misclassifications,
>>> wipe and retrain. Do you have your training corpora? Or do you discard
>>> messages once you've trained them?
>>
>> I had the good sense to retain the corpora.
> 
> Yay!
> 
>>> _Do_ you allow your users to train Bayes? Do they do so unsupervised or
>>> do you review their submissions? And if the process is automated, do you
>>> retain what they have provided for training so that you can go back
>>> later and do a troubleshooting review?
>>
>> Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in.
>> They do so unsupervised. Why this could be a problem is obvious. And no,
>> I don't retain their submissions. I probably should. I wonder if I can
>> make a few slight modifications to the shell script that Antispam calls,
>> such that it simply sends a copy of the message to an administrator
>> rather than calling sa-learn on the message.
> 
> That would be a very good idea if the number of users doing training is
> small. At the very least, the messages should be captured to a permanent
> corpus mailbox.

Good idea! I'll see if I can set this up.

> Do your users also train ham? Are the procedures similar enough that
> your users could become easily confused?

They do. The procedure is implemented via Dovecot's Antispam plug-in.
Basically, moving mail from Inbox to Junk trains it as spam, and moving
mail from Junk to Inbox trains it as ham. I really like this setup
(Antispam + calling SA through Amavis [i.e. not using spamd]) because
the results are effective immediately, which seems to be crucial for
combating this snowshoe spam (performance and scalability aside).

I don't find that procedure to be confusing, but people are different, I
suppose.

>>> Do you have autolearn turned on? My opinion is that autolearn is only
>>> appropriate for a large and very diverse userbase where a sufficiently
>>> "common" corpus of ham can't be manually collected. but then, I don't
>>> admin a Really Large Install, so YMMV.
>>
>> No, I was sure to disable autolearn after the last Bayes fiasco. :)
> 
> OK.
> 
>>> Do you use per-user or s

Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-01-15 Thread Ben Johnson


On 1/14/2013 8:16 PM, John Hardin wrote:
> On Mon, 14 Jan 2013, Ben Johnson wrote:
> 
>> I understand that snowshoe spam may not hit any net tests. I guess my
>> confusion is around what, exactly, classifies spam as "snowshoe".
> 
>   http://www.spamhaus.org/faq/section/Glossary
> 
> Basically, a large number of spambots sending the message so that no one
> sending IP can be easily tagged as evil.
> 
> Question: do you have any SMTP-time hard-reject DNSBL tests in place? Or
> are they all performed by SA?

In postfix's main.cf:

smtpd_recipient_restrictions = permit_mynetworks,
permit_sasl_authenticated, check_recipient_access
mysql:/etc/postfix/mysql-virtual_recipient.cf,
reject_unauth_destination, reject_rbl_client bl.spamcop.net

Do you recommend something more?

> Recommendation: consider using the Spamhaus ZEN DNSBL as a hard-reject
> SMTP-time DNS check in your MTA. It is well-respected and very reliable.
> One thing it includes is ranges of IP addresses that should not ever be
> sending email, so it may help reduce snowshoe spam.
> 
>   http://www.spamhaus.org/zen/

This article looks to be pretty thorough:

http://www.cyberciti.biz/faq/howto-configure-postfix-dnsrbls-under-linux-unix/

I'll add Spamhaus ZEN and a few others to the list.

> Another tactic that many report good results from is Greylisting. Do you
> have greylisting in place? Does your userbase demand no delays in mail
> delivery? In addition to blocking spam from spambots that do not retry,
> it can delay mail enough for the BLs to get a chance to list new
> IPs/domains, which can reduce the leakage if you happen to be at the
> leading edge of a new delivery campaign.
> 
>   http://www.greylisting.org/

Hmm, very interesting. No, I have no greylisting in place as yet, and
no, my userbase doesn't demand immediate delivery. I will look into
greylisting further.

>> Are most/all of the BL services hash-based?
> 
> Generally:
> 
> DNSBL: Blacklist of IP addresses
> URIBL: Blacklist of domain and host names appearing in URIs
> EMAILBL: (not widely used) Blacklist of email addresses (e.g.
> phishing response addresses)
> Razor, Pyzor: Blacklist of message content checksums/hashes

Perfect; that answers my question.

>> In other words, if a known spam message was added yesterday, will it
>> be considered "snowshoe" spam if the spammer sends the same message
>> today and changes only one character within the body?
> 
> No, the diverse IP addresses are the hallmark of "snowshoe", not so much
> the specific message content. If you see identical or generally-similar
> (e.g.) pharma spam coming from a wide range of different IP addresses,
> that's snowshoe.

I see. Given this information, it concerns me that Bayes scores hardly
seem to budge when I feed sa-learn nearly identical messages 3+ times.
We'll get into that below.

>> If so, then I guess the only remedy here is to focus on why Bayes seems
>> to perform so miserably.
> 
> Agreed.
> 
>> It must be a configuration issue, because I've sa-learn-ed messages
>> that are incredibly similar for two days now and not only do their
>> Bayes scores not change significantly, but sometimes they decrease.
>> And I have a hard time believing that one of my users is sa-train-ing
>> these messages as ham and negating my efforts.
> 
> This is why you retain your Bayes training corpora: so that if Bayes
> goes off the rails you can review your corpora for misclassifications,
> wipe and retrain. Do you have your training corpora? Or do you discard
> messages once you've trained them?

I had the good sense to retain the corpora.

> _Do_ you allow your users to train Bayes? Do they do so unsupervised or
> do you review their submissions? And if the process is automated, do you
> retain what they have provided for training so that you can go back
> later and do a troubleshooting review?

Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in.
They do so unsupervised. Why this could be a problem is obvious. And no,
I don't retain their submissions. I probably should. I wonder if I can
make a few slight modifications to the shell script that Antispam calls,
such that it simply sends a copy of the message to an administrator
rather than calling sa-learn on the message.

> Do you have autolearn turned on? My opinion is that autolearn is only
> appropriate for a large and very diverse userbase where a sufficiently
> "common" corpus of ham can't be manually collected. but then, I don't
> admin a Really Large Install, so YMMV.

No, I was sure to disable autolearn after the last Bayes fiasco. :)

> Do you use per-user or sitewide Bayes? If per-

Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-01-15 Thread Ben Johnson


On 1/14/2013 7:48 PM, Noel wrote:
> On 1/14/2013 2:59 PM, Ben Johnson wrote:
> 
>> I understand that snowshoe spam may not hit any net tests. I guess my
>> confusion is around what, exactly, classifies spam as "snowshoe".
> 
> Snowshoe spam - spreading a spam run across a large number of IPs so
> no single IP is sending a large volume.  Typically also combined
> with "natural language" text, RFC compliant mail servers, verified
> SPF and DKIM, business-class ISP with FCrDNS, and every other
> criteria to look like a legit mail source.  This type of spam is
> difficult to catch.
> 
> http://www.spamhaus.org/faq/section/Glossary#233
> and countless other links if you ask google.
> 
>> Are most/all of the BL services hash-based? In other words, if a known
>> spam message was added yesterday, will it be considered "snowshoe" spam
>> if the spammer sends the same message today and changes only one
>> character within the body?
> 
> No, most all DNS blacklists are based on IP reputation.  Check each
> list's website for their listing policy to see how an IP gets on
> their list; generally honypot email addresses or trusted user
> reports.  Most lists require some number of reports before listing
> an IP to prevent false positives; snowshoe spammers take advantage
> of this.
> 
>> If so, then I guess the only remedy here is to focus on why Bayes seems
>> to perform so miserably.
> 
> Sounds as if your bayes has been improperly trained in the past. 
> You might do better to just delete the bayes db and start over with
> hand-picked spam and ham.
> 
> 
> 
>   -- Noel Jones
> 

jdow, Noel, and John, I can't thank you enough for your very thorough
responses. Your time is valuable and I sincerely appreciate your
willingness to help.

John, I'll respond to you separately, for the sake of keeping this
organized.

> Ben, do be aware that sometimes you draw the short straw and sit at the
> very start of the spam distribution cycle. In those cases the BLs will
> generally not have been alerted yet so they may not trigger. For those
> situations the rules should be your friends. (I still use my treasured
> set of SARE rules and personally hand crafted rules my partner and I
> have created that fit OUR needs but may not be good general purpose
> rules.)

This makes perfect sense and underscores the importance of a
finely-tuned rule-set. It's become apparent just how dynamic and capable
a monster the spam industry is. No one approach will ever be a panacea,
it seems.

The advice from your second email is well-received, too. Especially the
part about not killing anybody. ;) I do hope fighting spam becomes fun
for me, because so far, it's been an uphill battle! Hehe.

Noel, thanks for excellent responses to my questions.

> Sounds as if your bayes has been improperly trained in the past.
> You might do better to just delete the bayes db and start over with
> hand-picked spam and ham.

I hope not, because this is my second go-round with the Bayes DB. The
first time (as Mr. Hardin may remember), auto-learning was enabled
out-of-the-box and some misconfiguration or another (seemingly related
to DNSWL_* rules) caused a lot of spam to be learned as ham. With John's
help, I corrected the issues (I hope), which I'll detail in my reply to
John.

Thanks again,

-Ben


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-01-14 Thread Ben Johnson


On 1/14/2013 2:49 PM, RW wrote:
> On Mon, 14 Jan 2013 13:24:55 -0500
> Ben Johnson wrote:
> 
> 
>> A clear pattern has emerged: the X-Spam-Status headers for very
>> obviously spammy messages never contain evidence that network tests
>> contributed to their SA scores.
>>
>> Ultimately, I need to know whether:
>>
>> a.) Network tests are not being run at all for these messages
>>
>> b.) Network tests are being run, but are failing in some way
>>
>> c.) Network tests are being run, and are succeeding, but return
>> responses that do not contribute to the messages' scores
>>
>> I've had a look at the log entries to which I link in my previous
>> message and I just need a little help interpreting the "dns" and
>> "async" messages.
> 
> As I said before, it's not unusual for snowshoe spam to hit no net
> tests at all. Also obvious spam isn't any more likely to be in a
> blocklist than less obvious spam.
> 
> However,  try adding this to your SpamAssassin configuration, and
> restart the appropriate daemon:
> 
> header   RCVD_IN_HITALL eval:check_rbl('hitall-lastexternal', 
> 'ipv4.fahq2.com.')
> tflags   RCVD_IN_HITALL net
> scoreRCVD_IN_HITALL 0.001
> 
> 
> It should add a dns test that is hit for all mail delivered from an
> IPv4 address.  
> 

Thanks, RW.

I understand that snowshoe spam may not hit any net tests. I guess my
confusion is around what, exactly, classifies spam as "snowshoe".

Are most/all of the BL services hash-based? In other words, if a known
spam message was added yesterday, will it be considered "snowshoe" spam
if the spammer sends the same message today and changes only one
character within the body?

If so, then I guess the only remedy here is to focus on why Bayes seems
to perform so miserably. It must be a configuration issue, because I've
sa-learn-ed messages that are incredibly similar for two days now and
not only do their Bayes scores not change significantly, but sometimes
they decrease. And I have a hard time believing that one of my users is
sa-train-ing these messages as ham and negating my efforts.

I have ensured that the spam token count increases when I train these
messages. That said, I do notice that the token count does not *always*
change; sometimes, sa-learn reports "Learned tokens from 0 message(s) (1
message(s) examined)". Does this mean that all tokens from these
messages have already been learned, thereby making it pointless to
continue feeding them to sa-learn?

If I receive one more uncaught message about how some mom is angering
doctors by doing something crazy to her face, I'm going to hunt-down the
er and rip her face OFF.

Finally, I added the test you supplied to my SA configuration, restarted
Amavis, and all messages appear to be tagged with RCVD_IN_HITALL=0.001.

Thanks for all your help,

-Ben


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-01-14 Thread Ben Johnson


On 1/11/2013 4:27 PM, Ben Johnson wrote:
> I enabled Amavis's SA debugging mode on the server in question and was
> able to extract the debug output for two messages that seem like they
> should definitely be classified as spam.
> 
> Message #1: http://pastebin.com/xLMikNJH
> 
> Message #2: http://pastebin.com/Ug78tPrt
> 
> A couple points of note and a couple of questions:
> 
> a.) There seems to be plenty of network activity, but I don't any
> "results" (for lack of a better term) for those queries. The final
> X-Spam-Status header that is generated looks like this:
> 
> No, score=1.592 tagged_above=-999 required=2 tests=[BAYES_50=0.8,
> RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=disabled
> 
> Does the absence of network tests in the resultant header simply mean
> that none of the network tests contributed to the score? If so, why
> might that be? Are these messages simply "too new" to appear in any
> blacklists?
> 
> b.) The scores for both messages are identical, which, I suppose, is not
> surprising, given that the same exact tests were performed and produced
> the same exact results. Is this normal?
> 
> c.) 45 minutes after receiving Message #2 from above, I received a very
> similar message. The subjects varied only in dollar amount advertised,
> and the bodies varies only in the hyperlink URLs and the footer/signature.
> 
> Here's the debug output: http://pastebin.com/sLMgXrf5
> 
> The second message was scored at 14.75, which seems much better. Of
> course, the second score was so much higher because the
> network/blacklist tests contributed significantly.
> 
> Is the conclusion to be drawn the same as in a) (these messages are "too
> new" to appear in blacklists)?
> 
> One final point of concern on this item: the Bayes score for the first
> of the two emails was BAYES_50=0.8, and I fed the message through
> sa-learn as spam shortly after it arrived. Yet, the Bayes score for the
> second message was BAYES_40=-0.001 -- *lower* than the first. How could
> this be? Is there some rational explanation?
> 
> Thanks for all the help here, guys!
> 
> -Ben

Nobody?

A clear pattern has emerged: the X-Spam-Status headers for very
obviously spammy messages never contain evidence that network tests
contributed to their SA scores.

Ultimately, I need to know whether:

a.) Network tests are not being run at all for these messages

b.) Network tests are being run, but are failing in some way

c.) Network tests are being run, and are succeeding, but return
responses that do not contribute to the messages' scores

I've had a look at the log entries to which I link in my previous
message and I just need a little help interpreting the "dns" and "async"
messages.

Thanks for any insight,

-Ben


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-01-11 Thread Ben Johnson


On 1/10/2013 3:13 PM, Tom Hendrikx wrote:
> On 10-01-13 19:55, Ben Johnson wrote:
>>
>>
>> On 1/10/2013 1:06 PM, RW wrote:
>>> On Thu, 10 Jan 2013 12:48:07 -0500
>>> Ben Johnson wrote:
>>>> pon further consideration, this behavior makes perfect sense if the
>>>> mailbox user has moved the message from Inbox to Junk between scans;
>>>> Dovecot's Antispam filter is in use on this server. This action would
>>>> cause the message tokens to be added to the Bayes database, which
>>>> explains why the SA score is higher on subsequent scans, even with
>>>> network tests disabled.
>>>
>>> Also by turning-off network tests you switch to a different score set so
>>> the score for RDNS_NONE rose.
>>>
>>
>> Ahh; I didn't realize that disabling network tests changes the score set
>> entirely. Thanks for the clarification there.
>>
>> So, at this point, I'm struggling to understand how the following happened.
>>
>> Over the course of 15 minutes, I received the same exact message four
>> times. Each time, the message was sent to the same recipient mailbox.
>> The "From" and "Return-Path" headers changed slightly each time, but the
>> message bodies appear to be identical.
>>
>> Here are the X-Spam-Status headers for each message:
>>
>> 1:28 PM
>>
>> Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9,
>> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449,
>> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001,
>> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25,
>> URIBL_WS_SURBL=1.608] autolearn=disabled
>>
>> 1:35 PM
>>
>> No, score=-0.374 tagged_above=-999 required=2 tests=[BAYES_00=-1.9,
>> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RDNS_NONE=0.793,
>> SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01] autolearn=disabled
>>
>> 1:36 PM
>>
>> Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9,
>> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449,
>> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001,
>> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25,
>> URIBL_WS_SURBL=1.608] autolearn=disabled
>>
>> 1:41 PM
>>
>> Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9,
>> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449,
>> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001,
>> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25,
>> URIBL_WS_SURBL=1.608] autolearn=disabled
>>
>> Questions:
>>
>> 1.) I have a fairly well-trained Bayes DB; why on earth does a message
>> with the subject "Cash Quick? Get up to 1500 Now", and an equally
>> nefarious body, trigger BAYES_00?
> 
> This will solely depend on the contents of your bayes db. Is this shared
> between users, etc etc. No good answer ready without looking at it.

Yes, the Bayes DB is shared between users. But it seems that focusing on
the "low-hanging fruit" (the network test issues) will be more
productive in the short term.

>> 2.) Why weren't network tests performed on message 2 of 4? This seems to
>> be evidence of the fact that network tests are not being performed some
>> percentage of the time, which could very well be at the root of this
>> whole problem.
> 
> The fact that not a single network test was triggered, is indeed
> suspicious. The DNSBL tests are of course sender sender dependent, but
> if the body is the same the URIBL stuff should fire. Maybe you DNS
> queries timed because your DNS setup is borked? Maybe you should
> temporarily enable debug logging for dns lookups in spamassassin?
> 

I enabled Amavis's SA debugging mode on the server in question and was
able to extract the debug output for two messages that seem like they
should definitely be classified as spam.

Message #1: http://pastebin.com/xLMikNJH

Message #2: http://pastebin.com/Ug78tPrt

A couple points of note and a couple of questions:

a.) There seems to be plenty of network activity, but I don't any
"results" (for lack of a better term) for those queries. The final
X-Spam-Status header that is generated looks like this:

No, score=1.592 tagged_above=-999 required=2 tests=[BAYES_50=0.8,
RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=disabled

Does the absence of network tests in the resultant header simply mean
that none of the network tests contributed to the score? If so, why
might that be? Are these messages simply "too new" to appear in an

Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-01-11 Thread Ben Johnson


On 1/10/2013 4:12 PM, John Hardin wrote:
> On Thu, 10 Jan 2013, Ben Johnson wrote:
> 
>> So, at this point, I'm struggling to understand how the following
>> happened.
>>
>> Over the course of 15 minutes, I received the same exact message four
>> times. Each time, the message was sent to the same recipient mailbox.
>> The "From" and "Return-Path" headers changed slightly each time, but the
>> message bodies appear to be identical.
>>
>> Here are the X-Spam-Status headers for each message:
>>
>> 1:28 PM
>>
>> Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9,
>> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449,
>> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001,
>> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25,
>> URIBL_WS_SURBL=1.608] autolearn=disabled
>>
>> 1:35 PM
>>
>> No, score=-0.374 tagged_above=-999 required=2 tests=[BAYES_00=-1.9,
>> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RDNS_NONE=0.793,
>> SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01] autolearn=disabled
>>
>> 1:36 PM
>>
>> Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9,
>> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449,
>> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001,
>> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25,
>> URIBL_WS_SURBL=1.608] autolearn=disabled
>>
>> 1:41 PM
>>
>> Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9,
>> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449,
>> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001,
>> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25,
>> URIBL_WS_SURBL=1.608] autolearn=disabled
>>
>> Questions:
>>
>> 1.) I have a fairly well-trained Bayes DB; why on earth does a message
>> with the subject "Cash Quick? Get up to 1500 Now", and an equally
>> nefarious body, trigger BAYES_00?
>>
>> 2.) Why weren't network tests performed on message 2 of 4? This seems to
>> be evidence of the fact that network tests are not being performed some
>> percentage of the time, which could very well be at the root of this
>> whole problem.
> 
> How many MTAs do you have? Is it possible the low-scoring one went via a
> different MTA?

Just one; there should be no possibility of that.

> Have you sotpped amavisd, killed all of the amavis processes, and
> restarted it?
> 
> 

I have now. And I enabled amavis's $sa_debug option, so we should see a
lot more in the way of useful SA debugging information now.

In fact, I was just able to capture the out that I believe we're after,
and I'll paste a link in my response to RW's message (shortly forthcoming).

Thanks,

-Ben


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-01-10 Thread Ben Johnson


On 1/10/2013 1:06 PM, RW wrote:
> On Thu, 10 Jan 2013 12:48:07 -0500
> Ben Johnson wrote:
>> pon further consideration, this behavior makes perfect sense if the
>> mailbox user has moved the message from Inbox to Junk between scans;
>> Dovecot's Antispam filter is in use on this server. This action would
>> cause the message tokens to be added to the Bayes database, which
>> explains why the SA score is higher on subsequent scans, even with
>> network tests disabled.
> 
> Also by turning-off network tests you switch to a different score set so
> the score for RDNS_NONE rose.
> 

Ahh; I didn't realize that disabling network tests changes the score set
entirely. Thanks for the clarification there.

So, at this point, I'm struggling to understand how the following happened.

Over the course of 15 minutes, I received the same exact message four
times. Each time, the message was sent to the same recipient mailbox.
The "From" and "Return-Path" headers changed slightly each time, but the
message bodies appear to be identical.

Here are the X-Spam-Status headers for each message:

1:28 PM

Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449,
RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001,
T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25,
URIBL_WS_SURBL=1.608] autolearn=disabled

1:35 PM

No, score=-0.374 tagged_above=-999 required=2 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RDNS_NONE=0.793,
SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01] autolearn=disabled

1:36 PM

Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449,
RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001,
T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25,
URIBL_WS_SURBL=1.608] autolearn=disabled

1:41 PM

Yes, score=7.008 tagged_above=-999 required=2 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449,
RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001,
T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25,
URIBL_WS_SURBL=1.608] autolearn=disabled

Questions:

1.) I have a fairly well-trained Bayes DB; why on earth does a message
with the subject "Cash Quick? Get up to 1500 Now", and an equally
nefarious body, trigger BAYES_00?

2.) Why weren't network tests performed on message 2 of 4? This seems to
be evidence of the fact that network tests are not being performed some
percentage of the time, which could very well be at the root of this
whole problem.

Thanks,

-Ben


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-01-10 Thread Ben Johnson


On 1/10/2013 12:18 PM, Ben Johnson wrote:
> 
> 
> On 1/10/2013 11:49 AM, RW wrote:
>> On Thu, 10 Jan 2013 11:43:44 -0500
>> Ben Johnson wrote:
>>  
>>
>>> This observation begs the question: why are network tests being
>>> performed for some messages but not others? To my knowledge, no
>>> white/gray/black listing has been done on this box.
>>
>> As has already been said, the score from network tests is commonly a
>> lot higher on retesting because of all the reporting that happened
>> in-between. 
>>
> 
> RW,
> 
> I understand that, but that doesn't explain why if I retest a given
> message by calling SpamAssassin directly, and I *disable network tests*,
> the score is sometimes *higher* than when the message was scanned
> initially with AMaViS.
> 
> When this message came through initially, the X-Spam-Status header was:
> 
> No, score=1.593 tagged_above=-999 required=2 tests=[BAYES_50=0.8,
> HTML_MESSAGE=0.001, RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=disabled
> 
> About an hour later, I fed the same message to the spamassassin
> executable, while disabling network tests:
> 
> # spamassassin -L -t -D < /tmp/msg.txt
> 
> Content analysis details:   (5.0 points, 5.0 required)
> 
>  pts rule name  description
>  --
> --
>  3.8 BAYES_99   BODY: Bayes spam probability is 99 to 100%
> [score: 1.]
>  0.0 HTML_MESSAGE   BODY: HTML included in message
>  1.2 RDNS_NONE  Delivered to internal network by a host with
> no rDNS
> 
> To restate the question, if network tests are not outright disabled in
> Amavis, why is Amavis returning lower scores than the SA binary does
> when called directly with network tests disabled? Shouldn't the SA score
> with network tests disabled *always* be lower than or equal to the
> Amavis score with network tests enabled (provided that all else is equal)?
> 
> Or am I way off-base here?
> 
> Thanks again,
> 
> -Ben
> 

Upon further consideration, this behavior makes perfect sense if the
mailbox user has moved the message from Inbox to Junk between scans;
Dovecot's Antispam filter is in use on this server. This action would
cause the message tokens to be added to the Bayes database, which
explains why the SA score is higher on subsequent scans, even with
network tests disabled.

Sorry... I'm still trying to wrap my head around all of this. Lots of
moving parts.

-Ben


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-01-10 Thread Ben Johnson


On 1/10/2013 11:49 AM, RW wrote:
> On Thu, 10 Jan 2013 11:43:44 -0500
> Ben Johnson wrote:
>  
> 
>> This observation begs the question: why are network tests being
>> performed for some messages but not others? To my knowledge, no
>> white/gray/black listing has been done on this box.
> 
> As has already been said, the score from network tests is commonly a
> lot higher on retesting because of all the reporting that happened
> in-between. 
> 

RW,

I understand that, but that doesn't explain why if I retest a given
message by calling SpamAssassin directly, and I *disable network tests*,
the score is sometimes *higher* than when the message was scanned
initially with AMaViS.

When this message came through initially, the X-Spam-Status header was:

No, score=1.593 tagged_above=-999 required=2 tests=[BAYES_50=0.8,
HTML_MESSAGE=0.001, RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=disabled

About an hour later, I fed the same message to the spamassassin
executable, while disabling network tests:

# spamassassin -L -t -D < /tmp/msg.txt

Content analysis details:   (5.0 points, 5.0 required)

 pts rule name  description
 --
--
 3.8 BAYES_99   BODY: Bayes spam probability is 99 to 100%
[score: 1.]
 0.0 HTML_MESSAGE   BODY: HTML included in message
 1.2 RDNS_NONE  Delivered to internal network by a host with
no rDNS

To restate the question, if network tests are not outright disabled in
Amavis, why is Amavis returning lower scores than the SA binary does
when called directly with network tests disabled? Shouldn't the SA score
with network tests disabled *always* be lower than or equal to the
Amavis score with network tests enabled (provided that all else is equal)?

Or am I way off-base here?

Thanks again,

-Ben


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-01-10 Thread Ben Johnson


On 1/9/2013 9:13 PM, John Hardin wrote:
> On Wed, 9 Jan 2013, Ben Johnson wrote:
> 
>> On 1/9/2013 7:36 PM, wolfgang wrote:
>>>
>>>> RCVD_IN_BRBL_LASTEXT,RCVD_IN_CSS,RCVD_IN_PSBL,RCVD_IN_XBL,URIBL_DBL_S
>>>> PAM, URIBL_JP_SURBL autolearn=disabled version=3.3.1
>>>
>>> I am not familiar with amavis, but I know that it calls spamassassin in
>>> a special way, depending on the amavis config. Wild guess: could it be
>>> that RBL/URIBL queries are disabled in your amavis config?
>>
>> Thanks for the reply.
>>
>> What you say about the RBL/URIBL tests makes sense.
> 
> Check your amavis configuration to see whether you have network tests
> disabled. That's the simplest explanation.
> 

Thanks, John.

On the surface, network tests appear to be enabled:

# grep -ir sa_local_tests_only /etc/amavis
/etc/amavis/conf.d/20-debian_defaults:$sa_local_tests_only = 0;#
only tests which do not require internet access?

Also, some of the incoming messages do contain network test scoring data
in the X-Spam-Status header; here are two examples:

Yes, score=8.451 tagged_above=-999 required=2 tests=[BAYES_99=3.5,
RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_CSS=1, RDNS_NONE=0.793,
SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7]
autolearn=disabled

Yes, score=12.266 tagged_above=-999 required=2 tests=[BAYES_50=0.8,
DATE_IN_FUTURE_12_24=3.199, DIET_1=0.001, HTML_MESSAGE=0.001,
RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_PSBL=2.7, RCVD_IN_XBL=0.375,
RDNS_NONE=0.793, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001,
URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25] autolearn=disabled

(Several of those are network tests, right?)

What's strange is that another message was delivered at nearly the same
time as the above two, yet it shows no evidence of network tests being
performed (right?):

No, score=0.8 tagged_above=-999 required=2 tests=[BAYES_50=0.8,
HTML_MESSAGE=0.001, SPF_PASS=-0.001] autolearn=disabled

It seems as though the SPAM that slips through never shows evidence of
network tests, whereas the SPAM that is caught (and usually has a high
score -- 10 or higher) always seems to show evidence of network tests.

This observation begs the question: why are network tests being
performed for some messages but not others? To my knowledge, no
white/gray/black listing has been done on this box.

Thanks again,

-Ben


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-01-09 Thread Ben Johnson


On 1/9/2013 7:36 PM, wolfgang wrote:
> On 2013-01-10 01:03, Ben Johnson wrote:
> 
>> I see; I saved the email message out of Thunderbird (with View ->
>> Headers -> All), as a plain text file. Apparently, that process
>> butchers the original message.
> 
> In Thunderbird, rather use File > Save as to save the entire message.
> 
>> RCVD_IN_BRBL_LASTEXT,RCVD_IN_CSS,RCVD_IN_PSBL,RCVD_IN_XBL,URIBL_DBL_S
>> PAM, URIBL_JP_SURBL autolearn=disabled version=3.3.1
> 
> Rules based on RBL/URIBL checks depend on DNS based blacklist queries. 
> And between the time you first receive an email and the time you 
> re-scan it, the originating client IP and/or URIs from the mail body 
> may have been added the the black lists after you first received the 
> mail. Did you re-scan the mail with amavis, too, or did you post the 
> X-Spam header lines from the original amavis scan and re-scan the mail 
> with spamassassin significantly later?
> 
> I am not familiar with amavis, but I know that it calls spamassassin in 
> a special way, depending on the amavis config. Wild guess: could it be 
> that RBL/URIBL queries are disabled in your amavis config?
> 
> Hope this helps.
> 
> Cheers,
> 
> wolfgang
> 

Hi, Wolfgang,

Thanks for the reply.

What you say about the RBL/URIBL tests makes sense. I did not rescan the
message with amavis; I posted the X-Spam-Status header contents from the
original scan. The only reason for which I did not rescan the message
with Amavis is that I don't know how to perform a SpamAssassin scan
through Amavis in a manual capacity. And I can't find instructions
regarding the process.

All of that said, less than eight hours elapsed between the original
scan with Amavis and the manual scan with "spamassassin". But, that's
probably long enough for the IP addresses to be blacklisted.

If nobody knows how to scan messages through Amavis, maybe I need to
take this question over to the Amavis list for the time being.

Thanks again,

-Ben


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-01-09 Thread Ben Johnson


On 1/9/2013 5:36 PM, RW wrote:
> On Wed, 09 Jan 2013 17:14:05 -0500
> Ben Johnson wrote:
> 
>> About five months ago, I experienced a problem that I *thought* I had
>> resolved, but I am observing similar behavior after retraining the
>> Bayes database. While the symptoms are similar, the root cause seems
>> to be different (thankfully). The original problem is documented at
>> http://spamassassin.1065346.n5.nabble.com/Very-spammy-messages-yield-BAYES-00-1-9-td101167.html
>> ..
>>
>> In any case, I am again seeing SA scores that seem way too low for the
>> message content in question. My "glue", as it were, is Amavis-New.
>>
>> In particular, certain messages that are clearly SPAM are scored
>> between 0 and 3 when processed via Amavis. However, if I process the
>> same messages with the "spamassassin" binary, directly, the scores
>> are much higher and much more in-line with what one would expect.
>> ...
>> When I process the same message with spamassassin, directly
>> (spamassassin -t -D < /tmp/msg.txt), the header looks like this:
>>
>> --
>> X-Spam-Status: Yes, score=7.5 required=5.0
>> tests=BAYES_50,MISSING_DATE,MISSING_HEADERS,MISSING_MID,MISSING_SUBJECT,NO_HEADERS_MESSAGE,NO_RECEIVED,NO_RELAYS
>> autolearn=disabled version=3.3.1
> 
> 
> This is not better, it indicates that SA didn't recognise it as an
> email, not that it recognised it as a spam. Whatever /tmp/msg.txt was
> it wasn't a properly formatted email.
> 

Thanks for the quick replies, Marius and RW.

I see; I saved the email message out of Thunderbird (with View ->
Headers -> All), as a plain text file. Apparently, that process butchers
the original message.

I'm reviewing SA's behavior using an email client to view the messages,
but I also have access to the mailbox on the server. I realize that this
question may seem amateurish, but how does one discern the "message ID"
from the email client and locate the corresponding file in the user's
"Maildir"? I'm using Dovecot 1.x.

The file names in the user's Maildir look like this:

1357762471.M952293P32429.example.com,S=4300,W=4381:2,

I assume that the first bit is a UNIX timestamp. Is there any means by
which to correlate the second bit (M952293P32429) to the message as I
see it in my email client (Thunderbird)? I don't see that string
anywhere in the headers (maybe that's by design).

In other words, when I spot a message that SA seems to be scoring
incorrectly in my Inbox, how do I track-down the actual file on the
server that should be fed into "spamassassin"?

Is there some better method than doing something like

# grep -ir 20B2834E4242 /var/vmail/example.com/user/Maildir

where 20B2834E4242 is the ID in the "Received" header?

In any case, I tracked-down the original message on the server and
repeated the process (spamassassin -t < /tmp/msg.txt):

--
X-Spam-Status: Yes, score=9.3 required=5.0 tests=BAYES_50,HTML_MESSAGE,

RCVD_IN_BRBL_LASTEXT,RCVD_IN_CSS,RCVD_IN_PSBL,RCVD_IN_XBL,URIBL_DBL_SPAM,
URIBL_JP_SURBL autolearn=disabled version=3.3.1

[...]

Content analysis details:   (9.3 points, 5.0 required)

 pts rule name  description
 --
--
 0.4 RCVD_IN_XBLRBL: Received via a relay in Spamhaus XBL
[188.165.126.107 listed in zen.spamhaus.org]
 1.0 RCVD_IN_CSSRBL: Received via a relay in Spamhaus CSS
 2.7 RCVD_IN_PSBL   RBL: Received via a relay in PSBL
[188.165.126.107 listed in psbl.surriel.com]
 1.2 URIBL_JP_SURBL Contains an URL listed in the JP SURBL blocklist
[URIs: ehylle.info]
 1.4 RCVD_IN_BRBL_LASTEXT   RBL: RCVD_IN_BRBL_LASTEXT
   [188.165.126.107 listed in
bb.barracudacentral.org]
 1.7 URIBL_DBL_SPAM Contains an URL listed in the DBL blocklist
[URIs: ehylle.info]
 0.0 HTML_MESSAGE   BODY: HTML included in message
 0.8 BAYES_50   BODY: Bayes spam probability is 40 to 60%
[score: 0.5428]
--

So, if I've done this correctly, the score discrepancy is even larger.

Thanks, guys!

-Ben


Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-01-09 Thread Ben Johnson
About five months ago, I experienced a problem that I *thought* I had
resolved, but I am observing similar behavior after retraining the Bayes
database. While the symptoms are similar, the root cause seems to be
different (thankfully). The original problem is documented at
http://spamassassin.1065346.n5.nabble.com/Very-spammy-messages-yield-BAYES-00-1-9-td101167.html
.

In any case, I am again seeing SA scores that seem way too low for the
message content in question. My "glue", as it were, is Amavis-New.

In particular, certain messages that are clearly SPAM are scored between
0 and 3 when processed via Amavis. However, if I process the same
messages with the "spamassassin" binary, directly, the scores are much
higher and much more in-line with what one would expect.

The X-Spam-Status header, when processed via Amavis, looks like this:

X-Spam-Status: No, score=0.8 tagged_above=-999 required=2
tests=[BAYES_50=0.8, HTML_MESSAGE=0.001, SPF_PASS=-0.001] autolearn=disabled

When I process the same message with spamassassin, directly
(spamassassin -t -D < /tmp/msg.txt), the header looks like this:

--
X-Spam-Status: Yes, score=7.5 required=5.0
tests=BAYES_50,MISSING_DATE,MISSING_HEADERS,MISSING_MID,MISSING_SUBJECT,NO_HEADERS_MESSAGE,NO_RECEIVED,NO_RELAYS
autolearn=disabled version=3.3.1

[...]

Content analysis details:   (7.5 points, 5.0 required)

 pts rule name  description
 --
--
-0.0 NO_RELAYS  Informational: message was not relayed via SMTP
 1.2 MISSING_HEADERSMissing To: header
 2.0 BAYES_50   BODY: Bayes spam probability is 40 to 60%
[score: 0.5000]
 1.2 MISSING_MIDMissing Message-Id: header
 1.3 MISSING_SUBJECTMissing Subject: header
-0.0 NO_RECEIVEDInformational: message has no Received headers
 1.8 MISSING_DATE   Missing Date: header
 0.0 NO_HEADERS_MESSAGE Message appears to be missing most RFC-822
headers
--

In short, my question is, how the  is the message scoring 0.8 in one
case and 7.5 in another? That is a massive discrepancy.

>From what I can tell, the same tests aren't even being performed in each
case.

I have to assume that the options that are passed to SA are wildly
different in each case.

It bears mention that the server in question uses ISPConfig 3. ISPConfig
allows for SA policies to be configured per-domain and per-user, and
Amavis leverages MySQL to make that happen. If relevant, I can provide
more information about this aspect of my setup.

These are the only directives that I've added to /etc/spamassassin/local.cf:

--
bayes_path /var/lib/amavis/.spamassassin/bayes

use_bayes 1
bayes_auto_expire 0
bayes_store_module  Mail::SpamAssassin::BayesStore::MySQL
bayes_sql_dsn   DBI:mysql:sa_bayes:localhost
bayes_sql_username  sa_user
bayes_sql_password  [scrubbed]
bayes_sql_override_username amavis
--

Given the first directive, SA should always use the same Bayes database
(the one I've configured in MySQL), regardless of how SA is called, right?

For those curious about the state of the Bayes database, here's the
output from "sa-learn --dump magic" (sorry for the wrapping):

0.000  0  3  0  non-token data: bayes db version
0.000  0   2007  0  non-token data: nspam
0.000  0   6554  0  non-token data: nham
0.000  0 188379  0  non-token data: ntokens
0.000  0 1356345829  0  non-token data: oldest atime
0.000  0 1357769317  0  non-token data: newest atime
0.000  0  0  0  non-token data: last journal
sync atime
0.000  0 1357727978  0  non-token data: last expiry atime
0.000  01382400  0  non-token data: last expire
atime delta
0.000  0   3191  0  non-token data: last expire
reduction count

Ultimately, it seems that I should be trying to figure out how, exactly,
Amavis is calling SpamAssassin in the course of normal operation.

Thanks for any help here, folks!

-Ben


Re: Try to run sa-learn

2012-10-04 Thread Ben Johnson


On 10/4/2012 2:06 PM, troxlinux wrote:
> Hi list , I try to run sa-learn on centos 6.3 but no work
> 
>  sa-learn --spam --showdots /dir/dir/domain.com.ni/spam/.spam/cur/
> 
> Learned tokens from 0 message(s) (1 message(s) examined)
> ERROR: the Bayes learn function returned an error, please re-run with
> -D for more information at /usr/bin/sa-learn line 493.
> 
> any idea ? , is a bug? , selinux is disabled

Well, did you do what the error message suggested (run 'sa-learn' with
the -D switch)?

What's the relevant output?

> my version of spamassassin
> spamassassin-3.3.2-4.el6.rfx.x86_64
> 
> 
> regardss
> 


Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-22 Thread Ben Johnson


On 8/22/2012 10:26 AM, Axb wrote:
> On 08/22/2012 04:10 PM, Ben Johnson wrote:
>>
>> I did end-up overriding the bayes_path, which provided a workaround for
>> the permissions issues. Cheers to the suggestion.
> 
> This is not a workaround, it's common practice in many types of setups
> and documented, but due to numerous reasons can't be set as a default.
> If the install routine would require/create a
> /etc/mail/spamassassin/bayes path it could bite "other" systems than
> standard Linux distros.
> (note to myself: discuss this in dev list)

Right; it makes sense that this path cannot have a default value (other
than ~/...).

That said, it seems that for some users (myself included), setting this
path manually is a critical step in creating a maximally functional
(that is, Bayes-enabled) SpamAssassin installation. This would be
especially true if the SA developers were to change the
"bayes_auto_learn" default value to zero, or lower the default value for
"bayes_auto_learn_threshold_nonspam" (as a result of my "incident" here).

For this reason, it seems prudent for developers/contributors to take
one of two actions (or both):

1.)

Add the "bayes_path" directive to the default/stock "local.cf" that
ships with SpamAssassin, in a commented-out state. I realize that this
file may be maintainer/distribution specific, and that there are
attendant challenges associated with such a change.

This measure would underscore the directive's importance for the
administrator who is configuring the software.

2.)

Where possible, modify the SpamAssassin installer package to prompt the
user for the "bayes_path" during installation. These types of prompts
are common among related packages. For example, Postfix asks for all
kinds of information during its installation (on Debian-based systems,
anyway).

Again, I realize that the SA developers likely have no control over how
the software is packaged and delivered, so if this point seems valid, I
am happy to open distro-specific bug reports (or feature requests).

Thanks, Axb.

-Ben


Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-22 Thread Ben Johnson


On 8/22/2012 9:43 AM, John Hardin wrote:
> On Wed, 22 Aug 2012, Bowie Bailey wrote:
> 
>> On 8/21/2012 5:51 PM, Ben Johnson wrote:
>>>
>>>  What good is the --username switch, then?

Thanks for the follow-up, John!

> See other responses.
> 
>>>  Why does this command train the "root" user's database?
> 
> Because you ran the command as root.
> 
> I apologize, I didn't provide sufficient details. When I said "train as
> the user who runs SA" I meant "su to that OS user ID before running the
> sa-learn command".

No apology necessary; I knew what you meant, and did indeed try running
the sa-learn command as "root", initially, but the problem then was a
lack of access to the mail directories. On Debian/Ubuntu systems, when
using Dovecot, all mail directories are vmail:vmail owned, with 700
permissions, which prevents the "amavis" user from having access to
them. (This is by design, I'm sure, and makes sense.)

> You can either override the default Bayes database files path to
> explicitly specify a shared global database as has been suggested, or
> run sa-learn as the amavis user via su or a cron job.

I did end-up overriding the bayes_path, which provided a workaround for
the permissions issues. Cheers to the suggestion.

Defining a global
> bayes database is probably a better solution overall, but bear in mind
> if you have to wipe and retrain you need to check the permissions on the
> new database files after you run sa-learn the first time.
> 

This is an important point; thanks for articulating it.

All appears to be well in SpamAssassin Town for the time being (don't
think you've heard the last of me, though!). Thanks to everyone who
shared his or her expertise.

-Ben


Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-22 Thread Ben Johnson


On 8/22/2012 9:05 AM, Bowie Bailey wrote:
> On 8/21/2012 5:51 PM, Ben Johnson wrote:
>>
>> On 8/21/2012 5:19 PM, John Hardin wrote:
>>> On Tue, 21 Aug 2012, Ben Johnson wrote:
>>>
>>>> Aug 21 13:08:33.729 [23714] dbg: bayes: tie-ing to DB file R/O
>>>> /var/lib/amavis/.spamassassin/bayes_toks
>>>>
>>>> ---8<--
>>>> # sa-learn --username=amavis --dump magic
>>> Run that with --debug and verify that the filenames match.
>>>
>> Sure enough, they don't match:
>>
>> ---8<--
>> [...]
>> dbg: bayes: tie-ing to DB file R/O /root/.spamassassin/bayes_toks
>> dbg: bayes: tie-ing to DB file R/O /root/.spamassassin/bayes_seen
>> Aug 21 14:41:13.112 [32170] dbg: bayes: found bayes db version 3
>> 0.000  0  3  0  non-token data: bayes db version
>> 0.000  0 95  0  non-token data: nspam
>> 0.000  0307  0  non-token data: nham
>> 0.000  0  62301  0  non-token data: ntokens
>> 0.000  0 1345469997  0  non-token data: oldest atime
>> 0.000  0 1345579297  0  non-token data: newest atime
>> 0.000  0  0  0  non-token data: last journal
>> sync atime
>> 0.000  0  0  0  non-token data: last expiry atime
>> 0.000  0  0  0  non-token data: last expire
>> atime delta
>> 0.000  0  0  0  non-token data: last expire
>> reduction count
>> ---8<--
>>
>> So, I suppose that I didn't actually resolve the problem from yesterday,
>> which was that I cannot seem to train under the "amavis" user due to the
>> ownership/permissions on the /var/vmail directory.
>>
>> What good is the --username switch, then?
>>
>> Why does this command train the "root" user's database?
>>
>> # sa-learn --username=amavis --spam "/path/to/spam"
>>
>> And why does this command dump the "root" user's database?
>>
>> # sa-learn --username=amavis --dump magic
>>
>> Thanks very much,
> 
> As has already been mentioned, the '--username' option is only useful if
> you're using SQL.  You should set your bayes_path so there is no confusion.

Thank you Axb and Bowie for clarifying this point. Perhaps the sa-learn
documentation should be updated to eliminate the ambiguity around this
switch. In particular, I am referring to this page:
http://spamassassin.apache.org/full/3.0.x/dist/doc/sa-learn.html , which
states only the following:

"If specified this username will override the username taken from the
runtime environment. You can use this option to specify users in a
virtual user configuration."

Maybe adding the "SQL" keyword will make the "virtual user
configuration" distinction more evident.

> Since you have been training the root database, you may want to copy
> that one over.
> 
> $ cp /root/.spamassassin/bayes* /var/lib/amavis/.spamassassin/
> 
> Then fix the permissions and ownership back to what they should be for
> the amavis user.

I did think to do this, but I approached it a bit differently, and used
"sa-learn --backup" (and --restore), under the "amavis" user account,
which mitigated the need to modify the permissions on the database.

> Then set the bayes path in your local.cf:
> 
> bayes_path /var/lib/amavis/.spamassassin/bayes
> 
> (Don't double the 'bayes' at the end as was suggested previously unless
> you want to move the bayes files into a 'bayes' directory)
> 
> Restart amavis and try again...
> 

Again, thanks to Axb and Bowie for making this suggestion. Hard-coding
the bayes_path was the missing link for me; this is what allowed me to
train under the "amavis" user while having "root" (or "vmail")
privileges, which on Debian, are necessary to read mail during training.

I think I'm sorted here; thanks again, guys!

-Ben


Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-21 Thread Ben Johnson


On 8/21/2012 5:19 PM, John Hardin wrote:
> On Tue, 21 Aug 2012, Ben Johnson wrote:
> 
>> Aug 21 13:08:33.729 [23714] dbg: bayes: tie-ing to DB file R/O
>> /var/lib/amavis/.spamassassin/bayes_toks
>>
>> ---8<--
>> # sa-learn --username=amavis --dump magic
> 
> Run that with --debug and verify that the filenames match.
> 

Sure enough, they don't match:

---8<--
[...]
dbg: bayes: tie-ing to DB file R/O /root/.spamassassin/bayes_toks
dbg: bayes: tie-ing to DB file R/O /root/.spamassassin/bayes_seen
Aug 21 14:41:13.112 [32170] dbg: bayes: found bayes db version 3
0.000  0  3  0  non-token data: bayes db version
0.000  0 95  0  non-token data: nspam
0.000  0307  0  non-token data: nham
0.000  0  62301  0  non-token data: ntokens
0.000  0 1345469997  0  non-token data: oldest atime
0.000  0 1345579297  0  non-token data: newest atime
0.000  0  0  0  non-token data: last journal
sync atime
0.000  0  0  0  non-token data: last expiry atime
0.000  0  0  0  non-token data: last expire
atime delta
0.000  0  0  0  non-token data: last expire
reduction count
---8<--

So, I suppose that I didn't actually resolve the problem from yesterday,
which was that I cannot seem to train under the "amavis" user due to the
ownership/permissions on the /var/vmail directory.

What good is the --username switch, then?

Why does this command train the "root" user's database?

# sa-learn --username=amavis --spam "/path/to/spam"

And why does this command dump the "root" user's database?

# sa-learn --username=amavis --dump magic

Thanks very much,

-Ben


Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-21 Thread Ben Johnson
On 8/20/2012 2:47 PM, Ben Johnson wrote:
> I was able to resolve the issue by adding the --username switch to the
> 'sa-learn' executable:
> 
> # sa-learn --username=amavis --spam
> /var/vmail/example.com/trainer/Maildir/.INBOX.Spam/cur
> 
> Thanks for all of the hints, folks!

So, I've been training SpamAssassin like a mad-man for a couple of days.

I don't have over 200 spams and 200 hams, so I don't expect Bayes to be
used yet (and it's not), but the following output is puzzling
(particularly, "only 0 spam(s) in bayes DB < 200"):

---8<--
# su amavis -c "spamassassin -D -t <
/usr/share/doc/spamassassin/examples/sample-spam.txt 2>&1 | egrep
'(bayes:|whitelist:|AWL)'"

Aug 21 13:08:33.717 [23714] dbg: bayes: learner_new
self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x213613f8),
bayes_store_module=Mail::SpamAssassin::BayesStore::DBM
Aug 21 13:08:33.728 [23714] dbg: bayes: learner_new: got
store=Mail::SpamAssassin::BayesStore::DBM=HASH(0x2153b400)
Aug 21 13:08:33.729 [23714] dbg: bayes: tie-ing to DB file R/O
/var/lib/amavis/.spamassassin/bayes_toks
Aug 21 13:08:33.729 [23714] dbg: bayes: tie-ing to DB file R/O
/var/lib/amavis/.spamassassin/bayes_seen
Aug 21 13:08:33.730 [23714] dbg: bayes: found bayes db version 3
Aug 21 13:08:33.730 [23714] dbg: bayes: DB journal sync: last sync: 0
Aug 21 13:08:33.730 [23714] dbg: bayes: not available for scanning, only
0 spam(s) in bayes DB < 200
Aug 21 13:08:33.730 [23714] dbg: bayes: untie-ing
Aug 21 13:08:33.732 [23714] dbg: bayes: tie-ing to DB file R/O
/var/lib/amavis/.spamassassin/bayes_toks
Aug 21 13:08:33.732 [23714] dbg: bayes: tie-ing to DB file R/O
/var/lib/amavis/.spamassassin/bayes_seen
Aug 21 13:08:33.733 [23714] dbg: bayes: found bayes db version 3
Aug 21 13:08:33.733 [23714] dbg: bayes: DB journal sync: last sync: 0
Aug 21 13:08:33.733 [23714] dbg: bayes: not available for scanning, only
0 spam(s) in bayes DB < 200
Aug 21 13:08:33.733 [23714] dbg: bayes: untie-ing
---8<--

Restarting Amavis does not change the output above.

And the output below seems to contradict the above (300 spams and 95 hams):

---8<--
# sa-learn --username=amavis --dump magic

0.000  0  3  0  non-token data: bayes db version
0.000  0 95  0  non-token data: nspam
0.000  0300  0  non-token data: nham
0.000  0  59420  0  non-token data: ntokens
0.000  0 1345469997  0  non-token data: oldest atime
0.000  0 1345577900  0  non-token data: newest atime
0.000  0  0  0  non-token data: last journal
sync atime
0.000  0  0  0  non-token data: last expiry atime
0.000  0  0  0  non-token data: last expire
atime delta
0.000  0  0  0  non-token data: last expire
reduction count
---8<--

Am I doing something silly?

Thanks for any help,

-Ben


Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-20 Thread Ben Johnson


On 8/20/2012 2:02 PM, Ben Johnson wrote:
> 
> 
> On 8/20/2012 12:56 PM, Bowie Bailey wrote:
>> On 8/20/2012 12:46 PM, Axb wrote:
>>> On 08/20/2012 06:42 PM, Ben Johnson wrote:
>>>>
>>>> On 8/17/2012 11:28 AM, John Hardin wrote:
>>>>> On Fri, 17 Aug 2012, Ben Johnson wrote:
>>>>>
>>>>>> On 8/16/2012 2:00 PM, Ben Johnson wrote:
>>>>>> Basically, I need to do something about the spam inundation, as
>>>>>> soon as
>>>>>> possible.
>>>>>>
>>>>>> Is there any reason that I should NOT be performing the sa-learn
>>>>>> training under the "amavis" user account?
>>>>> In general, all training should be done as the user that SA (in your
>>>>> case, SA via Amavis) is running as.
>>>> I have tried to do this, but to no avail:
>>>>
>>>> ---
>>>> # su amavis -c 'sa-learn --spam
>>>> /var/vmail/example.com/trainer/Maildir/.INBOX.Spam'
>>>>
>>>> archive-iterator: no access to
>>>> /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at
>>>> /usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 539.
>>>> archive-iterator: no access to
>>>> /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at
>>>> /usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 771.
>>>> archive-iterator: unable to open
>>>> /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13
>>>> ---
>>> ~/Maildir/* assumes 1 file=1 mail
>>>
>>> pls try
>>>
>>> su amavis -c 'sa-learn --spam --progress --dir
>>> /var/vmail/example.com/trainer/Maildir/.INBOX.Spam/cur/'
>>>
>>> or wherever the message are stored
>>
>> But first, you need access to the files.  The simplest way is probably
>> to add the amavis user account to the group used by the mail directories.
>>
>> Assuming the group is "vmail", the command should look like this (on
>> RedHat/CentOS):
>>
>> $ usermod -a -G vmail amavis
> 
> Thanks, guys. I did consider adding the "amavis" user to the "vmail"
> group, but the default permissions on the directories within "Maildir"
> are 700 (with vmail:vmail ownership).
> 
> So, I'd have to fiddle with the permissions on the entire directory
> tree, for each user, which seems like a bad idea.
> 
> Furthermore, ISPconfig handles the creation (and deletion) of these
> directories, so I hesitate to change anything manually and muck-up the
> installation.
> 
> While there may be permissions mask that is applied, modifying it seems
> risky.
> 
> I wonder what the rest of the Dovecot + Amavis + SA world is doing about
> this. Maybe I should ask on the Amavis mailing list.
> 
> If anyone has other suggestions, by all means, please do share.
> 
>> This command will probably need to be run as root.  If you are using a
>> different distro, you will need to look up the command to add the amavis
>> user to the vmail group.
>>
> 
> Much thanks,
> 
> -Ben
> 

I was able to resolve the issue by adding the --username switch to the
'sa-learn' executable:

# sa-learn --username=amavis --spam
/var/vmail/example.com/trainer/Maildir/.INBOX.Spam/cur

Thanks for all of the hints, folks!

-Ben


Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-20 Thread Ben Johnson


On 8/20/2012 12:56 PM, Bowie Bailey wrote:
> On 8/20/2012 12:46 PM, Axb wrote:
>> On 08/20/2012 06:42 PM, Ben Johnson wrote:
>>>
>>> On 8/17/2012 11:28 AM, John Hardin wrote:
>>>> On Fri, 17 Aug 2012, Ben Johnson wrote:
>>>>
>>>>> On 8/16/2012 2:00 PM, Ben Johnson wrote:
>>>>> Basically, I need to do something about the spam inundation, as
>>>>> soon as
>>>>> possible.
>>>>>
>>>>> Is there any reason that I should NOT be performing the sa-learn
>>>>> training under the "amavis" user account?
>>>> In general, all training should be done as the user that SA (in your
>>>> case, SA via Amavis) is running as.
>>> I have tried to do this, but to no avail:
>>>
>>> ---
>>> # su amavis -c 'sa-learn --spam
>>> /var/vmail/example.com/trainer/Maildir/.INBOX.Spam'
>>>
>>> archive-iterator: no access to
>>> /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at
>>> /usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 539.
>>> archive-iterator: no access to
>>> /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at
>>> /usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 771.
>>> archive-iterator: unable to open
>>> /var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13
>>> ---
>> ~/Maildir/* assumes 1 file=1 mail
>>
>> pls try
>>
>> su amavis -c 'sa-learn --spam --progress --dir
>> /var/vmail/example.com/trainer/Maildir/.INBOX.Spam/cur/'
>>
>> or wherever the message are stored
> 
> But first, you need access to the files.  The simplest way is probably
> to add the amavis user account to the group used by the mail directories.
> 
> Assuming the group is "vmail", the command should look like this (on
> RedHat/CentOS):
> 
> $ usermod -a -G vmail amavis

Thanks, guys. I did consider adding the "amavis" user to the "vmail"
group, but the default permissions on the directories within "Maildir"
are 700 (with vmail:vmail ownership).

So, I'd have to fiddle with the permissions on the entire directory
tree, for each user, which seems like a bad idea.

Furthermore, ISPconfig handles the creation (and deletion) of these
directories, so I hesitate to change anything manually and muck-up the
installation.

While there may be permissions mask that is applied, modifying it seems
risky.

I wonder what the rest of the Dovecot + Amavis + SA world is doing about
this. Maybe I should ask on the Amavis mailing list.

If anyone has other suggestions, by all means, please do share.

> This command will probably need to be run as root.  If you are using a
> different distro, you will need to look up the command to add the amavis
> user to the vmail group.
> 

Much thanks,

-Ben


Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-20 Thread Ben Johnson


On 8/17/2012 11:28 AM, John Hardin wrote:
> On Fri, 17 Aug 2012, Ben Johnson wrote:
> 
>> On 8/16/2012 2:00 PM, Ben Johnson wrote:
>> Basically, I need to do something about the spam inundation, as soon as
>> possible.
>>
>> Is there any reason that I should NOT be performing the sa-learn
>> training under the "amavis" user account?
> 
> In general, all training should be done as the user that SA (in your
> case, SA via Amavis) is running as.

I have tried to do this, but to no avail:

---
# su amavis -c 'sa-learn --spam
/var/vmail/example.com/trainer/Maildir/.INBOX.Spam'

archive-iterator: no access to
/var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at
/usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 539.
archive-iterator: no access to
/var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13 at
/usr/share/perl5/Mail/SpamAssassin/ArchiveIterator.pm line 771.
archive-iterator: unable to open
/var/vmail/example.com/trainer/Maildir/.INBOX.Spam: 13
---

This seems to occur because the virtual mail directory permissions do
not provide the "amavis" user with the required access level. The
"vmail" user is the only user with any type of access to
/var/vmail/example.com/user/Maildir. I suspect that there is a good
reason for this and that the ownership/permissions should not be changed.

I've done some research on this issue and there isn't much to be found.
This archived thread ( http://marc.info/?l=amavis-user&m=116457786312019
) discusses overriding the Bayes user with "bayes_sql_override_username
amavis", but that doesn't solve the problem (obviously). I still see the
same permission errors, although the need to use the 'su' wrapper does
go away.

Is there a conventional means by which to deal with this issue?

> If you have your system configured for per-user Bayes databases, then
> you'd need to train as the user whose database you want to affect.

The system in question leverages ISPConfig 3, which implements virtual
users/mailboxes, although, I don't know if ISPConfig configures Amavis
to utilize individual Bayes databases or if there's an individual
database for the "amavis" user. I can check with the developers.

> What is your bayes_path config?

I don't see this directive anywhere on the system in question; perhaps a
default value is being used. The only instance of that string exists in
a source file:

/usr/share/perl5/Mail/SpamAssassin/Conf.pm:=item bayes_path
/path/filename  (default: ~/.spamassassin/bayes)

So, presumably, "bayes_path" is equating to "~/.spamassassin/bayes", or
in my case, "/var/lib/amavis/.spamassassin".

>> Would doing so preclude me from creating training folders for
>> individual IMAP users in the future?
> 
> They're not related. Per-user ham and spam training folders doesn't
> preclude using those messages for training a global Bayes database.

Understood.

> You actually may want to implement a hybrid folder model: per-user ham
> training folders and a global spam training folder. Misclassified ham
> could potentially be private messages that the recipient doesn't want
> other users to see, but for misclassified spam who cares?

Right, that makes sense.

>> Or can I train under the "amavis" user for now and then "layer-on"
>> training for individual IMAP users in the future without undesirable
>> consequences?
> 
> As stated above, if you're not enabling per-user Bayes *databases*, the
> question is meaningless. Are you going to configure per-user Bayes
> databases? Or (as I suspect is more likely) perform global database
> training from individual users whose judgement you trust?
> 

I suppose that I need to determine whether or not ISPConfig implements
per-user Bayes database already. I'll report-back for those who may be
curious.

Thanks again,

-Ben


Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-17 Thread Ben Johnson
On 8/16/2012 2:00 PM, Ben Johnson wrote:
> In any event, at this point, I'm confused as to which user account I
> should be using when executing "sa-learn --spam", for example.
> 
> As a bit of background, I'm using ISPConfig 3, which implements virtual
> mailbox users via MySQL.
> 
> I dug through the mailing list archive and found
> http://spamassassin.1065346.n5.nabble.com/Problem-with-sa-learn-and-virtual-user-td44666.html
> , which seems to be relevant.
> 
> Ultimately, I'm wondering if I should be using the "amavis" user to
> learn ham/spam, or individual mailbox user accounts.
> 
> If it is possible to use either, are there pros and cons of which one
> should be aware before settling on an approach?
> 
> As I mentioned previously, I would like to set-up "LearnHam" and
> "LearnSpam" folders for each IMAP user, eventually, so perhaps this
> answers my question?
> 
> Thanks again for all the help!

John Hardin, sorry to bust you up here... just curious whether or not
you saw the rest of my previous note. If you didn't address these
questions intentionally, then please ignore me. :)

Basically, I need to do something about the spam inundation, as soon as
possible.

Is there any reason that I should NOT be performing the sa-learn
training under the "amavis" user account? Would doing so preclude me
from creating training folders for individual IMAP users in the future?
Or can I train under the "amavis" user for now and then "layer-on"
training for individual IMAP users in the future without undesirable
consequences?

Thanks again,

-Ben


Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-16 Thread Ben Johnson


On 8/16/2012 12:32 PM, John Hardin wrote:
> On Thu, 16 Aug 2012, Ben Johnson wrote:
> 
>> On 8/16/2012 11:38 AM, John Hardin wrote:
>>> On Thu, 16 Aug 2012, Ben Johnson wrote:
>>>
>>>> So, after disabling auto-learn (for now) and executing "sa-learn
>>>> --clear", and restarting Amavis, I'm still seeing this:
>>>>
>>>> No, score=0.593 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
>>>> HTML_MESSAGE=0.001, RDNS_NONE=0.793, SPF_PASS=-0.001,
>>>> URIBL_DBL_SPAM=1.7] autolearn=disabled
>>>>
>>>> Why BAYES_00 still? Am I running the wrong command to clear the
>>>> database?
>>>
>>> That's correct. Be sure that you're running it as the same user that
>>> amavis+SA is running as, otherwise you're clearing the wrong files.
>>>
>>> You might want to run "sa-learn --dump magic" afterwards to verify the
>>> database is cleared.
>>
>> John,
>>
>> You were exactly right; I forgot to execute "sa-learn --clear" as the
>> "amavis" user.
>>
>> What is the expected output of "sa-learn --dump magic" once the database
>> has been cleared successfully?
>>
>> # su amavis -c 'sa-learn --dump magic'
>>
>> ERROR: Bayes dump returned an error, please re-run with -D for more
>> information
>>
>> # su amavis -c 'sa-learn -D --dump magic'
>>
>> [...]
>> dbg: bayes: no dbs present, cannot tie DB R/O:
>> /var/lib/amavis/.spamassassin/bayes_toks
>> [...]
>>
>> Is this to be expected? Or did I muck-up the works?
> 
> Heh. I was expecting zeroes, but "no dbs present" is also a good
> confirmation that the Bayes database has been reset... :)
> 
> You might need to restart amavis now, too.
> 

So, I preemptively restarted Amavis, per your suggestion (without
executing "su amavis -c 'sa-learn -D --dump magic'" first), and when I
executed the aforementioned command after the restart, I received the
"expected" output:

# su amavis -c 'sa-learn --dump magic'
0.000  0  3  0  non-token data: bayes db version
0.000  0  0  0  non-token data: nspam
0.000  0  0  0  non-token data: nham
0.000  0  0  0  non-token data: ntokens
0.000  0  0  0  non-token data: oldest atime
0.000  0  0  0  non-token data: newest atime
0.000  0  0  0  non-token data: last journal
sync atime
0.000  0  0  0  non-token data: last expiry atime
0.000  0  0  0  non-token data: last expire
atime delta
0.000  0  0  0  non-token data: last expire
reduction count

All looks well. (I'm performing these actions in a test/development
environment, by the way.)

So, I went to follow the same procedure in production:

# su amavis -c 'sa-learn --clear'

# service amavis restart

# su amavis -c 'sa-learn -D --dump magic'

Yet this yields that familiar message:

ERROR: Bayes dump returned an error, please re-run with -D for more
information

I waited a little while (at least an hour) and tried again. Same thing.
I restarted Amavis again, same thing.

A few minutes later, I decided to give it one last shot, and sure
enough, I received the expected output with all zeros.

It may be academic at this point, but I'm now curious as to what causes
the DB file to be recreated, if not restarting Amavis. (It bears mention
that plenty of mail came in between using the "--clear" switch and when
using the "--dump" switch began to produce valid [all-zero] output. In
other words, the DB didn't seem to be recreated when the first message
was received after clearing the old DB and restarting Amavis.)

In any event, at this point, I'm confused as to which user account I
should be using when executing "sa-learn --spam", for example.

As a bit of background, I'm using ISPConfig 3, which implements virtual
mailbox users via MySQL.

I dug through the mailing list archive and found
http://spamassassin.1065346.n5.nabble.com/Problem-with-sa-learn-and-virtual-user-td44666.html
, which seems to be relevant.

Ultimately, I'm wondering if I should be using the "amavis" user to
learn ham/spam, or individual mailbox user accounts.

If it is possible to use either, are there pros and cons of which one
should be aware before settling on an approach?

As I mentioned previously, I would like to set-up "LearnHam" and
"LearnSpam" folders for each IMAP user, eventually, so perhaps this
answers my question?

Thanks again for all the help!

-Ben


Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-16 Thread Ben Johnson


On 8/16/2012 11:38 AM, John Hardin wrote:
> On Thu, 16 Aug 2012, Ben Johnson wrote:
> 
>> So, after disabling auto-learn (for now) and executing "sa-learn
>> --clear", and restarting Amavis, I'm still seeing this:
>>
>> No, score=0.593 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
>> HTML_MESSAGE=0.001, RDNS_NONE=0.793, SPF_PASS=-0.001,
>> URIBL_DBL_SPAM=1.7] autolearn=disabled
>>
>> Why BAYES_00 still? Am I running the wrong command to clear the database?
> 
> That's correct. Be sure that you're running it as the same user that
> amavis+SA is running as, otherwise you're clearing the wrong files.
> 
> You might want to run "sa-learn --dump magic" afterwards to verify the
> database is cleared.
> 

John,

You were exactly right; I forgot to execute "sa-learn --clear" as the
"amavis" user.

What is the expected output of "sa-learn --dump magic" once the database
has been cleared successfully?

# su amavis -c 'sa-learn --dump magic'

ERROR: Bayes dump returned an error, please re-run with -D for more
information

# su amavis -c 'sa-learn -D --dump magic'

[...]
dbg: bayes: no dbs present, cannot tie DB R/O:
/var/lib/amavis/.spamassassin/bayes_toks
[...]

Is this to be expected? Or did I muck-up the works?

Thanks again,

-Ben


Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-16 Thread Ben Johnson


On 8/16/2012 10:14 AM, Ben Johnson wrote:
> 
> 
> On 8/15/2012 4:05 PM, John Hardin wrote:
>> On Wed, 15 Aug 2012, Ben Johnson wrote:
>>
>>> On 8/15/2012 2:24 PM, John Hardin wrote:
>>>> On Wed, 15 Aug 2012, Ben Johnson wrote:
>>>>
>>>>> Some 99% of the spam that I receive, which is grossly spammy (we're
>>>>> talking auto loans, cash advances, dink pills, the whole lot) contains
>>>>> "BAYES_00=-1.9" in the tests portion of the X-Spam-Status header.
>>>>>
>>>>> Might anyone know why?
>>>>
>>>> Poor training.
>>>
>>> John, I can't thank you enough for the thoroughness of your response.
>>
>> I like to show off. :)
>>
>>>> Apart from the Bayes score, what kind of scores are those spams getting?
>>>
>>> Here are a few examples (the first two of which are two of VERY few in
>>> which the BAYES_* value is over 00):
>>>
>>> -
>>> No, score=0.192 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001,
>>> HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793,
>>> SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=no
>>>
>>> No, score=2.241 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001,
>>> HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RDNS_NONE=0.793,
>>> SPF_PASS=-0.001] autolearn=no
>>>
>>> No, score=-0.836 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
>>> HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3,
>>> RDNS_NONE=0.793, SPF_PASS=-0.001, URI_HEX=1.122] autolearn=no
>>>
>>> No, score=1.256 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
>>> HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3,
>>> RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7,
>>> URIBL_RHS_DOB=1.514] autolearn=no
>>> -
>>
>> It might be interesting to see some log entries where autolearn=yes...
> 
> Here you go:
> 
> No, score=-4.2 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
> HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001] autolearn=ham
> 
> No, score=-3.407 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
> HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793,
> SPF_PASS=-0.001] autolearn=ham
> 
> No, score=-2.5 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
> HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001,
> URIBL_DBL_SPAM=1.7] autolearn=ham
> 
> No, score=-3.407 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
> HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793,
> SPF_PASS=-0.001] autolearn=ham
> 
>>> It bears mention that the RCVD_IN_DNSWL_MED test is having even more
>>> of a negative impact (pardon the pun) than BAYES_*. I am already
>>> working with the dnswl.org folks (off-list, for privacy reasons) to
>>> get to the bottom of that issue.
>>
>> This might be a major contributing factor. If your system was taught
>> from scratch by autolearn, and DNSWL (which is fairly well trusted) has
>> been pushing a lot of spams to low scores...
> 
> It looks as though this is exactly what happened. I'll post back once
> I've done some more troubleshooting with the folks at dnswl.org.
> 
>> You might want to set:
>> bayes_auto_learn_threshold_nonspam -3
> 
> Done.
> 
>> That won't _fix_ the problem (at least not quickly) or avoid the need to
>> wipe and retrain, but it might keep things from getting worse.
> 
> I disabled auto-learn and executed "sa-learn --clear", too. So, I should
> be starting with a "clean slate", right?
> 
> I have also disabled the DNSWL rules, until the issue can be resolved,
> and will begin manual training immediately.
> 
>> See perldoc Mail::SpamAssassin::Plugin::AutoLearnThreshold for more info.
>>
>>> Most of the list is probably laughing, but given the complexity of Spam
>>> Assassin, this crucial requirement was lost on me, amidst the sea of
>>> information and instructions. For example, there is no mention of the
>>> fact that SA is essentially useless without Bayesian training on
>>> http://wiki.apache.org/spamassassin/StartUsing .
>>
>> That's because that shouldn't be the case. The base ruleset + URIBL
>> should be very effective pretty much out-of-the-box.
>>
>>>> What version of SA is this?
>>>
>>> # spamassassin --version
>>> SpamAssassin version 3.3.1
>>>  running on Perl version 5.10.1
>>
>> A little st

Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-16 Thread Ben Johnson


On 8/15/2012 4:05 PM, John Hardin wrote:
> On Wed, 15 Aug 2012, Ben Johnson wrote:
> 
>> On 8/15/2012 2:24 PM, John Hardin wrote:
>>> On Wed, 15 Aug 2012, Ben Johnson wrote:
>>>
>>>> Some 99% of the spam that I receive, which is grossly spammy (we're
>>>> talking auto loans, cash advances, dink pills, the whole lot) contains
>>>> "BAYES_00=-1.9" in the tests portion of the X-Spam-Status header.
>>>>
>>>> Might anyone know why?
>>>
>>> Poor training.
>>
>> John, I can't thank you enough for the thoroughness of your response.
> 
> I like to show off. :)
> 
>>> Apart from the Bayes score, what kind of scores are those spams getting?
>>
>> Here are a few examples (the first two of which are two of VERY few in
>> which the BAYES_* value is over 00):
>>
>> -
>> No, score=0.192 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001,
>> HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793,
>> SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=no
>>
>> No, score=2.241 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001,
>> HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RDNS_NONE=0.793,
>> SPF_PASS=-0.001] autolearn=no
>>
>> No, score=-0.836 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
>> HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3,
>> RDNS_NONE=0.793, SPF_PASS=-0.001, URI_HEX=1.122] autolearn=no
>>
>> No, score=1.256 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
>> HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3,
>> RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7,
>> URIBL_RHS_DOB=1.514] autolearn=no
>> -
> 
> It might be interesting to see some log entries where autolearn=yes...

Here you go:

No, score=-4.2 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001] autolearn=ham

No, score=-3.407 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793,
SPF_PASS=-0.001] autolearn=ham

No, score=-2.5 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001,
URIBL_DBL_SPAM=1.7] autolearn=ham

No, score=-3.407 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793,
SPF_PASS=-0.001] autolearn=ham

>> It bears mention that the RCVD_IN_DNSWL_MED test is having even more
>> of a negative impact (pardon the pun) than BAYES_*. I am already
>> working with the dnswl.org folks (off-list, for privacy reasons) to
>> get to the bottom of that issue.
> 
> This might be a major contributing factor. If your system was taught
> from scratch by autolearn, and DNSWL (which is fairly well trusted) has
> been pushing a lot of spams to low scores...

It looks as though this is exactly what happened. I'll post back once
I've done some more troubleshooting with the folks at dnswl.org.

> You might want to set:
> bayes_auto_learn_threshold_nonspam -3

Done.

> That won't _fix_ the problem (at least not quickly) or avoid the need to
> wipe and retrain, but it might keep things from getting worse.

I disabled auto-learn and executed "sa-learn --clear", too. So, I should
be starting with a "clean slate", right?

I have also disabled the DNSWL rules, until the issue can be resolved,
and will begin manual training immediately.

> See perldoc Mail::SpamAssassin::Plugin::AutoLearnThreshold for more info.
> 
>> Most of the list is probably laughing, but given the complexity of Spam
>> Assassin, this crucial requirement was lost on me, amidst the sea of
>> information and instructions. For example, there is no mention of the
>> fact that SA is essentially useless without Bayesian training on
>> http://wiki.apache.org/spamassassin/StartUsing .
> 
> That's because that shouldn't be the case. The base ruleset + URIBL
> should be very effective pretty much out-of-the-box.
> 
>>> What version of SA is this?
>>
>> # spamassassin --version
>> SpamAssassin version 3.3.1
>>  running on Perl version 5.10.1
> 
> A little stale, but not bad.

'Tis the major drawback with using LTS Linux distros and managing
software via packages, I suppose.

>>> You may also want to set up some mechanism for users to submit
>>> misclassified messages for training. Depending on how much you trust
>>> their judgement the learning from these can be automatic or can go
>>> through you as a reviewer.
>>
>> That sounds like a good idea. Is there a particular HOW TO or tutorial
>> that you recommend? If it depends on the environment/configuration, this
>> server runs Ubuntu 10.04 with Dovecot, Amavis, Sieve, and Spam Assassin.
> 
> I'm not sure, I don't lurk the Wiki much. About the best I can suggest
> is search the SA users mailing list archives for "training dovecot".
> 

Thanks, I'll look into setting-up IMAP folders for individual users in
some programmatic way.

-Ben


Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-15 Thread Ben Johnson
On 8/15/2012 4:19 PM, Kris Deugau wrote:
> John Hardin wrote:
>> I wasn't aware that autolearning could do a cold-start of Bayes, can
>> anyone confirm whether this is the case?
> 
> If you let it run long enough to pass the 200/200 ham/spam thresholds,
> yes;  there's no distinction I've ever met about where the learning came
> from.
> 
> That said, I wouldn't trust a pure autolearn setup with stock autolearn
> thresholds - all too much spam will get learned scoring under 0.1.  :(
> 
> -kgd
> 

It's a bit disappointing to learn this (pardon the pun), given:

a.) This exchange between John Hardin and I, which occurred previously
in this thread:

---8<--

Me:

> Most of the list is probably laughing, but given the complexity of Spam
> Assassin, this crucial requirement was lost on me, amidst the sea of
> information and instructions. For example, there is no mention of the
> fact that SA is essentially useless without Bayesian training on
> http://wiki.apache.org/spamassassin/StartUsing .

John:

That's because that shouldn't be the case. The base ruleset + URIBL
should be very effective pretty much out-of-the-box.

---8<--

b.) The default value for bayes_auto_learn is 1 (on). (At least in my
particular distribution.)

Correct me if I'm wrong, but this issue's root cause seems to be that
bayes_auto_learn was on, out-of-the-box, yet I was not complementing its
efficacy via sa-learn.

Is this an accurate summary? Because if so, it seems prudent to change
the default bayes_auto_learn value to zero, and scorn any package
maintainer or developer who modifies it, or, alternatively, put a
banner, at font-size 100em, on the SpamAssassin homepage that issues an
unmistakable warning about Bayesian training's importance.

(John, I'll respond to your most recent message tomorrow most likely;
had enough for one day!)

Thank you,

-Ben


Re: Very spammy messages yield BAYES_00 (-1.9)

2012-08-15 Thread Ben Johnson
On 8/15/2012 2:24 PM, John Hardin wrote:
> On Wed, 15 Aug 2012, Ben Johnson wrote:
> 
>> Some 99% of the spam that I receive, which is grossly spammy (we're
>> talking auto loans, cash advances, dink pills, the whole lot) contains
>> "BAYES_00=-1.9" in the tests portion of the X-Spam-Status header.
>>
>> Might anyone know why?
> 
> Poor training.

John, I can't thank you enough for the thoroughness of your response.

> Apart from the Bayes score, what kind of scores are those spams getting?

Here are a few examples (the first two of which are two of VERY few in
which the BAYES_* value is over 00):

-
No, score=0.192 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001,
HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793,
SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=no

No, score=2.241 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001,
HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RDNS_NONE=0.793,
SPF_PASS=-0.001] autolearn=no

No, score=-0.836 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3,
RDNS_NONE=0.793, SPF_PASS=-0.001, URI_HEX=1.122] autolearn=no

No, score=1.256 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3,
RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7,
URIBL_RHS_DOB=1.514] autolearn=no
-

It bears mention that the RCVD_IN_DNSWL_MED test is having even more of
a negative impact (pardon the pun) than BAYES_*. I am already working
with the dnswl.org folks (off-list, for privacy reasons) to get to the
bottom of that issue.

>> While I have not trained the Bayesian filter manually to date,
> 
> Is there any provision for any manual training in your environment? Have
> you set up training folders where your users can submit message for
> training? Do you run sa-learn at all?

No, there is no provision. No, I have not set-up training folders, and
no, I have no run sa-learn manually at all.

Most of the list is probably laughing, but given the complexity of Spam
Assassin, this crucial requirement was lost on me, amidst the sea of
information and instructions. For example, there is no mention of the
fact that SA is essentially useless without Bayesian training on
http://wiki.apache.org/spamassassin/StartUsing .

>> how is it that the spammiest of the spam is being classified with
>> BAYES_00 (thereby receiving the score -1.9)? Doesn't BAYES_00 imply
>> that the message is almost certainly not spam?
> 
> BAYES_00 implies that the message in question looks very similar to
> messages the Bayes system has been told are not spam. It depends solely
> on how it has been trained.
> 
> I wasn't aware that autolearning could do a cold-start of Bayes, can
> anyone confirm whether this is the case?
> 
> If it can't then someone somewhere trained bayes up to the default
> minimum 200 hams and 200 spams needed for it to start classifying.
> 
> Before we offer suggestions, some more data from you please:
> 
> What version of SA is this?

# spamassassin --version
SpamAssassin version 3.3.1
  running on Perl version 5.10.1

> What does "sa-learn --dump magic" report about your current Bayes database?

# sa-learn --dump magic
ERROR: Bayes dump returned an error, please re-run with -D for more
information

# su amavis -c 'sa-learn --dump magic'

# su amavis -c 'sa-learn --dump magic'
0.000  0  3  0  non-token data: bayes db version
0.000  0  11499  0  non-token data: nspam
0.000  0  39412  0  non-token data: nham
0.000  0 197769  0  non-token data: ntokens
0.000  0 1344331893  0  non-token data: oldest atime
0.000  0 1345056746  0  non-token data: newest atime
0.000  0 1345053771  0  non-token data: last journal
sync atime
0.000  0 1345023550  0  non-token data: last expiry atime
0.000  0 345600  0  non-token data: last expire
atime delta
0.000  0   6482  0  non-token data: last expire
reduction count

> What are all of the bayes_* configuration options in your local config?

None are defined there. There are a few defaults/examples, but they are
commented-out.

> 
> What will probably end up happening is this:
> (1) wipe your Bayes database
> (2) turn off autolearn
> (3) collect several hundred hams and spams for an initial training corpus
> (4) train using that corpus
> (5) evaluate results
> 
> Depending on your mail volume, once Bayes is working well after manual
> training, you may then want to reenable autolearn; I personally suggest
> it only where the volume is high enough and/or the character of mail is
> varied enough 

Very spammy messages yield BAYES_00 (-1.9)

2012-08-15 Thread Ben Johnson
Hello,

Some 99% of the spam that I receive, which is grossly spammy (we're
talking auto loans, cash advances, dink pills, the whole lot) contains
"BAYES_00=-1.9" in the tests portion of the X-Spam-Status header.

Might anyone know why? This is a stock installation (Ubuntu package on
10.04).

local.cf contains

#   Bayesian classifier auto-learning (default: 1)
#
# bayes_auto_learn 1

and I have not overridden the default elsewhere. So, presumably,
auto-learning is enabled (if that's event relevant).

While I have not trained the Bayesian filter manually to date, how is it
that the spammiest of the spam is being classified with BAYES_00
(thereby receiving the score -1.9)? Doesn't BAYES_00 imply that the
message is almost certainly not spam?

Others have run into this same problem, but I see no resolution; here is
one such example:

http://forums.eukhost.com/f38/problems-spamassassin-bayes-filter-16948/

Outside of the above forum post, search query results for this issue are
scant.

Thanks for any help,

-Ben


Re: RCVD_IN_DNSWL_BLOCKED

2012-08-14 Thread Ben Johnson
On 8/14/2012 9:33 AM, Bowie Bailey wrote:
> On 8/14/2012 12:35 AM, JP Kelly wrote:
>> How can I disable the DNSWL rule/plugin or whatever. Not just give it
>> a low/zero score but disable it completely.
>> I am tired of seeing RCVD_IN_DNSWL_BLOCKED in my headers.
> 
> If you set the score to zero, the rule will be disabled and you should
> no longer see it show up in the score report.
> 
> If you want to disable the DNSWL lookup completely, you should zero out
> the main rules and the sub-rule:
> 
>score RCVD_IN_DNSWL_BLOCKED 0
>score RCVD_IN_DNSWL_HI 0
>score RCVD_IN_DNSWL_LOW 0
>score RCVD_IN_DNSWL_MED 0
>score RCVD_IN_DNSWL_NONE 0
>score __RCVD_IN_DNSWL 0
> 

Thanks, Bowie. I was wondering how to do this, too.

The majority of the spam that our users receive is a direct result of
this one rule; it seems that plenty of spammers are white-listed in this
database, and it is a weighty test (it reduces the score by as much as 2
or 3 points in some cases, often putting the message just below the
required-for-spam score). We have no use for it.

-Ben


Re: SpamAssassin scores and 12-letter domains

2012-08-06 Thread Ben Johnson


On 8/6/2012 1:32 PM, Axb wrote:
> On 08/06/2012 05:25 PM, Ben Johnson wrote:
> 
>> Given that ASF has no other public support channel, and no way to
>> contact anybody to request that the filters be adjusted, what choice do
>> I have beyond pushing to have the software modified?
> 
> bare in mind: SpamAssassin is a framework and VERY flexible.
> It's aiming to be the global solution for spam filtering.
> 
> The SpamAssassin project delivers a set of rules and scores.
> 
> These may not fit all types fo traffic, globally - with minimal skills
> you can modify the ruleset to work for your setup.
> 

Thanks, Axb.

Yes, I understand that SpamAssassin is very flexible. The problem I'm
describing, however, is not with my SpamAssassin configuration (in which
case I would simply adjust it); it is with Apache Software Foundation's
configuration (over which I have no control).

I raised this issue because ASF's SpamAssassin configuration --
specifically, the 12-letter-domain check -- causes my messages to its
various mailing lists to be rejected more often than not. This list is
very forgiving in that the required score is 10.0, but other ASF lists
require only 5.0.

All of that said, it sounds like this issue will be discussed among the
developers, so maybe something will be done and not all 12-letter-domain
owners will be blacklisted throughout the Internet.

Best regards,

-Ben


Re: SpamAssassin scores and 12-letter domains

2012-08-06 Thread Ben Johnson
On 8/6/2012 8:01 AM, Benny Pedersen wrote:
> Den 2012-08-05 20:30, Michael Scheidell skrev:
> 
>>> X-ASF-Spam-Status: No, hits=4.8 required=10.0
>>> tests=FROM_12LTRDOM,SPF_HELO_PASS,SPF_PASS,URI_HEX
>> default is 5.0, not 10.0
> 
> why did ASF change it ?, did thay only change reguired ? :=)
> 
>>> as you see there is long way to 10
>> .2 points to go to 5.0
> 
> irrelevant on ASF
> 
>> score FROM_12LTRDOM 0.099 3.499 0.099 3.499
>> is a HUGE difference, any score over 2.75 points should be suspect.
> 
> spamassassin is opensource, scores is not hardcoded
> 
> i think what is more needed is just more comiters with ham and spam to
> the public corpus scores is generated from, dont fight rules, change
> scores if one is not comitter
> 
> this rule does not hit ham here
> 

Thanks for the replies thus far.

Benny, it bears mention that not all of ASF's servers/mailing lists are
configured the same way.

The Apache HTTP Server mailing list requires 5.0. My best guess is that
they cranked-up the threshold for the SpamAssassin mailing list because,
by nature, the discussion contains a lot of "spammy" content and
false-positives were becoming a problem.

The fact that SpamAssassin is open-source is what's irrelevant; I have
no control over how ASF configures its servers, and therefore no ability
to disable the ridiculous 12-letter-domain check. ASF would have to
change its configuration if my messages are to be accepted.

Better still would be to remove this "feature" from SpamAssassin
altogether, as it is completely useless. That way, the problem would
disappear as soon as ASF updates to a version of SpamAssassin in which
the 12-letter-domain check is removed.

The fact is that nobody has articulated the rationale behind the
12-letter-domain check speaks for itself. If a rule is deemed to be
useless, why is it not removed? It is wasting CPU cycles and affecting
genuine ASF mailing list subscribers adversely (by rejecting their
messages without basis). Further, it's not as though ASF's servers are
the only ones using FROM_12LTRDOM; this ridiculous issue is affecting my
ability to communicate across the Internet at large.

Given that ASF has no other public support channel, and no way to
contact anybody to request that the filters be adjusted, what choice do
I have beyond pushing to have the software modified?

Thank you,

-Ben


SpamAssassin scores and 12-letter domains

2012-08-05 Thread Ben Johnson
Hello,

As an owner of a 12-letter domain, and someone who is unable to post to
any of the Apache mailing lists due to messages being rejected as SPAM
(I'll be surprised if this one if any different), I have to ask, what is
the rationale for the infamous 12-letter-domain-ding?

How many 12-letter domains exist? A few million? I can't think of a less
useful metric, nor one that is more likely to yield false-positives.

There is hardly any published information on this subject, so perhaps
one of the experts here will weigh-in. Apparently, I'm not the only one
who feels this "feature" needs to die:
http://spamassassin.1065346.n5.nabble.com/FROM-12LTRDOM-high-scored-remove-td100710.html
.

Thanks for any insight.

-Ben