Re: Are the RBL scores high enough?

2005-06-03 Thread Kevin Sullivan

On Jun 2, 2005, at 8:27 PM, Matt Kettler wrote:

If one's wrong, they are ALL wrong.

SA's rule scores are evolved based on a real-world test of a 
hand-sorted corpus of fresh spam and ham. The whole scoreset is 
evolved simultaneously to optimize the placement pattern.


Of course, one thing that can affect accuracy is if some spams are 
accidentally misplaced into the ham pile it can cause some heavy score 
biasing to occur. A little bit of this is unavoidable, as human 
mistakes happen, but a lot of it will cause deflated scores and a lot 
of FNs.


The rule scores are optimized for the spam which was sent at the time 
that version of SA was released (actually, at the time the rule 
scoreset was calculated).  Since then, the static SA rules have become 
less useful since spammers now write their messages to avoid them.  The 
only rules which spammers cannot easily avoid are the dynamic ones:  
bayes and network checks (RBLs, URIBLs, razor, etc).


On my systems, I raise the scores for the dynamic tests since they are 
the only ones which hit a lot of today's spam.


 -Kevin



Re: interesting problem with SQL backend

2005-03-24 Thread Kevin Sullivan
--On 03/25/05 01:43:49 +0900 alan premselaar wrote:
I got notified by one of my users that they were unable to send mail
suddenly.  after checking the logs I determined that MIMEDefang was
timing out and returning errors.  the cause for this was very unclear
(which is why i'm sharing my findings with all of you)...
After digging around (and some assistance from David Skoll on the
MIMEDefang list) I was able to determine that the problem was caused by
SpamAssassin not being able to connect to the database server where the
bayes database is stored. (using MySQL on a remote host)
I've modified my mimedefang-filter file so that mail from internal IP 
addresses and mail sent via SMTP-AUTH (trusted mail) is not sent through 
SpamAssassin.

In general, SMTP is very resistant to delays (it just stores the message 
and tries again later).  The sole exception is the connection from the MUA 
to the initial MTA; the MUA can't hang around forever retrying so the user 
gets an eror message.  The more subsystems we add to our mail servers (spam 
and virus checking, for example) the more random delays we'll get.  If 
you bypass the checking for mail sent from trusted machines then you 
reduce the chance of delays.

Now, in my case, I can trust my users not to send spam and viruses.  If you 
can't, then you can set up a different machine for outgoing mail than for 
incoming mail.  Have the outgoing mailer *always* accept mail from internal 
or authenticated hosts, queue it, then scan the queue.  This way mail is 
still always scanned but if you have SpamAssassin-caused delays your users 
probably won't notice.

Of course, the ideal solution is for SpamAssassin to never cause delays. 
Unfortunately this isn't realistic, so the next best solution is to have 
mail continue to work even when SpamAssassin isn't working.

	-Kevin


pgpQ2I9vuBWwh.pgp
Description: PGP signature


Re: Manually training SpamAssassin by forwarding mail

2005-02-04 Thread Kevin Sullivan
--On 02/04/05 09:17:55 -0400 Peter Marshall wrote:
My question is the same as Henrik, I have a bunch of email that is spam
(either tagged by spam assassin or not tagged at all.  I forwared it as
an attachment to a spam mail box.  What do I have to do now before I
can get bayes to learn the message ... I read you have to remove the
headers  Could anyone give me a little more detail ?
There's no 100% good way to do this; it depends on how the message was 
mangled by the client (and possibly server).  The only guaranteed way is 
(as I described) to save a copy at the same point as it is inspected by 
SpamAssassin so you can use it later.

That being said, forwarding a message as an attachment will usually 
preserve the headers pretty well.  The perl MailTools and MIME-tools 
modules have procedures to pull out attachments and save them in the Unix 
format which sa-learn wants.

Sorry I don't have any ready-made scripts for this; my users dump messages 
into shared IMAP mailboxes which don't need any preprocessing before being 
fed to sa-learn.

	-Kevin


pgpCJwlbtYhvO.pgp
Description: PGP signature


RE: Manually training SpamAssassin by forwarding mail

2005-02-04 Thread Kevin Sullivan
--On 02/04/05 16:08:53 +0100 Sander Holthaus - Orange XL wrote:
Basically, I've got two option. All mail that is received is backupped on
the mailserver before adding any headers. I could match those with mail
received in the spam-learn and ham-learn accounts. However, mail is
backupped only for a limited amount of time before being moved, after
which the mail-server hasn't got any access to it. So unless people
report mail that found it's way through the filters on a very regular
basis it won't be a full proof sollution.
You don't really need a 100% solution; something which works 80% of the 
time would probably be fine.  But you may not want to do the programming 
needed to automate this.

The other option sounds more viable, I would only need to strip off the
X-Scanned-By, X-Spam-* and X-Sanitized headers (which are ignored in my
setup for bayes anyhow), BUT I have no guarentee that the message is in
it's original format. Some MIME-Boundry rewriting may be done by the
mailserver (where necessary), as is converting 8bit to 7bit where
possible. And I think that there are many client-sided mailfiltering
engines, spamscanners and virusscanners out there that may do some
rewriting as well.
You'll probably find that the various changes don't affect bayes that much. 
When a re-written message is learned you may make bayes miss email which 
(in an ideal world) it would have caught, but I think it will tend to 
classify messages around 50% I don't know if this is ham or spam rather 
than classifying it incorrectly.  And there should be enough unchanged 
tokens in the messages to let bayes work anyways.

So I say strip off what you can but don't obsess about the rest.  Feed it 
into bayes and see how it works, and only try to fix it if you see bayes 
misclassifying email.

-Kevin



pgpBKhvCmRjqs.pgp
Description: PGP signature


ALL_TRUSTED problems

2004-11-24 Thread Kevin Sullivan
I've set the trusted networks manually:
clear_trusted_networks
trusted_networks 127/8
trusted_networks 205.201.9.33/32
trusted_networks 10.30/16
clear_internal_networks
internal_networks 127/8 205.201.9.33/32 10.30/16
But I still get *lots* of mail incorrectly triggering ALL_TRUSTED.  I'm 
running spamassassin from a milter.  It looks like the milter runs before 
sendmail adds its own Received: line, so much mail comes in with no 
Received lines.  And it looks like mail with no Received lines is 
automatically tagged as trusted.

So, does this seem plausable?  And can it be fixed?
It seems like there have been many problems with the ALL_TRUSTED system 
with 3.0.  Is there a way to disable the whole thing?  I know that I can 
set ALL_TRUSTED to 0 points; will that also stop the side effects of 
ALL_TRUSTED?

	-Kevin


pgpk01miT4aLs.pgp
Description: PGP signature