Re: dealing with SPF and external authenticated users

2006-01-10 Thread hamann . w
  
 What would be the correct way of dealing with this situation ? As a 
  workaround I have used whitelist_from_rvc [EMAIL PROTECTED], which seems 
  to 
  be a great workaround, because I have rules in postfix that do not allow 
  external users that do NOT authenticate to send messages with my own 
  domain, not even to my local  users.
 
 There's nothing wrong with that solution since you have Postfix setup to 
 refuse mail to local address from un-auth'd users.
 
 
I implemented a similar setup a while ago, and it turned out that some legit 
(although
suspiciously looking) mails from ebay were blocked.
I had to whitelist ebay there..
This particular user is no longer there, so I dont know whether ebay have 
revised these
mails since

Wolfgang Hamann






Another URL obfuscation

2006-01-10 Thread Rosenbaum, Larry M.
I found this obfuscated URL in a drug spam:

A href=3Dhttp://gozifo .upze5otbbutzanbb655k685ys5nn%2Eridgykh=
comFONT SIZE=3D2/FONT


Larry R.



Re: Another URL obfuscation

2006-01-10 Thread Jeff Chan
On Tuesday, January 10, 2006, 6:17:38 AM, Larry Rosenbaum wrote:
 I found this obfuscated URL in a drug spam:

 A href=3Dhttp://gozifo .upze5otbbutzanbb655k685ys5nn%2Eridgykh=
comFONT SIZE=3D2/FONT

Good grief, does any mail client actually parse that as a
functional URI?

Jeff C.
-- 
Jeff Chan
mailto:[EMAIL PROTECTED]
http://www.surbl.org/



RE: rules better than bayes?

2006-01-10 Thread Chris Santerre
Title: RE: rules better than bayes?







 -Original Message-
 From: jo3 [mailto:[EMAIL PROTECTED]]
 Sent: Monday, January 09, 2006 2:28 PM
 To: users@spamassassin.apache.org
 Subject: rules better than bayes?
 
 
 Hi,
 
 This is an observation, please take it in the spirit in which it is 
 intended, it is not meant to be flame bait.
 
 After using spamassassin for six solid months, it seems to me 
 that the 
 bayes process (sa-learn [--spam | --ham]) has only very 
 limited success 
 in learning about new spam. Regardless of how many spams and 
 hams are 
 submitted, the effectiveness never goes above the default 
 level which, 
 in our case here, is somewhere around 2 out of 3 spams correctly 
 identified. By the same token, after adding the third party rule, 
 airmax.cf, the effectiveness went up to 99 out of 100 spams correctly 
 identified.


I have long said that IMHO, I do not think bayes is worth it. Left unattended, it isn't as good. A simple rule can take out a lot of spam. Some may say rule writing is more complicated then training bayes. Maybe. Not so much the rule writing, but the figuring out what to look for and testing for FPs. 

I do not run Bayes for our company. Obviously I'm partial to URIBL.com and SARE rules ;) I get about 98% of spam caught, and little FPs. 

This is going to sound like tooting our own horn, but so be it. Before SARE, Bayes was cool. After SARE, I see no need. 

Chris Santerre
SysAdmin and SARE/URIBL ninja
http://www.uribl.com
http://www.rulesemporium.com





Re: Another URL obfuscation

2006-01-10 Thread Chris Lear
* Jeff Chan wrote (10/01/2006 15:42):
 On Tuesday, January 10, 2006, 6:17:38 AM, Larry Rosenbaum wrote:
 I found this obfuscated URL in a drug spam:
 
 A href=3Dhttp://gozifo .upze5otbbutzanbb655k685ys5nn%2Eridgykh=
 comFONT SIZE=3D2/FONT
 
 Good grief, does any mail client actually parse that as a
 functional URI?

Yes. In your e-mail, my Thunderbird created a clickable link to
http://gozifo
My IE gives a DNS error when it tries that address.
My FireFox redirects to
http://www.google.com/search?btnI=I%27m+Feeling+Luckyie=UTF-8oe=UTF-8q=gozifo
which in turn redirects to http://www.vojir.com/other/basic-myebol.html
which gives a 404 error. It's probably possible to turn this
(mis)feature off in FireFox, but there it is by default.

I have no idea whether this is the original intention of the
obfuscation. I would guess not - and if it's viewed as html to start
with that might make a difference.

Chris


RE: rules better than bayes?

2006-01-10 Thread Matt Kettler

At 10:50 AM 1/10/2006, Chris Santerre wrote:


I have long said that IMHO, I do not think bayes is worth it. Left 
unattended, it isn't as good. A simple rule can take out a lot of spam. 
Some may say rule writing is more complicated then training bayes. Maybe. 
Not so much the rule writing, but the figuring out what to look for and 
testing for FPs.



Interesting.. For me, BAYES_99 is right between SURBL and URIBL in terms of 
hits. (And has 98.91% of URIBL's total hits) I find it completely 
indispensable.


I rarely train manually, except at initial setup where I feed it a good 
base learning. (the autolearner can sometimes go awry if you don't train 
some mail manually before letting it go.)


On a day to day basis I mostly feed automatically with a cronjob that 
collects mail via spamtraps and hamtraps. I have that coupled with 
autolearning that's set a bit differently than the defaults. (IMNSHO, 
having a ham learning threshold that's positive is suicide, but I also have 
a large number of small negative-score rules so I can keep my threshold at 
-0.01 and actually autolearn some ham).


This setup is near zero maintenance, and highly effective. I can't see why 
it wouldn't be worth it. It's almost as good as turning on URIBLs and not 
much more work. It's certainly much less work than rule writing. The last 
time I bothered to tinker with my bayes was before Christmas. 



Re: SA 3.10 skipping some emails or errors in log??

2006-01-10 Thread George R . Kasica
On Mon, 09 Jan 2006 21:45:11 -0500, you wrote:

On 09/01/2006 7:36 PM, George R. Kasica wrote:

 Jan  9 15:31:07 eagle spamd[8420]: spamd: processing message
 [EMAIL PROTECTED] for mail:561 
 Jan  9 15:34:55 eagle spamd[8715]: __alarm__ 
 Jan  9 15:35:01 eagle spamd[8715]: __alarm__ 
 Jan  9 15:35:01 eagle spamd[8311]: prefork: child states: BBBIB 
 Jan  9 15:35:02 eagle spamd[8719]: spamd: processing message
 [EMAIL PROTECTED] for
 mail:561 
 Jan  9 15:35:12 eagle spamd[8311]: tcp timeout at
 /usr/local/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm
 line 195. 
 Jan  9 15:35:12 eagle spamd[8311]: tcp timeout at
 /usr/local/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/SpamdForkScaling.pm
 line 195. 
 Jan  9 15:35:12 eagle spamd[8311]: prefork: select returned undef!
 recovering 
 Jan  9 15:35:48 eagle spamd[8712]: spamd: clean message (0.0/5.0) for
 mail:561 in 186.2 seconds, 14503 bytes. 
 Jan  9 15:35:48 eagle spamd[8712]: spamd: result: .  0 - HTML_MESSAGE
 scantime=186.2,size=14503,user=mail,uid=561,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=54421,mid=[EMAIL
  PROTECTED],autolearn=disabled

Please see, and comment on, bug 4696:

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4696


Daryl

Daryl:

Just curious as to the estimate for how long it will be until the
problem is corrected? Right now with the way SA 3.1 is operating here
it is almost worthless, catching and scanning about 20% of the spam
due to the bug causing difficulties I'm assuming?

George
===[George R. Kasica]===+1 262 677 0766
President   +1 206 374 6482 FAX 
Netwrx Consulting Inc.  Jackson, WI USA 
http://www.netwrx1.com
[EMAIL PROTECTED]
ICQ #12862186


RE: rules better than bayes?

2006-01-10 Thread Aaron Grewell
Hi Matt, I'm interested in how your setup compares to mine.  I also find
Bayes very useful, but I haven't gotten it to work as well as what
you've described.

 
 Interesting.. For me, BAYES_99 is right between SURBL and 
 URIBL in terms of 
 hits. (And has 98.91% of URIBL's total hits) I find it completely 
 indispensable.
 

Are you using a single site-wide database, or is this a per-user setup?

 I rarely train manually, except at initial setup where I feed 
 it a good 
 base learning. (the autolearner can sometimes go awry if you 
 don't train 
 some mail manually before letting it go.)
 

The trouble I had with the autolearner was that some spammers would send
innocuous mail through to raise their scores until Bayes decided they
were ok, then start spamming.  That was a couple of versions back, does
that sort of thing no longer work?

 On a day to day basis I mostly feed automatically with a cronjob that 
 collects mail via spamtraps and hamtraps. I have that coupled with 
 autolearning that's set a bit differently than the defaults. (IMNSHO, 
 having a ham learning threshold that's positive is suicide, 
 but I also have 
 a large number of small negative-score rules so I can keep my 
 threshold at 
 -0.01 and actually autolearn some ham).
 

I'd love to make my Bayesian database more effective, is there a doc
somewhere that describes how you tuned it to your environment?


Re: rules better than bayes?

2006-01-10 Thread Matt Kettler
Aaron Grewell wrote:
 Hi Matt, I'm interested in how your setup compares to mine.  I also find
 Bayes very useful, but I haven't gotten it to work as well as what
 you've described.
 
 
Interesting.. For me, BAYES_99 is right between SURBL and 
URIBL in terms of 
hits. (And has 98.91% of URIBL's total hits) I find it completely 
indispensable.

 
 
 Are you using a single site-wide database, or is this a per-user setup?

Single site-wide.. I use mailscanner which does not support per-user, but I'm
not really looking for it.
 
 
I rarely train manually, except at initial setup where I feed 
it a good 
base learning. (the autolearner can sometimes go awry if you 
don't train 
some mail manually before letting it go.)

 
 
 The trouble I had with the autolearner was that some spammers would send
 innocuous mail through to raise their scores until Bayes decided they
 were ok, then start spamming.  That was a couple of versions back, does
 that sort of thing no longer work?


Erm, that really shouldn't affect the bayes autolearner.. perhaps you are
thinking of the AWL? I don't run the AWL for this very reason.

  On a day to day basis I mostly feed automatically with a cronjob that 
collects mail via spamtraps and hamtraps. I have that coupled with 
autolearning that's set a bit differently than the defaults. (IMNSHO, 
having a ham learning threshold that's positive is suicide, 
but I also have 
a large number of small negative-score rules so I can keep my 
threshold at 
-0.01 and actually autolearn some ham).

 
 
 I'd love to make my Bayesian database more effective, is there a doc
 somewhere that describes how you tuned it to your environment?

Not really.. but it's not hard.

Spamtraps and hamtraps:
---
1) create a secret hamtrap email account. Subscribe this account to
newsletters and news feeds that your users typically subscribe to. Do not post
this address around, and don't use hamtrap as the account name, it's too 
obvious.

2) create a spamtrap account, or several of them. Carefully seed this out in
the body of some Usenet and mailing list postings.

3) create a cron-job that auto-feeds the above mail to sa-learn.

Simple example fragment of the script I use (it keeps a rotating archive of the
past 5 learning sessions):

#!/bin/sh
cd /var/spool/training/

if [ -f /var/spool/mail/spamtrap ]; then
 echo learning spam mailbox - spamtrap
 mv /var/spool/mail/spamtrap .
 /usr/bin/sa-learn --spam --mbox spamtrap
 rm spam/spamtrap.alearn5.gz
 mv spam/spamtrap.alearn4.gz spam/spamtrap.alearn5.gz
 mv spam/spamtrap.alearn3.gz spam/spamtrap.alearn4.gz
 mv spam/spamtrap.alearn2.gz spam/spamtrap.alearn3.gz
 gzip spam/spamtrap.alearn1
 mv spam/spamtrap.alearn1.gz spam/spamtrap.alearn2.gz

 mv spamtrap spam/spamtrap.alearn1
fi

4) Carefully monitor the data being fed for a while (two weeks or so) to make
sure there's no pollution. After it's established you can monitor it less often.


Autolearn adjustment:

1) add  bayes_auto_learn_threshold_nonspam -0.01 to your local.cf

2) create a bayes_hamlearning.cf file. Create several simple body text rules
with catch phrases from your normal nonspam. Assign these rules very small
negative scores (-0.01 to -0.1). This is generally easier in a corporate
environment, but it can be done in academic too.

body LOCAL_THESIS   /\bThesis\b/i
score LOCAL_THESIS  -0.01

You have to keep the scores small, as you don't want to use these to whitelist
spam mail. You merely want to make mail that would otherwise score 0 earn a
small negative score if it's got some of these phrases in it. It's not perfect,
but it's better than blindly learning everything under 0.5. I feel learning as
ham should be earned, not a default for not hitting any rules at all.

The problem is this requires some customization. This can't be a default setup
of SA as the catch phrases vary from place to place, and if there was a
default set of them spammers would be sure to always include them, making them
pointless. You'd effectively have the same thing as the current default, by
avoiding spam rules and existing bayes tokens they can get a message learned.










Re: rules better than bayes?

2006-01-10 Thread Jim Maul

Aaron Grewell wrote:

Hi Matt, I'm interested in how your setup compares to mine.  I also find
Bayes very useful, but I haven't gotten it to work as well as what
you've described.

Interesting.. For me, BAYES_99 is right between SURBL and 
URIBL in terms of 
hits. (And has 98.91% of URIBL's total hits) I find it completely 
indispensable.




Are you using a single site-wide database, or is this a per-user setup?



Im not matt, but running a very similar setup which works very well so i 
thought i would comment also.  Im running a single sitewide database. 
All mail is processed under my spamd user.



I rarely train manually, except at initial setup where I feed 
it a good 
base learning. (the autolearner can sometimes go awry if you 
don't train 
some mail manually before letting it go.)




The trouble I had with the autolearner was that some spammers would send
innocuous mail through to raise their scores until Bayes decided they
were ok, then start spamming.  That was a couple of versions back, does
that sort of thing no longer work?



I rarely train manually as well.  The only ones i train (and its only 
because there is nothing else to train) are spam which are correctly 
identified as such but have autolearn=no because they did not meet the 
autolearn criteria.  These almost always have BAYES_99 and a score of 20 
or so but most likely did not have enough header points to autolearn it.


I didnt even start training my database manually.  I started from 
scratch and let the autolearner do its thing.  I have never had to 
correct what it did because it was always always right.  The poison that 
spammers like to include in messages doesnt appear to have any affect on 
the overall outcome of the bayes score.  I dont really know why this is, 
it just works.


NOTE: to operate in this fashion i believe it is imperative that you 
change the autolearn thresholds.  The defaults are dangerous! (atleast 
in 2.64 which i still run).  I have mine set as such:


bayes_auto_learn_threshold_nonspam -0.1
bayes_auto_learn_threshold_spam 10.0

To this date (been running over 2 years) i have yet to see the 
autolearner misclassify.  Most bayes hits are the far extremes (bayes_99 
and bayes_0) with only a few in the 80-90 range.



On a day to day basis I mostly feed automatically with a cronjob that 
collects mail via spamtraps and hamtraps. I have that coupled with 
autolearning that's set a bit differently than the defaults. (IMNSHO, 
having a ham learning threshold that's positive is suicide, 
but I also have 
a large number of small negative-score rules so I can keep my 
threshold at 
-0.01 and actually autolearn some ham).




I'd love to make my Bayesian database more effective, is there a doc
somewhere that describes how you tuned it to your environment?



I doubt there is anything that specific and if there was, it most likely 
wouldnt help you in your situation.  There are general tuning notes on 
the SA website and such but you really just have to try and see what 
works and what doesnt in your setup.  What works well for 1 person may 
not work at all for someone else.


-Jim


Re: rules better than bayes?

2006-01-10 Thread Kelson Vibber

Aaron Grewell wrote:

The trouble I had with the autolearner was that some spammers would send
innocuous mail through to raise their scores until Bayes decided they
were ok, then start spamming.  That was a couple of versions back, does
that sort of thing no longer work?


Are you sure this is Bayes-related?  Bayes looks at the entire message, 
not just the sender.  All I'd expect this tactic to do would be to make 
future innocuous mail look more innocuous -- it shouldn't have any 
significant impact on spammy mail from the same source since the content 
will be different.


--
Kelson Vibber
SpeedGate Communications, www.speed.net


RE: Another URL obfuscation

2006-01-10 Thread Matthew.van.Eerde
Chris Lear wrote:
 http://gozifo
 My IE gives a DNS error when it tries that address.
 My FireFox redirects to
 http://www.google.com/search?btnI=I%27m+Feeling+Luckyie=UTF-8oe=UTF-8q=gozifo
 which in turn redirects to
 http://www.vojir.com/other/basic-myebol.html which gives a 404 error.
 It's probably possible to turn this (mis)feature off in FireFox, but
 there it is by default. 

about:config
Set the keyword.enabled pref to false
Or change the keyword.URL pref to another URL of your choosing

-- 
Matthew.van.Eerde (at) hbinc.com   805.964.4554 x902
Hispanic Business Inc./HireDiversity.com   Software Engineer


OT: Using ldap_routing in sendmail to verify GroupWise Recipients before SA

2006-01-10 Thread Joe Zitnik


I'm trying to configure sendmail to perform recipient verification by using ldap_routing in order to reduce the number of messages that need to be scanned by Guinevere and SpamAssassin.

Our configuration is similar to the setup discussed in comp.mail.sendmail, readable here:

http://www.issociate.de/board/post/266566/check_users_and_forward_to_an_other_mail_server.html 

I've basically been successful in setting this up in a test environment as follows:

FEATURE(`ldap_routing',`null',`ldap -1 -TTMPF -v mail -k mail=%0',`bounce')dnlLDAPROUTE_DOMAIN(hfcc.edu)dnlLDAPROUTE_DOMAIN(hfcc.net)dnlLDAPROUTE_DOMAIN(henryford.cc.mi.us)dnlLDAPROUTE_DOMAIN(mail.henryford.cc.mi.us)dnldefine(`confLDAP_DEFAULT_SPEC', `-h hostname -b o=org -s sub')dnl

There's only one problem. We have multiple domains (as you can see above) and yet each user only has one domain in their mail attribute.

I don't need to route, just verify existance and drop non-matches. I can't find any documentation on the parameters for ldap_routing except that -v and -k are required fields and a couple of examples here and there.

So here's my question: It's apparent that %0 is the recipient's email address. If there was an easy way to only check the lhs of the address, I could compare it against a different attribute and it would match all possible domains, and that would be good enough. I don't know enough about sendmail rule hacking to do this, but I'm sure it can be done.


RE: rules better than bayes?

2006-01-10 Thread Aaron Grewell
 
 Erm, that really shouldn't affect the bayes autolearner.. 
 perhaps you are
 thinking of the AWL? I don't run the AWL for this very reason.


Oh yeah.  I was thinking of the AWL.  NM.
 

 The problem is this requires some customization. This can't 
 be a default setup
 of SA as the catch phrases vary from place to place, and if 
 there was a
 default set of them spammers would be sure to always include 
 them, making them
 pointless. You'd effectively have the same thing as the 
 current default, by
 avoiding spam rules and existing bayes tokens they can get a 
 message learned.
 

That all makes sense.  I'll give it a shot.  Thanks!

-Aaron


RE: rules better than bayes?

2006-01-10 Thread Aaron Grewell
 

 Im not matt, but running a very similar setup which works 
 very well so i thought i would comment also.  Im running a 
 single sitewide database. 
 All mail is processed under my spamd user.

OK, that's basically what I'm doing too.

 
 I rarely train manually as well.  

 NOTE: to operate in this fashion i believe it is imperative that you 
 change the autolearn thresholds.  The defaults are dangerous! 
 (atleast 
 in 2.64 which i still run).  I have mine set as such:
 
 bayes_auto_learn_threshold_nonspam -0.1
 bayes_auto_learn_threshold_spam 10.0
 

OK, Matt said something similar about the thresholds.  Mine are default
so that may be part of the issue.  Thanks for the feedback!

-Aaron


Re: rules better than bayes?

2006-01-10 Thread Marc Perkel
Bayes would be much good if not for the rules to create a basic compass 
as to what is spam and not spam. The rules in a large part is what makes 
bayes work.


Getting Exim to read SA MySQL AWL Database to reduce load

2006-01-10 Thread Marc Perkel
Has anyone tried to get Exim to read the MySQL database of SA? The 
reason I'm asking is that I'm thinking that under load conditions Exim 
could read the AWL database and bypass SA on matches with very high 
scores (just rejecting them) and messages with very low scores (just 
accepting them bypassing SA). The idea is to give overloaded servers 
better performance during rush hour and let the learner work when 
things are slower.


Anyone done this?

--
Marc Perkel - [EMAIL PROTECTED]

Spam Filter: http://www.junkemailfilter.com
   My Blog: http://marc.perkel.com



Re: rules better than bayes?

2006-01-10 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Matt Kettler writes:
 At 10:50 AM 1/10/2006, Chris Santerre wrote:
 
 I have long said that IMHO, I do not think bayes is worth it. Left 
 unattended, it isn't as good. A simple rule can take out a lot of spam. 
 Some may say rule writing is more complicated then training bayes. Maybe. 
 Not so much the rule writing, but the figuring out what to look for and 
 testing for FPs.
 
 Interesting.. For me, BAYES_99 is right between SURBL and URIBL in terms of 
 hits. (And has 98.91% of URIBL's total hits) I find it completely 
 indispensable.

The thing is, Bayes is a tool for personalization -- and as such, its
effectiveness varies widely depending on what *you* do with it.

For what it's worth, I've *never* trained my current Bayes DB, and have
been running with it for about 6 months I think.  I get BAYES_00 on most
ham, and BAYES_99 on most spam.

But the 4 letters that matter with Bayes are:

YMMV

;)

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFDxAWfMJF5cimLx9ARAvvfAJwIiQQpAzBPYNEKnQiWLw4NMmxZewCfTxEg
qquh5FGGGQFwFU6TdOlIDi0=
=CcrR
-END PGP SIGNATURE-



Re: rules better than bayes?

2006-01-10 Thread William Stearns

Good evening, Justin, all,

On Tue, 10 Jan 2006, Justin Mason wrote:


-(Modified PGP heading)-
Hash: SHA1

Matt Kettler writes:

At 10:50 AM 1/10/2006, Chris Santerre wrote:


I have long said that IMHO, I do not think bayes is worth it. Left
unattended, it isn't as good. A simple rule can take out a lot of spam.
Some may say rule writing is more complicated then training bayes. Maybe.
Not so much the rule writing, but the figuring out what to look for and
testing for FPs.


Interesting.. For me, BAYES_99 is right between SURBL and URIBL in terms of
hits. (And has 98.91% of URIBL's total hits) I find it completely
indispensable.


The thing is, Bayes is a tool for personalization -- and as such, its
effectiveness varies widely depending on what *you* do with it.

For what it's worth, I've *never* trained my current Bayes DB, and have
been running with it for about 6 months I think.  I get BAYES_00 on most
ham, and BAYES_99 on most spam.

But the 4 letters that matter with Bayes are:

   YMMV


	Isn't that an OTCBB Ticker symbol?  I heard they're about to go 
through the _roof_!!

/me ducks...
Cheers,
- Bill

---
We don't want an election without a paper trail...all three
owners of the companies who make these machines are donors to the Bush
administration.  Is this not corruption?
-- Gore Vidal
(Courtesy of http://www.laweekly.com/ink/03/52/features-cooper.php)
--
William Stearns ([EMAIL PROTECTED]).  Mason, Buildkernel, freedups, p0f,
rsync-backup, ssh-keyinstall, dns-check, more at:   http://www.stearns.org
--


Re: Another URL obfuscation

2006-01-10 Thread Loren Wilton
A href=3Dhttp://gozifo .upze5otbbutzanbb655k685ys5nn%2Eridgykh=
comFONT SIZE=3D2/FONT

Ooooh, cute!  Breaks a lot of regex scanners that are looking for the end of
the href record!
First time I've seen those in html; I've been seeing them in plain text for
a week or two.

Loren



Re: rules better than bayes? Certainly better than mine.

2006-01-10 Thread Jim Maul

Andrew Donkin wrote:

Jim Maul [EMAIL PROTECTED] writes:


NOTE: to operate in this fashion i believe it is imperative that you
change the autolearn thresholds.  The defaults are dangerous! (atleast
in 2.64 which i still run).  I have mine set as such:

bayes_auto_learn_threshold_nonspam -0.1
bayes_auto_learn_threshold_spam 10.0


Matt agreed.  Aaron was going to change to something similar.

Before reading this thread, I did the opposite.  I changed my nonspam
threshold from -0.2 to the default 0.1 because Bayes I thought
(mistakenly perhaps) that the Bayes database's spam:ham ratio was far
too high.  Incoming mail is about 3:1, but the Bayes database was more
like 20:1.  See:

 3 bayes db version
   1491805 nspam
 75795 nham
   1081029 ntokens
1136779207 oldest atime
1136925099 newest atime
1136925026 last journal sync atime
1136838312 last expiry atime
 43200 last expire atime delta
 25087 last expire reduction count



I started autolearning with the defaults and then quickly changed my 
thresholds as mentioned before.  Our server here doesnt see a lot of 
spam (hell it doesnt even see a lot of mail total) so our ratios are 
obviously going to be different.  Mine shows:


 2  0  non-token data: bayes db version
 26378  0  non-token data: nspam
 54313  0  non-token data: nham
147479  0  non-token data: ntokens
1134172970  0  non-token data: oldest atime
1136925620  0  non-token data: newest atime
1136925554  0  non-token data: last journal sync atime
1136232703  0  non-token data: last expiry atime
   2060396  0  non-token data: last expire atime delta
 34608  0  non-token data: last expire reduction count





In particular, a message from James Keating of this list received this
report from Bayes:

X-Spam-Bayes-ham: 0.011-8--5h-0s--19d--SpamAssassin, 
	0.026-3--2h-0s--19d--autolearn, 0.029-203--156h-39s--19d--5.0, 
	0.031-7--5h-1s--19d--spamassassin, 0.050-4162--3796h-1707s--0d--i'm
X-Spam-Bayes-spam: 1.000-149--0h-6920s--1d--HX-Accept-Language:en-us, 
	1.000-27--0h-1229s--18d--H*UA:Thunderbird, 
	1.000-24--0h-1083s--18d--H*u:Thunderbird, 
	1.000-16--0h-718s--0d--H*RU:sk:cpe-24-, 
	1.000-13--0h-594s--11d--H*r:sk:cpe-24-


...implying that User-agent: Thunderbird was in a thousand spams but
no hams.  And that Accept-Language:en-us was in 6900 spams and no
hams.  !

So, I'm thinking that my Bayes is hosed again.  Will a hamtrap help me
here?



Im not sure, i've never seen this report before and i certainly dont 
have the same message to compare what it scored on my system here.  Have 
you noticed bayes misclassifying messages because of this, or are you 
speaking theoretically?  A huge ratio alone does not imply a problem, 
its the results that matter.



I'm CCing you, Jim, because my last two posts to the list vanished
without a trace.



Not a problem.  Just not sure how much help i am in this situation...

-Jim


Re: SA 3.10 skipping some emails or errors in log??

2006-01-10 Thread George R . Kasica
On Tue, 10 Jan 2006 18:58:37 -0500, you wrote:

On 10/01/2006 11:29 AM, George R. Kasica wrote:
On Mon, 09 Jan 2006 21:45:11 -0500, you wrote:

Please see, and comment on, bug 4696:

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4696


 Just curious as to the estimate for how long it will be until the
 problem is corrected? Right now with the way SA 3.1 is operating here
 it is almost worthless, catching and scanning about 20% of the spam
 due to the bug causing difficulties I'm assuming?

If you can get a strace -ftp PID of the parent spamd process while 
this happens (along with a matching debug log) and *attach* it to the 
bug, I'm sure Justin would take a look at it.

I haven't been able to reproduce it myself, so I haven't looked at it 
further.

Daryl:

Not a programmer here, but with a little direction I think I can get
the info.

I'm assuming the following here:

strace -ftp PID where PID is the PID of the parent spamd process
correct?

As to debug log, how would I go about that? Is it the info I provided
earlier just doing it over again to match with strace output?

George
George, MR. Tibbs, Nazarene, Ginger/The Beast Kasica(8/1/88-3/19/01, 1/17/02-)
Jackson, WI USA
[EMAIL PROTECTED]
http://www.netwrx1.com/georgek
ICQ #12862186

(`-''-/).___..--''`-._
`6_ 6  )   `-.  ( ).`-.__.`)
(_Y_.)'  ._   )  `._ `. ``-..-'
_..`--'_..-_/  /--'_.' ,'
(il),-''  (li),'  ((!.-'


Re: SA 3.10 skipping some emails or errors in log??

2006-01-10 Thread Daryl C. W. O'Shea

On 10/01/2006 8:17 PM, George R. Kasica wrote:

On Tue, 10 Jan 2006 18:58:37 -0500, you wrote:


If you can get a strace -ftp PID of the parent spamd process while 
this happens (along with a matching debug log) and *attach* it to the 
bug, I'm sure Justin would take a look at it.


I haven't been able to reproduce it myself, so I haven't looked at it 
further.



Daryl:

Not a programmer here, but with a little direction I think I can get
the info.

I'm assuming the following here:

strace -ftp PID where PID is the PID of the parent spamd process
correct?


Yeah PID is the process ID of the parent spamd process.  Also, you can 
redirect the output to a file with normal redirection, or just specify 
an output file with the -o option, ala:


strace -ftp PID -o /path/to/output/file



As to debug log, how would I go about that? Is it the info I provided
earlier just doing it over again to match with strace output?


Yeah.  You might want to add -Dprefork as one of the options to your 
spamd call though.



Daryl



Re: rules better than bayes?

2006-01-10 Thread mouss
Chris Santerre a écrit :

 
 I have long said that IMHO, I do not think bayes is worth it. Left
 unattended, it isn't as good. A simple rule can take out a lot of spam. Some
 may say rule writing is more complicated then training bayes. Maybe. Not so
 much the rule writing, but the figuring out what to look for and testing for
 FPs. 
 
 I do not run Bayes for our company. Obviously I'm partial to URIBL.com and
 SARE rules ;)  I get about 98% of spam caught, and little FPs. 
 
 This is going to sound like tooting our own horn, but so be it. Before SARE,
 Bayes was cool. After SARE, I see no need. 

I think SARE and bayes are complementary:

- sare will detect new spam once ninjas have found the corresponding rules.

- bayes will detect new spam if it resembles previous spam.

That said, I don't use SA/Bayes (I use dspam on a per-user basis, while
SA is site-wide).


Re: rules better than bayes?

2006-01-10 Thread jdow

From: Chris Santerre [EMAIL PROTECTED]

-Original Message-
From: jo3 [mailto:[EMAIL PROTECTED]

Hi,

This is an observation, please take it in the spirit in which it is 
intended, it is not meant to be flame bait.


After using spamassassin for six solid months, it seems to me 
that the 
bayes process (sa-learn [--spam | --ham]) has only very 
limited success 
in learning about new spam.  Regardless of how many spams and 
hams are 
submitted, the effectiveness never goes above the default 
level which, 
in our case here, is somewhere around 2 out of 3 spams correctly 
identified.  By the same token, after adding the third party rule, 
airmax.cf, the effectiveness went up to 99 out of 100 spams correctly 
identified.


I have long said that IMHO, I do not think bayes is worth it. Left
unattended, it isn't as good. A simple rule can take out a lot of spam. Some
may say rule writing is more complicated then training bayes. Maybe. Not so
much the rule writing, but the figuring out what to look for and testing for
FPs. 


I do not run Bayes for our company. Obviously I'm partial to URIBL.com and
SARE rules ;)  I get about 98% of spam caught, and little FPs. 


This is going to sound like tooting our own horn, but so be it. Before SARE,
Bayes was cool. After SARE, I see no need. 


Autolearning Bayes is not really very good based on what people here
seem to say. I do note that I raised by BAYES_99 score to 5. If BAYES_99
hits the odds that the message is spam are so high that it's silly to
give BAYES_99 a low score, theoretical nonsense notwithstanding.

If you apply the wrong statistical theory with the wrong conceptual
criteria the math or theory may be good but the results are trash. For
an existing spam database the rule setup that exists is probably quite
good. If 99 hits then other rules probably hit as well. This leads to
artificially lowering the 99 score. Then when a new technique hits that
Bayes can recognize but nothing else does comes along the message floats
on through. At least on this system 99 misses once in 2000 to 1
times. Most of those times other very light whitelisting rules let the
messages come through. Probably the right score for more general use
would be 4.95 or something such that if any other rule hits it's dinged
as spam. It depends on your spam tolerance compared to your tolerance
for sorting spam by score and looking at the few that are marginal.

Anyway, making that ONE change made the already good results I was getting
with SARE and BAYES combined quite a bit better. Missed spam went down
almost a factor of 10 and tagged ham went up by about 1 in 10,000 or
less. (I can't remember the last time I got a ham marked as spam on
the sole basis of BAYES_99 with a score of 5 that I had to fetch out of
the spam folder.) I take this as a proof of concept that penalizing a
rule for being too good is ridiculous on its face, statistical theories
notwithstanding. I maintain this is a positive indication that either
the criteria, the chosen statistical approach, or both are wrong.

It might be entertaining to setup stock BAYES on your system, Chris,
with all BAYES scores being very very low, 0.01 or something. Then run
the SARE version of sa_stats.pl to see what the goodness of each
BAYES level really is. From that you can guesstimate some scores that
might improve your system. I'd be really interested to see what the
autolearn BAYES really can perform like when it's used in your sort
of environment. I know for my environment it's silly to use it due to
the automated mis-learning on marginal messages. (Either it learns
wrong or not at all on the most critical portions of the email load,
the marginal messages.)

{^_^}   Joanne steps down off her soapbox yet again.



Re[2]: OT Humor: was rules better than bayes?

2006-01-10 Thread Robert Menschel
Hello William,

Tuesday, January 10, 2006, 11:37:35 AM, you wrote:

 But the 4 letters that matter with Bayes are:
YMMV

WS Isn't that an OTCBB Ticker symbol?  I heard they're about to go 
WS through the _roof_!!

Your Milage May Vary, Inc.  I hear they're cornering the market on
automotive fuel saving devices and technologies.

Today's stock prices ranged from 2.159 to 2.259, or 2.359 for
preferred stock.








Re: rules better than bayes?

2006-01-10 Thread jdow

From: Jim Maul [EMAIL PROTECTED]


Chris Santerre wrote:


  -Original Message-
  From: jo3 [mailto:[EMAIL PROTECTED]
  Sent: Monday, January 09, 2006 2:28 PM
  To: users@spamassassin.apache.org
  Subject: rules better than bayes?
 
 
  Hi,
 
  This is an observation, please take it in the spirit in which it is
  intended, it is not meant to be flame bait.
 
  After using spamassassin for six solid months, it seems to me
  that the
  bayes process (sa-learn [--spam | --ham]) has only very
  limited success
  in learning about new spam.  Regardless of how many spams and
  hams are
  submitted, the effectiveness never goes above the default
  level which,
  in our case here, is somewhere around 2 out of 3 spams correctly
  identified.  By the same token, after adding the third party rule,
  airmax.cf, the effectiveness went up to 99 out of 100 spams correctly
  identified.

I have long said that IMHO, I do not think bayes is worth it. Left unattended, 
it isn't as good. A simple rule can take out a lot of spam. Some may say rule 
writing is more complicated then training bayes. Maybe. Not so much the rule 
writing, but the figuring out what to look for and testing for FPs.


I do not run Bayes for our company. Obviously I'm partial to URIBL.com and SARE 
rules ;)  I get about 98% of spam caught, and little FPs.


This is going to sound like tooting our own horn, but so be it. Before SARE, 
Bayes was cool. After SARE, I see no need.





I always feel i have to point out the flip side to this just to offer 
another opinion.  While i certainly dont have a NEED for bayes at our 
facility, i do run it, complete with autolearn.  We have very low volume 
(5k msgs/day) but it works so well i rarely ever have to think about it. 
  For us, 96% of the time bayes alone is enough to say whether a 
message is ham/spam.  Add all the other tests on top of this (uribl, 
razor, a few sare, and theres easily a 20 point difference between ham 
and spam.


Jim, can you back that up with a run of the SARE version of sa_stats.pl?
I'd love to see your record with that setup for the highest and lowest
ranking BAYES scores.

{^_^}



Re: rules better than bayes?

2006-01-10 Thread jdow

From: Matt Kettler [EMAIL PROTECTED]


At 10:50 AM 1/10/2006, Chris Santerre wrote:


I have long said that IMHO, I do not think bayes is worth it. Left 
unattended, it isn't as good. A simple rule can take out a lot of spam. 
Some may say rule writing is more complicated then training bayes. Maybe. 
Not so much the rule writing, but the figuring out what to look for and 
testing for FPs.



Interesting.. For me, BAYES_99 is right between SURBL and URIBL in terms of 
hits. (And has 98.91% of URIBL's total hits) I find it completely 
indispensable.


It's number 1 here on scoring spam, 83.22 for 0.05 of ham with can't
remember the last ham scoring on 99 that hit the spam folder. 99 has
a score of 5 here because it does, all alone, tag spam that no other
rule hits. XBL is the best BL here at the moment, 55.50% for 0.04% of
hits on ham.

I rarely train manually, except at initial setup where I feed it a good 
base learning. (the autolearner can sometimes go awry if you don't train 
some mail manually before letting it go.)


I manually learn, particularly on spam not marked as spam that has a
low BAYES score and some meat in it. (I don't bother with content
free spam. Those very quickly score higher due to BL hits that pop up
like magic.)

{^_^}



Re: SA 3.10 skipping some emails or errors in log??

2006-01-10 Thread George R . Kasica
On Tue, 10 Jan 2006 20:56:48 -0500, you wrote:

On 10/01/2006 8:17 PM, George R. Kasica wrote:
On Tue, 10 Jan 2006 18:58:37 -0500, you wrote:

If you can get a strace -ftp PID of the parent spamd process while 
this happens (along with a matching debug log) and *attach* it to the 
bug, I'm sure Justin would take a look at it.

I haven't been able to reproduce it myself, so I haven't looked at it 
further.
 
 Daryl:
 
 Not a programmer here, but with a little direction I think I can get
 the info.
 
 I'm assuming the following here:
 
 strace -ftp PID where PID is the PID of the parent spamd process
 correct?

Yeah PID is the process ID of the parent spamd process.  Also, you can 
redirect the output to a file with normal redirection, or just specify 
an output file with the -o option, ala:

strace -ftp PID -o /path/to/output/file


 As to debug log, how would I go about that? Is it the info I provided
 earlier just doing it over again to match with strace output?

Yeah.  You might want to add -Dprefork as one of the options to your 
spamd call though.

It's running now. I will hopefully have some items to upload soon.

George
George, MR. Tibbs, Nazarene, Ginger/The Beast Kasica(8/1/88-3/19/01, 1/17/02-)
Jackson, WI USA
[EMAIL PROTECTED]
http://www.netwrx1.com/georgek
ICQ #12862186

(`-''-/).___..--''`-._
`6_ 6  )   `-.  ( ).`-.__.`)
(_Y_.)'  ._   )  `._ `. ``-..-'
_..`--'_..-_/  /--'_.' ,'
(il),-''  (li),'  ((!.-'