[SAtalk] Philosophical SA questions

Darren Madams Mon, 22 Sep 2003 17:11:01 -0700


OK, I need some help, and sorry in advance for the long email.  I had tried SA about a 
year ago and wasn't overly impressed.  I ended up going with SpamBouncer, which worked 
reasonably well but quickly got out of date and had no facilities for easy update 
(other than from the author, who it appears is a single person and very busy).  I 
switched back to SA on the 11th of this month when we migrated our mail servers over 
to Debian on Sparc hardware.  I've been relatively impressed, but the results haven't 
been what I would consider great:




256 Ham, 1040 Probably Spam (>5 points), 256 Almost Certainly Spam (>15 points), and 
269 false negatives, 0 false positives.  Bayes was trained with 16680 Spam, 4092 Ham, 
125776 tokens.  I have auto-learning enabled, and feed all the false negatives back 
into sa-learn the same day...



Philosophical question #1:  Am I expecting too much to be disappointed with so many 
false negatives?  I'm [obviously] nowhere near the numbers you guys are quoting.  A 
lot of my ham doesn't have an X-Spam-Status header at all for some unknown reason.  
Should every non-spam?  I thought I initially had a configuration problem, but other 
mail was working (and tagged good or bad) and it seems to have died down with bayes 
training.



Philosophical question #1.5:  Are the network tests (RAZOR, etc.) essentially 
required?  I haven't installed them yet (was worried about processor and network 
impact), but could do so if my results will get much better.



Philosophical question #2:  I feel I could do much better tweaking some of the rules 
(already made MIME_HTML_ONLY 3 points) that most of my spam hits that never are in my 
ham, but should I start there or just lower my overall spam threshold?  Has anyone 
already done a "more aggressive" prefs file, especially anti-HTML mail so that I don't 
have to start from scratch?



Philosophical question #2.5:  How are the default scores chosen?  I thought I read 
they were determined mathematically based on their frequency in the test spam corpus?  
Is that true?  If so, why is my corpus so different?



Philosophical question #3:  One of the things I liked about SpamBouncer was feeding it 
your legitimate email addresses and mailing list addresses and then it would consider 
items sent TO those (missing or specifically there) in the overall scoring.  I don't 
think SA offers anything like that... it's not whitelisting (since that's From:), and 
it fails on BCCs (hence the need for positive weighting of other factors)... would be 
nice to have?  Anyone written a rule like that?  Any suggestions?  I'm not sure how 
highly to score it.



Philosophical question #4:  Should I convert purely to bayes-type filters?  I can't 
believe it's worth throwing out some of the basic SA heuristics, but the Bayes scores 
coming from SA have been pretty accurate.  To start with, has anybody already written 
a prefs file favoring bayes heavier than default?  Alternatively, can somebody explain 
to me the differences in the DEFAULT SCORES (local, net, with bayes, with bayes+net) 
column on the tests page?



Philosophical question #5:  Should I try to get my bayes ham vs. spam ratio closer as 
many suggest?  If so, why exactly?  It seems a waste to throw out spam since it can 
only further prove the frequency of spam tokens and lack of hammy ones... maybe I'm 
missing the math behind it?



Philosophical question #6:  Why autolearn only on the certainly spam?  Most of them 
already score high on Bayes, why not train on the borderlines where bayes could push 
it over the edge? I get a lot of 3.9s and 4.2s with no (or little) affecting score 
from bayes.



Thanks in advance!  And I in no way mean this to be a negative statement on the work 
everyone has done on SA so far.  I have nothing but respect for the code that's there! 
 I just want to make it work the best way possible for me.



  --Darren



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

[SAtalk] Philosophical SA questions

Reply via email to