RE: [SAtalk] Philosophical SA questions

Tom Meunier Mon, 22 Sep 2003 19:04:27 -0700

Hi Darren,  

> 256 Ham, 1040 Probably Spam (>5 points), 256 Almost Certainly 
> Spam (>15 points), and 269 false negatives, 0 false 
> positives.  Bayes was trained with 16680 Spam, 4092 Ham, 
> 125776 tokens.  I have auto-learning enabled, and feed all 
> the false negatives back into sa-learn the same day...


What version of SA are you using? 
I can't imagine any reason for this, other than your bayesian database
is tainted.  Did you hand-confirm those 21,000 emails?  Honestly, you'd
do much much better than that with just 200 of each.  There's something
very wrong with those numbers, that can't be accounted for in normal
operation.  Spot-check through the headers of the false negatives, and
see if the BAYES_xx is wrong.  It should hardly ever be wrong.  

> 
> 
> Philosophical question #1:  Am I expecting too much to be 
> disappointed with so many false negatives? 

Personally, I'd rip it out and rebuild with numbers like that.

> Philosophical question #1.5:  Are the network tests (RAZOR, 
> etc.) essentially required? 

No.  They're nice to have, but I disabled them with very slight initial
impact.  After I was satisfied that Bayes was trained.  If you disable
them, you'll be using another score set that compensates.

> 
> 
> Philosophical question #2:  I feel I could do much better 
> tweaking some of the rules (already made MIME_HTML_ONLY 3 
> points) that most of my spam hits that never are in my ham, 
> but should I start there or just lower my overall spam 
> threshold?  Has anyone already done a "more aggressive" prefs 
> file, especially anti-HTML mail so that I don't have to start 
> from scratch?

You may want to check out the rules sites:
http://www.merchantsoverseas.com/wwwroot/gorilla/sa_rules.htm
http://www.exit0.us/
Personally, I try not to touch the rules - I like to rely on Bayes if I
can.  However, I *really* like the ROT13, etc. rules.  And when I see a
domain repeatedly spamming me, I throw a blacklist_from *domain.com into
my .cf file just in case they learn how to sneak through.

> 
> Philosophical question #2.5:  How are the default scores 
> chosen?  I thought I read they were determined mathematically 
> based on their frequency in the test spam corpus?  Is that 
> true?  If so, why is my corpus so different?

It's my understanding that they're put under load with a large corpus of
ham/spam, and their effectiveness is analyzed from the results of that
run.
 
> 
> Philosophical question #3:  One of the things I liked about 
> SpamBouncer was feeding it your legitimate email addresses 
> and mailing list addresses and then it would consider items 
> sent TO those (missing or specifically there) in the overall 
> scoring.  I don't think SA offers anything like that... it's 
> not whitelisting (since that's From:), and it fails on BCCs 
> (hence the need for positive weighting of other factors)... 
> would be nice to have?  Anyone written a rule like that?  Any 
> suggestions?  I'm not sure how highly to score it.
> 
There are various levels of adding this type of whitelisting to your
prefs file.

> 
> 
> Philosophical question #4:  Should I convert purely to 
> bayes-type filters?  I can't believe it's worth throwing out 
> some of the basic SA heuristics, but the Bayes scores coming 
> from SA have been pretty accurate.  To start with, has 
> anybody already written a prefs file favoring bayes heavier 
> than default?  Alternatively, can somebody explain to me the 
> differences in the DEFAULT SCORES (local, net, with bayes, 
> with bayes+net) column on the tests page?
> 
I've considered it, but I like the ability to help it along with
alternative heuristics.  Spammers are becoming very interested in Bayes
poisoning lately.


> Philosophical question #5:  Should I try to get my bayes ham 
> vs. spam ratio closer as many suggest?  If so, why exactly?  
> It seems a waste to throw out spam since it can only further 
> prove the frequency of spam tokens and lack of hammy ones... 
> maybe I'm missing the math behind it?
> 
I'm interested in a definitive answer to this question also.  Experience
tells me no, but lack of analysis says I could very well be wrong for
the 1 billionth time this month.  

> 
> Philosophical question #6:  Why autolearn only on the 
> certainly spam?  Most of them already score high on Bayes, 
> why not train on the borderlines where bayes could push it 
> over the edge? I get a lot of 3.9s and 4.2s with no (or 
> little) affecting score from bayes.

To guard against mistakes, which would be a big problem.  And to give
you a chance to manually train the borderline stuff.  I'm rather certain
your Bayes is trashed, Darren.  


-tom


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

RE: [SAtalk] Philosophical SA questions

Reply via email to