Re: Bayes Autolearn Threshold - different scoring?
Kris, thanks for your help and insight. From what I can see, the settings are in PerMsgStatus.pm, line 308/309 (my version of course). my $required_body_points = 3; my $required_head_points = 3; I'll try changing those around, and update my status to this list in a while. Again, thanks! -g > [EMAIL PROTECTED] wrote: >> I'm sure that's the problem. Here's a different sample spam, minus >> the bayes score (which isn't counted on the autolearn body tests, >> correct?) > > Correct. But keep in mind that the autolearn process actually uses > different scores. > >> 2.2 RCVD_HELO_IP_MISMATCH Received: HELO and IP do not match, but >> should > >>>From scoreset 3 (2.178); autolearn will use set 1 (score: 0.618) > >> 3.0 DATE_IN_FUTURE_12_24 Date: is 12 to 24 hours after Received: >> date > > Set 1 score is 2.329. > >> 1.2 RCVD_NUMERIC_HELO Received: contains an IP address used for >> HELO > > Set 1 score is 1.531. > >> 2.7 FORGED_YAHOO_RCVD 'From' yahoo.com does not match >> 'Received' headers > > Set 1 score is 2.174. > > All together, that's well over the minimum 3 points from headers... but > no body score. > >> No body hits there... So basically, I'm getting what I want from the >> headers, and from what bayes already knows. How do I tweak the >> thresholds that the autolearner uses, for example, either setting the >> body threshold to 0 or eliminating that check entirely? > > Hack the code. There's no option I've heard of, and nothing noted in > the man page IIRC to allow that. > >> I realize this might produce >> unwanted results, so I'd probably give it a week or so initial >> experiment. > > I don't know how the current setup was decided on, but I'd imagine that > other methods have been tried - for general use, the 3+3 minimum in the > distributed SA is probably ideal. For some specific mail streams > (yours, perhaps?) this may not be optimal and may need to be tweaked. > > -kgd > -- > Get your mouse off of there! You don't know where that email has been! >
Re: Bayes Autolearn Threshold - different scoring?
[EMAIL PROTECTED] wrote: > I'm sure that's the problem. Here's a different sample spam, minus > the bayes score (which isn't counted on the autolearn body tests, > correct?) Correct. But keep in mind that the autolearn process actually uses different scores. > 2.2 RCVD_HELO_IP_MISMATCH Received: HELO and IP do not match, but > should >From scoreset 3 (2.178); autolearn will use set 1 (score: 0.618) > 3.0 DATE_IN_FUTURE_12_24 Date: is 12 to 24 hours after Received: > date Set 1 score is 2.329. > 1.2 RCVD_NUMERIC_HELO Received: contains an IP address used for > HELO Set 1 score is 1.531. > 2.7 FORGED_YAHOO_RCVD 'From' yahoo.com does not match > 'Received' headers Set 1 score is 2.174. All together, that's well over the minimum 3 points from headers... but no body score. > No body hits there... So basically, I'm getting what I want from the > headers, and from what bayes already knows. How do I tweak the > thresholds that the autolearner uses, for example, either setting the > body threshold to 0 or eliminating that check entirely? Hack the code. There's no option I've heard of, and nothing noted in the man page IIRC to allow that. > I realize this might produce > unwanted results, so I'd probably give it a week or so initial > experiment. I don't know how the current setup was decided on, but I'd imagine that other methods have been tried - for general use, the 3+3 minimum in the distributed SA is probably ideal. For some specific mail streams (yours, perhaps?) this may not be optimal and may need to be tweaked. -kgd -- Get your mouse off of there! You don't know where that email has been!
Re: Bayes Autolearn Threshold - different scoring?
> As your only email access? pretty much, yes. > Try several thousand, as a number of customers have reported to > me... oh, I've been there - I'm just trying to avoid going there again. :) > Mmm. Dangerous - I've seen FPs get autolearned as spam once or twice. > :( I realize that. With my system on my spam the way it is now, my spam threshold is set to one. I have not seen a FP >=3.0 in several months. So, I know there's a risk. > What I do on my accounts is set up a "big-spam" folder, and rely on the > X-Spam-Level header to move mail there. Anything scoring 15 or higher > gets 15 or more stars in X-Spam-Level, and I have this: > > :0: > * ^X-Spam-Level:.\*\*\*\*\*\*\*\*\*\*\*\*\*\*\* > /home/kdeugau/mail/bigspam > > before the check that files spam in my "main" spam folder. > > With the well-tuned 2.64+SURBL systems I have, ~80% or the spam usually > ends up in the "big-spam" folder. If I did that with a threshold of 3.0 on my system I would have had 84% of the total 'spams' I've gotten in the last week end up in the big-spam folder, with no FPs. > [snip] >> debug: auto-learn? ham=0.1, spam=1, body-points=0, head-points=-2.82, >> learned-points=1.886 >> debug: auto-learn? no: scored as spam but too few body points (0 < 3) > > These two entries are the critical ones; note the body-points and > head-points. To be autolearned as spam, a message must hit tests worth > a total of 3 points or more on header tests, and a total of 3 points or > more on body tests. I'm sure that's the problem. Here's a different sample spam, minus the bayes score (which isn't counted on the autolearn body tests, correct?) 2.2 RCVD_HELO_IP_MISMATCH Received: HELO and IP do not match, but should 3.0 DATE_IN_FUTURE_12_24 Date: is 12 to 24 hours after Received: date 1.2 RCVD_NUMERIC_HELO Received: contains an IP address used for HELO 2.7 FORGED_YAHOO_RCVD 'From' yahoo.com does not match 'Received' headers No body hits there... So basically, I'm getting what I want from the headers, and from what bayes already knows. How do I tweak the thresholds that the autolearner uses, for example, either setting the body threshold to 0 or eliminating that check entirely? I realize this might produce unwanted results, so I'd probably give it a week or so initial experiment. > I notice you're still using the default autolearn-as-ham setting; this > is dangerous as very low-scoring spam can get autolearned incorrectly. > I've dropped it to -0.01 on my systems to prevent this. That's a good tip, i'll implement that. Thanks!
Re: Bayes Autolearn Threshold - different scoring?
[EMAIL PROTECTED] wrote: > My problem is this: I'm using squirrelmail, As your only email access? > and to keep an eye on false negatives (I define those as real mails > that get shuttled to spam, just to keep things clear) I have a 'spam' > folder. As anyone that uses sqmail knows, it gets very slow when any > folder contains more than a few hundred messages. Try several thousand, as a number of customers have reported to me... Actually, it's only spewed out error messages in a very few cases. > But, since my > filter is trained very well, I'd like to send autolearned spams to > /mail/Trash (ultimately to /dev/null) so I don't have to deal with > those. Mmm. Dangerous - I've seen FPs get autolearned as spam once or twice. :( What I do on my accounts is set up a "big-spam" folder, and rely on the X-Spam-Level header to move mail there. Anything scoring 15 or higher gets 15 or more stars in X-Spam-Level, and I have this: :0: * ^X-Spam-Level:.\*\*\*\*\*\*\*\*\*\*\*\*\*\*\* /home/kdeugau/mail/bigspam before the check that files spam in my "main" spam folder. With the well-tuned 2.64+SURBL systems I have, ~80% or the spam usually ends up in the "big-spam" folder. > I figured just setting bayes_auto_learn_threshold_spam 6 would > work great. It really does not do much of anything. I've decreased > it to 3, and to 1, but it really doesnt make a difference. I found > these relevant lines in a debug: [snip] > debug: auto-learn? ham=0.1, spam=1, body-points=0, head-points=-2.82, > learned-points=1.886 > debug: auto-learn? no: scored as spam but too few body points (0 < 3) These two entries are the critical ones; note the body-points and head-points. To be autolearned as spam, a message must hit tests worth a total of 3 points or more on header tests, and a total of 3 points or more on body tests. I notice you're still using the default autolearn-as-ham setting; this is dangerous as very low-scoring spam can get autolearned incorrectly. I've dropped it to -0.01 on my systems to prevent this. > What, exactly, is going on here? The head points I can explain (this > is a spam I saved that had already come to me) but the body points - > I don't understand. It also wasn't clear to me until this debug that > the autolearn had its own scoring system. Not entirely; to decide whether to autolearn a message one of the "no-Bayes" score sets is used to calculate the scores, depending on whether you've got network tests disabled or not. -kgd -- Get your mouse off of there! You don't know where that email has been!