On Tue, 2009-08-25 at 22:13 -0400, Alex wrote: > > If you're using autolearning, what are your learning thresholds? > > What do you recommend for thresholds? I'm considering using > autolearning, but very concerned about corrupting the database. I > think I would use something like +15 for spam.
I generally recommend the defaults, unless you *do* know you need something else. That's why they are defaults. That's <= 0.1 for ham and >= 12.0 for spam. Keep in mind these scores are calculated using a non-Bayes score set, so they generally differ from the overall score of the message. Also, this does not take various specific rules' scores into account, like Bayes and AWL. Plus some more esoteric constraints. See the docs. [1] > There are FNs on occasion in the 2.x range with low bayes numbers (or > BAYES_50) that I wouldn't want to be tagged as ham. Should that be a > concern? No. Bayes auto-learning is *not* self-feeding. Any overall score of about 2 (with Bayes) is *very* unlikely to cross either threshold when using the respective non-Bayes score-set. Moreover, your concern is with Bayes probability <= 50%, and thus a negative score for the BAYES hit. This hit is not considered for auto-learning, though, and as a first rule-of-thumb subtract that score again -- which yields a slightly higher score. Still no way even close to the thresholds. > Even mail that has been whitelisted could also contain spam, so would > a ham threshold of like -100 work, or present the same problem? 60_whitelist.cf: tflags USER_IN_WHITELIST userconf nice noautolearn Again, as per the docs [1], whitelisting will not be considered for the decision whether to auto-learn or not. guenther [1] http://spamassassin.apache.org/full/3.2.x/doc/Mail_SpamAssassin_Plugin_AutoLearnThreshold.html -- char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}