Re: Bayes and multipart messages

Karsten Bräckelmann Thu, 09 Jan 2014 20:47:09 -0800

On Thu, 2014-01-09 at 20:14 -0700, Amir 'CG' Caspi wrote:
> On Thu, January 9, 2014 6:20 pm, Karsten Bräckelmann wrote:
> > Even the most effective results I have ever seen on a non-personal
> > attack is merely getting the Bayes classification to a neutral. And that
> > was not a "regular" text token, but includes mail headers. And a biased
> > Bayes database towards some specific mail headers that spam run happened
> > to use...
> 
> So, I unfortunately still see the occasional FN slipping through my
> filters with bayes_00... which means either these spams are magically


Wait. I do see *occasional* FNs with Bayes below 0.5, too. That is not
related to any of the attempts to circumvent Bayes, though, but can
generally be described as "seriously low on text, offering $funds for
charity and recipient". At least here. With the total amount of text
less than this paragraph.

In other words, "dead husband, suffer cancer, donate millions to you".
The shorter the text, the more likely to sneak through. Unfortunately,
well, for the scumbags, the shorter it gets, the less likely it is to be
understood. Fallen for. Or even understood to be actual language.

> hitting some very hammy tokens, or I've got some major problems with my
> bayes DB.  I've been training my DB both with autolearn and with manual
> sa-learn spam classification (the latter run every week or two on my spam
> folder, which holds the last 30 days of spam), but I admit that autolearn
> has been running for probably years before I actually started to
> "properly" set up and train SA, so that may be one issue, that it
> autolearned spam as ham.

Rather unlikely, because auto-learn thresholds do include quite some
additional constraints. Minimum of header and body scores, score-set
without Bayes, and of course Bayes not self-feeding.

>                          On the other hand, other users on my system who
> have ALSO been autolearning for years don't seem to get bayes_00 FN hits,
> just bayes_50ish (sometimes as low as 20 but that's rare), so I'm not sure
> autolearn is the problem (unless I was mistakenly autolearning a helluva
> lot more spam than they have over that time, for some reason).
> 
> I'd prefer not to dump my entire bayes DB and start over, though I can do
> that if I have to... but I'd like to try to diagnose the issue before
> burning down the house.
> 
> What's the way that I can inject the bayes-identified tokens (hammy or
> spammy) into my SA headers, so that I can try to debug what's causing this
> problem?  I'd want to do this for all emails, not just ones identified as

See the M::SA::Conf docs, section Template Tags, and the add_header conf
option. In this case take special care about the (h|sp)ammy tokens
sub-section to get detailed info.

  http://spamassassin.apache.org/doc/Mail_SpamAssassin_Conf.html

For first debug insight, that also might be worth a shot as an ad-hoc
spamassassin --cf option with a previously processed mail.

> ham or spam.  I've seen some people posting real-language bayes hits here
> so I'm wondering how to do that.  (I imagine there's no way to get the
> actual real-language words out of the existing bayes DB since they're
> stored as hashes, right?  That is, the actual words aren't stored, their
> hashes are?  Or is that not right?)


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: Bayes and multipart messages

Reply via email to