SpamAssassin false positive bayes with attachments
I have been seeing some issues with bayes detection from base64 strings within attachments causing false positives. Example: Oct 6 09:02:14.374 [15869] dbg: bayes: token 'H4f' = 0.71186828264 Oct 6 09:02:14.374 [15869] dbg: bayes: token 'wx2' = 0.68644662127 Oct 6 09:02:14.374 [15869] dbg: bayes: token 'z4f' = 0.68502147581 Oct 6 09:02:14.378 [15869] dbg: bayes: token '0vf' = 0.66604823748 Is there a solution to prevent triggering bayes from the base64 data in an attachment? It was my impression that attachments should not trigger bayes data, but it seems that it is parsing it as text rather than an attachment. This is with SpamAssassin v3.3. Thanks
Re: SpamAssassin false positive bayes with attachments
On October 6, 2014 3:03:30 PM jdime abuse jdimeab...@gmail.com wrote: I have been seeing some issues with bayes detection from base64 strings within attachments causing false positives. Train more data then, bayes needs more data to prevent it Example: Oct 6 09:02:14.374 [15869] dbg: bayes: token 'H4f' = 0.71186828264 Oct 6 09:02:14.374 [15869] dbg: bayes: token 'wx2' = 0.68644662127 Oct 6 09:02:14.374 [15869] dbg: bayes: token 'z4f' = 0.68502147581 Oct 6 09:02:14.378 [15869] dbg: bayes: token '0vf' = 0.66604823748 Above is pretty normal for how bayes works Is there a solution to prevent triggering bayes from the base64 data in an attachment? It was my impression that attachments should not trigger bayes data, but it seems that it is parsing it as text rather than an attachment. Dokumentation is in perldoc Mail::SpamAssassin::Conf perldoc Mail::SpamAssassin::Plugin::Bayes If not dokumented its not supported This is with SpamAssassin v3.3. While 3.4 is now stable
Re: SpamAssassin false positive bayes with attachments
On Mon, 2014-10-06 at 09:03 -0400, jdime abuse wrote: I have been seeing some issues with bayes detection from base64 strings within attachments causing false positives. Example: Oct 6 09:02:14.374 [15869] dbg: bayes: token 'H4f' = 0.71186828264 Oct 6 09:02:14.374 [15869] dbg: bayes: token 'wx2' = 0.68644662127 Oct 6 09:02:14.374 [15869] dbg: bayes: token 'z4f' = 0.68502147581 Oct 6 09:02:14.378 [15869] dbg: bayes: token '0vf' = 0.66604823748 Is there a solution to prevent triggering bayes from the base64 data in an attachment? It was my impression that attachments should not trigger bayes data, but it seems that it is parsing it as text rather than an attachment. Bayes tokens are basically taken from rendered, textual body parts (and mail headers). Attachments are not tokenized. Unless the message's MIME-structure is severely broken, these tokens appear somewhere other than a base64 encoded attachment. Can you provide a sample uploaded to a pastebin? -- char *t=\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1: (c=*++x); c128 (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: SpamAssassin false positive bayes with attachments
On Mon, 06 Oct 2014 21:28:02 +0200 Karsten Bräckelmann guent...@rudersport.de wrote: Unless the message's MIME-structure is severely broken, these tokens appear somewhere other than a base64 encoded attachment. Agreed, and a Qmail bounce message is a prime example of a message whose MIME structure is severely broken. I wonder if that's what the OP is seeing? Qmail's bounce message starts with: Hi. This is the and then (sometimes) includes the entire raw MIME message as a giant glob of text. http://cr.yp.to/proto/qsbmf.txt We have custom code specifically to detect such messages and avoid tokenizing them. :( Regards, David.
Re: SpamAssassin false positive bayes with attachments
After reading your reply, I re-examined the message and found the case was an incorrect Content-Type: ~~~ Content-Type: text/plain; charset=windows-1250; name=pdfname.pdf Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename=pdfname.pdf ~~~ So it was scanning the base64 as text and tokenizing it. On Mon, Oct 6, 2014 at 3:28 PM, Karsten Bräckelmann guent...@rudersport.de wrote: On Mon, 2014-10-06 at 09:03 -0400, jdime abuse wrote: I have been seeing some issues with bayes detection from base64 strings within attachments causing false positives. Example: Oct 6 09:02:14.374 [15869] dbg: bayes: token 'H4f' = 0.71186828264 Oct 6 09:02:14.374 [15869] dbg: bayes: token 'wx2' = 0.68644662127 Oct 6 09:02:14.374 [15869] dbg: bayes: token 'z4f' = 0.68502147581 Oct 6 09:02:14.378 [15869] dbg: bayes: token '0vf' = 0.66604823748 Is there a solution to prevent triggering bayes from the base64 data in an attachment? It was my impression that attachments should not trigger bayes data, but it seems that it is parsing it as text rather than an attachment. Bayes tokens are basically taken from rendered, textual body parts (and mail headers). Attachments are not tokenized. Unless the message's MIME-structure is severely broken, these tokens appear somewhere other than a base64 encoded attachment. Can you provide a sample uploaded to a pastebin? -- char *t=\10pse\0r\0dtu\0.@ghno \x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1: (c=*++x); c128 (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}