SpamAssassin false positive bayes with attachments

2014-10-06 Thread jdime abuse
I have been seeing some issues with bayes detection from base64 strings
within attachments causing false positives.

Example:
Oct  6 09:02:14.374 [15869] dbg: bayes: token 'H4f' = 0.71186828264
Oct  6 09:02:14.374 [15869] dbg: bayes: token 'wx2' = 0.68644662127
Oct  6 09:02:14.374 [15869] dbg: bayes: token 'z4f' = 0.68502147581
Oct  6 09:02:14.378 [15869] dbg: bayes: token '0vf' = 0.66604823748

Is there a solution to prevent triggering bayes from the base64 data in an
attachment? It was my impression that attachments should not trigger bayes
data, but it seems that it is parsing it as text rather than an attachment.

This is with SpamAssassin v3.3.

Thanks


Re: SpamAssassin false positive bayes with attachments

2014-10-06 Thread Benny Pedersen

On October 6, 2014 3:03:30 PM jdime abuse jdimeab...@gmail.com wrote:


I have been seeing some issues with bayes detection from base64 strings
within attachments causing false positives.


Train more data then, bayes needs more data to prevent it


Example:
Oct  6 09:02:14.374 [15869] dbg: bayes: token 'H4f' = 0.71186828264
Oct  6 09:02:14.374 [15869] dbg: bayes: token 'wx2' = 0.68644662127
Oct  6 09:02:14.374 [15869] dbg: bayes: token 'z4f' = 0.68502147581
Oct  6 09:02:14.378 [15869] dbg: bayes: token '0vf' = 0.66604823748


Above is pretty normal for how bayes works


Is there a solution to prevent triggering bayes from the base64 data in an
attachment? It was my impression that attachments should not trigger bayes
data, but it seems that it is parsing it as text rather than an attachment.


Dokumentation is in

perldoc Mail::SpamAssassin::Conf
perldoc Mail::SpamAssassin::Plugin::Bayes

If not dokumented its not supported


This is with SpamAssassin v3.3.


While 3.4 is now stable


Re: SpamAssassin false positive bayes with attachments

2014-10-06 Thread Karsten Bräckelmann
On Mon, 2014-10-06 at 09:03 -0400, jdime abuse wrote:
 I have been seeing some issues with bayes detection from base64
 strings within attachments causing false positives.
 
 Example:
 Oct  6 09:02:14.374 [15869] dbg: bayes: token 'H4f' = 0.71186828264
 Oct  6 09:02:14.374 [15869] dbg: bayes: token 'wx2' = 0.68644662127
 Oct  6 09:02:14.374 [15869] dbg: bayes: token 'z4f' = 0.68502147581
 Oct  6 09:02:14.378 [15869] dbg: bayes: token '0vf' = 0.66604823748
 
 Is there a solution to prevent triggering bayes from the base64 data
 in an attachment? It was my impression that attachments should not
 trigger bayes data, but it seems that it is parsing it as text rather
 than an attachment.

Bayes tokens are basically taken from rendered, textual body parts (and
mail headers). Attachments are not tokenized.

Unless the message's MIME-structure is severely broken, these tokens
appear somewhere other than a base64 encoded attachment. Can you provide
a sample uploaded to a pastebin?


-- 
char *t=\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4;
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1:
(c=*++x); c128  (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: SpamAssassin false positive bayes with attachments

2014-10-06 Thread David F. Skoll
On Mon, 06 Oct 2014 21:28:02 +0200
Karsten Bräckelmann guent...@rudersport.de wrote:

 Unless the message's MIME-structure is severely broken, these tokens
 appear somewhere other than a base64 encoded attachment.

Agreed, and a Qmail bounce message is a prime example of a message
whose MIME structure is severely broken.  I wonder if that's what
the OP is seeing?

Qmail's bounce message starts with:

Hi. This is the

and then (sometimes) includes the entire raw MIME message as a giant
glob of text.

http://cr.yp.to/proto/qsbmf.txt

We have custom code specifically to detect such messages and avoid
tokenizing them. :(

Regards,

David.


Re: SpamAssassin false positive bayes with attachments

2014-10-06 Thread Joe Albertson
After reading your reply, I re-examined the message and found the case was
an incorrect Content-Type:
~~~
Content-Type: text/plain; charset=windows-1250;
 name=pdfname.pdf
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
 filename=pdfname.pdf
~~~

So it was scanning the base64 as text and tokenizing it.

On Mon, Oct 6, 2014 at 3:28 PM, Karsten Bräckelmann guent...@rudersport.de
wrote:

 On Mon, 2014-10-06 at 09:03 -0400, jdime abuse wrote:
  I have been seeing some issues with bayes detection from base64
  strings within attachments causing false positives.
 
  Example:
  Oct  6 09:02:14.374 [15869] dbg: bayes: token 'H4f' = 0.71186828264
  Oct  6 09:02:14.374 [15869] dbg: bayes: token 'wx2' = 0.68644662127
  Oct  6 09:02:14.374 [15869] dbg: bayes: token 'z4f' = 0.68502147581
  Oct  6 09:02:14.378 [15869] dbg: bayes: token '0vf' = 0.66604823748
 
  Is there a solution to prevent triggering bayes from the base64 data
  in an attachment? It was my impression that attachments should not
  trigger bayes data, but it seems that it is parsing it as text rather
  than an attachment.

 Bayes tokens are basically taken from rendered, textual body parts (and
 mail headers). Attachments are not tokenized.

 Unless the message's MIME-structure is severely broken, these tokens
 appear somewhere other than a base64 encoded attachment. Can you provide
 a sample uploaded to a pastebin?


 --
 char *t=\10pse\0r\0dtu\0.@ghno
 \x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4;
 main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8?
 c=1:
 (c=*++x); c128  (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0;
 }}}