http://bugzilla.spamassassin.org/show_bug.cgi?id=3131

           Summary: Simple body test for signature fails
           Product: Spamassassin
           Version: 2.63
          Platform: All
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P3
         Component: spamassassin
        AssignedTo: [EMAIL PROTECTED]
        ReportedBy: [EMAIL PROTECTED]


I have a class of spams that appear sometimes in html and sometimes in plain 
text.  A characteristic of all of these spams is a signature in the form:

Regards,
Alpha Foo (or other 2-word random name)

In html:

<P>Regards,
<P>Alpah Foo
<P>

Or:

<P>Regards,<BR>
Alpha Foo<BR>

Seemingly a simple body test should catch this:

body BOGUS_SIG  /\bRegards, [A-Z][a-z]+ [A-Z][a-z]+\b/

This in fact works on the 20% or so of these spams that are plain text.  It 
does NOT work on any of the HTML spams, because "body" breaks into multiple 
hunks at <P> marks or <BR> marks!  Thus, "body" in html is approximately as 
useless as rawbody when trying to find a specific sequence of words that will 
match and might span a line break.

Now, it will be argued (erroneously) that this can be handled by a meta, and 
thus splitting the body isn't a problem:

body __REGARDS /\bRegards,/
body __SIG /\b[A-Z][a-z]+ [A-Z][a-z]+\b/
meta BOGUS_SIG (__REGARDS && __SIG)

I leave it to the reader to figure out why that one won't work.

Thus, the ONLY current way to detect this particular spam signature is to use 
FULL.  Which means that the regex now has to parse html and line breaks.  And 
will fail if the body is encoded in quoted-printable or base64.  Or if the text 
appears in a header.  Or probably any of another possible obfuscations or 
erroneous hits.

It is argued that 'body' needs to be separate pieces to reduce regex overhead.  
I argue that forcing simple body tests onto full (which is by definition a 
larger hunk of text than the combined body of the message) INCREASES overhead, 
both due to searching a larger text string, and because the tests themselves 
become very convoluted to attempt to un-encode all of the various ways that a 
body can be encoded.  Thus the body decoding has to be done multiple times.

Another argument would of course be that we don't need to detect obvious spam 
signatures, since being able to look for them would perhaps increase SA 
overhead.  The rejoinder to that is, what the heck is the purpose of SA if not 
to detect spam by its characteristics?  Do we expect the spammers to purposely 
code their spams to make them easy for SA to detect?



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to