http://bugzilla.spamassassin.org/show_bug.cgi?id=3131
Summary: Simple body test for signature fails
Product: Spamassassin
Version: 2.63
Platform: All
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P3
Component: spamassassin
AssignedTo: [EMAIL PROTECTED]
ReportedBy: [EMAIL PROTECTED]
I have a class of spams that appear sometimes in html and sometimes in plain
text. A characteristic of all of these spams is a signature in the form:
Regards,
Alpha Foo (or other 2-word random name)
In html:
<P>Regards,
<P>Alpah Foo
<P>
Or:
<P>Regards,<BR>
Alpha Foo<BR>
Seemingly a simple body test should catch this:
body BOGUS_SIG /\bRegards, [A-Z][a-z]+ [A-Z][a-z]+\b/
This in fact works on the 20% or so of these spams that are plain text. It
does NOT work on any of the HTML spams, because "body" breaks into multiple
hunks at <P> marks or <BR> marks! Thus, "body" in html is approximately as
useless as rawbody when trying to find a specific sequence of words that will
match and might span a line break.
Now, it will be argued (erroneously) that this can be handled by a meta, and
thus splitting the body isn't a problem:
body __REGARDS /\bRegards,/
body __SIG /\b[A-Z][a-z]+ [A-Z][a-z]+\b/
meta BOGUS_SIG (__REGARDS && __SIG)
I leave it to the reader to figure out why that one won't work.
Thus, the ONLY current way to detect this particular spam signature is to use
FULL. Which means that the regex now has to parse html and line breaks. And
will fail if the body is encoded in quoted-printable or base64. Or if the text
appears in a header. Or probably any of another possible obfuscations or
erroneous hits.
It is argued that 'body' needs to be separate pieces to reduce regex overhead.
I argue that forcing simple body tests onto full (which is by definition a
larger hunk of text than the combined body of the message) INCREASES overhead,
both due to searching a larger text string, and because the tests themselves
become very convoluted to attempt to un-encode all of the various ways that a
body can be encoded. Thus the body decoding has to be done multiple times.
Another argument would of course be that we don't need to detect obvious spam
signatures, since being able to look for them would perhaps increase SA
overhead. The rejoinder to that is, what the heck is the purpose of SA if not
to detect spam by its characteristics? Do we expect the spammers to purposely
code their spams to make them easy for SA to detect?
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.