Re: Scanning large-body spam

Mark Martinec Wed, 31 Mar 2010 14:26:19 -0700

On Wednesday March 31 2010 18:05:52 Charles Gregory wrote:
> Excuse me for not *thinking* earlier, but it occurs to me that there is a
> very big drawback to *truncating* a message before passing it to SA, as
> opposed to my original request/suggestion to *flag* (or set a config
> param?) to tell SA to *ignore* parts of a message past a certain size.
>
> I believe it is fairly common practice for MTA's to expect SA to return
> the *entire* message, complete with X-Spam header 'markup', from SA's
> standard output stream. This is particularly important where mail
> classified as *slightly* spammy is delivered to a special spam folder
> based upon the headers added by SA. Or on a system where all mail tagged
> as spam is quarantined. Having SA's markup/explanations is critical to
> analysing false positives/negatives.
> 
> So SA needs to read and write the *entire* message, but then be given a
> parameter to keep it from thrashing over the really large ones.....


There are some drawbacks in depriving SpamAssassin of the full message
and letting it work on a truncated message, appropriately marked as one.
But even the message header alone often carries half the value of score
quality. Adding to that the first 400 kB of a body already covers plenty
of information about a message. It would be better of course to let SA
have access to a full or summarized info about the rest of the message
(like its attachments) too, but doing without is not too bad. Comparing
the quality of a score on a partial message, to not having any score
at all (and passing any big message as clean) makes a decision trivial
(it just needs to be done).

> I believe it is fairly common practice for MTA's to expect SA to return
> the *entire* message, complete with X-Spam header 'markup', from SA's
> standard output stream.

Sure, but this is an implementation detail. There is no underlying reason
that spamc could not keep the original message and only feed part of it
to spamd, then merge the results back and do the final message editing
(like inserting/editing header fields) by itself. Or to modify spamd and
let it handle arbitrary size messages by avoiding its current paradigm
of keeping the entire message in memory.

Anyway, the amavisd glue to SpamAssassin does just that: let SpamAssassin
see only the first 400 kB (configurable) of a large message, then edit
the original message based on results obtained from SpamAssassin. This
offers best of both worlds: handles arbitrary size messages, and avoids
SpamAssassin slurping it all in memory. The tricky details are in editing
the message, and ensuring that DKIM and DK signatures survive (which is
done by using an out-of-band channel between a caller and SA with its
plugins, as provided by SA 3.3).

  Mark

Re: Scanning large-body spam

Reply via email to