[Bug 3077] spamassassin -d is too damn slow

bugzilla-daemon 26 Feb 2004 06:15:32 -0000

http://bugzilla.spamassassin.org/show_bug.cgi?id=3077

------- Additional Comments From [EMAIL PROTECTED]  2004-02-25 22:15 -------
Subject: RE:  spamassassin -d is too damn slow

> ------- Additional Comments From [EMAIL PROTECTED]  2004-02-25 
> 21:18 -------
> Subject: Re:  spamassassin -d is too damn slow
> 
> On Wed, Feb 25, 2004 at 09:08:56PM -0800, 
> [EMAIL PROTECTED] wrote:
> > 
> > I don't have a strong opinion on this, except that like any user,
> > I'd like all operations to be speedy. In a one message at a time mode,
> > 'spamassassin -d' is not executed often, and probably its slow
> > execution time is not a big deal. But when users start operating
> > on large collections of messages, usually to build a corpus,
> > the overhead dominates and _is_ noticeable.
> 
> Why not just write a script that does it all for you?  It's pretty
> trivial if you use something like Mail::Mailbox to suck in a bunch of
> messages, loop over them removing their markup and saving them to a
> folder.  I do this all the time, it's pretty much how I manage my
> corpus, although most of it is via IMAP.  Then you only incur the
> startup once for a whole bunch of messages.
> 
> Michael

Hi Micheal,

I actually did write a script, using procmail/formail/sed, that I posted
to the SA list. It likely runs faster than most Perl implementations, but
it doesn't do everything that 'spamassassin -d' does. 'spamassassin -d'
will read SA's configuration directives, looking for heeder line rewrites,
and for the values of those rewire tags. SA will then attempt to reverse
the effect of those tags. So, to be a true plug replacement, the new tool
will have to be cognziant of the other thigns that 'spamassassin -d' does.

I've used Mailbox as well. It's a convenient, powerful package. But it has
limitiations. It keeps the mail messages it reads into memeory, trying to
keep track of their subject threading and such. From a practical point of
view, I've found it is best to limit the number of messages processed
by the Mail module at about 10,000 messages. So, at a minimum, the corpus
would need to be split into checks that the Mail based script can handle.
Finally, Mail is not very resistent to mal-formed MIME message headers and
bodies, and such things are commonplace in spam. The SA deverlopers are
implementing their own MIME/html parsers for this reason.

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3077] spamassassin -d is too damn slow

Reply via email to