http://bugzilla.spamassassin.org/show_bug.cgi?id=3077
------- Additional Comments From [EMAIL PROTECTED] 2004-02-25 22:15 ------- Subject: RE: spamassassin -d is too damn slow > ------- Additional Comments From [EMAIL PROTECTED] 2004-02-25 > 21:18 ------- > Subject: Re: spamassassin -d is too damn slow > > On Wed, Feb 25, 2004 at 09:08:56PM -0800, > [EMAIL PROTECTED] wrote: > > > > I don't have a strong opinion on this, except that like any user, > > I'd like all operations to be speedy. In a one message at a time mode, > > 'spamassassin -d' is not executed often, and probably its slow > > execution time is not a big deal. But when users start operating > > on large collections of messages, usually to build a corpus, > > the overhead dominates and _is_ noticeable. > > Why not just write a script that does it all for you? It's pretty > trivial if you use something like Mail::Mailbox to suck in a bunch of > messages, loop over them removing their markup and saving them to a > folder. I do this all the time, it's pretty much how I manage my > corpus, although most of it is via IMAP. Then you only incur the > startup once for a whole bunch of messages. > > Michael Hi Micheal, I actually did write a script, using procmail/formail/sed, that I posted to the SA list. It likely runs faster than most Perl implementations, but it doesn't do everything that 'spamassassin -d' does. 'spamassassin -d' will read SA's configuration directives, looking for heeder line rewrites, and for the values of those rewire tags. SA will then attempt to reverse the effect of those tags. So, to be a true plug replacement, the new tool will have to be cognziant of the other thigns that 'spamassassin -d' does. I've used Mailbox as well. It's a convenient, powerful package. But it has limitiations. It keeps the mail messages it reads into memeory, trying to keep track of their subject threading and such. From a practical point of view, I've found it is best to limit the number of messages processed by the Mail module at about 10,000 messages. So, at a minimum, the corpus would need to be split into checks that the Mail based script can handle. Finally, Mail is not very resistent to mal-formed MIME message headers and bodies, and such things are commonplace in spam. The SA deverlopers are implementing their own MIME/html parsers for this reason. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.
