As Bart's note to SA Talk indicated, I've been trying to use 'spamassassin -d' to remove markup from a corpus that I'm building up (85000 spams, 45000 hams). I found that this naive approach to cleaning the report headers takes a long time, even on fairly quick hardware: fromail -s spamassassin -d < inbox_with_markup > outbox_without_markup I've heard this could be sped up by using mass check, but I didn't get around to learning that well enough, so have stayed with the approach above so far. I'm about done for now, so am ready to move on, but thought I'd share some measurements that I took.
Given an mbox file called spam-1000-msgs.mbox that contains 1000 mostly safe_copy marked up mail messages. I tried this first: % ls -l spam-1000-msgs.mbox spam-1000-clean-1.mbox -rw-rw-r-- 1 gary users 5315319 Feb 23 13:30 spam-1000-clean-1.mbox -rw-rw-r-- 1 gary users 9088876 Feb 23 13:13 spam-1000-msgs.mbox First, the regular way. time formail -s spamassassin -d < spam-1000-msgs.mbox > spam-1000-clean-1.mbox User=734.860 System=62.480 Wall=13:33.50 (U+S)/W=98.0% That's (13*60 + 33) seconds, or 813 seconds, or about 1.25 messages/sec. This is on a 2.4Ghz P4 with 2G of memory and fast scsi drives (RH 9 with all updates). After some thought, I realized that I'm running this test from my own user-id, which has its home directory mounted via NFS. I didn't expect that to make a big difference but it did. I created an empty directory and empty prefs file, and tried again: time formail -s spamassassin -d -C ./empty -p ./prefs < spam-1000-msgs.mbox > spam-1000-clean-2.mbox User=426.530 System=37.770 Wall=7:54.77 (U+S)/W=97.7% I/O=0/0 That's a noticeable difference. We're up to 2.1 messages/sec. now. (BTW, there were no local site rules, and this is SA 3.0.0-r6789, above.) Adding 'use_bayes 0' made no significant change. % echo "use_bayes 0" > prefs [EMAIL PROTECTED] time formail -s spamassassin -L -d -C ./empty -p ./prefs < spam-1000-msgs.mbox > spam-1000-clean-2.mbox User=423.690 System=38.500 Wall=7:47.45 (U+S)/W=98.8% I/O=0/0 I had a copy of SA 2.61 on this system, so tried that out as well, just to eliminate the possibility of a regression: % time formail -s ../SA-2.61/spamassassin -L -d -C ./empty -p ./prefs < spam-1000-msgs.mbox > spam-1000-clean-2.mbox User=397.770 System=35.230 Wall=7:17.83 (U+S)/W=98.8% I/O=0/0 It seems that 2.61 is about 6% faster than 3.0 on this operation. Back to 3.0 - tried running as root, because I know root's home directory is local, and it can't write anywhere via NFS even if it wanted to. # time formail -s spamassassin -L -d -C ./empty -p ./prefs < spam-1000-msgs.mbox > spam-1000-clean-2.mbox 422.000u 37.420s 7:40.09 99.8% 0+0k 0+0io 574120pf+0w No change there. Bart Schaefer recommended that I try 'pperl': % time formail -s pperl `which spamassassin` -L -d -C ./empty -p ./prefs < spam-1000-msgs.mbox > spam-1000-clean-2.mbox User=4.540 System=2.220 Wall=1:14.88 Which is quite a bit better -- 6.4x times faster than our first, naive try. This brings the processing rate up to 13.5 messages per second, which is 1/8-th of the formail/procmail/sed solution, but is much improved.
