[Bug 3887] New: don't record regression test strings unless run from t/rule_tests.t
http://bugzilla.spamassassin.org/show_bug.cgi?id=3887 Summary: don't record regression test strings unless run from t/rule_tests.t Product: Spamassassin Version: 3.0.0 Platform: Other OS/Version: other Status: NEW Severity: normal Priority: P5 Component: Libraries AssignedTo: dev@spamassassin.apache.org ReportedBy: [EMAIL PROTECTED] this is extremely trivial, and saves a little RAM and startup time. We're recording the regression tests, which are only necessary when t/rule_tests.t is being run! 3 runs of spamassassin -Lt sample-spam.txt: before: 0m0.783s / 0m0.792s / 0m0.795s after: 0m0.785s / 0m0.780s / 0m0.786s a mass-check, before: Fri Oct 8 16:14:49 2004: 0 1000 13427 3353 25 0 24236 22168 - R+ pts/5 0:08 /usr/bin/perl -w ./mass-check -n -o spam:dir:~/ftp/tstcorpus/spam Fri Oct 8 16:14:50 2004: 0 1000 13427 3353 22 0 24504 22192 - R+ pts/5 0:09 /usr/bin/perl -w ./mass-check -n -o spam:dir:~/ftp/tstcorpus/spam Fri Oct 8 16:14:51 2004: 0 1000 13427 3353 25 0 24236 22168 - R+ pts/5 0:09 /usr/bin/perl -w ./mass-check -n -o spam:dir:~/ftp/tstcorpus/spam Fri Oct 8 16:14:52 2004: 0 1000 13427 3353 25 0 24236 22168 - R+ pts/5 0:10 /usr/bin/perl -w ./mass-check -n -o spam:dir:~/ftp/tstcorpus/spam Fri Oct 8 16:14:53 2004: 0 1000 13427 3353 24 0 24504 22192 - R+ pts/5 0:10 /usr/bin/perl -w ./mass-check -n -o spam:dir:~/ftp/tstcorpus/spam Fri Oct 8 16:14:54 2004: 0 1000 13427 3353 25 0 24236 22168 - R+ pts/5 0:11 /usr/bin/perl -w ./mass-check -n -o spam:dir:~/ftp/tstcorpus/spam and here's after: Fri Oct 8 16:13:36 2004: 0 1000 13231 3353 25 0 24224 22116 - R+ pts/5 0:07 /usr/bin/perl -w ./mass-check -n -o spam:dir:~/ftp/tstcorpus/spam Fri Oct 8 16:13:38 2004: 0 1000 13231 3353 25 0 24224 22116 - R+ pts/5 0:08 /usr/bin/perl -w ./mass-check -n -o spam:dir:~/ftp/tstcorpus/spam Fri Oct 8 16:13:39 2004: 0 1000 13231 3353 25 0 24224 22116 - R+ pts/5 0:08 /usr/bin/perl -w ./mass-check -n -o spam:dir:~/ftp/tstcorpus/spam Fri Oct 8 16:13:40 2004: 0 1000 13231 3353 23 0 24224 22168 - R+ pts/5 0:09 /usr/bin/perl -w ./mass-check -n -o spam:dir:~/ftp/tstcorpus/spam Fri Oct 8 16:13:41 2004: 0 1000 13231 3353 24 0 24224 22168 - R+ pts/5 0:09 /usr/bin/perl -w ./mass-check -n -o spam:dir:~/ftp/tstcorpus/spam Fri Oct 8 16:13:42 2004: 0 1000 13231 3353 25 0 24224 22168 - R+ pts/5 0:10 /usr/bin/perl -w ./mass-check -n -o spam:dir:~/ftp/tstcorpus/spam Fri Oct 8 16:13:43 2004: 0 1000 13231 3353 25 0 24224 22168 - R+ pts/5 0:10 /usr/bin/perl -w ./mass-check -n -o spam:dir:~/ftp/tstcorpus/spam --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3887] don't record regression test strings unless run from t/rule_tests.t
http://bugzilla.spamassassin.org/show_bug.cgi?id=3887 --- Additional Comments From [EMAIL PROTECTED] 2004-10-08 16:19 --- Created an attachment (id=2432) -- (http://bugzilla.spamassassin.org/attachment.cgi?id=2432action=view) patch --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3887] [review] don't record regression test strings unless run from t/rule_tests.t
http://bugzilla.spamassassin.org/show_bug.cgi?id=3887 [EMAIL PROTECTED] changed: What|Removed |Added Severity|normal |minor Summary|don't record regression test|[review] don't record |strings unless run from |regression test strings |t/rule_tests.t |unless run from ||t/rule_tests.t Target Milestone|Future |3.0.1 --- Additional Comments From [EMAIL PROTECTED] 2004-10-08 16:20 --- applied to 3.1.0 tree; r54130. trivial enough for 3.0.1 ;) --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3883] [review] Spamcop reporting uses whole GECOS field, not just the name
http://bugzilla.spamassassin.org/show_bug.cgi?id=3883 [EMAIL PROTECTED] changed: What|Removed |Added Status|NEW |RESOLVED Resolution||FIXED --- Additional Comments From [EMAIL PROTECTED] 2004-10-08 18:35 --- committed. r54147 --- You are receiving this mail because: --- You are on the CC list for the bug, or are watching someone who is.
[Bug 3805] [review] Manual whitelist for URIDNSBL lookups
http://bugzilla.spamassassin.org/show_bug.cgi?id=3805 --- Additional Comments From [EMAIL PROTECTED] 2004-10-08 19:11 --- Please ignore my comment #16. I got this ticket mixed up with the cctld ticket. http://bugzilla.spamassassin.org/show_bug.cgi?id=3827 ne.jp indeed doesn't belong on a whitelist. For an update on some improved whitelist data, please see: http://bugzilla.spamassassin.org/show_bug.cgi?id=3827 Within a week we should have some much better data for this purpose. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3865] Config option w/o parameter sets value to 0
http://bugzilla.spamassassin.org/show_bug.cgi?id=3865 --- Additional Comments From [EMAIL PROTECTED] 2004-10-08 21:21 --- +1 for backporting to 3.0 --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3887] [review] don't record regression test strings unless run from t/rule_tests.t
http://bugzilla.spamassassin.org/show_bug.cgi?id=3887 --- Additional Comments From [EMAIL PROTECTED] 2004-10-08 21:23 --- +1 --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3828] spamd parent stops accepting requests
http://bugzilla.spamassassin.org/show_bug.cgi?id=3828 --- Additional Comments From [EMAIL PROTECTED] 2004-10-08 21:27 --- Any more ideas regarding this bug? Are we closer to a fix? --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3844] Error in BayesStore.pm
http://bugzilla.spamassassin.org/show_bug.cgi?id=3844 --- Additional Comments From [EMAIL PROTECTED] 2004-10-08 21:43 --- Subject: RE: Error in BayesStore.pm Couldn't upgrade to 3.0.0, as FreeBSD is in a port freeze. Issue went away when I changed the to lt in BayesStore.pm Cheers, Richard -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Saturday, 9 October 2004 3:40 PM To: [EMAIL PROTECTED] Subject: [Bug 3844] Error in BayesStore.pm http://bugzilla.spamassassin.org/show_bug.cgi?id=3844 --- Additional Comments From [EMAIL PROTECTED] 2004-10-08 21:39 --- Did you by chance upgrade to version 3.0.0 and then downgrade? or possibly just test 3.0.0? Can you run sa-learn --dump magic and paste the output to the bug? Thanks --- You are receiving this mail because: --- You reported the bug, or are watching the reporter. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3844] Error in BayesStore.pm
http://bugzilla.spamassassin.org/show_bug.cgi?id=3844 --- Additional Comments From [EMAIL PROTECTED] 2004-10-08 21:53 --- Subject: RE: Error in BayesStore.pm sa-learn --dump magic as requested. Cheers, Richard 0.000 0 2 0 non-token data: bayes db version 0.000 0286 0 non-token data: nspam 0.000 0362 0 non-token data: nham 0.000 0 15984 0 non-token data: ntokens 0.000 0 0 0 non-token data: oldest atime 0.000 0 1097305184 0 non-token data: newest atime 0.000 0 1097278550 0 non-token data: last journal sync atime 0.000 0 0 0 non-token data: last expiry atime 0.000 0 0 0 non-token data: last expire atime delta 0.000 0 0 0 non-token data: last expire reduction count --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3776] massive memory consumption in spamd
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776 --- Additional Comments From [EMAIL PROTECTED] 2004-10-09 05:53 --- Something else is going on with this other than TextCat being slow. When I run (on my PC under cygwin using current trunk svn) time spamassassin -t -L badmsg1.txt /dev/null I get real time 16.6 seconds. When I take the body of the malformed MIME sections and put it by itself with some headers as the body of a mail message, that makes a 42Kbyte file, and running time spamassassin -t -L message /dev/null I get real time 6.6 seconds When I put in Dallas' proposed patch, I get times of 9.1 and 5.6 respectively. When I instead change the call to create_lm($input) in TextCat.pm to be create_lm(substr($input, 0, 1) (which I am more comfortable with than the proposed patch, by the way) the times go to 8.5 and 5.2 respectively. This needs more investigation, but it seems that something else is going on in that malformed MIME example. The times are consistent with textcat being called more than once on the message body in the badmsg1.txt example. I don't have time to look at this more right at this moment. Also, the slowness of that loop in TextCat doesn't explain the memory blowup. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3857] Spamd repeatedly causing unacceptably high system load
http://bugzilla.spamassassin.org/show_bug.cgi?id=3857 --- Additional Comments From [EMAIL PROTECTED] 2004-10-09 06:02 --- patch applied. Now to monitor and wait and see... --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3776] massive memory consumption in spamd
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776 --- Additional Comments From [EMAIL PROTECTED] 2004-10-09 06:27 --- Subject: Re: Memory Leak in spamd Not having looked at the message in question, but I wonder if this would also apply to other binary-in-the-body stuff, like old uuencoded binaries in the body. Loren --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3850] spamassasin with -T option consumes all memory
http://bugzilla.spamassassin.org/show_bug.cgi?id=3850 [EMAIL PROTECTED] changed: What|Removed |Added Status|REOPENED|RESOLVED Resolution||WORKSFORME --- Additional Comments From [EMAIL PROTECTED] 2004-10-09 06:30 --- I will try to be less rude and more explicit. This is not a useable bug report. 1) Any report that is equivalent to SpamAssassin does not work at all is proven to be incorrect by the thousands of installations of SpamAssassin that do work. 2) Point #1 implies that unless you can indicate enough specifics of your situation to indicate how it is possible that you see a problem when other people don't, then we do not have enough information to do anything about it. 3) This place is for reporting bugs in SpamAssassin, not to help you debug problems with your installation. Unless you have reason to suspect a bug in SpamAssasin, ask for help with your configuration in the spamassassin users mailing list. See http://wiki.apache.org/spamassassin/MailingLists for details on not just the mailing list, but all of the other steps that you should take before writing to the list. 4) The first step you should take, as outlined in the above URL is upgrade to a version of SpamAssassin that is not out of date. The current release is 3.0, not 2.63 and not even 2.64. 5) There is no -T option in SpamAssassin. Did you mean -t? What else might not be right in your configuration? Don't answer that question here. You need to get your configuration right before you start claiming that is a bug in SpamAssassin. We are happy to investigate and fix bugs in SpamAssassin when they are reported. What you posted here is not a report of a bug in SpamAssassin, it is a very incomplete report of your problems in trying to install SpamAssassin, which is not of interest here. If you do find a bug, please do report it so that it can be fixed. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3776] massive memory consumption in spamd
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776 --- Additional Comments From [EMAIL PROTECTED] 2004-10-09 07:04 --- When I use create_lm(substr($input, 0, 1)) to limit the time there, the next big delay in the debug output is between these two lines: debug: rules: running raw-body-text per-line regexp tests; score so far=4.328 debug: rules: running full-text regexp tests; score so far=4.328 That makes sense given that the raw body message seems to include all of the undecoded MIME parts. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3872] SA 3.0 creates randomly extreme big bayes_journal
http://bugzilla.spamassassin.org/show_bug.cgi?id=3872 --- Additional Comments From [EMAIL PROTECTED] 2004-10-09 07:19 --- Subject: Re: SA 3.0 creates randomly extreme big bayes_journal On Sat, Oct 09, 2004 at 12:02:32AM -0700, [EMAIL PROTECTED] wrote: I wonder if there is any relationship between creating random huge journals and SA randomly growing suddenly to 250MB+. BTW: for someone experiencing this issue, I'd be interested in getting a copy (compressed of course!) of the extreme sized journal, and/or seeing the output of ls -las, stat, etc. There's only 1 way I can think of for a journal file to instantly grow to a large size (in normal usage, it will grow at a rate relative to the amount of mail being processed). When the journal write occurs, and a failure is detected, the code will truncate() the file back to the original size before the write (see BayesStore/DBM::cleanup()). The truncate() is accompanied by a warning: bayes: partial write to bayes journal $path ($len of $nbytes), recovering If this process somehow gets screwed up, the truncate could actually make the file larger and create what is known as a sparse file. ie: a bunch of data, a bunch of nothing (the OS typically inserts nulls when reading from the section that doesn't actually exist on disk), and a bunch of data. This behavior is new in 3.0.0, 2.6 would detect a partial write of journal data, internally jump ahead to the part that wasn't written, and try again. This could potentially lead to multiple writers clobbering each other, which could still happen for a partial write, but at least the journal file should be truncated() to a known good state. For examples of creating spare files: $ perl -e 'open(T, foo); print T hi; truncate(T,256*1048576); close(T);' $ ls -las foo 4 -rw-r--r--1 tvd wheel268435456 Oct 9 09:47 foo $ stat foo File: `foo' Size: 268435456 Blocks: 8 IO Block: 4096 Regular File Device: 803h/2051d Inode: 212997 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 1113/ tvd) Gid: ( 10/ wheel) Access: 2004-10-09 09:47:55.0 -0400 Modify: 2004-10-09 09:47:55.0 -0400 Change: 2004-10-09 09:47:55.0 -0400 So the file only actually has a small number of blocks used on disk to store hi\n (see below for comments about the actual size usage), but the filesystem reports the file as 256MB. In contrast: $ perl -e 'open(T, foo); print T hix(128*1048576); close(T);' $ ls -las foo 262404 -rw-r--r--1 tvd wheel268435456 Oct 9 09:50 foo $ stat foo File: `foo' Size: 268435456 Blocks: 524808 IO Block: 4096 Regular File Device: 803h/2051d Inode: 212997 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 1113/ tvd) Gid: ( 10/ wheel) Access: 2004-10-09 09:47:55.0 -0400 Modify: 2004-10-09 09:50:52.0 -0400 Change: 2004-10-09 09:50:52.0 -0400 This file actually has 256MB of data in it. Notice that the space actually used on disk is much higher. (the ls vs stat output may be a little confusing wrt blocks. ls reports 1k blocks, stat reports 512 byte blocks, and the actual file system block size (smallest amount of allocatable space in the FS) is 4096 bytes. so in the truncate version, there is only 1 FS block allocated for the file (4 x 1k == 1 x 4k), the bottom version has 65601 FS blocks allocated (262404 x 1k blocks == 65601 x 4k blocks)). --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3888] New: URIBL open redirector checking misses http-equiv Refresh
http://bugzilla.spamassassin.org/show_bug.cgi?id=3888 Summary: URIBL open redirector checking misses http-equiv Refresh Product: Spamassassin Version: 3.0.0 Platform: Other OS/Version: Linux Status: NEW Severity: normal Priority: P5 Component: Plugins AssignedTo: dev@spamassassin.apache.org ReportedBy: [EMAIL PROTECTED] Received a spam this morning, using Geocities sites redirecting with Meta http-equiv=Refresh to the spamsite. I'll attach the spam and the current contents of the site. I can see the request for the Geocities URL, and the response coming back, so the open redirect check is happening - it just misses this way of doing it. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3888] URIBL open redirector checking misses http-equiv Refresh
http://bugzilla.spamassassin.org/show_bug.cgi?id=3888 --- Additional Comments From [EMAIL PROTECTED] 2004-10-09 07:45 --- Created an attachment (id=2434) -- (http://bugzilla.spamassassin.org/attachment.cgi?id=2434action=view) the page from Geocities --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3888] URIBL open redirector checking misses http-equiv Refresh
http://bugzilla.spamassassin.org/show_bug.cgi?id=3888 --- Additional Comments From [EMAIL PROTECTED] 2004-10-09 07:49 --- Be sure to report these to the redirection sites. I'm sure Yahoo doesn't particularly want to have their services used to redirect spam traffic. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3872] SA 3.0 creates randomly extreme big bayes_journal
http://bugzilla.spamassassin.org/show_bug.cgi?id=3872 --- Additional Comments From [EMAIL PROTECTED] 2004-10-09 07:54 --- Sorry, I should have mentioned that I have copies of most of the occurences. However, I don't know if the block usage remains the same when I make a cp -a dir dir.new. I tar-gzipped the following dir, it's now 11 MB. Do you want to have it? Can I upload it to an FTP or shall I provide it to you via ftp or scp? Note that this is happening with SA 3.0 used by MailScanner, this is not a spamd issue, at least as far as I experience it. I have MailScanner+3.0 setup on two machines (our own mailserver and its backup MX), it occurs only on the main mailserver. Both are setup almost identical, the backup MX gets much less mail, though (about 10% of the main mailserver) and it's only spam. It didn't happen with MailScanner+2.63. (All other machines are milter+spamd2.63 because I'm reluctant to try spamd 3.0 with the memory problems getting reported. On one of the spamd machines (with the most mail traffic supposedly) I sometimes see one or two spamd processes collect massive amounts of memory (can be 1 GB), but this happens very rarily, less than once per month. If it is not detected early enough it crashes the machine eventually. I don't know of a way to limit this, AFAIK ulimit applies to logins and this won't work for a daemon, correct?) Here's the output for ls and stat as mentioned above. n8:/home/spamd/bayes.4 # ls -las total 305248 4 drwxrwsr-x2 spamdwww 4096 Oct 8 16:54 . 4 drwxr-xr-x 14 spamdmail 4096 Oct 8 17:01 .. 4 -rw---1 root www16 Oct 8 16:49 bayes.lock 4 -rw-rw-rw-1 spamdwww42 Oct 5 15:41 bayes.mutex 46116 -rw-rw-rw-1 root www 47168872 Oct 8 16:54 bayes_journal 202004 -rw-rw-rw-1 root www 206639592 Oct 8 16:32 bayes_journal.old 1544 -rw-rw-rw-1 spamdwww 2482176 Oct 8 16:04 bayes_seen 16420 -rw-rw-rw-1 spamdwww 20951040 Oct 8 16:50 bayes_toks 4 -rwxr-xr-x1 spamdwww 3499 Apr 12 18:14 create_bayes.pl 39136 -rw-r--r--1 root www 40030104 Oct 5 16:25 dump.txt 4 -rwx--1 spamdwww 943 Apr 12 19:34 stripbadtoken.pl 4 -rwx--1 spamdwww 229 Apr 12 19:06 timetest.pl n8:/home/spamd/bayes.4 # stat bayes_journal File: `bayes_journal' Size: 47168872Blocks: 92232 IO Block: 4096 regular file Device: 802h/2050d Inode: 785771 Links: 1 Access: (0666/-rw-rw-rw-) Uid: (0/root) Gid: (8/ www) Access: 2004-10-08 16:54:09.0 +0200 Modify: 2004-10-08 16:54:17.0 +0200 Change: 2004-10-08 17:01:07.0 +0200 n8:/home/spamd/bayes.4 # stat bayes_journal.old File: `bayes_journal.old' Size: 206639592 Blocks: 404008 IO Block: 4096 regular file Device: 802h/2050d Inode: 785775 Links: 1 Access: (0666/-rw-rw-rw-) Uid: (0/root) Gid: (8/ www) Access: 2004-10-08 16:51:03.0 +0200 Modify: 2004-10-08 16:32:12.0 +0200 Change: 2004-10-08 17:01:46.0 +0200 --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3889] New: Debug code is hard to use in third-party code
http://bugzilla.spamassassin.org/show_bug.cgi?id=3889 Summary: Debug code is hard to use in third-party code Product: Spamassassin Version: SVN Trunk (Latest Devel Version) Platform: Other OS/Version: other Status: NEW Severity: enhancement Priority: P5 Component: Libraries AssignedTo: dev@spamassassin.apache.org ReportedBy: [EMAIL PROTECTED] I have a small bit of code used to parse a message and then do things with it: #!/usr/bin/perl use Mail::SpamAssassin; use Mail::SpamAssassin::Message; $Mail::SpamAssassin::DEBUG-{enabled} = 1; my $msg = new Mail::SpamAssassin::Message({parsenow = 1}); This used to work, but doesn't now due to the massive debug change. Either way, it makes the debug output tied in to the M::SA code, even if M::SA isn't necessary (as in this case). I'd like to see the debug code moved into its own module separate from M::SA. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3776] massive memory consumption in spamd
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776 --- Additional Comments From [EMAIL PROTECTED] 2004-10-09 08:27 --- That makes sense given that the raw body message seems to include all of the undecoded MIME parts. sorta, but not really. according to the check_language() code, the body passed to TextCat is the rendered body, which means only text and message leaf parts, fully decoded and HTML rendered. in this case, since the message is malformed, the jpegs and such are all in a part marked text/plain, so that's why they're passed in. I'd have no problem limiting the amount of input to TextCat though. I don't think it needs the full body of the message, the first X KB ought to be sufficient. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3776] massive memory consumption in spamd
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776 --- Additional Comments From [EMAIL PROTECTED] 2004-10-09 08:35 --- I wonder if this would also apply to other binary-in-the-body stuff, like old uuencoded binaries in the body. Hi Loren, the MIME decoder in SA3 should strip out in-line uuencoded just like it does on rfc822 attachments. The problem here is that this message has a malformed mime structure, causing SA3 and other tools i have tried (Ripmime) to decode it like it sees it. It seems that the entire message body is being passed in to create_lm, not just the malformed MIME part that MUAs display in the body. Sidney, Isn't that what i said? I agree your proposed patch is the better way to go to limit the total size that can be scanned rather than the total number of iterations on the for loop. Also, the slowness of that loop in TextCat doesn't explain the memory blowup. Sidney, after patching the create_lm call, i dont have any big increases in memory consumption. the for loop takes about 6 seconds because its doing 28k iterations. the ngrams sort in the else statement takes 4-5 seconds. the splice (5-6) takes 10-11 seconds. the return back to classify takes 4-5 seconds! its not the for loop that causes the big increases in memory, its the sorts and splices, and then having to return that big array back to classify(). 2004-10-09 10:31:41.246016500 debug: generic: going to textcat matches 2004-10-09 10:31:41.247064500 debug: generic: running TextCat::classify 2004-10-09 10:31:44.90500 debug: generic: count was 28758 2004-10-09 10:31:45.000113500 debug: generic: 3 else sort ngrams 2004-10-09 10:31:49.693734500 debug: generic: 4 else sort ngrams is done 2004-10-09 10:31:49.693854500 debug: generic: 5 splice sorted 2004-10-09 10:31:59.008898500 debug: generic: 6 splice sorted is done 2004-10-09 10:31:59.009022500 debug: generic: 7 return sorted to classify() 2004-10-09 10:32:03.020713500 debug: generic: done running create_lm after patching create_lm(), the sorts and splices are very fast... 2004-10-09 10:30:07.046344500 debug: generic: going to textcat matches 2004-10-09 10:30:07.047347500 debug: generic: running TextCat::classify 2004-10-09 10:30:07.413029500 debug: generic: count was 2501 2004-10-09 10:30:07.413138500 debug: generic: 3 else sort ngrams 2004-10-09 10:30:07.568822500 debug: generic: 4 else sort ngrams is done 2004-10-09 10:30:07.568935500 debug: generic: 5 splice sorted 2004-10-09 10:30:07.591153500 debug: generic: 6 splice sorted is done 2004-10-09 10:30:07.591260500 debug: generic: 7 return sorted to classify() 2004-10-09 10:30:07.651056500 debug: generic: done running create_lm I am taking a weekend vacation with my wife right now, so i'm not sure if i can continue on this until monday. I agree the scan time of 4-6 seconds for this message is still too slow, and we need to figure out what is causing that next slow down. thanks. d --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3872] SA 3.0 creates randomly extreme big bayes_journal
http://bugzilla.spamassassin.org/show_bug.cgi?id=3872 --- Additional Comments From [EMAIL PROTECTED] 2004-10-09 08:54 --- Subject: Re: SA 3.0 creates randomly extreme big bayes_journal On Sat, Oct 09, 2004 at 07:54:27AM -0700, [EMAIL PROTECTED] wrote: However, I don't know if the block usage remains the same when I make a cp -a dir dir.new. I tar-gzipped the following dir, it's now 11 MB. Do you want to Hrm. I believe the issue is whether cp understands sparse files. The Linux cp I have (it looks like you're using Linux) seems to say it possibly supports sparse files based on a crude heuristic to determine if the file is sparse or not. So to be safe, I wouldn't trust cp. have it? Can I upload it to an FTP or shall I provide it to you via ftp or scp? If you can make it available, I'll grab it from you. I can make some ftp space available if that's easier. crashes the machine eventually. I don't know of a way to limit this, AFAIK ulimit applies to logins and this won't work for a daemon, correct?) ulimit applies to processes and their children. logins are simply a shell with children procs. ;) (BTW: ulimit -c 0 is great for httpd and such to prevent core files being written upon crash...) 46116 -rw-rw-rw-1 root www 47168872 Oct 8 16:54 bayes_journal 202004 -rw-rw-rw-1 root www 206639592 Oct 8 16:32 bayes_journal.old File: `bayes_journal' Size: 47168872Blocks: 92232 IO Block: 4096 regular file File: `bayes_journal.old' Size: 206639592 Blocks: 404008 IO Block: 4096 regular file Both of these seem to be non-sparse. In a cp version of the .old file, I'd look for a bunch of text, then a bunch of nulls, then potentially a bunch more text. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3776] massive memory consumption in spamd
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776 --- Additional Comments From [EMAIL PROTECTED] 2004-10-09 08:57 --- Subject: Re: massive memory consumption in spamd On Sat, Oct 09, 2004 at 08:35:02AM -0700, [EMAIL PROTECTED] wrote: Hi Loren, the MIME decoder in SA3 should strip out in-line uuencoded just like it does on rfc822 attachments. The problem here is that this message has a malformed mime structure, causing SA3 and other tools i have tried (Ripmime) to You're correct about the malformed part being the issue, but SA3 does not specially handle uuencode. See bug 3278. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3872] SA 3.0 creates randomly extreme big bayes_journal
http://bugzilla.spamassassin.org/show_bug.cgi?id=3872 --- Additional Comments From [EMAIL PROTECTED] 2004-10-09 09:23 --- Thanks for the hint. I had a look at the copied bayes_journal and bayes_journal.old files. They seem to contain stuff the normal bayes_journal doesn't have. F.i. the bayes_journal of size 47168872 contains over 800.000 lines. Some of them look like the stuff I see in a normal bayes_journal (but not exactly), but 813090 of the lines have an sa_generated. That's obviously what blows them to the sky. Examples: normal bayes_journal: a few dozen or hundred lines like this: t 1097337643 32578a6c95 blown_up bayes_journal: 99% of lines like this: m h [EMAIL PROTECTED] m s [EMAIL PROTECTED] m s [EMAIL PROTECTED] (it seems to start with thousands of these and eventually finishes with a mix of the following) the last 300 lines of the file are like this: (... indicates I skipped lines of the same type) ... t 1097247243 9826be506b n 0 1 c 0 1 1097247243 65cb22ebd8 ... c 0 1 1097247243 850959cf3e m h [EMAIL PROTECTED] n 0 1 c 0 1 1097247243 65cb22ebd8 ... c 0 1 1097247243 850959cf3e m h [EMAIL PROTECTED] EOF Does that already give you enough information or shall I still provide the complete files? --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3872] SA 3.0 creates randomly extreme big bayes_journal
http://bugzilla.spamassassin.org/show_bug.cgi?id=3872 --- Additional Comments From [EMAIL PROTECTED] 2004-10-09 09:32 --- In addition to my last comment, I only noticed that now. The vast amount of those sa_generated lines seems to come in doubles. F.i. the file starts with m h [EMAIL PROTECTED] m h [EMAIL PROTECTED] m h [EMAIL PROTECTED] m h [EMAIL PROTECTED] m s [EMAIL PROTECTED] m s [EMAIL PROTECTED] m s [EMAIL PROTECTED] m s [EMAIL PROTECTED] and seems to go on this way almost up to the end. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3872] SA 3.0 creates randomly extreme big bayes_journal
http://bugzilla.spamassassin.org/show_bug.cgi?id=3872 --- Additional Comments From [EMAIL PROTECTED] 2004-10-09 10:28 --- Subject: Re: SA 3.0 creates randomly extreme big bayes_journal On Sat, Oct 09, 2004 at 09:32:01AM -0700, [EMAIL PROTECTED] wrote: m h [EMAIL PROTECTED] m h [EMAIL PROTECTED] m h [EMAIL PROTECTED] m h [EMAIL PROTECTED] m s [EMAIL PROTECTED] m s [EMAIL PROTECTED] m s [EMAIL PROTECTED] m s [EMAIL PROTECTED] and seems to go on this way almost up to the end. Hrm. Those are message learn commands (m). Basically it says h/s for ham/spam, and then the message-id, which are all sha1 hash generated based on headers. Shouldn't be multiples together, and there should be token information before it (iirc, n # # (change ham/spam count), c # # token atime (learn token ham/spam, atime), m h/s msgid (message learned, ham/spam, msgid to avoid double learning, etc.), in that order...) Having the same entry multiple times (total) is ok, btw. sa-learn doesn't know a message was learned already until the journal is synced, so the same message can appear multiple times. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3872] SA 3.0 creates randomly extreme big bayes_journal
http://bugzilla.spamassassin.org/show_bug.cgi?id=3872 --- Additional Comments From [EMAIL PROTECTED] 2004-10-09 10:38 --- Subject: Re: SA 3.0 creates randomly extreme big bayes_journal On Sat, Oct 09, 2004 at 09:23:55AM -0700, [EMAIL PROTECTED] wrote: lines. Some of them look like the stuff I see in a normal bayes_journal (but not exactly), but 813090 of the lines have an sa_generated. That's obviously what blows them to the sky. that's normal. normal bayes_journal: a few dozen or hundred lines like this: t 1097337643 32578a6c95 those are token atime updates. happens during scanning. blown_up bayes_journal: 99% of lines like this: m h [EMAIL PROTECTED] m s [EMAIL PROTECTED] m s [EMAIL PROTECTED] (it seems to start with thousands of these and eventually finishes with a mix of the following) Hrm. the last 300 lines of the file are like this: (... indicates I skipped lines of the same type) ... t 1097247243 9826be506b n 0 1 c 0 1 1097247243 65cb22ebd8 ... c 0 1 1097247243 850959cf3e m h [EMAIL PROTECTED] yeah, that's normal for learning to journal. Does that already give you enough information or shall I still provide the complete files? I'd like to get the file if I can. If there's actually lines and no nulls, it's not a truncate issue, so that's good. I'd like to see what they actually look like though. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3803] logo contest output needed
http://bugzilla.spamassassin.org/show_bug.cgi?id=3803 --- Additional Comments From [EMAIL PROTECTED] 2004-10-09 11:23 --- Created an attachment (id=2435) -- (http://bugzilla.spamassassin.org/attachment.cgi?id=2435action=view) SpamAssassin Web Button 1.0a --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3803] logo contest output needed
http://bugzilla.spamassassin.org/show_bug.cgi?id=3803 --- Additional Comments From [EMAIL PROTECTED] 2004-10-09 11:24 --- Created an attachment (id=2436) -- (http://bugzilla.spamassassin.org/attachment.cgi?id=2436action=view) SpamAssassin Web Button 1.0b --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3803] logo contest output needed
http://bugzilla.spamassassin.org/show_bug.cgi?id=3803 --- Additional Comments From [EMAIL PROTECTED] 2004-10-09 11:29 --- I just uploaded two web buttons for SpamAssassin. I don't know it those would work because I remember from the logo design that SpamAssassin is one word, but after many tries I could not come up with a decent design where both the logo and SpamAssassin would fit in a small web button. So I divided the word. Anyway, I wanted to at least have something uploaded that could be used right now since it was taking too long. Christian --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3872] SA 3.0 creates randomly extreme big bayes_journal
http://bugzilla.spamassassin.org/show_bug.cgi?id=3872 --- Additional Comments From [EMAIL PROTECTED] 2004-10-09 11:39 --- Subject: Re: SA 3.0 creates randomly extreme big bayes_journal On Sat, Oct 09, 2004 at 11:10:07AM -0700, [EMAIL PROTECTED] wrote: n (1 line) c (many) m (1 line) t (many) here it should then start over with n. well, the ncm (learn) and t (scan) are different operations, but otherwise, yeah. file maybe took an hour or less. Could be a problem with MailScanner - SA interaction, could it? Possibly, it's hard to say. This is part of my problem with bugs reported when not using our provided code -- it's pretty hard for us to debug/reproduce issues in code we didn't write, we don't use, etc. I don't see how multiple m lines from the same message would go into the journal without trying to learn the same message multiple times. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3803] logo contest output needed
http://bugzilla.spamassassin.org/show_bug.cgi?id=3803 --- Additional Comments From [EMAIL PROTECTED] 2004-10-09 11:53 --- Created an attachment (id=2437) -- (http://bugzilla.spamassassin.org/attachment.cgi?id=2437action=view) SpamAssassin Logo (transparent bg and transparent envelope borders) --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3803] logo contest output needed
http://bugzilla.spamassassin.org/show_bug.cgi?id=3803 --- Additional Comments From [EMAIL PROTECTED] 2004-10-09 11:54 --- Created an attachment (id=2438) -- (http://bugzilla.spamassassin.org/attachment.cgi?id=2438action=view) SpamAssassin Logo Source (transparent bg and transparent envelope borders) --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3890] New: Change X-Spam-Level: character no longer supported in 3.0.0?
http://bugzilla.spamassassin.org/show_bug.cgi?id=3890 Summary: Change X-Spam-Level: character no longer supported in 3.0.0? Product: Spamassassin Version: 3.0.0 Platform: PC OS/Version: Linux Status: NEW Severity: major Priority: P4 Component: Score Generation AssignedTo: dev@spamassassin.apache.org ReportedBy: [EMAIL PROTECTED] It seems that when upgrading to 3.0.0 my prefered character (X) for spam level headers has changed to *, making alkl my scripts for logging and blocking mail with high score brake, since my mailserver donsent support escaped * it takes it like wildcard, meaning no character or multiple characters.. i wote for re-supporting: spam_level_char X //N --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3872] SA 3.0 creates randomly extreme big bayes_journal
http://bugzilla.spamassassin.org/show_bug.cgi?id=3872 --- Additional Comments From [EMAIL PROTECTED] 2004-10-09 14:56 --- Subject: Re: SA 3.0 creates randomly extreme big bayes_journal On Sat, Oct 09, 2004 at 01:38:56PM -0400, Theo Van Dinter wrote: I'd like to get the file if I can. If there's actually lines and no nulls, it's not a truncate issue, so that's good. I'd like to see what they actually look like though. Hrm. I got a copy of an errored journal file: 1096 c 3562112 m 4 n 266 t so yeah, the problem is the 3.5 million m lines. They do, by in large, all look duplicated for some reason. So the lines indicates ~1.8 million mails learned, but no corresponding ham/spam count updates or tokens, which is just wrong. I don't think you learned 1.8 million mails anyway. There is a ton of duplication. Individual msgid and repeat counts, respectively: 106 33120 16 3312 1 736 4 184 2 72 10 8 3 4 2 2 So something's up. I haven't seen this issue in normal spamd-type usage, so I'm tempted to blame MailScanner... However, looking at the code, I found something that seems odd, and could very well cause the issue. In fact, allow me to go: OMG! In 2.6x and 3.0, the sync_journal function (takes the journal data and updates the databases) calls seen_put (and seen_delete) to take the message id and store it in the database. BUT! seen_put and seen_delete check to see if learn_to_journal is set, and if so, defers the update to the journal! OMG OMG OMG! I can even reproduce this in normal SA mode! If you use sa-learn --sync to sync the journal, the problem doesn't exist, for some reason. If you let auto-sync occur during SA runs, the behavior happens due to the reason above. I got the behavior by setting: bayes_learn_to_journal 1 bayes_journal_max_size 1 then shoving messages through causes the problem to occur, and it actually keeps adding the same message over and over as well since seen is never updated. OMG! So the easy solution is to either have a special sync seen_put and seen_delete, or kluge the learn_to_journal setting around the calls. I think the first is the right solution. Patch forthcoming, then please test it for me. :) I have no idea how this hasn't been seen before. This code has been for ages. The m code was new to 2.6, so it's been over a year. geez! --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3776] massive memory consumption in spamd
http://bugzilla.spamassassin.org/show_bug.cgi?id=3776 --- Additional Comments From [EMAIL PROTECTED] 2004-10-09 14:59 --- Dallas, I know that you had said that the entire message was being sent to create_lm. I thought from Thunderbird's rendering of the message that would not be true and expressed surprise when I saw that you were correct about that. Thunderbird displays only that first picture as text in the message, and the other MIME parts show up as malformed attached jpeg files. Theo, When I said it makes sense I meant that the running time for create_lm makes sense for an over 200K input given the other numbers I saw, not that it makes sense for this message to cause it to have that big an input. I see several issues here: 1. Since TexCat behaves badly on large inputs that are within the size threshold we recommend for SpamAssassin and since it doesn't need that much real text input to get a reliable result (anything over 1K works very well if that 1K does contain representative text in a modeled language), I do suggest limiting its input to, say 1 bytes. Would someone who is more of a perl expert than I comment on whether using create_lm(substr($input, 0, 1)) is the proper way to do that? 2. The memory blowup is of more concern to me than the time it takes. The code in create_lm is supposed to do the following: For every word in the input, where that is defined as delimited by digits and whitespace, count the occurences of every length 1, 2, 3, 4, and 5 substring of the word with a start and end marker of character \000. Again, for the perl experts: Is there a better way of getting all those substrings into a hash table for counting without all the overhead of creating all the temporary strings and sorting and so on that the current code does? 3. This is coming from a malformed message. But is there something we could do to handle it better so that SpamAssassin would not put all of the message into what it thinks is the rendered message body? If this fools other MUAs, perhaps there isn't something we can do, as we do need to duplicate behavior of MUAs, but what do MUAs do with this message? Thunderbird does not display all of it, what about Outlook Express and Eudora and some others? If in fact MUAs do not display this entire message as text, then SpamAssassin should not be using it all, no matter what Ripmime does. 4. We should see where the rest of the time is being spent in the processing of the large body in case there is another optimization to do, as it still is quite slow. It might just turn out that processing a 200K message body with our rules does take several seconds, but it would be good to take a careful look. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3890] Change X-Spam-Level: character no longer supported in 3.0.0?
http://bugzilla.spamassassin.org/show_bug.cgi?id=3890 [EMAIL PROTECTED] changed: What|Removed |Added Status|NEW |RESOLVED Resolution||INVALID --- Additional Comments From [EMAIL PROTECTED] 2004-10-09 15:00 --- if you rtm, you'd see that: _STARS(*)_one * (use any character) for each score point (note: this is limited to 50 'stars' to stay on the right side of the RFCs) lets you change the character to anything you'd like. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.
[Bug 3872] SA 3.0 creates randomly extreme big bayes_journal
http://bugzilla.spamassassin.org/show_bug.cgi?id=3872 [EMAIL PROTECTED] changed: What|Removed |Added CC||dev@spamassassin.apache.org AssignedTo|dev@spamassassin.apache.org |[EMAIL PROTECTED] Severity|normal |major Priority|P5 |P3 Target Milestone|Future |3.0.1 --- Additional Comments From [EMAIL PROTECTED] 2004-10-09 15:02 --- taking the ticket, moving to 3.0.1 queue, etc. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee. You are on the CC list for the bug, or are watching someone who is.
[Bug 3872] SA 3.0 creates randomly extreme big bayes_journal
http://bugzilla.spamassassin.org/show_bug.cgi?id=3872 --- Additional Comments From [EMAIL PROTECTED] 2004-10-09 15:11 --- Created an attachment (id=2439) -- (http://bugzilla.spamassassin.org/attachment.cgi?id=2439action=view) suggested patch I'm still stunned about this! OMG! Anyway, here's my patch. It makes a new seen_{put,delete}_direct function that is called directly from the sync_journal function. the normal seen_{put,delete} is still available for the public API, and calls *_direct appropriately which I consider a private API just usable by the sync_journal function. --- You are receiving this mail because: --- You are on the CC list for the bug, or are watching someone who is.
Re[2]: Spamd is a memory hog (?)
*** REPLY SEPARATOR *** On 08.10.2004 at 10:52 Dallas L. Engelken wrote: I think the real-time auto expiry on bayes is causing some major issues. Good job I'd say. SpamD suffered from these memory jumps under win32 (win2k) as well. Now, after reading your msg I've added bayes_auto_expire 0 to local.cf and let SpamAssassin do a --force-expire every night by a scheduler job. After two days of watching spamD I tend to think there might me some truth in your words to say the least. SpamD now stays or falls back exactly to the memory amount it took after scanning the first msgs so it looks like as if the leak is really somewhere in the auto-expiry code. Tnx for spotting this! +---+ - Mailto: [EMAIL PROTECTED] - No HTML mails please +---+