Re: Suddenly load average of 15-18???
jdow wrote: From: Thomas Cameron [EMAIL PROTECTED] On Thu, 2005-05-12 at 08:31 -0700, Loren Wilton wrote: Usually a high load average means that a spamd child suddenly (or possibly slowly) got fat, and you are out of memory and thrashing to beat the band. [snip] Again, study what causes the problem. Experiment gently if you must to characterize it properly. Then solve it. Don't reboot. That just defers the problem. It's like paying blackmail money. The blackmailers never go away. And it's a constant drain. You will enjoy reading about the Dane-Geld; http://www.poetryloverspage.com/poets/kipling/dane_geld.html Or listen to it, as sung by Michael Longcor [snip] 4) Live happily ever after or at least until the next crisis, which most likely will not be a repeat of this one. This is one of the tricks of old age guile that allows us old folks to defeat youth and enthusiasm. {^_-} {^_^} -- Martin G. Diehl Visit my online gallery: Renderosity, a 3D Artist's Community http://www.renderosity.com/gallery.ez?ByArtist=YesArtist=MGD So much wisdom and knowledge -- so little time and bandwidth. --MGD Reality: That which remains after you stop thinking about it. --inspired by P. K. Dick
Suddenly load average of 15-18???
All - spamc is suddenly bringing my mail server to its knees. Running RHEL 4 with the spamassassin-3.0.1-0.EL4 (supplied by Red Hat) and spamass-milter-0.3.0-3 (I made that RPM) along with razor-agents-2.67-0, dcc-1.3.0-0 and pyzor-0.4.0-0. All of a sudden about two days ago spamc processes were chewing up the machine - sendmail was actually rejecting messages because the load average was so high! This is a machine that is only used for about 6 users... It only handles around a thousand to two thousand messages a day. I am the only admin on it and nothing has changed. Here is my local.cf: --- begin --- required_score 5 report_safe 1 rewrite_header subject **SPAM** _SCORE_ ok_languages en ok_locales en use_dcc 1 use_pyzor 1 use_razor2 1 whitelist_from_rcvd [EMAIL PROTECTED] whitelist_from_rcvd [EMAIL PROTECTED] score ALL_TRUSTED 0 0 0 0 --- end --- Here are the relevant lines from my sendmail.mc: --- begin --- INPUT_MAIL_FILTER(`greylist',`S=local:/var/milter-greylist/milter-greylist.sock')dnl define(`confMILTER_MACROS_HELO', `{verify}, {cert_subject}')dnl define(`confMILTER_MACROS_ENVFROM', `i, {auth_authen}')dnl INPUT_MAIL_FILTER(`spamassassin', `S=local:/var/run/spamass.sock, F=, T=C:15m;S:4m;R:4m;E:10m')dnl define(`confMILTER_MACROS_CONNECT',`b, j, _, {daemon_name}, {if_name}, {if_addr}')dnl INPUT_MAIL_FILTER(`clamav-milter', `S=local:/var/run/clamav/clamav-milter.sock, F=T,T=S:4m;R:4m;E:10m') --- end --- I have no idea why it is doing this... It was working fine and then this happened sort of out of the blue. Any pointers? Thanks! Thomas
Re: Suddenly load average of 15-18???
Take a look at the switches you have in /etc/init.d/spamassassin change them to only run 5 processess and to die off after 15 or twenty scans. -m5 --max-conn-per-child=5 Steve Thomas Cameron wrote: All - spamc is suddenly bringing my mail server to its knees. Running RHEL 4 with the spamassassin-3.0.1-0.EL4 (supplied by Red Hat) and spamass-milter-0.3.0-3 (I made that RPM) along with razor-agents-2.67-0, dcc-1.3.0-0 and pyzor-0.4.0-0. All of a sudden about two days ago spamc processes were chewing up the machine - sendmail was actually rejecting messages because the load average was so high! This is a machine that is only used for about 6 users... It only handles around a thousand to two thousand messages a day. I am the only admin on it and nothing has changed. Here is my local.cf: --- begin --- required_score 5 report_safe 1 rewrite_header subject **SPAM** _SCORE_ ok_languages en ok_locales en use_dcc 1 use_pyzor 1 use_razor2 1 whitelist_from_rcvd [EMAIL PROTECTED] whitelist_from_rcvd [EMAIL PROTECTED] score ALL_TRUSTED 0 0 0 0 --- end --- Here are the relevant lines from my sendmail.mc: --- begin --- INPUT_MAIL_FILTER(`greylist',`S=local:/var/milter-greylist/milter-greylist.sock')dnl define(`confMILTER_MACROS_HELO', `{verify}, {cert_subject}')dnl define(`confMILTER_MACROS_ENVFROM', `i, {auth_authen}')dnl INPUT_MAIL_FILTER(`spamassassin', `S=local:/var/run/spamass.sock, F=, T=C:15m;S:4m;R:4m;E:10m')dnl define(`confMILTER_MACROS_CONNECT',`b, j, _, {daemon_name}, {if_name}, {if_addr}')dnl INPUT_MAIL_FILTER(`clamav-milter', `S=local:/var/run/clamav/clamav-milter.sock, F=T,T=S:4m;R:4m;E:10m') --- end --- I have no idea why it is doing this... It was working fine and then this happened sort of out of the blue. Any pointers? Thanks! Thomas -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.308 / Virus Database: 266.11.9 - Release Date: 5/12/2005
Re: Suddenly load average of 15-18???
Usually a high load average means that a spamd child suddenly (or possibly slowly) got fat, and you are out of memory and thrashing to beat the band. The two most common causes of this seem to be Bayes expiry runs and Awl expiry runs. Sometimes though it can seemingly happen from some unknown sequence of mail messages. How many children are you running? What is the max lifetime (messages processed) per child? Limiting to probably 5 children, or maybe even less in your case with so few users, and limiting to maybe 20-100 connections per child will probably work around your problems. Oh, I'm assuming you have at least 512M or so. If not, you might want to cut down to only a couple of children, and definitely go with the lower number of connections per child. Loren
RE: [SOLVED] Re: Suddenly load average of 15-18???
From: Thomas Cameron [mailto:[EMAIL PROTECTED] Sent: Thursday, May 12, 2005 11:38 AM To: spamassassin-users; spamass-milt-list@nongnu.org Subject: [SOLVED] Re: Suddenly load average of 15-18??? OK, this is a weird solution... I rebooted the server and all the problems went away. It's chuffing along happily now. Memory leak, maybe? What kind of hardware? Are you scanning zips? I had to just start blocking zip attachments all together until these virii settle down a bit. .jon
Re: Suddenly load average of 15-18???
On Thu, 2005-05-12 at 11:19 -0400, Stephen M. Przepiora wrote: Take a look at the switches you have in /etc/init.d/spamassassin change them to only run 5 processess and to die off after 15 or twenty scans. -m5 --max-conn-per-child=5 Steve I just tried that and as soon as I restarted everything the load shot up to ~ 6. I had to kill everything and remove the SA milter. I'd like to figure out what the root cause is rather than band-aid the symptom. Anyone have any ideas why this would suddenly start? Thomas
Re: Suddenly load average of 15-18???
On Thu, 2005-05-12 at 08:31 -0700, Loren Wilton wrote: Usually a high load average means that a spamd child suddenly (or possibly slowly) got fat, and you are out of memory and thrashing to beat the band. The two most common causes of this seem to be Bayes expiry runs and Awl expiry runs. Sometimes though it can seemingly happen from some unknown sequence of mail messages. Is there something I should/could do about these expiry runs? It seems odd that it's been like this for a couple of days now... How could I know that this was the issue? How many children are you running? What is the max lifetime (messages processed) per child? Limiting to probably 5 children, or maybe even less in your case with so few users, and limiting to maybe 20-100 connections per child will probably work around your problems. My rc file has this: SPAMDOPTIONS=-d -c -m5 --max-conn-per-child=5 -H I just added the --max-conn-per-child=5 per Stephen Przepiora's suggestion but that didn't seem to help. Oh, I'm assuming you have at least 512M or so. If not, you might want to cut down to only a couple of children, and definitely go with the lower number of connections per child. Yes, I have 512M. As I said - this has been working flawlessly since the server was installed several weeks ago. It just suddenly went bonkers a couple of days ago. Thomas
Re: Suddenly load average of 15-18???
Hi, Thomas Cameron schrieb: I just tried that and as soon as I restarted everything the load shot up to ~ 6. I had to kill everything and remove the SA milter. I'd like to figure out what the root cause is rather than band-aid the symptom. Anyone have any ideas why this would suddenly start? Do you use the sa-blacklist? I've recently had problems with it. My load was getting very high. Thomas Greets Christoph signature.asc Description: OpenPGP digital signature
Re: Suddenly load average of 15-18???
On Thu, 2005-05-12 at 18:10 +0200, Christoph Petersen wrote: Hi, Thomas Cameron schrieb: I just tried that and as soon as I restarted everything the load shot up to ~ 6. I had to kill everything and remove the SA milter. I'd like to figure out what the root cause is rather than band-aid the symptom. Anyone have any ideas why this would suddenly start? Do you use the sa-blacklist? I've recently had problems with it. My load was getting very high. I have done nothing past the initial installation and adding spamass- milter... This is about as vanilla an installation as you can get. Thomas
[SOLVED] Re: Suddenly load average of 15-18???
OK, this is a weird solution... I rebooted the server and all the problems went away. It's chuffing along happily now. Memory leak, maybe? Thomas
RE: [SOLVED] Re: Suddenly load average of 15-18???
On Thu, 2005-05-12 at 11:46 -0500, Jon Dossey wrote: From: Thomas Cameron [mailto:[EMAIL PROTECTED] Sent: Thursday, May 12, 2005 11:38 AM To: spamassassin-users; spamass-milt-list@nongnu.org Subject: [SOLVED] Re: Suddenly load average of 15-18??? OK, this is a weird solution... I rebooted the server and all the problems went away. It's chuffing along happily now. Memory leak, maybe? What kind of hardware? Are you scanning zips? I had to just start blocking zip attachments all together until these virii settle down a bit. .jon It's just a plain Jane P-III 800MHz with 512MB memory on a 7-disk RAID 5 Ultra 160 SCSI array. I have not disabled scanning of zip files. It is running just fine now. Very odd. Thomas
Re: Suddenly load average of 15-18???
On Thu, 2005-05-12 at 10:53 -0500, Dan Nelson wrote: In the last episode (May 12), Thomas Cameron said: spamc is suddenly bringing my mail server to its knees. Running RHEL 4 with the spamassassin-3.0.1-0.EL4 (supplied by Red Hat) and spamass-milter-0.3.0-3 (I made that RPM) along with razor-agents-2.67-0, dcc-1.3.0-0 and pyzor-0.4.0-0. All of a sudden about two days ago spamc processes were chewing up the machine - sendmail was actually rejecting messages because the load average was so high! This is a machine that is only used for about 6 users... It only handles around a thousand to two thousand messages a day. I am the only admin on it and nothing has changed. What's the average processing time for a message, and are you using any -i flags on your spamass-milter commandline? Grep your maillog for in .* seconds, to get the timings. If they're all under 10 seconds or so and you're not using -i, check for things like mail loops, or large outgoing mail bursts. It was up around 50-60 seconds per message. I rebooted the machine and it has cleared up. Thanks for the help! Thomas
Re: Suddenly load average of 15-18???
symptom. Anyone have any ideas why this would suddenly start? Running Awl? Running Bayes? Since it starts immediately, it sounds like a large expiry run for one or the other of them. If you aren't running either, then this may be the area where nobody really knows what is going wrong. Loren
RE: [SOLVED] Re: Suddenly load average of 15-18???
From: Thomas Cameron [mailto:[EMAIL PROTECTED] To: spamassassin-users Subject: RE: [SOLVED] Re: Suddenly load average of 15-18??? On Thu, 2005-05-12 at 11:46 -0500, Jon Dossey wrote: From: Thomas Cameron [mailto:[EMAIL PROTECTED] Sent: Thursday, May 12, 2005 11:38 AM To: spamassassin-users; spamass-milt-list@nongnu.org Subject: [SOLVED] Re: Suddenly load average of 15-18??? OK, this is a weird solution... I rebooted the server and all the problems went away. It's chuffing along happily now. Memory leak, maybe? What kind of hardware? Are you scanning zips? I had to just start blocking zip attachments all together until these virii settle down a bit. .jon It's just a plain Jane P-III 800MHz with 512MB memory on a 7-disk RAID 5 Ultra 160 SCSI array. I have not disabled scanning of zip files. It is running just fine now. Very odd. This may only be a temporary fix. Personally, rebooting a linux machine to solve a problem just isn't acceptable. Did you try restarting spamd before rebooting? I'd go through your maillog, and check the spamassassin processing times, and see if you can pinpoint where the processing time shoots up. Then, go through your mqueue and take a look at the offending message. .jon
Re: Suddenly load average of 15-18???
Is there something I should/could do about these expiry runs? It seems odd that it's been like this for a couple of days now... How could I know that this was the issue? Um, this isn't my area of expertise. I suspect Matt or Justin will be along with a workable suggestion fairly soon. I'm pretty sure that there is some logging to indicate when an expiry run happens, but I don't know precisely what to look for. At least with bayes there is a way you can turn off the auto-expire and then use a cron job to schedule a manual expiry once a day/week/whatever. I'm not sure if similar functionality exists for awl. Did you happen to notice if all of your spamd children get fat at once, or if just one of them got really huge? All of them gettiing big might indicate something changed with your rules files. A single fat child would be more indicitave of an expiry run. Loren
RE: [SOLVED] Re: Suddenly load average of 15-18???
On Thu, 2005-05-12 at 12:20 -0500, Jon Dossey wrote: This may only be a temporary fix. Personally, rebooting a linux machine to solve a problem just isn't acceptable. Did you try restarting spamd before rebooting? Several times. I restarted the entire mail suite - sendmail, clam, SA, milter-greylist, etc. I'd go through your maillog, and check the spamassassin processing times, and see if you can pinpoint where the processing time shoots up. Then, go through your mqueue and take a look at the offending message. It wasn't just one message. It was every message. Thomas
Re: Suddenly load average of 15-18???
On Thu, 2005-05-12 at 09:31 -0700, Loren Wilton wrote: Is there something I should/could do about these expiry runs? It seems odd that it's been like this for a couple of days now... How could I know that this was the issue? Um, this isn't my area of expertise. I suspect Matt or Justin will be along with a workable suggestion fairly soon. I'm pretty sure that there is some logging to indicate when an expiry run happens, but I don't know precisely what to look for. OK, I'll look for that. At least with bayes there is a way you can turn off the auto-expire and then use a cron job to schedule a manual expiry once a day/week/whatever. I'm not sure if similar functionality exists for awl. I don't know either. Did you happen to notice if all of your spamd children get fat at once, or if just one of them got really huge? All of them gettiing big might indicate something changed with your rules files. A single fat child would be more indicitave of an expiry run. Loren It didn't really look like any of them were really fat... The machine's drives just started hammering and the load average shot up. It's all cleared up now after a reboot. Thomas
Re: [SOLVED] Re: Suddenly load average of 15-18???
Thomas Cameron wrote: On Thu, 2005-05-12 at 12:20 -0500, Jon Dossey wrote: I'd go through your maillog, and check the spamassassin processing times, and see if you can pinpoint where the processing time shoots up. Then, go through your mqueue and take a look at the offending message. It wasn't just one message. It was every message. I think what he's getting at is that one message can consume enough CPU and memory to bog down processing all the other ones, too. I saw this with spamd and large attachments, before I started bypassing large messages around spamassassin.
Re: Suddenly load average of 15-18???
From: Thomas Cameron [EMAIL PROTECTED] On Thu, 2005-05-12 at 08:31 -0700, Loren Wilton wrote: Usually a high load average means that a spamd child suddenly (or possibly slowly) got fat, and you are out of memory and thrashing to beat the band. The two most common causes of this seem to be Bayes expiry runs and Awl expiry runs. Sometimes though it can seemingly happen from some unknown sequence of mail messages. Is there something I should/could do about these expiry runs? It seems odd that it's been like this for a couple of days now... How could I know that this was the issue? How many children are you running? What is the max lifetime (messages processed) per child? Limiting to probably 5 children, or maybe even less in your case with so few users, and limiting to maybe 20-100 connections per child will probably work around your problems. My rc file has this: SPAMDOPTIONS=-d -c -m5 --max-conn-per-child=5 -H I just added the --max-conn-per-child=5 per Stephen Przepiora's suggestion but that didn't seem to help. Oh, I'm assuming you have at least 512M or so. If not, you might want to cut down to only a couple of children, and definitely go with the lower number of connections per child. Yes, I have 512M. As I said - this has been working flawlessly since the server was installed several weeks ago. It just suddenly went bonkers a couple of days ago. I read your solved remark with some bemusement. Hammering the machine over the head to solve this sort of problem is just not the way it's done in the 'nix world. I suspect you have not really found the reason yet. If you administer that machine with KDE or GNOME running and have five spamds allowed you are overloading the machine driving it into virtual memory thrashing. Cut down the number of spamds to perhaps 3, -m3. Each spamd here with 3.02 gets up to about 60 megabytes before it is harvested by max connections and a new one created. Five of those uses up a lot of memory, to be sure. I have X running here. But I have a gigabyte of memory in the machine. I mostly manage to stay out of swap so VM doesn't thrash. The thing you really needed to do and seem to have not done is isolate exactly what is causing the problem. Hammering it with a reboot just means you get to reboot often. If you spend the time to figure out what resource was exhausted on your machine and what was the chief villain with regards to exhausting that resource then you can work to mitigate the problem. And you can enjoy many year long uptimes unless you have to update the kernel. It saves wear and tear on you, freeing you to apply the same principles to solve other problems that might appear. It also frees the time to be proactive about the problems that might appear. As my first paragraph implies I suspect memory is the resource and spamd coupled with KDE or GNOME might be the problem. It is quite sufficient to drive the machine to the edge. And any OS gets pokey when you get to the edge. The machine that has SA 2.63 on it is a 66 MHz Pentium with 256 megs of memory. It takes a nearly couple minutes to scan a message. It sits in console mode. It handles DNS and the firewall as well as the email. It can handle the 1200 to 1500 emails per day that Loren and I were getting while I was still on that machine. I have since installed 3.02 on a spare Linux machine, my pet computer toy, and put my email filtering over on it. I get on the order of a total of 1000 messages a day. It handles them at under 1.5% of its potential It has a gigabyte of memory so X's requirements are not a threat to the email filtering. Everything runs fast. I also tuned the number of spamds and connections per spamd to use only a reasonable chunk of the machine. (I untuned it recently to test a fix for a scoring bug in 3.02. It probably is time to reduce the -m value. I don't NEED it as high as I have it now. {^_-}) Again, study what causes the problem. Experiment gently if you must to characterize it properly. Then solve it. Don't reboot. That just defers the problem. It's like paying blackmail money. The blackmailers never go away. And it's a constant drain. 1) What resource is becoming saturated? It's not always obvious when you first look at the problem. Dig to find the real bottleneck. (If a small 66MHz machine can handle nearly the volume I believe you cited then time is not where you want to look on a machine ten times faster.) 2) Find what is consuming overmuch of that resource. 3) Mitigate the excessive resource usage. 4) Live happily ever after or at least until the next crisis, which most likely will not be a repeat of this one. This is one of the tricks of old age guile that allows us old folks to defeat youth and enthusiasm. {^_-} {^_^}
Re: Suddenly load average of 15-18???
From: Thomas Cameron [EMAIL PROTECTED] On Thu, 2005-05-12 at 09:31 -0700, Loren Wilton wrote: Is there something I should/could do about these expiry runs? It seems odd that it's been like this for a couple of days now... How could I know that this was the issue? Um, this isn't my area of expertise. I suspect Matt or Justin will be along with a workable suggestion fairly soon. I'm pretty sure that there is some logging to indicate when an expiry run happens, but I don't know precisely what to look for. OK, I'll look for that. At least with bayes there is a way you can turn off the auto-expire and then use a cron job to schedule a manual expiry once a day/week/whatever. I'm not sure if similar functionality exists for awl. I don't know either. Loren's suggestion is likely a very good one. top is a nice way to find out WHAT is consuming the time. I do note that I do not use automatic learning or whitelisting here. (Me paranoid. Me not trust 'em. So me feed salearn manually. Me get outstanding results. Me happy. {^_-}) Did you happen to notice if all of your spamd children get fat at once, or if just one of them got really huge? All of them gettiing big might indicate something changed with your rules files. A single fat child would be more indicitave of an expiry run. Loren It didn't really look like any of them were really fat... The machine's drives just started hammering and the load average shot up. It's all cleared up now after a reboot. For how long? You did not SOLVE the problem. You paid it's blackmail. {^_-}