Re: Suddenly load average of 15-18???

2005-05-13 Thread Martin G. Diehl
jdow wrote:
From: Thomas Cameron [EMAIL PROTECTED]
On Thu, 2005-05-12 at 08:31 -0700, Loren Wilton wrote:
Usually a high load average means that a spamd child suddenly 
(or possibly slowly) got fat, and you are out of memory and 
thrashing to beat the band.
[snip]
Again, study what causes the problem. Experiment gently if you must
to characterize it properly. Then solve it. Don't reboot. That just
defers the problem. It's like paying blackmail money. 
The blackmailers never go away. And it's a constant drain.
You will enjoy reading about the Dane-Geld;
http://www.poetryloverspage.com/poets/kipling/dane_geld.html
Or listen to it, as sung by Michael Longcor
[snip]
4) Live happily ever after or at least until the next crisis, which
most likely will not be a repeat of this one.
This is one of the tricks of old age guile that allows us old folks to
defeat youth and enthusiasm. {^_-}
{^_^}
--
Martin G. Diehl
Visit my online gallery: Renderosity, a 3D Artist's Community
http://www.renderosity.com/gallery.ez?ByArtist=YesArtist=MGD
So much wisdom and knowledge -- so little time and bandwidth.
--MGD
Reality: That which remains after you stop thinking about it.
--inspired by P. K. Dick


Suddenly load average of 15-18???

2005-05-12 Thread Thomas Cameron
All -
spamc is suddenly bringing my mail server to its knees.
Running RHEL 4 with the spamassassin-3.0.1-0.EL4 (supplied by Red Hat) and 
spamass-milter-0.3.0-3 (I made that RPM) along with razor-agents-2.67-0, 
dcc-1.3.0-0 and pyzor-0.4.0-0.

All of a sudden about two days ago spamc processes were chewing up the 
machine - sendmail was actually rejecting messages because the load average 
was so high!  This is a machine that is only used for about 6 users...  It 
only handles around a thousand to two thousand messages a day.  I am the 
only admin on it and nothing has changed.

Here is my local.cf:
--- begin ---
required_score 5
report_safe 1
rewrite_header subject **SPAM** _SCORE_
ok_languages en
ok_locales en
use_dcc 1
use_pyzor 1
use_razor2 1
whitelist_from_rcvd [EMAIL PROTECTED]
whitelist_from_rcvd [EMAIL PROTECTED]
score ALL_TRUSTED 0 0 0 0
--- end ---
Here are the relevant lines from my sendmail.mc:
--- begin ---
INPUT_MAIL_FILTER(`greylist',`S=local:/var/milter-greylist/milter-greylist.sock')dnl
define(`confMILTER_MACROS_HELO', `{verify}, {cert_subject}')dnl
define(`confMILTER_MACROS_ENVFROM', `i, {auth_authen}')dnl
INPUT_MAIL_FILTER(`spamassassin', `S=local:/var/run/spamass.sock, F=, 
T=C:15m;S:4m;R:4m;E:10m')dnl
define(`confMILTER_MACROS_CONNECT',`b, j, _, {daemon_name}, {if_name}, 
{if_addr}')dnl

INPUT_MAIL_FILTER(`clamav-milter', 
`S=local:/var/run/clamav/clamav-milter.sock, F=T,T=S:4m;R:4m;E:10m')

--- end ---
I have no idea why it is doing this...  It was working fine and then this 
happened sort of out of the blue.  Any pointers?

Thanks!
Thomas 



Re: Suddenly load average of 15-18???

2005-05-12 Thread Stephen M. Przepiora
Take a look at the switches you have in /etc/init.d/spamassassin change 
them to only run 5 processess and to die off after 15 or twenty scans.
-m5 --max-conn-per-child=5
Steve

Thomas Cameron wrote:
All -
spamc is suddenly bringing my mail server to its knees.
Running RHEL 4 with the spamassassin-3.0.1-0.EL4 (supplied by Red Hat) 
and spamass-milter-0.3.0-3 (I made that RPM) along with 
razor-agents-2.67-0, dcc-1.3.0-0 and pyzor-0.4.0-0.

All of a sudden about two days ago spamc processes were chewing up the 
machine - sendmail was actually rejecting messages because the load 
average was so high!  This is a machine that is only used for about 6 
users...  It only handles around a thousand to two thousand messages a 
day.  I am the only admin on it and nothing has changed.

Here is my local.cf:
--- begin ---
required_score 5
report_safe 1
rewrite_header subject **SPAM** _SCORE_
ok_languages en
ok_locales en
use_dcc 1
use_pyzor 1
use_razor2 1
whitelist_from_rcvd [EMAIL PROTECTED]
whitelist_from_rcvd [EMAIL PROTECTED]
score ALL_TRUSTED 0 0 0 0
--- end ---
Here are the relevant lines from my sendmail.mc:
--- begin ---
INPUT_MAIL_FILTER(`greylist',`S=local:/var/milter-greylist/milter-greylist.sock')dnl 

define(`confMILTER_MACROS_HELO', `{verify}, {cert_subject}')dnl
define(`confMILTER_MACROS_ENVFROM', `i, {auth_authen}')dnl
INPUT_MAIL_FILTER(`spamassassin', `S=local:/var/run/spamass.sock, F=, 
T=C:15m;S:4m;R:4m;E:10m')dnl
define(`confMILTER_MACROS_CONNECT',`b, j, _, {daemon_name}, {if_name}, 
{if_addr}')dnl

INPUT_MAIL_FILTER(`clamav-milter', 
`S=local:/var/run/clamav/clamav-milter.sock, F=T,T=S:4m;R:4m;E:10m')

--- end ---
I have no idea why it is doing this...  It was working fine and then 
this happened sort of out of the blue.  Any pointers?

Thanks!
Thomas


--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.308 / Virus Database: 266.11.9 - Release Date: 5/12/2005


Re: Suddenly load average of 15-18???

2005-05-12 Thread Loren Wilton
Usually a high load average means that a spamd child suddenly (or possibly
slowly) got fat, and you are out of memory and thrashing to beat the band.
The two most common causes of this seem to be Bayes expiry runs and Awl
expiry runs.  Sometimes though it can seemingly happen from some unknown
sequence of mail messages.

How many children are you running?  What is the max lifetime (messages
processed) per child?  Limiting to probably 5 children, or maybe even less
in your case with so few users, and limiting to maybe 20-100 connections per
child will probably work around your problems.

Oh, I'm assuming you have at least 512M or so.  If not, you might want to
cut down to only a couple of children, and definitely go with the lower
number of connections per child.

Loren



RE: [SOLVED] Re: Suddenly load average of 15-18???

2005-05-12 Thread Jon Dossey
 From: Thomas Cameron [mailto:[EMAIL PROTECTED]
 Sent: Thursday, May 12, 2005 11:38 AM
 To: spamassassin-users; spamass-milt-list@nongnu.org
 Subject: [SOLVED] Re: Suddenly load average of 15-18???
 
 OK, this is a weird solution...  I rebooted the server and all the
 problems went away.  It's chuffing along happily now.
 
 Memory leak, maybe?


What kind of hardware?  Are you scanning zips?  I had to just start
blocking zip attachments all together until these virii settle down a
bit.


.jon



Re: Suddenly load average of 15-18???

2005-05-12 Thread Thomas Cameron
On Thu, 2005-05-12 at 11:19 -0400, Stephen M. Przepiora wrote:
 Take a look at the switches you have in /etc/init.d/spamassassin change 
 them to only run 5 processess and to die off after 15 or twenty scans.
 -m5 --max-conn-per-child=5
 Steve

I just tried that and as soon as I restarted everything the load shot up
to ~ 6.  I had to kill everything and remove the SA milter.

I'd like to figure out what the root cause is rather than band-aid the
symptom.  Anyone have any ideas why this would suddenly start?

Thomas



Re: Suddenly load average of 15-18???

2005-05-12 Thread Thomas Cameron
On Thu, 2005-05-12 at 08:31 -0700, Loren Wilton wrote: 
 Usually a high load average means that a spamd child suddenly (or possibly
 slowly) got fat, and you are out of memory and thrashing to beat the band.
 The two most common causes of this seem to be Bayes expiry runs and Awl
 expiry runs.  Sometimes though it can seemingly happen from some unknown
 sequence of mail messages.

Is there something I should/could do about these expiry runs?  It seems
odd that it's been like this for a couple of days now...  How could I
know that this was the issue?

 How many children are you running?  What is the max lifetime (messages
 processed) per child?  Limiting to probably 5 children, or maybe even less
 in your case with so few users, and limiting to maybe 20-100 connections per
 child will probably work around your problems.

My rc file has this:

SPAMDOPTIONS=-d -c -m5 --max-conn-per-child=5 -H

I just added the --max-conn-per-child=5 per Stephen Przepiora's
suggestion but that didn't seem to help.

 Oh, I'm assuming you have at least 512M or so.  If not, you might want to
 cut down to only a couple of children, and definitely go with the lower
 number of connections per child.

Yes, I have 512M.  As I said - this has been working flawlessly since
the server was installed several weeks ago.  It just suddenly went
bonkers a couple of days ago.

Thomas



Re: Suddenly load average of 15-18???

2005-05-12 Thread Christoph Petersen
Hi,

Thomas Cameron schrieb:
 I just tried that and as soon as I restarted everything the load shot up
 to ~ 6.  I had to kill everything and remove the SA milter.
 
 I'd like to figure out what the root cause is rather than band-aid the
 symptom.  Anyone have any ideas why this would suddenly start?
 

Do you use the sa-blacklist? I've recently had problems with it. My load
was getting very high.

 Thomas

Greets
Christoph


signature.asc
Description: OpenPGP digital signature


Re: Suddenly load average of 15-18???

2005-05-12 Thread Thomas Cameron
On Thu, 2005-05-12 at 18:10 +0200, Christoph Petersen wrote:
 Hi,
 
 Thomas Cameron schrieb:
  I just tried that and as soon as I restarted everything the load shot up
  to ~ 6.  I had to kill everything and remove the SA milter.
  
  I'd like to figure out what the root cause is rather than band-aid the
  symptom.  Anyone have any ideas why this would suddenly start?
  
 
 Do you use the sa-blacklist? I've recently had problems with it. My load
 was getting very high.

I have done nothing past the initial installation and adding spamass-
milter...  This is about as vanilla an installation as you can get.

Thomas



[SOLVED] Re: Suddenly load average of 15-18???

2005-05-12 Thread Thomas Cameron
OK, this is a weird solution...  I rebooted the server and all the
problems went away.  It's chuffing along happily now.

Memory leak, maybe?

Thomas



RE: [SOLVED] Re: Suddenly load average of 15-18???

2005-05-12 Thread Thomas Cameron
On Thu, 2005-05-12 at 11:46 -0500, Jon Dossey wrote:
  From: Thomas Cameron [mailto:[EMAIL PROTECTED]
  Sent: Thursday, May 12, 2005 11:38 AM
  To: spamassassin-users; spamass-milt-list@nongnu.org
  Subject: [SOLVED] Re: Suddenly load average of 15-18???
  
  OK, this is a weird solution...  I rebooted the server and all the
  problems went away.  It's chuffing along happily now.
  
  Memory leak, maybe?
 
 
 What kind of hardware?  Are you scanning zips?  I had to just start
 blocking zip attachments all together until these virii settle down a
 bit.
 
 
 .jon
 


It's just a plain Jane P-III 800MHz with 512MB memory on a 7-disk RAID 5
Ultra 160 SCSI array.  I have not disabled scanning of zip files.

It is running just fine now.  Very odd.

Thomas



Re: Suddenly load average of 15-18???

2005-05-12 Thread Thomas Cameron
On Thu, 2005-05-12 at 10:53 -0500, Dan Nelson wrote:
 In the last episode (May 12), Thomas Cameron said:
  spamc is suddenly bringing my mail server to its knees.
  
  Running RHEL 4 with the spamassassin-3.0.1-0.EL4 (supplied by Red Hat) and 
  spamass-milter-0.3.0-3 (I made that RPM) along with razor-agents-2.67-0, 
  dcc-1.3.0-0 and pyzor-0.4.0-0.
  
  All of a sudden about two days ago spamc processes were chewing up
  the machine - sendmail was actually rejecting messages because the
  load average was so high!  This is a machine that is only used for
  about 6 users...  It only handles around a thousand to two thousand
  messages a day.  I am the only admin on it and nothing has changed.
 
 What's the average processing time for a message, and are you using any
 -i flags on your spamass-milter commandline?  Grep your maillog for 
 in .* seconds, to get the timings.  If they're all under 10 seconds
 or so and you're not using -i, check for things like mail loops, or
 large outgoing mail bursts.  

It was up around 50-60 seconds per message.  I rebooted the machine and
it has cleared up.

Thanks for the help!

Thomas



Re: Suddenly load average of 15-18???

2005-05-12 Thread Loren Wilton
 symptom.  Anyone have any ideas why this would suddenly start?

Running Awl?  Running Bayes?  Since it starts immediately, it sounds like a
large expiry run for one or the other of them.  If you aren't running
either, then this may be the area where nobody really knows what is going
wrong.

Loren



RE: [SOLVED] Re: Suddenly load average of 15-18???

2005-05-12 Thread Jon Dossey
 From: Thomas Cameron [mailto:[EMAIL PROTECTED]
 To: spamassassin-users
 Subject: RE: [SOLVED] Re: Suddenly load average of 15-18???
 
 On Thu, 2005-05-12 at 11:46 -0500, Jon Dossey wrote:
   From: Thomas Cameron [mailto:[EMAIL PROTECTED]
   Sent: Thursday, May 12, 2005 11:38 AM
   To: spamassassin-users; spamass-milt-list@nongnu.org
   Subject: [SOLVED] Re: Suddenly load average of 15-18???
  
   OK, this is a weird solution...  I rebooted the server and all the
   problems went away.  It's chuffing along happily now.
  
   Memory leak, maybe?
 
 
  What kind of hardware?  Are you scanning zips?  I had to just start
  blocking zip attachments all together until these virii settle down
a
  bit.
 
 
  .jon
 
 
 
 It's just a plain Jane P-III 800MHz with 512MB memory on a 7-disk RAID
5
 Ultra 160 SCSI array.  I have not disabled scanning of zip files.
 
 It is running just fine now.  Very odd.

This may only be a temporary fix.  Personally, rebooting a linux machine
to solve a problem just isn't acceptable.  Did you try restarting spamd
before rebooting?

I'd go through your maillog, and check the spamassassin processing
times, and see if you can pinpoint where the processing time shoots up.
Then, go through your mqueue and take a look at the offending message.

.jon



Re: Suddenly load average of 15-18???

2005-05-12 Thread Loren Wilton
 Is there something I should/could do about these expiry runs?  It seems
 odd that it's been like this for a couple of days now...  How could I
 know that this was the issue?

Um, this isn't my area of expertise.  I suspect Matt or Justin will be along
with a workable suggestion fairly soon.  I'm pretty sure that there is some
logging to indicate when an expiry run happens, but I don't know precisely
what to look for.

At least with bayes there is a way you can turn off the auto-expire and then
use a cron job to schedule a manual expiry once a day/week/whatever.  I'm
not sure if similar functionality exists for awl.

Did you happen to notice if all of your spamd children get fat at once, or
if just one of them got really huge?  All of them gettiing big might
indicate something changed with your rules files.  A single fat child would
be more indicitave of an expiry run.

Loren



RE: [SOLVED] Re: Suddenly load average of 15-18???

2005-05-12 Thread Thomas Cameron
On Thu, 2005-05-12 at 12:20 -0500, Jon Dossey wrote:

 This may only be a temporary fix.  Personally, rebooting a linux machine
 to solve a problem just isn't acceptable.  Did you try restarting spamd
 before rebooting?

Several times.  I restarted the entire mail suite - sendmail, clam, SA,
milter-greylist, etc.

 I'd go through your maillog, and check the spamassassin processing
 times, and see if you can pinpoint where the processing time shoots up.
 Then, go through your mqueue and take a look at the offending message.

It wasn't just one message.  It was every message.

Thomas



Re: Suddenly load average of 15-18???

2005-05-12 Thread Thomas Cameron
On Thu, 2005-05-12 at 09:31 -0700, Loren Wilton wrote:
  Is there something I should/could do about these expiry runs?  It seems
  odd that it's been like this for a couple of days now...  How could I
  know that this was the issue?
 
 Um, this isn't my area of expertise.  I suspect Matt or Justin will be along
 with a workable suggestion fairly soon.  I'm pretty sure that there is some
 logging to indicate when an expiry run happens, but I don't know precisely
 what to look for.

OK, I'll look for that.

 At least with bayes there is a way you can turn off the auto-expire and then
 use a cron job to schedule a manual expiry once a day/week/whatever.  I'm
 not sure if similar functionality exists for awl.

I don't know either.

 Did you happen to notice if all of your spamd children get fat at once, or
 if just one of them got really huge?  All of them gettiing big might
 indicate something changed with your rules files.  A single fat child would
 be more indicitave of an expiry run.
 
 Loren

It didn't really look like any of them were really fat...  The machine's
drives just started hammering and the load average shot up.

It's all cleared up now after a reboot.

Thomas



Re: [SOLVED] Re: Suddenly load average of 15-18???

2005-05-12 Thread David Brodbeck
Thomas Cameron wrote:
On Thu, 2005-05-12 at 12:20 -0500, Jon Dossey wrote:
I'd go through your maillog, and check the spamassassin processing
times, and see if you can pinpoint where the processing time shoots up.
Then, go through your mqueue and take a look at the offending message.

It wasn't just one message.  It was every message.
I think what he's getting at is that one message can consume enough CPU 
and memory to bog down processing all the other ones, too.  I saw this 
with spamd and large attachments, before I started bypassing large 
messages around spamassassin.


Re: Suddenly load average of 15-18???

2005-05-12 Thread jdow
From: Thomas Cameron [EMAIL PROTECTED]

 On Thu, 2005-05-12 at 08:31 -0700, Loren Wilton wrote:
  Usually a high load average means that a spamd child suddenly (or
possibly
  slowly) got fat, and you are out of memory and thrashing to beat the
band.
  The two most common causes of this seem to be Bayes expiry runs and Awl
  expiry runs.  Sometimes though it can seemingly happen from some unknown
  sequence of mail messages.

 Is there something I should/could do about these expiry runs?  It seems
 odd that it's been like this for a couple of days now...  How could I
 know that this was the issue?

  How many children are you running?  What is the max lifetime (messages
  processed) per child?  Limiting to probably 5 children, or maybe even
less
  in your case with so few users, and limiting to maybe 20-100 connections
per
  child will probably work around your problems.

 My rc file has this:

 SPAMDOPTIONS=-d -c -m5 --max-conn-per-child=5 -H

 I just added the --max-conn-per-child=5 per Stephen Przepiora's
 suggestion but that didn't seem to help.

  Oh, I'm assuming you have at least 512M or so.  If not, you might want
to
  cut down to only a couple of children, and definitely go with the lower
  number of connections per child.

 Yes, I have 512M.  As I said - this has been working flawlessly since
 the server was installed several weeks ago.  It just suddenly went
 bonkers a couple of days ago.

I read your solved remark with some bemusement. Hammering the machine
over the head to solve this sort of problem is just not the way it's
done in the 'nix world. I suspect you have not really found the reason
yet. If you administer that machine with KDE or GNOME running and have
five spamds allowed you are overloading the machine driving it into
virtual memory thrashing. Cut down the number of spamds to perhaps 3,
-m3. Each spamd here with 3.02 gets up to about 60 megabytes before
it is harvested by max connections and a new one created. Five of those
uses up a lot of memory, to be sure. I have X running here. But I have
a gigabyte of memory in the machine. I mostly manage to stay out of
swap so VM doesn't thrash.

The thing you really needed to do and seem to have not done is isolate
exactly what is causing the problem. Hammering it with a reboot just
means you get to reboot often. If you spend the time to figure out what
resource was exhausted on your machine and what was the chief villain
with regards to exhausting that resource then you can work to mitigate
the problem. And you can enjoy many year long uptimes unless you have
to update the kernel. It saves wear and tear on you, freeing you to
apply the same principles to solve other problems that might appear.
It also frees the time to be proactive about the problems that might
appear.

As my first paragraph implies I suspect memory is the resource and
spamd coupled with KDE or GNOME might be the problem. It is quite
sufficient to drive the machine to the edge. And any OS gets pokey
when you get to the edge. The machine that has SA 2.63 on it is a
66 MHz Pentium with 256 megs of memory. It takes a nearly couple
minutes to scan a message. It sits in console mode. It handles DNS
and the firewall as well as the email. It can handle the 1200 to
1500 emails per day that Loren and I were getting while I was still
on that machine. I have since installed 3.02 on a spare Linux
machine, my pet computer toy, and put my email filtering over on
it. I get on the order of a total of 1000 messages a day. It handles
them at under 1.5% of its potential It has a gigabyte of memory so
X's requirements are not a threat to the email filtering. Everything
runs fast. I also tuned the number of spamds and connections per
spamd to use only a reasonable chunk of the machine. (I untuned it
recently to test a fix for a scoring bug in 3.02. It probably is
time to reduce the -m value. I don't NEED it as high as I have it
now. {^_-})

Again, study what causes the problem. Experiment gently if you must
to characterize it properly. Then solve it. Don't reboot. That just
defers the problem. It's like paying blackmail money. The blackmailers
never go away. And it's a constant drain.

1) What resource is becoming saturated? It's not always obvious when
you first look at the problem. Dig to find the real bottleneck. (If a
small 66MHz machine can handle nearly the volume I believe you cited
then time is not where you want to look on a machine ten times faster.)

2) Find what is consuming overmuch of that resource.

3) Mitigate the excessive resource usage.

4) Live happily ever after or at least until the next crisis, which
most likely will not be a repeat of this one.

This is one of the tricks of old age guile that allows us old folks to
defeat youth and enthusiasm. {^_-}

{^_^}




Re: Suddenly load average of 15-18???

2005-05-12 Thread jdow
From: Thomas Cameron [EMAIL PROTECTED]

 On Thu, 2005-05-12 at 09:31 -0700, Loren Wilton wrote:
   Is there something I should/could do about these expiry runs?  It
seems
   odd that it's been like this for a couple of days now...  How could I
   know that this was the issue?
 
  Um, this isn't my area of expertise.  I suspect Matt or Justin will be
along
  with a workable suggestion fairly soon.  I'm pretty sure that there is
some
  logging to indicate when an expiry run happens, but I don't know
precisely
  what to look for.

 OK, I'll look for that.

  At least with bayes there is a way you can turn off the auto-expire and
then
  use a cron job to schedule a manual expiry once a day/week/whatever.
I'm
  not sure if similar functionality exists for awl.

 I don't know either.

Loren's suggestion is likely a very good one. top is a nice way to find
out WHAT is consuming the time. I do note that I do not use automatic
learning or whitelisting here. (Me paranoid. Me not trust 'em. So me
feed salearn manually. Me get outstanding results. Me happy. {^_-})

  Did you happen to notice if all of your spamd children get fat at once,
or
  if just one of them got really huge?  All of them gettiing big might
  indicate something changed with your rules files.  A single fat child
would
  be more indicitave of an expiry run.
 
  Loren

 It didn't really look like any of them were really fat...  The machine's
 drives just started hammering and the load average shot up.

 It's all cleared up now after a reboot.

For how long? You did not SOLVE the problem. You paid it's blackmail.

{^_-}