Re: Hanging process: detection and determination (was Re: Runaway processes)

1999-10-22 Thread Stas Bekman

> My watch dog code is a bit of a mess, but it uses LWP::*, 
> or lwp-request to query the server, test for expected 
> output, and if it takes too long restart the server.  If 
> too many subsequent restarts fail to work, and apache is 
> still not responding in a timely manner, send an 
> email to the administrator.
> 
> Some code is posted below that you can adapt for you use.

Wow! That's quite a script :) 

I'm using some simpler one (only for httpd), see the second watchdog at:
http://perl.apache.org/guide/control.html#Monitoring_the_Server_A_watchdo

[code snipped]
___
Stas Bekman  mailto:[EMAIL PROTECTED]www.singlesheaven.com/stas  
Perl,CGI,Apache,Linux,Web,Java,PC at  www.singlesheaven.com/stas/TULARC
www.apache.org  & www.perl.com  == www.modperl.com  ||  perl.apache.org
single o-> + single o-+ = singlesheavenhttp://www.singlesheaven.com



Re: Hanging process: detection and determination (was Re: Runaway processes)

1999-10-22 Thread Joshua Chamas

Remi Fasol wrote:
> 
> hi Joshua,
> 
> is this your recommended setup when using Apache::ASP
> or is this for mod_perl in general?
> 

mod_perl in general.

> if it's for Apache::ASP, do you have a sample CPU
> limit script and/or watchdog?
> 

Check out Apache::Resource, the CPU limiting feature
is well documented there.  You need to install
also BSD::Resource on your system.

My watch dog code is a bit of a mess, but it uses LWP::*, 
or lwp-request to query the server, test for expected 
output, and if it takes too long restart the server.  If 
too many subsequent restarts fail to work, and apache is 
still not responding in a timely manner, send an 
email to the administrator.

Some code is posted below that you can adapt for you use.

--Joshua
_
Joshua Chamas   Chamas Enterprises Inc.
NODEWORKS >> free web link monitoring   Huntington Beach, CA  USA 
http://www.nodeworks.com1-714-625-4051

## main watchdog, you can add other testing modules, I test
## web server, database, dns, smtp with this.

use Util;
use My::UserAgent;
# use LWP::Debug qw(+);

my $ua = new My::UserAgent;
my $self = new Util::Monitor;

$self->add_run(bless
   {
   'name' => 'proxy homepages test',

   # make sure both the secure and non-secure pages are working
   'test' => sub {
   my $http_doc = $ua->Request('http://www.nodeworks.com');
   my $https_doc = $ua->Request('https://www.nodeworks.com');
   ($http_doc->{success} && $https_doc->{success}); },

   # restart www server gracefully, works when server isn't started
   # all too
   'fix'  => sub {
   $self->log("attempting restart of www server");
   `/usr/local/apache/sbin/apachectl graceful`; },

   # allow for downtime of one - two minutes before sending page
   # it may take a minute to reboot server when the machine is busy
   'period' => 30,
   'max_tests' => 3,
   'timeout' => 30,
   }, Util::Monitor::Run
   );

$self->monitor;

## from Util.pm

package Util::Monitor;
@ISA = qw(Util);
use Class::Struct;
use File::Basename;
use Carp qw(confess cluck carp);
use HTTP::Date;
use Net::SMTP;
use Net::Config;
use Data::Dumper;
use Tie::CPHash;
use File::Basename;
use Time::localtime;

@Mandatory = ('name', 'test', 'period');
$MaxSleep = 60;
$DefaultMaxTests = 3;

unless(keys %Util::Monitor::Run::) {
struct(Util::Monitor::Run => {
'name' => "\$", # string describing test
'test' => "\$", # CODE
'fix'  => "\$", # CODE
'period' => "\$", # time in seconds to iterate
'max_tests' => "\$", # number of times before erroring
'num_tests' => "\$", # number of times before erroring
'last' => "\$", # time last ran
'timeout' => "\$", # timeout for test
});
}

sub new {
chdir(File::Basename::dirname($0)) || die("can't change to dir for $0");
my $self = bless { runs => [] };
$self->write_pidfile;
$self;
}

# add runs before monitoring
sub add_run {
my($self, $run) = @_;
die("no run") unless $run;
die("run is not well defined") unless (@Mandatory == grep(defined $run->{$_}, 
@Mandatory));

$run->max_tests || $run->max_tests($DefaultMaxTests);
$run->last(0);
$run->num_tests(0);
push(@{$self->{runs}}, $run);
}

# main code to loop over
sub monitor {
my $self = shift;
while($self->alive) {
$self->do_runs;
$self->sleep;
}
}

sub sleep {
my $self = shift;
my $next_time = time() + $MaxSleep;
for(@{$self->{runs}}) {
my $run_time = $_->period + $_->last;
if($run_time < $next_time) {
$next_time = $run_time;
}
}
my $sleep_time = $next_time - time;
if($sleep_time > 0) {
$self->log("sleeping $sleep_time");
sleep($sleep_time);
} else {
$sleep_time = 0;
}

$sleep_time;
}

sub do_runs {
my $self = shift;
@{$self->{runs}} > 0 or die("no runs to do");

my $run;
for $run (@{$self->{runs}}) {
my $run_time = $run->period + $run->last;
next unless ($run_time <= time);
$self->log("doing run name ".$run->name);
$self->do_run($run);
$run->last(time);
}
}

sub do_run {
my($self, $run) = @_;

my $name = $run->name;
my $start = time();
my $result = $self->try($run->test, $run->timeout);
my $total = time - $start;
$self->log("time for $name: $total");

if($result) {
# test succeeded
$self->log("test success for $name, result $result");
if($run->num_tests) {
$self->sendmail({
Subject => $run->name . " fixed",
Body => "failed t

Re: Hanging process: detection and determination (was Re: Runaway processes)

1999-10-22 Thread Stas Bekman

> I use Apache::Resource to set a CPU limit, that only a 
> runaway process would hit so the random killer process
> doesn't accumulate and take down my system.  I have
> MaxRequestsPerChild set to a few hundred and have found
> empirically that they don't tend to take more than 10
> seconds of CPU time for normal use, so I give a CPU 
> limit of 20-30 seconds for all my httpds.

So you use the formula:

total_proc_cpu_time_limit =
MaxRequestsPerChild * single_request_cpu_time_limit

Hmm, you describe a workable solution...  But it can be very problematic
to determine the limit numbers for the above formula, if the environment
tend to change. I mean, when you add/remove scripts, add features...

$detection_solutions++ :) 
Anyone else?

> I also run a monitor program that watchdogs the
> server every 20-30 seconds and restarts it if 
> response time is ever too low, just in case other 
> odd things go wrong. It just does a graceful 
> restart, I haven't needed to fix a problem with a 
> full stop / start yet.

Yup, I do the same. My watchdog also emails a report to myself when this
happens, so I can monitor the whole thing and spot problems. (see the
guide for the watchdog). But unfortunately this cannot spot that just a
few processes hang. It would only work, when hanging_procs = MaxClients,
so parent process wouldn't spawn any more procs and the watchdog would
detect and restart the server, killing all the hanging procs...

Thank you, Joshua

___
Stas Bekman  mailto:[EMAIL PROTECTED]www.singlesheaven.com/stas  
Perl,CGI,Apache,Linux,Web,Java,PC at  www.singlesheaven.com/stas/TULARC
www.apache.org  & www.perl.com  == www.modperl.com  ||  perl.apache.org
single o-> + single o-+ = singlesheavenhttp://www.singlesheaven.com



Re: Hanging process: detection and determination (was Re: Runaway processes)

1999-10-22 Thread Remi Fasol

hi Joshua,

is this your recommended setup when using Apache::ASP
or is this for mod_perl in general?

if it's for Apache::ASP, do you have a sample CPU
limit script and/or watchdog?

thanks!
remi


--- Joshua Chamas <[EMAIL PROTECTED]> wrote:
> Stas,
> 
> I use Apache::Resource to set a CPU limit, that only
> a 
> runaway process would hit so the random killer
> process
> doesn't accumulate and take down my system.  I have
> MaxRequestsPerChild set to a few hundred and have
> found
> empirically that they don't tend to take more than
> 10
> seconds of CPU time for normal use, so I give a CPU 
> limit of 20-30 seconds for all my httpds.
> 
> I also run a monitor program that watchdogs the
> server every 20-30 seconds and restarts it if 
> response time is ever too low, just in case other 
> odd things go wrong. It just does a graceful 
> restart, I haven't needed to fix a problem with a 
> full stop / start yet.
> 


=

__
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



Re: Hanging process: detection and determination (was Re: Runaway processes)

1999-10-22 Thread Stas Bekman

> The reason why the stop button does not stop the script lies in the fact
> that you're script does not produce any output while it is running. SIGPIPE
> is only raised when your script tries to write to a closed (STOPed)
> connection. No output from your script = no SIGPIPE!

That's right, Tobias.

I've checked the Apache::SIG and $r->connection->aborted, but is there a
way to "write" without actually writing, probably some control char will
do? Something like:

  while(1){
$r->print("\0");
last if $r->connection->aborted;
$i++;
sleep (1);
  }

I guess you must flush it as well, otherwise it would be cached... so
either $|++ or $r->rflush, this one seems to work:

  while(1){
$r->print("\0");
$r->rflush;
last if $r->connection->aborted;
$i++;
sleep (1);
  }

but this one doesn't work (removed "last if $r->connection->aborted").
Which seems that makes mod_perl broken

  while(1){
$r->print("$$\n");
$r->rflush;
$i++;
sleep (1);
  }

See the output of strace, when I press Stop - it detects the SIGPIPE but
doesn't quit!

[snip]
nanosleep(0xb308, 0xb308, 0x401a61b4, 0xb308, 0xb41c) = 0
time([940621341])   = 940621341
write(4, "22572\n", 6)  = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) ---
time([940621341])   = 940621341
SYS_175(0, 0xb41c, 0xb39c, 0x8, 0) = 0
SYS_174(0x11, 0, 0xb1a0, 0x8, 0x11) = 0
SYS_175(0x2, 0xb39c, 0, 0x8, 0x2)   = 0
nanosleep(0xb308, 0xb308, 0x401a61b4, 0xb308, 0xb41c) = 0
[snip]
continues non-stop here

So Apache::SIG doesn't set correctly the mod_perl's default behaviour,
since when I add: 

use Apache::SIG ();
Apache::SIG->set;

It stops right away after I press the Stop button.

I run Apache/1.3.10-dev mod_perl/1.22-dev (CVS snapshot a few days old)



> 
> Tobias
> 
> At 07:29 PM 10/22/99 +0200, Stas Bekman wrote:
> >Hi,
> >
> >Let's take a little script that obviously "hangs" the server:
> >
> >  my $r = shift;
> >  $r->send_http_header('text/plain'); 
> >  $|=1; # so we would see the $$ printed
> >  print "OK $$\n";
> >  sleep 1, $i++ while 1;
> >
> >The second question is how comes that the above little script never quits
> >after the stop button was pressed? Apache was supposed to detect SIGPIPE
> >and abort the run... but it doesn't - it's very easy to reproduce - just
> >run it... I've used $|=1 to print the $$ and check that it really hangs...
> >
> >Thanks!
> >
> >___
> >Stas Bekman  mailto:[EMAIL PROTECTED]www.singlesheaven.com/stas  
> >Perl,CGI,Apache,Linux,Web,Java,PC at  www.singlesheaven.com/stas/TULARC
> >www.apache.org  & www.perl.com  == www.modperl.com  ||  perl.apache.org
> >single o-> + single o-+ = singlesheavenhttp://www.singlesheaven.com
> >
> >
> 
> 
> 



___
Stas Bekman  mailto:[EMAIL PROTECTED]www.singlesheaven.com/stas  
Perl,CGI,Apache,Linux,Web,Java,PC at  www.singlesheaven.com/stas/TULARC
www.apache.org  & www.perl.com  == www.modperl.com  ||  perl.apache.org
single o-> + single o-+ = singlesheavenhttp://www.singlesheaven.com



Re: Hanging process: detection and determination (was Re: Runaway processes)

1999-10-22 Thread Joshua Chamas

Stas,

I use Apache::Resource to set a CPU limit, that only a 
runaway process would hit so the random killer process
doesn't accumulate and take down my system.  I have
MaxRequestsPerChild set to a few hundred and have found
empirically that they don't tend to take more than 10
seconds of CPU time for normal use, so I give a CPU 
limit of 20-30 seconds for all my httpds.

I also run a monitor program that watchdogs the
server every 20-30 seconds and restarts it if 
response time is ever too low, just in case other 
odd things go wrong. It just does a graceful 
restart, I haven't needed to fix a problem with a 
full stop / start yet.

-- Joshua
_
Joshua Chamas   Chamas Enterprises Inc.
NODEWORKS >> free web link monitoring   Huntington Beach, CA  USA 
http://www.nodeworks.com1-714-625-4051


Stas Bekman wrote:
> 
> Hi,
> 
> Recently there were a few questions regarding hanging processes, I've
> tried to reproduce this case and have found two problems.
> 
> Let's take a little script that obviously "hangs" the server:
> 
>   my $r = shift;
>   $r->send_http_header('text/plain');
>   $|=1; # so we would see the $$ printed
>   print "OK $$\n";
>   sleep 1, $i++ while 1;
> 
> First question is: how do I detect that some server hangs? I've tried top,
> ps, /server-status -- none of them helped me to find that some process
> hangs. Of course if the process uses lot of resources you can bust it, by
> watching the top(), another approach is to use /server-status and watch it
> for about 5-10 minutes spotting which process number has the same number
> of requests while its status is 'W' (Which means that it hangs), but when
> you have about 50 procs, it's quite hard to spot such a process.
> 
> Another easy spotting is when some process trashes the error_log and
> writes millions of error messages there... But you still don't know the
> PID of this process, so you just restart all of them.
> 
> So my question is, is there any way to tell that some process hangs?
> Those who reported their processes hang, how did you spot it?
> 
> If I knew a programmatical way to spot the hanging process, I'd implement
> it in Apache::VMonitor to warn the admin and it would be possible to run
> watchdogs to kill off the process and report to admin... I think it would
> be a useful addon for us...
> 
> The second question is how comes that the above little script never quits
> after the stop button was pressed? Apache was supposed to detect SIGPIPE
> and abort the run... but it doesn't - it's very easy to reproduce - just
> run it... I've used $|=1 to print the $$ and check that it really hangs...
> 
> Thanks!
> 
> ___
> Stas Bekman  mailto:[EMAIL PROTECTED]www.singlesheaven.com/stas
> Perl,CGI,Apache,Linux,Web,Java,PC at  www.singlesheaven.com/stas/TULARC
> www.apache.org  & www.perl.com  == www.modperl.com  ||  perl.apache.org
> single o-> + single o-+ = singlesheavenhttp://www.singlesheaven.com



Re: Hanging process: detection and determination (was Re: Runaway processes)

1999-10-22 Thread Tobias Hoellrich

The reason why the stop button does not stop the script lies in the fact
that you're script does not produce any output while it is running. SIGPIPE
is only raised when your script tries to write to a closed (STOPed)
connection. No output from your script = no SIGPIPE!

Tobias

At 07:29 PM 10/22/99 +0200, Stas Bekman wrote:
>Hi,
>
>Let's take a little script that obviously "hangs" the server:
>
>  my $r = shift;
>  $r->send_http_header('text/plain'); 
>  $|=1; # so we would see the $$ printed
>  print "OK $$\n";
>  sleep 1, $i++ while 1;
>
>The second question is how comes that the above little script never quits
>after the stop button was pressed? Apache was supposed to detect SIGPIPE
>and abort the run... but it doesn't - it's very easy to reproduce - just
>run it... I've used $|=1 to print the $$ and check that it really hangs...
>
>Thanks!
>
>___
>Stas Bekman  mailto:[EMAIL PROTECTED]www.singlesheaven.com/stas  
>Perl,CGI,Apache,Linux,Web,Java,PC at  www.singlesheaven.com/stas/TULARC
>www.apache.org  & www.perl.com  == www.modperl.com  ||  perl.apache.org
>single o-> + single o-+ = singlesheavenhttp://www.singlesheaven.com
>
>




Hanging process: detection and determination (was Re: Runaway processes)

1999-10-22 Thread Stas Bekman

Hi,

Recently there were a few questions regarding hanging processes, I've
tried to reproduce this case and have found two problems.

Let's take a little script that obviously "hangs" the server:

  my $r = shift;
  $r->send_http_header('text/plain'); 
  $|=1; # so we would see the $$ printed
  print "OK $$\n";
  sleep 1, $i++ while 1;

First question is: how do I detect that some server hangs? I've tried top,
ps, /server-status -- none of them helped me to find that some process
hangs. Of course if the process uses lot of resources you can bust it, by
watching the top(), another approach is to use /server-status and watch it
for about 5-10 minutes spotting which process number has the same number
of requests while its status is 'W' (Which means that it hangs), but when
you have about 50 procs, it's quite hard to spot such a process. 

Another easy spotting is when some process trashes the error_log and
writes millions of error messages there... But you still don't know the
PID of this process, so you just restart all of them. 

So my question is, is there any way to tell that some process hangs?
Those who reported their processes hang, how did you spot it?

If I knew a programmatical way to spot the hanging process, I'd implement
it in Apache::VMonitor to warn the admin and it would be possible to run
watchdogs to kill off the process and report to admin... I think it would
be a useful addon for us...

The second question is how comes that the above little script never quits
after the stop button was pressed? Apache was supposed to detect SIGPIPE
and abort the run... but it doesn't - it's very easy to reproduce - just
run it... I've used $|=1 to print the $$ and check that it really hangs...

Thanks!

___
Stas Bekman  mailto:[EMAIL PROTECTED]www.singlesheaven.com/stas  
Perl,CGI,Apache,Linux,Web,Java,PC at  www.singlesheaven.com/stas/TULARC
www.apache.org  & www.perl.com  == www.modperl.com  ||  perl.apache.org
single o-> + single o-+ = singlesheavenhttp://www.singlesheaven.com