This is *exactly* the symptoms we see, and we're just about always up to
date with Apache/Perl/modperl releases.

We've spent a fair amount of time trying to isolate the cause of these,
but haven't been able to point the finger at any one cause.  Some of the
things we've determined:

- The same behavior is displayed under Solaris (5.6 and 5.7) and Linux
  (2.2.14).
- We've seen this through through a bunch of releases of
  Apache/Perl/modperl over the past 6 months.
- When a child process goes astray, it is in a tight loop, quickly growing
  to consume 95 to 100% of the cpu cycles.
- Under Linux, running strace on the runaway results in nothing -- 
  no system calls are shown whatsoever, so it's apparently spinning in
  a tight CPU loop (though see the next bullet -- it's possible I've
  just never caught it at an early enough stage.)
- Under Solaris, I've managed to catch a few of these at an early stage
  and observed (via truss) an endless series of 'sbrk' calls, eventually
  this gets bound up tight with no system calls displayed, like the
  Linux case.
- This seems to happen more often under heavy load, but we've also seen it
  fairly regularly during low traffic periods.
- We did have some luck in doing a thorough read of our handlers that use
  DBI, making sure that all database connections are explicitly closed 
  at the end of a request (we *don't* use Apache::DBI).  This cut down on
  the number of runaways, but we still see them.

We've kept our runaways under control by running a watchdog script that
looks for modperl processes with the correct load numbers (cpu% > 10% and
run time > something), but we've all along thought that this would be a
temporary solution until we determined what we're doing wrong.

Now that I've seen this report from a couple of others on the list, I'm
wondering if it's not something we're doing, but rather something within
Apache or modperl.

If there's anything anyone on the list can recommend that I do to try to
collect more clues on the cause, I'll be happy to try it.

Or maybe if there are others who've seen the same behavior, pipe in so
that we can get a feeling for how many sites are experiencing this?

<Steve Reppucci>

On Thu, 1 Jun 2000, Gustavo Duarte wrote:

> Hi there people,
> 
> I have inherited a web server running mod_perl and I am experiencing a
> somewhat critical problem: http processes sometimes get into an infinite
> loop, using 100% cpu time, and given enough time bring the machine to a
> halt.
> 
> I've done a lot of testing, and there isn't a specific http request that
> triggers the behaviour, eventhough it always happens after a request. It
> seems to happen every few hours: the httpd process simply starts hogging
> up the CPU, and won't let go of it. After a while, I have a few of these
> processes running, and the machine's load average skyrockets. Sometimes
> it's bad enough I'm not even able to log in via console.
> 
> I'll upgrade all the software to new versions, but apparently this
> problem has been ocurring for a while, and survived a couple of
> hardware/software upgrades. I'll also be rewriting the perl code running
> there to see if it stops the problem (the code isn't too clean - lots of
> global variables, not written under strict, etc, but "it works").
> However, it would be cool if someone could enlighten me on what's going
> on, and possibly suggest a fix :).
> 
> Thanks a lot!
> 
> signed,
> gustavo
> 
> <begin debugging info>
> 
> => our OS is:
> 
> [root@blueland /root]# uname -a
> Linux blueland 2.2.14-5.0 #4 Wed Apr 12 20:28:28 MDT 2000 i586 unknown
> 
> => Apache:
> 
> Server Version: Apache/1.3.6 (Unix) mod_perl/1.19 mod_ssl/2.2.8
> OpenSSL/0.9.2b
> 
> => let's look into one of the monster processes:
> 
> 497 ?        R    288:06 /usr/local/apache_1.3.6/bin/httpd
> 
> => (nice cpu time there...)
> => now for gdb...
> 
> [root@blueland /root]# gdb /usr/local/apache/bin/httpd 497
> GNU gdb 4.18
> Copyright 1998 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you
> are
> welcome to change it and/or distribute copies of it under certain
> conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for
> details.
> 
> Attaching to program: /usr/local/apache/bin/httpd, Pid 497
> Reading symbols from /lib/libNoVersion.so.1...done.
> Reading symbols from /lib/libm.so.6...done.
> Reading symbols from /lib/libcrypt.so.1...done.
> Reading symbols from
> /usr/lib/perl5/5.00502/i586-linux/CORE/libperl.so...done.
> Reading symbols from /lib/libnsl.so.1...done.
> Reading symbols from /lib/libdl.so.2...done.
> Reading symbols from /lib/libc.so.6...done.
> Reading symbols from /lib/ld-linux.so.2...done.
> Reading symbols from /lib/libnss_files.so.2...done.
> Reading symbols from
> /usr/lib/perl5/site_perl/5.005/i586-linux/auto/Sybase/DBlib/DBlib.none...done.
> 
> Reading symbols from /opt/sybase/lib/libsybdb.so...done.
> Reading symbols from /opt/sybase/lib/libinsck.so...done.
> Reading symbols from /lib/libnss_nisplus.so.2...done.
> Reading symbols from /lib/libnss_nis.so.2...done.
> 0x80a7407 in free_blocks ()
> 
> => let's see what the stack tells us...
> 
> (gdb) i s
> #0  0x80a7407 in free_blocks ()
> #1  0x80a7660 in ap_clear_pool ()
> #2  0x80a76a1 in ap_destroy_pool ()
> #3  0x80be71a in ap_destroy_sub_req ()
> #4  0x8070d8a in XS_Apache__SubRequest_DESTROY ()
> #5  0x400d0f45 in Perl_pp_entersub () at pp_hot.c:1965
> #6  0x4007aa8d in perl_call_sv () at perl.c:1008
> #7  0x400d9d4c in Perl_sv_clear () at sv.c:2418
> #8  0x400da451 in Perl_sv_free () at sv.c:2418
> #9  0x400d24ab in do_clean_objs (sv=0x8385744) at sv.c:338
> #10 0x400d237c in visit (f=0x400d2410 <do_clean_objs>) at sv.c:306
> #11 0x400d263f in Perl_sv_clean_objs () at sv.c:359
> #12 0x40077fa5 in perl_destruct () at perl.c:1008
> #13 0x80635a3 in perl_shutdown ()
> #14 0x80645c5 in perl_child_exit ()
> #15 0x8064430 in perl_child_exit_cleanup ()
> #16 0x80a8dd6 in run_cleanups ()
> #17 0x80a762c in ap_clear_pool ()
> #18 0x80a76a1 in ap_destroy_pool ()
> #19 0x80b3b03 in clean_child_exit ()
> #20 0x80b6773 in child_main ()
> #21 0x80b6e14 in make_child ()
> #22 0x80b716e in perform_idle_server_maintenance ()
> #23 0x80b7665 in standalone_main ()
> #24 0x80b7cd3 in main ()
> 
> => registers have:
> 
> (gdb) i r
> eax            0x8a3e408        144958472
> ecx            0x8608414        140542996
> edx            0x8a3e408        144958472
> ebx            0x8239208        136548872
> esp            0xbffff90c       0xbffff90c
> ebp            0xbffff910       0xbffff910
> esi            0x1      1
> edi            0xffffffff       -1
> eip            0x80a7407        0x80a7407
> eflags         0x202    514
> cs             0x23     35
> ss             0x2b     43
> ds             0x2b     43
> es             0x2b     43
> fs             0x0      0
> gs             0x0      0
> 
> => this is a dump of the running function
> 
> (gdb) disas
> Dump of assembler code for function free_blocks:
> 0x80a73e0 <free_blocks>:        push   %ebp
> 0x80a73e1 <free_blocks+1>:      mov    %esp,%ebp
> 0x80a73e3 <free_blocks+3>:      sub    $0x4,%esp
> 0x80a73e6 <free_blocks+6>:      cmpl   $0x0,0x8(%ebp)
> 0x80a73ea <free_blocks+10>:     jne    0x80a73f0 <free_blocks+16>
> 0x80a73ec <free_blocks+12>:     jmp    0x80a7439 <free_blocks+89>
> 0x80a73ee <free_blocks+14>:     mov    %esi,%esi
> 0x80a73f0 <free_blocks+16>:     mov    0x8146b88,%eax
> 0x80a73f5 <free_blocks+21>:     mov    %eax,0xfffffffc(%ebp)
> 0x80a73f8 <free_blocks+24>:     mov    0x8(%ebp),%eax
> 0x80a73fb <free_blocks+27>:     mov    %eax,0x8146b88
> 0x80a7400 <free_blocks+32>:     mov    0x8(%ebp),%eax
> 0x80a7403 <free_blocks+35>:     cmpl   $0x0,0x4(%eax)
> 0x80a7407 <free_blocks+39>:     jne    0x80a740c <free_blocks+44>
> 0x80a7409 <free_blocks+41>:     jmp    0x80a7424 <free_blocks+68>
> 0x80a740b <free_blocks+43>:     nop
> 0x80a740c <free_blocks+44>:     mov    0x8(%ebp),%eax
> 0x80a740f <free_blocks+47>:     mov    0x8(%ebp),%ecx
> 0x80a7412 <free_blocks+50>:     add    $0xc,%ecx
> 0x80a7415 <free_blocks+53>:     mov    %ecx,0x8(%eax)
> 0x80a7418 <free_blocks+56>:     mov    0x8(%ebp),%eax
> 0x80a741b <free_blocks+59>:     mov    0x4(%eax),%edx
> 0x80a741e <free_blocks+62>:     mov    %edx,0x8(%ebp)
> 0x80a7421 <free_blocks+65>:     jmp    0x80a7400 <free_blocks+32>
> 0x80a7423 <free_blocks+67>:     nop
> 0x80a7424 <free_blocks+68>:     mov    0x8(%ebp),%eax
> 0x80a7427 <free_blocks+71>:     mov    0x8(%ebp),%ecx
> 0x80a742a <free_blocks+74>:     add    $0xc,%ecx
> 0x80a742d <free_blocks+77>:     mov    %ecx,0x8(%eax)
> 0x80a7430 <free_blocks+80>:     mov    0x8(%ebp),%eax
> 0x80a7433 <free_blocks+83>:     mov    0xfffffffc(%ebp),%edx
> 0x80a7436 <free_blocks+86>:     mov    %edx,0x4(%eax)
> 0x80a7439 <free_blocks+89>:     leave
> 0x80a743a <free_blocks+90>:     ret
> 
> => strace doesn't show any system calls going on, which makes sense,
> since we don't ever leave free_blocks()
> 
> .EOF.
> 

=-=-=-=-=-=-=-=-=-=-  My God!  What have I done?  -=-=-=-=-=-=-=-=-=-=
Steve Reppucci                                       [EMAIL PROTECTED] |
Logical Choice Software                          http://logsoft.com/ |
508/958-0183                                                 Be Open |

Reply via email to