Am 23.02.2015 um 19:03 schrieb Jesse Defer:
I have a farm of Apache httpd servers proxying to Tomcat with mod_jk
and I am having issues with Apache processes getting stuck (as seen by
the W state in server-status). I am sending to this list because the
stack traces show httpd gets stuck in mod_jk.
httpd is configured for prefork and with 512 servers on start and
maximum. When the problem occurs we end up with nearly all 512
processes in the W state until we restart it. The problem occurs more
often when load is high but is not restricted to high load. The
problem started occuring more often when we increased the servers from
384 to 512. The hosts have enough memory and do not swap. The issue
occurs intermitently and is not tied to a particular host or instance
Tomcat (there are ~150 Tomcat instances). The JkShmFile is on tmpfs.
Why on tmpfs?
Environment: RHEL5.11, Apache 2.4.10 (prefork), JK 1.2.40, APR 1.5.1,
APR-UTIL 1.5.4
The stuck httpd processes all show the same stack and strace:
pstack:
#0 0x00002b3439026bff in fcntl () from /lib64/libpthread.so.0
#1 0x00002b3440911656 in jk_shm_lock () from
/usr/local/apache2/modules/mod_jk.so
#2 0x00002b3440917805 in ajp_maintain () from
/usr/local/apache2/modules/mod_jk.so
#3 0x00002b3440906cac in maintain_workers () from
/usr/local/apache2/modules/mod_jk.so
#4 0x00002b3440901140 in wc_maintain () from
/usr/local/apache2/modules/mod_jk.so
#5 0x00002b34408f40b6 in jk_handler () from
/usr/local/apache2/modules/mod_jk.so
#6 0x0000000000448eca in ap_run_handler ()
#7 0x000000000044cc92 in ap_invoke_handler ()
#8 0x000000000045e24f in ap_process_async_request ()
#9 0x000000000045e38f in ap_process_request ()
#10 0x000000000045ab65 in ap_process_http_connection ()
#11 0x00000000004530ba in ap_run_process_connection ()
#12 0x000000000046423a in child_main ()
#13 0x0000000000464544 in make_child ()
#14 0x00000000004649ae in prefork_run ()
#15 0x0000000000430634 in ap_run_mpm ()
#16 0x000000000042ad97 in main ()
So mod_jk tries to get a lock on the shared memory before reading and
updating some shared memory data. That as one of many things mod_jk does
is normal. It would be not normal, if most processes seem to almost
always sit in this stack.
strace:
fcntl(19, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) = 0
fcntl(19, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=1}) = 0
time(NULL) = 1424711498
fcntl(19, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) = 0
fcntl(19, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=1}) = 0
fcntl(19, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) = 0
fcntl(19, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=1}) = 0
fcntl(19, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) = 0
fcntl(19, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=1}) = 0
Any help tracking this issue down would be appreciated.
This doesn't look like mod_jk is hanging in the jk_shm_lock() but
instead it looks like normal processing, locking and then unlocking.
Locking and unlocking suceeds (return code 0).
You didn't provide time stamps to get an idea, whether it is normal
behavior or not. What is your request throughput (requests per second
forwarded by mod_jk as long as it is running well)?
I suspect something else is wrong. Have you checked, whether in fact
mod_jk waits for the response from requests it has send to the backend
and which do not return quickly, e.g. looking at a Java thread dump
(not: heap dump) of the backend during the times you have problems?
What are the error or warn messages in your mod_jk log file?
Did you use "apachectl graceful" shortly before the problems start, or
change configuration via the mod_jk status worker?
What is special here, is the use of many processes plus many workers. So
the lock is used quite a lot. Still the "global maintain" functionality
which uses the jk_shm_lock() in the stack above should be called by each
process only every worker.maintain seconds, by default every 60 seconds.
And each project should quickly detect whether another process already
did the global maintain. During the global maintain, for any ajp13
worker there is really just a few simply local code statements. For any
lb (load balancer) worker, there is a little more to do, especialy
checking whether any members have failed long ago and should now be
recovered. ut still those operations are all local, not going on the
network.
What is your workr struture? Are the 150 workers part of few load
balancers or are they all in the worker.list? Can you show significant
parts of your workers.properties? Are you using the non-default
"pessimistic" locking?
If your Apache uses a threaded APR library, you could try using
JkWatchdogInterval. That moved the maintenance from the normal request
processing code to a separate thread per Apache process.