> -----Original Message-----
> From: Rainer Jung [mailto:rainer.j...@kippdata.de]
> Sent: Monday, February 23, 2015 5:00 PM
> To: Tomcat Users List
> Subject: Re: mod_jk causing stuck Apache processes
> 
> Am 24.02.2015 um 00:29 schrieb Jesse Defer:
> >> -----Original Message-----
> >> Am 23.02.2015 um 22:53 schrieb Jesse Defer:
> >>>> -----Original Message-----
> >>>> Am 23.02.2015 um 19:03 schrieb Jesse Defer:
> 
> ...
> 
> >> How many Apache web servers sit in front of the Tomcats? Is each of
> >> the Apaches configured with 512 processes?
> >
> > All Tomcats are configured the same, maxThreads is 4096, apr, Tomcat 6.
> > 16 apaches, all configured identically with 512 processes, the same
> > worker.properties, etc.
> 
> OK. So less threads than 16*512. But the apr connector - like the nio one
> - can handle more connections than it has threads. Not more in-flight
> requests, but for both connectors idle connections (keep-alive
> connections) are cheap and do not need a pool thread.
> 
> > Our busiest Tomcats peak at about 2000 connections to their AJP port
> > from the web farm.  I'm not sure how closely connections relate to
> > thread usage, I haven't looked too closely at the JVM yet.
> 
> For BIO one connection one thread. For APR and NIO it's roughly one in-
> flight request one thread. Connections waiting for the next request to
> arrive are cheap and handled by few separate poller threads.
> 
> ...
> 
> >> And only the lb is added to the worker.list? Or also the individual
> >> AJP workers?
> >
> > Just the LBs.
> 
> OK
> 
> >>>> If your Apache uses a threaded APR library, you could try using
> >>>> JkWatchdogInterval. That moved the maintenance from the normal
> >>>> request processing code to a separate thread per Apache process.
> >>
> >> I suspect that once your app (one or some of your apps) gets slow,
> >> requests pile up in Apache and >Apache can no longer connect to
> >> Tomcat, because it doesn't have free threads. Some Apache children
> >> still can connect, because the connect is handled by the OS
> >> underneath Tomcat, but those connections do not get the awaited cping
> reply.
> >
> > Shouldn't the ping timeout cause the processes to exit the wait state
> > if the cping reply hasn't been received?  I have seen the SS column in
> > server-status grow into the thousands.
> 
> Yes it would. But it could be a mixture: lots of Apache processes waiting
> for hanging or slow Tomcat responses due to an aplication or GC
> bottleneck, and some on-top requests failing in cping.
> 
> >> In addition to the Thread Dumps, you should also activate GC-Logs for
> >> your Tomcat JVMs.
> >
> > I have gc logging enabled already.
> 
> ACK
> 
> >>>> - using the worker or prefork MPM in combination with
> >>>> JkWatchdogInterval would reduce the amount of shm locking due to
> >>>> global maintenance calls, because it is done per process, not per
> >>>> thread. You might have other reasons for using prefork though. Have
> >> you?
> >>>
> >>> Use of prefork is historical.  We have been looking at going to
> >>> event, but Haven't made the switch yet.
> >>
> >> If you are using many Apaches each with 512 children, you will most
> >> likely need to use an NIO connector on the Tomcat processes and would
> >> also benefit from worker or event MPM for Apache. Each process e.g.
> >> with
> >> 50 or 100 threads and a relatively short connection_pool_timeout like
> e.g.
> >> 60 seconds.
> >
> > I am going to bring up accelerating the switch to the event MPM with
> > my team.  Will NIO give us more scalability than APR?
> 
> Not really, both are good at connection scalability. Maybe don't change to
> much at the same time. I'd say NIO is a bit more mainstream than APR, but
> your mileage my vary.
> 
> >> What was the reason, that you started suspecting the fcntl()
> >> underneath jk_shm_lock? Are we sure, that the stack you provided is
> >> really the typical one? And was the strace dumping all system calls
> >> or only the fcntl calls?
> >
> > I can't be sure it's the typical one, I only know every time I sample
> > the stack of a stuck process that's what I get, plus the strace shows
> > no other system calls except time().  It seems like it's stuck in a
> fairly tight loop somewhere.
> > I haven't looked too deeply in the mod_jk code though and haven't
> > collected any system call statistics.
> >
> > I don't necessarily suspect fcntl, but it's getting called a lot (the
> > strace is not filtered).  System time jumps when the issue is
> > happening and the fcntl looks like the vast majority of system calls.
> > Since the issue got worse when we added more processes it seems
> plausible it could be lock contention.
> 
> Hmm, but lock contention should result in the fcntl call to take more and
> more time. Could it be a problem with the system time? That somehow the
> "don't run the maintenance for the next 60 seconds" rule doesn't get
> effective because the clock jumps?
> 
> I suggest:
> 
> - check whether there's a performance problem on the webapp side when it
> happens next time
>    - check thread dumps
>    - check GC log

No long pauses that I can see.

>    - activate Tomcat access log and add "%D" to the pattern, so that you
> can analyze request duration on the Tomcat side post-mortem
>    - using the request duration logs, you can also check, whether a
> potential performance problem happens on all nodes or only on some.
> Depending on the result, the GC log check and thread dump creation and
> check would be easier or harder (more work).
> 
> - run strace with time stamps plus system call duration next time the
> problem occurs

http://pastebin.com/raw.php?i=X09ZTW4V

> - also add %D to the LogFormat of your Apache access log. Do also add the
> name of the load balancer member that handled the request to the access
> log. Mod_jk provides "notes", that can be logged in the access log (look
> for "JK_LB_LAST_NAME" in
> http://tomcat.apache.org/connectors-doc/reference/apache.html)
> 
> - prepare switching to a threaded MPM, be it worker or event. Use
> JkWatchdogInterval there.
> 
> - put JkShmFile to a local file system
> 
> - check whether your system clock on the Apache servers works correctly
> and doesn't jump (common virtualization problem)

NTP is configured on all hosts and we monitor drift with collectd every 10
seconds, no anomalies.

> - configure the jk status worker and check, whether many of your workers
> are in ERROR state. You can also retrieve info from the JK status worker
> in a simple to parse text format and automate retrieval and extraction of
> data (similar to what one can do with the Apache server-status)

I haven't seen the status worker show any workers in error.

> Regards,
> 
> Rainer

Here are some logs with request duration.  First entry in each pair is httpd, 
second is tomcat.  It is interesting that the tomcat response is very fast 
but in two of the cases the apache response is very slow.  I wasn't able to 
correlate the JVM stack traces for these, maybe because they executed quickly 
and weren't blocked.

10.142.64.251 - - [24/Feb/2015:10:34:37 -0700] "GET /my/ HTTP/1.1" 200 72209 
"-" "Mozilla/5.0" pid=25649 tid=47017797431248 time_s=1 time_us=1174645 
x_id=VOy2LYHbdfEAAGQxXQkAAAC0 content_type="text/html" last_modified="-" 
cache_control="no-cache" worker=mylb first_name=my27 first_busy=0 
last_name=my27 last_busy=0 ssl_protocol=TLSv1 ssl_cipher=AES256-SHA
10.142.64.251 - - [24/Feb/2015:10:34:37 -0700] "GET /my/ HTTP/1.1" 200 72209 
"-" "Mozilla/5.0" time_ms=115 tomcat_thread=ajp-62014-31 
x_id=VOy2LYHbdfEAAGQxXQkAAAC0 app=my

10.142.64.251 - - [24/Feb/2015:10:31:02 -0700] "GET /my/ HTTP/1.1" 200 8184 "-" 
"Mozilla/5.0" pid=25697 tid=47017797431248 time_s=246 time_us=246422721 
x_id=VOy1VoHbdfEAAGRhuqQAAADq content_type="text/html" last_modified="-" 
cache_control="-" worker=mylb first_name=my27 first_busy=0 last_name=my27 
last_busy=0 ssl_protocol=TLSv1 ssl_cipher=AES256-SHA
10.142.64.251 - - [24/Feb/2015:10:35:09 -0700] "GET /my/ HTTP/1.1" 200 72209 
"-" "Mozilla/5.0" time_ms=53 tomcat_thread=ajp-62014-32 
x_id=VOy1VoHbdfEAAGRhuqQAAADq app=my

10.142.64.251 - - [24/Feb/2015:10:30:41 -0700] "GET /my/ HTTP/1.1" 200 8184 "-" 
"Mozilla/5.0" pid=25772 tid=47017797431248 time_s=371 time_us=371698757 
x_id=VOy1QYHbdfEAAGSsv-oAAAEy content_type="text/html" last_modified="-" 
cache_control="-" worker=mylb first_name=my27 first_busy=0 last_name=my27 
last_busy=0 ssl_protocol=TLSv1 ssl_cipher=AES256-SHA
10.142.64.251 - - [24/Feb/2015:10:36:52 -0700] "GET /my/ HTTP/1.1" 200 72209 
"-" "Mozilla/5.0" time_ms=57 tomcat_thread=ajp-62014-5 
x_id=VOy1QYHbdfEAAGSsv-oAAAEy app=my

I let one server stay busy for about an hour to collect some logs.
There are some slow requests for this app in the tomcat logs, but few over
60 seconds and only 1 over 120 seconds (for all 17 of this app's tomcats).  
Meanwhile there are thousands of httpd requests over 120s from just the 
one web server with the stuck processes.  Plotting the request time for
the entire httpd farm shows the affected server has a large jump but
no change in the other servers.

Respectfully,

Jesse

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to