This problem is so mysterious to me that I am hesitant to make any definitive
statements regarding the scope of the issue. In the staging environment where
the problem was originally seen the GC logs were not enabled. In that
environment the session timeouts persisted for a long time (and operations
tried several restarts to clear the issue). Here are some sample logs:
$ grep -e starting -e session time
uverse-staging-admin-0-20150423150103-kafka.log
[2015-04-23 14:23:53,837] INFO Client session timed out, have not heard from
server in 10066ms for sessionid 0x14ce69c464f0009, closing socket connection
and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2015-04-23 14:31:56,133] INFO [Kafka Server 0], starting
(kafka.server.KafkaServer)
[2015-04-23 14:32:08,636] INFO Client session timed out, have not heard from
server in 10422ms for sessionid 0x14ce6b1209f0002, closing socket connection
and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2015-04-23 14:32:21,641] INFO Client session timed out, have not heard from
server in 10643ms for sessionid 0x14ce6b1209f0003, closing socket connection
and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2015-04-23 14:32:33,614] INFO Client session timed out, have not heard from
server in 9842ms for sessionid 0x14ce6b1209f0004, closing socket connection and
attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2015-04-23 14:32:45,589] INFO Client session timed out, have not heard from
server in 9878ms for sessionid 0x14ce6b1209f0005, closing socket connection and
attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2015-04-23 14:32:57,906] INFO Client session timed out, have not heard from
server in 9823ms for sessionid 0x14ce6b1209f0006, closing socket connection and
attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2015-04-23 14:33:10,701] INFO Client session timed out, have not heard from
server in 10365ms for sessionid 0x14ce6b1209f0007, closing socket connection
and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2015-04-23 14:33:22,475] INFO Client session timed out, have not heard from
server in 10006ms for sessionid 0x14ce6b1209f0008, closing socket connection
and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2015-04-23 14:33:46,372] INFO Client session timed out, have not heard from
server in 11062ms for sessionid 0x14ce6b1209f0009, closing socket connection
and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2015-04-23 14:42:11,942] INFO Client session timed out, have not heard from
server in 12582ms for sessionid 0x14ce6b1209f000a, closing socket connection
and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2015-04-23 14:55:47,273] INFO Client session timed out, have not heard from
server in 12935ms for sessionid 0x14ce6b1209f000b, closing socket connection
and attempting reconnect (org.apache.zookeeper.ClientCnxn)
The specific configuration of Kafka (heap size, number of topics, ...) seem to
have a big influence on the behavior.
Using the test program I have generally seen the ParNew GCs times eventually
recover, but, it can take a long time (and sometimes I kill the process before
it recovers). Here is an example of one that persists for 2 ParNew GC cycles:
[ec2-user@ip-10-136-75-2 ~]$ jdk1.8.0_45/bin/java -Xmx4G -Xms4G
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCTimeStamps
-XX:+PrintGCDetails LongParNewPause $((500*1024*1024)) 100 10
0.289: [GC (Allocation Failure) 0.289: [ParNew: 272640K-27707K(306688K),
0.0322014 secs] 784640K-539707K(4160256K), 0.0322659 secs] [Times: user=0.10
sys=0.03, real=0.04 secs]
0.360: [GC (Allocation Failure) 0.360: [ParNew: 300347K-34048K(306688K),
3.5932171 secs] 812347K-572709K(4160256K), 3.5932638 secs] [Times: user=14.12
sys=0.05, real=3.60 secs]
3.992: [GC (Allocation Failure) 3.992: [ParNew: 306688K-34048K(306688K),
53.1317959 secs] 845349K-599722K(4160256K), 53.1318434 secs] [Times: user=212.40
sys=0.00, real=53.13 secs]
57.162: [GC (Allocation Failure) 57.162: [ParNew: 306688K-34048K(306688K),
0.0500041 secs] 872362K-626660K(4160256K), 0.0500528 secs] [Times: user=0.17
sys=0.01, real=0.05 secs]
57.250: [GC (Allocation Failure) 57.250: [ParNew: 306688K-34048K(306688K),
0.0503560 secs] 899300K-658186K(4160256K), 0.0504036 secs] [Times: user=0.17
sys=0.01, real=0.05 secs]
57.339: [GC (Allocation Failure) 57.339: [ParNew: 306688K-34048K(306688K),
0.0510655 secs] 930826K-687994K(4160256K), 0.0511131 secs] [Times: user=0.17
sys=0.00, real=0.05 secs]
57.428: [GC (Allocation Failure) 57.428: [ParNew: 306688K-34048K(306688K),
0.0516372 secs] 960634K-720213K(4160256K), 0.0516853 secs] [Times: user=0.17
sys=0.01, real=0.05 secs]
Just changing the fraction of objects saved from 1 out of 10 to 1 out of
100 can dramatically increase the recovery time to 11 ParNew GCs:
[ec2-user@ip-10-136-75-2 ~]$ jdk1.8.0_45/bin/java -Xmx4G -Xms4G
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -verbose:gc