Cluster setup stopped working after 3 months in production

Krishna Saranathan Mon, 11 Aug 2014 23:24:25 -0700

We have J2EE war application deployed in a cluster setup having two
nodes. Tomcat 6.0.39 is installed in the both nodes having identical
war deployed in both. Its deployed in Amazon AWS environment, and the
two ec2-nodes are beneath an ELB , with session stickiness enabled for
JSESSIONID. Also the two tomcat nodes are session replication enabled
too.


Following is Cluster config updated server.xml file:
=============================================================================
 <Cluster className="org.apache.catalina.ha.tcp.SimpleTcpCluster"
channelSendOptions="6" channelStartOptions="3">

<Manager className="org.apache.catalina.ha.session.DeltaManager"
expireSessionsOnShutdown="false" notifyListenersOnReplication="true"
/>

<Channel className="org.apache.catalina.tribes.group.GroupChannel">

<Receiver className="org.apache.catalina.tribes.transport.nio.NioReceiver"
                                autoBind="0" selectorTimeout="5000"
maxThreads="6"
                                address="x.x.x.x" port="4444" />
<Sender className="org.apache.catalina.tribes.transport.ReplicationTransmitter">
<Transport 
className="org.apache.catalina.tribes.transport.nio.PooledParallelSender"
                                        timeout="60000"
                                        keepAliveTime="10"
                                        keepAliveCount="0"
/>
</Sender>
<Interceptor 
className="org.apache.catalina.tribes.group.interceptors.TcpPingInterceptor"
staticOnly="true"/>
<Interceptor 
className="org.apache.catalina.tribes.group.interceptors.TcpFailureDetector"/>
<Interceptor 
className="org.apache.catalina.tribes.group.interceptors.StaticMembershipInterceptor">
<Member className="org.apache.catalina.tribes.membership.StaticMember"
                                        host="x.x.x.x"
                                        port="4444"

uniqueId="{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4}"/>
</Interceptor>
</Channel>
<Valve className="org.apache.catalina.ha.tcp.ReplicationValve" filter="" />
<Valve className="org.apache.catalina.ha.session.JvmRouteBinderValve" />
<ClusterListener
className="org.apache.catalina.ha.session.JvmRouteSessionIDBinderListener"/>
<ClusterListener
className="org.apache.catalina.ha.session.ClusterSessionListener"/>
</Cluster>

==========================================================================

Receiver ip, static member ip and unique id is different in the
server.xml of the other node in the cluster.

this was running fine in production environment for 3 months. Suddenly there was
an exception logged like this :, and started coming up infinitely.


==================================================
Aug 6, 2014 12:00:39 AM
org.apache.catalina.tribes.group.interceptors.TcpFailureDetector
memberDisappeared
INFO: Received 
memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://10.160.40.12:4444,10.160.40.12,4444,
alive=0,id={0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 }, payload={}, command={},
domain={}, ]] message. Will verify.
Aug 6, 2014 12:00:39 AM
org.apache.catalina.tribes.group.interceptors.TcpFailureDetector
memberDisappeared
INFO: Verification complete. Member still
alive[org.apache.catalina.tribes.membership.MemberImpl[tcp://10.160.40.12:4444,10.160.40.12,4444,
alive=0,id={0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 }, payload={}, command={},
domain={}, ]]
Aug 6, 2014 12:00:39 AM org.apache.catalina.ha.tcp.SimpleTcpCluster send
SEVERE: Unable to send message through cluster sender.
org.apache.catalina.tribes.ChannelException: Operation has timed
out(60000 ms.).; Faulty members:tcp://10.160.40.12:4444;
        at 
org.apache.catalina.tribes.transport.nio.ParallelNioSender.sendMessage(ParallelNioSender.java:97)
        at 
org.apache.catalina.tribes.transport.nio.PooledParallelSender.sendMessage(PooledParallelSender.java:53)
        at 
org.apache.catalina.tribes.transport.ReplicationTransmitter.sendMessage(ReplicationTransmitter.java:80)
        at 
org.apache.catalina.tribes.group.ChannelCoordinator.sendMessage(ChannelCoordinator.java:76)
        at 
org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelInterceptorBase.java:75)
        at 
org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelInterceptorBase.java:75)
        at 
org.apache.catalina.tribes.group.interceptors.TcpFailureDetector.sendMessage(TcpFailureDetector.java:88)
        at 
org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelInterceptorBase.java:75)
        at 
org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelInterceptorBase.java:75)
        at 
org.apache.catalina.tribes.group.GroupChannel.send(GroupChannel.java:216)
        at 
org.apache.catalina.tribes.group.GroupChannel.send(GroupChannel.java:175)
        at 
org.apache.catalina.ha.tcp.SimpleTcpCluster.send(SimpleTcpCluster.java:817)
        at 
org.apache.catalina.ha.tcp.SimpleTcpCluster.sendClusterDomain(SimpleTcpCluster.java:791)
        at 
org.apache.catalina.ha.tcp.ReplicationValve.send(ReplicationValve.java:553)
        at 
org.apache.catalina.ha.tcp.ReplicationValve.sendMessage(ReplicationValve.java:537)
        at 
org.apache.catalina.ha.tcp.ReplicationValve.sendSessionReplicationMessage(ReplicationValve.java:519)
        at 
org.apache.catalina.ha.tcp.ReplicationValve.sendReplicationMessage(ReplicationValve.java:430)
        at 
org.apache.catalina.ha.tcp.ReplicationValve.invoke(ReplicationValve.java:363)
        at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
        at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:861)
        at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:606)
        at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
        at java.lang.Thread.run(Thread.java:662)
============================================================================


After this, the web application is not accessible, and we have to
manually kill the tomcat process in one node, thereby disabling the
cluster.


We are unsure, how all of a sudden this is coming, and disabling
application access altogether. If there are any suggestion on remedy,
pls provide the same.

Cluster setup stopped working after 3 months in production

Reply via email to