Re: Cluster setup stopped working after 3 months in production

Krishna Saranathan Tue, 12 Aug 2014 02:47:36 -0700

Its linux distro.
Linux version 2.6.32-358.14.1.el6.x86_64 (
mockbu...@x86-022.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313
(Red Hat 4.4.7-3) (GCC) ) #1 SMP Mon Jun 17 15:54:20 EDT 2013


Java version - 1.6 update 45.

I doubt change in security group suddenly applied for the port. Am able to
telnet from server which is shutdown to the currently running server to
 port  4444 . Yes. OS restart was done for a hardware upgrade for RAM and
disk volume.


On Tue, Aug 12, 2014 at 6:58 AM, Igor Cicimov <icici...@gmail.com> wrote:

> On 12/08/2014 4:24 PM, "Krishna Saranathan" <krishna.saran...@gmail.com>
> wrote:
> >
> > We have J2EE war application deployed in a cluster setup having two
> > nodes. Tomcat 6.0.39 is installed in the both nodes having identical
> > war deployed in both. Its deployed in Amazon AWS environment, and the
>
> What distro? Win or linux? And if linux which one?
>
> > two ec2-nodes are beneath an ELB , with session stickiness enabled for
> > JSESSIONID. Also the two tomcat nodes are session replication enabled
> > too.
> >
> > Following is Cluster config updated server.xml file:
> >
>
> =============================================================================
> >  <Cluster className="org.apache.catalina.ha.tcp.SimpleTcpCluster"
> > channelSendOptions="6" channelStartOptions="3">
> >
> > <Manager className="org.apache.catalina.ha.session.DeltaManager"
> > expireSessionsOnShutdown="false" notifyListenersOnReplication="true"
> > />
> >
> > <Channel className="org.apache.catalina.tribes.group.GroupChannel">
> >
> > <Receiver
> className="org.apache.catalina.tribes.transport.nio.NioReceiver"
> >                                 autoBind="0" selectorTimeout="5000"
> > maxThreads="6"
> >                                 address="x.x.x.x" port="4444" />
> > <Sender
> className="org.apache.catalina.tribes.transport.ReplicationTransmitter">
> > <Transport
> className="org.apache.catalina.tribes.transport.nio.PooledParallelSender"
> >                                         timeout="60000"
> >                                         keepAliveTime="10"
> >                                         keepAliveCount="0"
> > />
> > </Sender>
> > <Interceptor
>
> className="org.apache.catalina.tribes.group.interceptors.TcpPingInterceptor"
> > staticOnly="true"/>
> > <Interceptor
>
> className="org.apache.catalina.tribes.group.interceptors.TcpFailureDetector"/>
> > <Interceptor
>
> className="org.apache.catalina.tribes.group.interceptors.StaticMembershipInterceptor">
> > <Member className="org.apache.catalina.tribes.membership.StaticMember"
> >                                         host="x.x.x.x"
> >                                         port="4444"
> >
> > uniqueId="{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4}"/>
> > </Interceptor>
> > </Channel>
> > <Valve className="org.apache.catalina.ha.tcp.ReplicationValve" filter=""
> />
> > <Valve className="org.apache.catalina.ha.session.JvmRouteBinderValve" />
> > <ClusterListener
> >
>
> className="org.apache.catalina.ha.session.JvmRouteSessionIDBinderListener"/>
> > <ClusterListener
> > className="org.apache.catalina.ha.session.ClusterSessionListener"/>
> > </Cluster>
> >
> >
> ==========================================================================
> >
> > Receiver ip, static member ip and unique id is different in the
> > server.xml of the other node in the cluster.
> >
> > this was running fine in production environment for 3 months. Suddenly
> there was
> > an exception logged like this :, and started coming up infinitely.
> >
> >
> > ==================================================
> > Aug 6, 2014 12:00:39 AM
> > org.apache.catalina.tribes.group.interceptors.TcpFailureDetector
> > memberDisappeared
> > INFO: Received
> memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://
> 10.160.40.12:4444,10.160.40.12,4444,
> > alive=0,id={0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 }, payload={}, command={},
> > domain={}, ]] message. Will verify.
> > Aug 6, 2014 12:00:39 AM
> > org.apache.catalina.tribes.group.interceptors.TcpFailureDetector
> > memberDisappeared
> > INFO: Verification complete. Member still
> > alive[org.apache.catalina.tribes.membership.MemberImpl[tcp://
> 10.160.40.12:4444,10.160.40.12,4444,
> > alive=0,id={0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 }, payload={}, command={},
> > domain={}, ]]
> > Aug 6, 2014 12:00:39 AM org.apache.catalina.ha.tcp.SimpleTcpCluster send
> > SEVERE: Unable to send message through cluster sender.
> > org.apache.catalina.tribes.ChannelException: Operation has timed
> > out(60000 ms.).; Faulty members:tcp://10.160.40.12:4444;
> >         at
>
> org.apache.catalina.tribes.transport.nio.ParallelNioSender.sendMessage(ParallelNioSender.java:97)
> >         at
>
> org.apache.catalina.tribes.transport.nio.PooledParallelSender.sendMessage(PooledParallelSender.java:53)
> >         at
>
> org.apache.catalina.tribes.transport.ReplicationTransmitter.sendMessage(ReplicationTransmitter.java:80)
> >         at
>
> org.apache.catalina.tribes.group.ChannelCoordinator.sendMessage(ChannelCoordinator.java:76)
> >         at
>
> org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelInterceptorBase.java:75)
> >         at
>
> org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelInterceptorBase.java:75)
> >         at
>
> org.apache.catalina.tribes.group.interceptors.TcpFailureDetector.sendMessage(TcpFailureDetector.java:88)
> >         at
>
> org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelInterceptorBase.java:75)
> >         at
>
> org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelInterceptorBase.java:75)
> >         at
> org.apache.catalina.tribes.group.GroupChannel.send(GroupChannel.java:216)
> >         at
> org.apache.catalina.tribes.group.GroupChannel.send(GroupChannel.java:175)
> >         at
> org.apache.catalina.ha.tcp.SimpleTcpCluster.send(SimpleTcpCluster.java:817)
> >         at
>
> org.apache.catalina.ha.tcp.SimpleTcpCluster.sendClusterDomain(SimpleTcpCluster.java:791)
> >         at
> org.apache.catalina.ha.tcp.ReplicationValve.send(ReplicationValve.java:553)
> >         at
>
> org.apache.catalina.ha.tcp.ReplicationValve.sendMessage(ReplicationValve.java:537)
> >         at
>
> org.apache.catalina.ha.tcp.ReplicationValve.sendSessionReplicationMessage(ReplicationValve.java:519)
> >         at
>
> org.apache.catalina.ha.tcp.ReplicationValve.sendReplicationMessage(ReplicationValve.java:430)
> >         at
>
> org.apache.catalina.ha.tcp.ReplicationValve.invoke(ReplicationValve.java:363)
> >         at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
> >         at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:861)
> >         at
>
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:606)
> >         at
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
> >         at java.lang.Thread.run(Thread.java:662)
> >
>
> ============================================================================
> >
> >
> > After this, the web application is not accessible, and we have to
> > manually kill the tomcat process in one node, thereby disabling the
> > cluster.
> >
> >
> > We are unsure, how all of a sudden this is coming, and disabling
> > application access altogether. If there are any suggestion on remedy,
> > pls provide the same.
>
> Firewall???
> Did you change something in the SecurityGroup the instances belong  to that
> might have affected the port 4444? Can you telnet from the server you shut
> down tomcat to port 4444 on the server tomcat is running on? Did you do a
> restart or OS update/upgrade that might have pulled some firewall package
> and activated it afterwards?
>

Re: Cluster setup stopped working after 3 months in production

Reply via email to