Hey guys, We seem to have proven that the 52529 bug is either still rearing it's head or something else here is causing the exact same problem as was reported in the above bug.
We are running Tomcat 7.0.30, jjdk 1.6 on CentOS 6.3 with mod_jk session replication and load balancing. My server.xml configuration is as follows (this is server #1 of a 3 server cluster) - All IP's and passwords have been changed to protect the innocent.... <?xml version='1.0' encoding='utf-8'?> <Server port="8005" shutdown="SHUTDOWN"> <Listener className="org.apache.catalina.core.AprLifecycleListener" SSLEngine="on" /> <!--Initialize Jasper prior to webapps are loaded. Documentation at /docs/jasper-howto.html --> <Listener className="org.apache.catalina.core.JasperListener" /> <!-- Prevent memory leaks due to use of particular java/javax APIs--> <Listener className="org.apache.catalina.core.JreMemoryLeakPreventionListener" /> <Listener className="org.apache.catalina.mbeans.GlobalResourcesLifecycleListener" /> <Listener className="org.apache.catalina.core.ThreadLocalLeakPreventionListener" /> <GlobalNamingResources> <Resource name="UserDatabase" auth="Container" type="org.apache.catalina.UserDatabase" description="User database that can be updated and saved" factory="org.apache.catalina.users.MemoryUserDatabaseFactory" pathname="conf/tomcat-users.xml" /> </GlobalNamingResources> <Service name="Catalina"> <Connector port="8080" protocol="HTTP/1.1" address="10.10.10.10" connectionTimeout="20000" redirectPort="8443" /> <Connector protocol="HTTP/1.1" address="10.10.10.10" port="8443" maxThreads="10" scheme="https" secure="true" SSLEnabled="true" keystoreFile="/opt/tomcat/.keystore" keystorePass="!Tz4S3cR3t!42" clientAuth="false" sslProtocol="TLS"/> <!-- Define an AJP 1.3 Connector on port 8009 --> <Connector port="8009" address="10.10.10.20" protocol="AJP/1.3" redirectPort="8443" /> <Engine name="Catalina" defaultHost="localhost" jvmRoute="app00-ems-billing-prod"> <Realm className="org.apache.catalina.realm.UserDatabaseRealm" resourceName="UserDatabase"/> <Host name="localhost" appBase="webapps" unpackWARs="true" autoDeploy="false" xmlValidation="false" xmlNamespaceAware="false" deployOnStartup="true"> <Valve className="org.apache.catalina.ha.authenticator.ClusterSingleSignOn" /> <!-- cluster settings --> <Cluster className="org.apache.catalina.ha.tcp.SimpleTcpCluster" channelSendOptions="8"> <Manager className="org.apache.catalina.ha.session.DeltaManager" /> <Channel className="org.apache.catalina.tribes.group.GroupChannel"> <Membership className="org.apache.catalina.tribes.membership.McastService" address="228.0.0.4" port="45564" frequency="500" dropTime="3000"/> <Receiver className="org.apache.catalina.tribes.transport.nio.NioReceiver" address="auto" port="4000" autoBind="100" selectorTimeout="5000" maxThreads="6"/> <Sender className="org.apache.catalina.tribes.transport.ReplicationTransmitter"> <Transport className="org.apache.catalina.tribes.transport.nio.PooledParallelSender"/> </Sender> <Interceptor className="org.apache.catalina.tribes.group.interceptors.TcpFailureDetector"/> <Interceptor className="org.apache.catalina.tribes.group.interceptors.MessageDispatch15Interceptor"/> </Channel> <Valve className="org.apache.catalina.ha.tcp.ReplicationValve" filter=".*\.gif|.*\.js|.*\.jpeg|.*\.jpg|.*\.png|.*\.htm|.*\.html|.*\.css|.*\.txt" statistics="true" /> <Valve className="org.apache.catalina.ha.session.JvmRouteBinderValve" /> <Deployer className="org.apache.catalina.ha.deploy.FarmWarDeployer" tempDir="/tmp/war-temp/" deployDir="/tmp/war-deploy/" watchDir="/tmp/war-listen/" watchEnabled="false"/> <ClusterListener className="org.apache.catalina.ha.session.JvmRouteSessionIDBinderListener"/> <ClusterListener className="org.apache.catalina.ha.session.ClusterSessionListener"/> </Cluster> </Host> </Engine> </Service> </Server> The scenario we have is readily reproducable: If we kill -9 a server out of the cluster (ugly yes), but does not call destroySession, and keeps other application servers (and the application itself online), the user (as expected) is migrated to another server in the cluster. Once the server that was taken out of the cluster is coming back online, if an NullPEx is thrown on any of the other already active servers, it will throw an: Nov 30, 2012 12:35:11 PM org.apache.catalina.tribes.group.interceptors.TcpFailureDetector memberDisappeared, on the server that is starting up and halt loading of the application. The real kicker to this is that if left alone and no other NPEx's are thrown, the application on the server that is restarting will recover after a period of time that seems to vary from 15 minutes or so to longer. Any ideas on this? - J