Hey guys,

We seem to have proven that the 52529 bug is either still rearing it's head
or something else here is causing the exact same problem as was reported in
the above bug.

We are running Tomcat 7.0.30, jjdk 1.6 on CentOS 6.3 with mod_jk session
replication and load balancing.

My server.xml configuration is as follows (this is server #1 of a 3 server
cluster) - All IP's and passwords have been changed to protect the
innocent....

<?xml version='1.0' encoding='utf-8'?>

<Server port="8005" shutdown="SHUTDOWN">
  <Listener className="org.apache.catalina.core.AprLifecycleListener"
SSLEngine="on" />
  <!--Initialize Jasper prior to webapps are loaded. Documentation at
/docs/jasper-howto.html -->
  <Listener className="org.apache.catalina.core.JasperListener" />
  <!-- Prevent memory leaks due to use of particular java/javax APIs-->
  <Listener
className="org.apache.catalina.core.JreMemoryLeakPreventionListener" />
  <Listener
className="org.apache.catalina.mbeans.GlobalResourcesLifecycleListener" />
  <Listener
className="org.apache.catalina.core.ThreadLocalLeakPreventionListener" />

  <GlobalNamingResources>

     <Resource name="UserDatabase" auth="Container"
              type="org.apache.catalina.UserDatabase"
              description="User database that can be updated and saved"
              factory="org.apache.catalina.users.MemoryUserDatabaseFactory"
              pathname="conf/tomcat-users.xml" />
  </GlobalNamingResources>

  <Service name="Catalina">

  <Connector port="8080" protocol="HTTP/1.1" address="10.10.10.10"
               connectionTimeout="20000"
               redirectPort="8443" />

  <Connector
        protocol="HTTP/1.1" address="10.10.10.10"
        port="8443" maxThreads="10"
        scheme="https" secure="true" SSLEnabled="true"
        keystoreFile="/opt/tomcat/.keystore" keystorePass="!Tz4S3cR3t!42"
        clientAuth="false" sslProtocol="TLS"/>

    <!-- Define an AJP 1.3 Connector on port 8009 -->

    <Connector port="8009" address="10.10.10.20" protocol="AJP/1.3"
redirectPort="8443" />

    <Engine name="Catalina" defaultHost="localhost"
jvmRoute="app00-ems-billing-prod">

      <Realm className="org.apache.catalina.realm.UserDatabaseRealm"
             resourceName="UserDatabase"/>

      <Host name="localhost"  appBase="webapps"
            unpackWARs="true" autoDeploy="false"
        xmlValidation="false" xmlNamespaceAware="false"
        deployOnStartup="true">

        <Valve
className="org.apache.catalina.ha.authenticator.ClusterSingleSignOn" />

        <!-- cluster settings  -->
        <Cluster className="org.apache.catalina.ha.tcp.SimpleTcpCluster"
channelSendOptions="8">
           <Manager className="org.apache.catalina.ha.session.DeltaManager"
/>
           <Channel
className="org.apache.catalina.tribes.group.GroupChannel">
               <Membership
className="org.apache.catalina.tribes.membership.McastService"
                   address="228.0.0.4"
                   port="45564"
                   frequency="500" dropTime="3000"/>
               <Receiver
className="org.apache.catalina.tribes.transport.nio.NioReceiver"
address="auto"
                   port="4000"
                   autoBind="100"
                   selectorTimeout="5000"
                   maxThreads="6"/>
               <Sender
className="org.apache.catalina.tribes.transport.ReplicationTransmitter">
                   <Transport
className="org.apache.catalina.tribes.transport.nio.PooledParallelSender"/>
               </Sender>
               <Interceptor
className="org.apache.catalina.tribes.group.interceptors.TcpFailureDetector"/>

               <Interceptor
className="org.apache.catalina.tribes.group.interceptors.MessageDispatch15Interceptor"/>
           </Channel>

           <Valve className="org.apache.catalina.ha.tcp.ReplicationValve"

filter=".*\.gif|.*\.js|.*\.jpeg|.*\.jpg|.*\.png|.*\.htm|.*\.html|.*\.css|.*\.txt"

                   statistics="true" />
           <Valve
className="org.apache.catalina.ha.session.JvmRouteBinderValve" />

           <Deployer
className="org.apache.catalina.ha.deploy.FarmWarDeployer"
                        tempDir="/tmp/war-temp/"
                        deployDir="/tmp/war-deploy/"
                        watchDir="/tmp/war-listen/"
                        watchEnabled="false"/>

           <ClusterListener
className="org.apache.catalina.ha.session.JvmRouteSessionIDBinderListener"/>
           <ClusterListener
className="org.apache.catalina.ha.session.ClusterSessionListener"/>
        </Cluster>
      </Host>
    </Engine>
  </Service>
</Server>

The scenario we have is readily reproducable:

If we kill -9 a server out of the cluster (ugly yes), but does not call
destroySession, and keeps other application servers (and the application
itself online), the user (as expected) is migrated to another server in the
cluster.  Once the server that was taken out of the cluster is coming back
online, if an NullPEx is thrown on any of the other already active servers,
it will throw an: Nov 30, 2012 12:35:11 PM
org.apache.catalina.tribes.group.interceptors.TcpFailureDetector
memberDisappeared, on the server that is starting up and halt loading of
the application.

The real kicker to this is that if left alone and no other NPEx's are
thrown, the application on the server that is restarting will recover after
a period of time that seems to vary from 15 minutes or so to longer.

Any ideas on this?

- J

Reply via email to