Re: problem with clustering

Daniel Mikusa Thu, 04 Apr 2013 06:28:56 -0700

On Apr 4, 2013, at 6:43 AM, Andy Pahne wrote:

> 
> An application that has been running fine for years now suddenly does perform 
> with varying results, sometimes as quick as always, but then sometimes a 
> simple page request uses up to 30 seconds.


If you haven't changed anything with the application or your Tomcat 
configuration, then you'll want to look at the external resources that your 
application depends upon, such as a database, the network, shared file systems, 
etc…  If the performance of an external resource is suffering, it could 
definitely be causing problems for your application.


> 
> Since the performance did degrade we regularly find log items like the 
> following one in catalina.out (many of them, about 100 to 300 per hour on 
> each host):
> 
> 04.04.2013 11:51:53 
> org.apache.catalina.tribes.group.interceptors.TcpFailureDetector 
> memberDisappeared
> INFO: Verification complete. Member still 
> alive[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64, -88, 6, 
> 21}:4000,{-64, -88, 6, 21},4000, alive=1706334,id={-99 120 -58 21 -84 121 74 
> 45 -104 -73 -123 -40 10 -76 70 59 }, payload={}, command={}, domain={}, ]]

I think that you'll typically see these when there is a network issue, but you 
would see them anytime a member is timed out.

The connections between the nodes in your cluster are monitored with a 
heartbeat.  When a node doesn't respond to the heartbeat the node is considered 
to have left the cluster.  To protect against false positives you can configure 
a TcpFailureDetector.  This listens for "memberDisappeared" events and when one 
occurs, it will connect to the member via TCP to try to confirm it's 
disappearance.  

In your case, the message that you are seeing is indicating that the heartbeat 
failed, but that the TcpFailureDetector was able to verify the node still 
exists.  In other words, this is a false positive.

In addition to the TcpFailureDetector, you can also adjust the "frequency" and 
"dropTime" attributes to control how often heartbeats are sent and how long to 
wait for the response.  You might try adjusting these settings to make the 
configuration more tolerant of your network.

  https://tomcat.apache.org/tomcat-6.0-doc/config/cluster-membership.html


> We ruled out that the recent changes to said application are the cause for 
> the poor performance y simulating all sorts of heavy load on various test 
> systems. It just works nicely in the test environment. However, on production 
> it does not.
> 
> We are using the SimpleTcpCluster solution for clustering on Tomcat 6. The 
> cluster has two nodes.

It would be helpful to post your configuration, minus comments, as well as the 
exact version of Tomcat that you are running.


> 
> I am NOT suspecting a tomcat bug. And as I said I am not suspecting a 
> performance bottleneck in our application or in the db queries it performs. 
> At the moment I am thinking of a hardware failure of some kind (network 
> interface, router etc.).
> 
> Do you have any experience with this problem and what did you do to resolve 
> it?

If you suspect a network issue, you could try monitoring with Wireshark or 
tcpdump to capture the network packets.  Analysis of the packets could show if 
there is a problem.  Another option would be to try and use a tool like iperf 
to put a high load on your network and possibly trigger the problem.

Dan



> 
> Thanks,
> Andy
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: problem with clustering

Reply via email to