Re: BackupManager start fails under heavy load
2013/6/28 Patrick Savage patrick.sav...@3pillarglobal.com We have an issue in our Tomcat 7.0.30 clustered production environment on RHEL 5 where Tomcat fails to start our application when other nodes in the cluster are under extremely heavy load. It fails because the BackupManager cannot start the replicated map due to timeouts trying to connect to all the other nodes. The only way to recover from this seems to be shutting down almost all of the nodes and then starting them again. The cluster has 9 nodes, but we have also had the problem with 6 nodes. Is there a way to ensure the application will start even if the BackupManager cannot connect to the other nodes? No. If replication map fails to start, associated context will fail to start. I will implement a feature to ensure the application will start even If replication map fails to start. -- Keiichi.Fujino
Re: BackupManager start fails under heavy load
Hi Patrick, A similar problem has been reported before: http://tomcat.10.n6.nabble.com/org-apache-catalina-tribes-ChannelException-Operation-has-timed-out-3000-ms-Faulty-members-tcp-64-88-td4656393.html The important error message from your log output is: Caused by: org.apache.catalina.tribes.ChannelException: Operation has timed out(3000 ms.).; Faulty members:tcp://{10, 230, 20, 86}:4001; tcp://{10, 230, 20, 87}:4001; tcp://{10, 230, 20, 94}:4001; tcp://{10, 230, 20, 95}:4001; tcp://{10, 230, 20, 70}:4001; tcp://{10, 230, 20, 89}:4001; at org.apache.catalina.tribes.transport.nio.ParallelNioSender.sendMessage(Paral lelNioSender.java:109) ... I am familiar with the code that generates this message; the problem is that the sending operation is abandoned for any sender object which has not been drained of data within timeout milliseconds. The timeout parameter is declared in AbstractSender class as (long) 3000. By my reckoning a small change to the timeout value could result a large reduction in messaging failures. According to information from this page: http://tomcat.apache.org/tomcat-7.0-doc/config/cluster-sender.html you should be able to increase the timeout parameter by setting a transport attribute thus: Sender className=org.apache.catalina.tribes.transport.ReplicationTransmitter Transport className=org.apache.catalina.tribes.transport.nio.PooledParallelSender timeout=4000 /Transport /Sender However, I can not find the code where the system reads the configuration to override the default value; if you make the alteration and find the error message still reports 3000ms, this would indicate an oversight in the coding which could be reported. BTW, your configuration for receiver has selectorTimeout=100 The code suggests that this should be the same value as sender/transport timeout (ie 3000). The documentation says the default is 5000. My examination of the code suggests that the PooledParallelSender class does not read the configuration but always uses 5000. Nevertheless, you could try setting that value to 5000 and seeing what happens. BTW my own interest was to implement tribes at Internet connection speed; by manipulating the parameter in question, my system copes with data transfers that take multiple seconds. http://tomcat.10.x6.nabble.com/overcoming-a-message-size-limitation-in-tribes-parallel-messaging-with-NioSender-tt4995446.html
BackupManager start fails under heavy load
We have an issue in our Tomcat 7.0.30 clustered production environment on RHEL 5 where Tomcat fails to start our application when other nodes in the cluster are under extremely heavy load. It fails because the BackupManager cannot start the replicated map due to timeouts trying to connect to all the other nodes. The only way to recover from this seems to be shutting down almost all of the nodes and then starting them again. The cluster has 9 nodes, but we have also had the problem with 6 nodes. Is there a way to ensure the application will start even if the BackupManager cannot connect to the other nodes? Or is there some configuration we can change to prevent the connection failure? E.g. should we set some of the attributes on Transport such as timeout, maxRetryAttempts, poolSize, or maxWait? We see many WARNING: Channel key is registered, but has had no interest ops for the last 3000 ms. messages when we we have relatively heavy load. Should we increase maxThreads on Receiver to avoid this message and potentially prevent the connection failures? This is the server.xml: Cluster className=org.apache.catalina.ha.tcp.SimpleTcpCluster channelSendOptions=8 Manager className=org.apache.catalina.ha.session.BackupManager mapSendOptions=8 expireSessionsOnShutdown=false notifyListenersOnReplication=true/ Channel className=org.apache.catalina.tribes.group.GroupChannel Membership className=org.apache.catalina.tribes.membership.McastService address=228.0.0.4 port=45564 frequency=500 dropTime=3000/ Receiver className=org.apache.catalina.tribes.transport.nio.NioReceiver address=auto port=4001 selectorTimeout=100 maxThreads=6/ Sender className=org.apache.catalina.tribes.transport.ReplicationTransmitter Transport className=org.apache.catalina.tribes.transport.nio.PooledParallelSender/ /Sender Interceptor className=org.apache.catalina.tribes.group.interceptors.TcpFailureDetector / Interceptor className=org.apache.catalina.tribes.group.interceptors.MessageDispatch15In terceptor/ Interceptor className=org.apache.catalina.tribes.group.interceptors.ThroughputIntercept or/ /Channel Valve className=org.apache.catalina.ha.tcp.ReplicationValve filter=.*\.gif|.*\.js|.*\.jpeg|.*\.jpg|.*\.png|.*\.htm|.*\.html|.*\.css|.*\ .txt/ Valve className=org.apache.catalina.ha.session.JvmRouteBinderValve / ClusterListener className=org.apache.catalina.ha.session.ClusterSessionListener/ ClusterListener className=org.apache.catalina.ha.session.JvmRouteSessionIDBinderListener/ /Cluster - This is log for the tcp://{10, 230, 20, 85} node that failed to startup because it timed out connecting to 6 of the 8 other nodes: Jun 26, 2013 5:29:01 PM org.apache.catalina.tribes.tipis.AbstractReplicatedMap init WARNING: Unable to send map start message. Jun 26, 2013 5:29:01 PM org.apache.catalina.ha.session.BackupManager startInternal SEVERE: Unable to start BackupManager: [/appv5] java.lang.RuntimeException: Unable to start replicated map. at org.apache.catalina.tribes.tipis.AbstractReplicatedMap.init(AbstractReplicat edMap.java:234) at org.apache.catalina.tribes.tipis.AbstractReplicatedMap.init(AbstractReplic atedMap.java:176) at org.apache.catalina.tribes.tipis.LazyReplicatedMap.init(LazyReplicatedMap. java:104) at org.apache.catalina.ha.session.BackupManager.startInternal(BackupManager.jav a:163) at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150) at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java: 5294) at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150) at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:9 01) at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877) at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:618) at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:963) at org.apache.catalina.startup.HostConfig$DeployWar.run(HostConfig.java:1600) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.ja va:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9 08) at java.lang.Thread.run(Thread.java:662) Caused by: org.apache.catalina.tribes.ChannelException: Operation has timed out(3000 ms.).; Faulty members:tcp://{10, 230, 20, 86}:4001; tcp://{10, 230, 20, 87}:4001; tcp://{10, 230, 20, 94}:4001; tcp://{10, 230,