Re: BackupManager start fails under heavy load

2013-07-02 Thread Keiichi Fujino
2013/6/28 Patrick Savage patrick.sav...@3pillarglobal.com

 We have an issue in our Tomcat 7.0.30 clustered production environment on
 RHEL 5 where Tomcat fails to start our application when other nodes in the
 cluster are under extremely heavy load. It fails because the BackupManager
 cannot start the replicated map due to timeouts trying to connect to all
 the
 other nodes. The only way to recover from this seems to be shutting down
 almost all of the nodes and then starting them again. The cluster has 9
 nodes, but we have also had the problem with 6 nodes.



 Is there a way to ensure the application will start even if the
 BackupManager cannot connect to the other nodes?


No.
If replication map fails to start, associated context will fail to start.

I will implement a feature  to ensure the application will start even If
replication map fails to start.

-- 
Keiichi.Fujino


Re: BackupManager start fails under heavy load

2013-06-28 Thread Vince Stewart
Hi Patrick,
A similar problem has been reported before:
http://tomcat.10.n6.nabble.com/org-apache-catalina-tribes-ChannelException-Operation-has-timed-out-3000-ms-Faulty-members-tcp-64-88-td4656393.html
The important error message from your log output is:

   

   Caused by: org.apache.catalina.tribes.ChannelException: Operation has
 timed out(3000 ms.).; Faulty members:tcp://{10, 230, 20, 86}:4001;
 tcp://{10, 230, 20, 87}:4001; tcp://{10, 230, 20, 94}:4001; tcp://{10, 230,
 20, 95}:4001; tcp://{10, 230, 20, 70}:4001; tcp://{10, 230, 20, 89}:4001;

 at

 org.apache.catalina.tribes.transport.nio.ParallelNioSender.sendMessage(Paral
 lelNioSender.java:109)
 ...


I am familiar with the code that generates this message; the problem is
that the sending operation is abandoned for any sender object which has not
been drained of data within timeout milliseconds. The timeout parameter
is declared in AbstractSender class as (long) 3000. By my reckoning a small
change to the timeout value could result a large reduction in messaging
failures.

According to information from this page:
http://tomcat.apache.org/tomcat-7.0-doc/config/cluster-sender.html

you should be able to increase the timeout parameter by setting a transport
attribute thus:

  Sender
className=org.apache.catalina.tribes.transport.ReplicationTransmitter
Transport
className=org.apache.catalina.tribes.transport.nio.PooledParallelSender
timeout=4000
   /Transport
  /Sender

However, I can not find the code where the system reads the configuration
to override the default value; if you make the alteration and find the
error message still reports 3000ms, this would indicate an oversight in
the coding which could be reported.

BTW, your configuration for receiver has
selectorTimeout=100

The code suggests that this should be the same value as sender/transport
timeout (ie 3000). The documentation says the default is 5000. My
examination of the code suggests that the PooledParallelSender class does
not read the configuration but always uses 5000. Nevertheless, you could
try setting that value to 5000 and seeing what happens.

BTW my own interest was to implement tribes at Internet connection speed;
by manipulating the parameter in question, my system copes with data
transfers that take multiple seconds.
http://tomcat.10.x6.nabble.com/overcoming-a-message-size-limitation-in-tribes-parallel-messaging-with-NioSender-tt4995446.html


BackupManager start fails under heavy load

2013-06-27 Thread Patrick Savage
We have an issue in our Tomcat 7.0.30 clustered production environment on
RHEL 5 where Tomcat fails to start our application when other nodes in the
cluster are under extremely heavy load. It fails because the BackupManager
cannot start the replicated map due to timeouts trying to connect to all the
other nodes. The only way to recover from this seems to be shutting down
almost all of the nodes and then starting them again. The cluster has 9
nodes, but we have also had the problem with 6 nodes.

 

Is there a way to ensure the application will start even if the
BackupManager cannot connect to the other nodes? Or is there some
configuration we can change to prevent the connection failure? E.g. should
we set some of the attributes on Transport such as timeout,
maxRetryAttempts, poolSize, or maxWait?

 

We see many WARNING: Channel key is registered, but has had no interest ops
for the last 3000 ms. messages when we we have relatively heavy load.
Should we increase maxThreads on Receiver to avoid this message and
potentially prevent the connection failures? 

 

This is the server.xml:

 

  Cluster className=org.apache.catalina.ha.tcp.SimpleTcpCluster
channelSendOptions=8

Manager className=org.apache.catalina.ha.session.BackupManager

 mapSendOptions=8

 expireSessionsOnShutdown=false

 notifyListenersOnReplication=true/

Channel className=org.apache.catalina.tribes.group.GroupChannel

  Membership
className=org.apache.catalina.tribes.membership.McastService

  address=228.0.0.4

  port=45564

  frequency=500

  dropTime=3000/

  Receiver
className=org.apache.catalina.tribes.transport.nio.NioReceiver

address=auto

port=4001

selectorTimeout=100

maxThreads=6/

  Sender
className=org.apache.catalina.tribes.transport.ReplicationTransmitter

Transport
className=org.apache.catalina.tribes.transport.nio.PooledParallelSender/

  /Sender

  Interceptor
className=org.apache.catalina.tribes.group.interceptors.TcpFailureDetector
/

  Interceptor
className=org.apache.catalina.tribes.group.interceptors.MessageDispatch15In
terceptor/

  Interceptor
className=org.apache.catalina.tribes.group.interceptors.ThroughputIntercept
or/

/Channel

 

Valve className=org.apache.catalina.ha.tcp.ReplicationValve

 
filter=.*\.gif|.*\.js|.*\.jpeg|.*\.jpg|.*\.png|.*\.htm|.*\.html|.*\.css|.*\
.txt/

Valve className=org.apache.catalina.ha.session.JvmRouteBinderValve /

 

ClusterListener
className=org.apache.catalina.ha.session.ClusterSessionListener/

ClusterListener
className=org.apache.catalina.ha.session.JvmRouteSessionIDBinderListener/

  /Cluster

 

-

 

This is log for the tcp://{10, 230, 20, 85} node that failed to startup
because it timed out connecting to 6 of the 8 other nodes:

 

  Jun 26, 2013 5:29:01 PM
org.apache.catalina.tribes.tipis.AbstractReplicatedMap init

  WARNING: Unable to send map start message.

  Jun 26, 2013 5:29:01 PM org.apache.catalina.ha.session.BackupManager
startInternal

  SEVERE: Unable to start BackupManager: [/appv5]

  java.lang.RuntimeException: Unable to start replicated map.

at
org.apache.catalina.tribes.tipis.AbstractReplicatedMap.init(AbstractReplicat
edMap.java:234)

at
org.apache.catalina.tribes.tipis.AbstractReplicatedMap.init(AbstractReplic
atedMap.java:176)

at
org.apache.catalina.tribes.tipis.LazyReplicatedMap.init(LazyReplicatedMap.
java:104)

at
org.apache.catalina.ha.session.BackupManager.startInternal(BackupManager.jav
a:163)

at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)

at
org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:
5294)

at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)

at
org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:9
01)

at
org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877)

at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:618)

at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:963)

at
org.apache.catalina.startup.HostConfig$DeployWar.run(HostConfig.java:1600)

at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)

at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

at java.util.concurrent.FutureTask.run(FutureTask.java:138)

at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.ja
va:886)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9
08)

at java.lang.Thread.run(Thread.java:662)

  Caused by: org.apache.catalina.tribes.ChannelException: Operation has
timed out(3000 ms.).; Faulty members:tcp://{10, 230, 20, 86}:4001;
tcp://{10, 230, 20, 87}:4001; tcp://{10, 230, 20, 94}:4001; tcp://{10, 230,