[JBoss-user] [Clustering/JBoss] - Re: HA-JMS fails, Master node undeploying channels, no failo

jkressin Thu, 06 Jul 2006 08:52:01 -0700

Sorry for not replying for a while, but I was analyzing the logfiles and trying 
to reproduce the behaviour we have on our production system. Thanks to the 
answers here I think I understand now better what is going on, and I indeed 
found a way to reproduce the behaviour.


First, I was wrong in my assumption that the channels are never rebound to JNDI 
when the master node fails.  Here's what happens:

Initally node 210 is the master node, and node 211 is a "slave" (hope the 
terminology is correct).  At 08:14:24 the node 211 begins to receive new views. 
Taken from 211's logfile:

2006-06-21 08:14:24,757 INFO  
[org.jboss.ha.framework.interfaces.HAPartition.lifecycle.StagePartition] New 
cluster view for partition StagePartition (id: 201, delta: -2) : 
[62.50.43.211:1099, 62.50.
43.213:1099, 62.50.43.216:1099, 62.50.43.215:1099]
2006-06-21 08:14:24,757 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
I am (62.50.43.211:1099) received membershipChanged event:
2006-06-21 08:14:24,757 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
Dead members: 2 ([62.50.43.210:1099, 62.50.43.214:1099])
2006-06-21 08:14:24,757 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
New Members : 0 ([])
2006-06-21 08:14:24,757 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
All Members : 4 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099, 
62.50.43.215:1
099])


As node 211 is now the master node and node 210 is in the list of dead members, 
node 211 deploys all channels, like it should.
Taken from 211's logfile:

2006-06-21 08:14:25,496 INFO  [org.jboss.web.tomcat.tc5.TomcatDeployer] deploy, 
ctxPath=/jbossmq-httpil, 
warUrl=.../deploy-hasingleton/jms/jbossmq-httpil.sar/jbossmq-httpil.war/
2006-06-21 08:14:26,916 INFO  
[org.jboss.mq.server.jmx.Topic.sgw/MOCacheInvalidationTopic] Bound to JNDI 
name: topic/sgw/MOCacheInvalidationTopic
2006-06-21 08:14:26,917 INFO  
[org.jboss.mq.server.jmx.Topic.sgw/CdaHtmlCacheInvalidationTopic] Bound to JNDI 
name: topic/sgw/CdaHtmlCacheInvalidationTopic
[...]

But: Node 210 did not receive view 201 at all, so this node still has all the 
channels deployed as well. The next thing I see in the logfile of 211 is that 
node 214 is still sending messages, but from the viewpoint of 211 is not a 
cluster member anymore. I do not know if this is of any relevance, but to give 
you a complete picture I wanted to mention it.
Taken from 211's logfile:
2006-06-21 08:14:29,985 ERROR [org.jgroups.protocols.pbcast.CoordGmsImpl] mbr 
62.50.43.214:54923 (additional data: 17 bytes) is not a member !
2006-06-21 08:14:29,987 INFO  
[org.jboss.ha.framework.interfaces.HAPartition.lifecycle.StagePartition] 
Suspected member: 62.50.43.214:54923 (additional data: 17 bytes)

Next, 211 is receiving two more view changes (id 202 and 203). 
Taken from 211's logfile:

2006-06-21 08:14:34,867 INFO  
[org.jboss.ha.framework.interfaces.HAPartition.lifecycle.StagePartition] New 
cluster view for partition StagePartition (id: 202, delta: 1) : 
[62.50.43.211:1099, 62.50.4
3.213:1099, 62.50.43.216:1099, 62.50.43.215:1099, 62.50.43.214:1099]
2006-06-21 08:14:34,867 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
I am (62.50.43.211:1099) received membershipChanged event:
2006-06-21 08:14:34,867 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
Dead members: 0 ([])
2006-06-21 08:14:34,867 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
New Members : 1 ([62.50.43.214:1099])
2006-06-21 08:14:34,867 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
All Members : 5 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099, 
62.50.43.215:1
099, 62.50.43.214:1099])
2006-06-21 08:14:35,021 INFO  
[org.jboss.ha.framework.interfaces.HAPartition.lifecycle.StagePartition] New 
cluster view for partition StagePartition (id: 203, delta: 1) : 
[62.50.43.211:1099, 62.50.4
3.213:1099, 62.50.43.216:1099, 62.50.43.215:1099, 62.50.43.214:1099, 
62.50.43.210:1099]
2006-06-21 08:14:35,021 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
I am (62.50.43.211:1099) received membershipChanged event:
2006-06-21 08:14:35,021 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
Dead members: 0 ([])
2006-06-21 08:14:35,021 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
New Members : 1 ([62.50.43.210:1099])
2006-06-21 08:14:35,021 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
All Members : 6 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099, 
62.50.43.215:1
099, 62.50.43.214:1099, 62.50.43.210:1099])

Node 210 was not receiving view 202, but view 203. After receiving view 203 
node 210 is aware that it is no longer the master node, and it undeploys the 
channels:
Taken from 210's logfile:
2006-06-21 08:14:35,049 INFO  
[org.jboss.ha.framework.interfaces.HAPartition.StagePartition] New cluster view 
for partition StagePartition: 203 ([62.50.43.211:1099, 62.50.43.213:1099, 
62.50.43.216:1
099, 62.50.43.215:1099, 62.50.43.214:1099, 62.50.43.210:1099] delta: 0)
2006-06-21 08:14:35,049 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
I am (62.50.43.210:1099) received membershipChanged event:
2006-06-21 08:14:35,049 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
Dead members: 0 ([])
2006-06-21 08:14:35,049 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
New Members : 0 ([])
2006-06-21 08:14:35,049 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
All Members : 6 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099, 
62.50.43.215:1
099, 62.50.43.214:1099, 62.50.43.210:1099])
2006-06-21 08:14:35,329 INFO  
[org.jboss.mq.server.jmx.Topic.sgw/MOCacheInvalidationTopic] Unbinding JNDI 
name: topic/sgw/MOCacheInvalidationTopic
2006-06-21 08:14:35,465 INFO  
[org.jboss.mq.server.jmx.Topic.sgw/CdaHtmlCacheInvalidationTopic] Unbinding 
JNDI name: topic/sgw/CdaHtmlCacheInvalidationTopic
2006-06-21 08:14:35,466 INFO  
[org.jboss.mq.server.jmx.Queue.sgw/AlertUserQueue] Unbinding JNDI name: 
queue/sgw/AlertUserQueue
2006-06-21 08:14:35,466 INFO  [org.jboss.mq.server.jmx.Queue.sgw/UserQueue] 
Unbinding JNDI name: queue/sgw/UserQueue

And exactly from this point onwards the two nodes cannot lookup any channel 
anymore, although they are bound on 211 and 211 is the master node according to 
the view messages 201 - 203. The next messages appear on all nodes in the 
cluster:

javax.jms.InvalidDestinationException: This destination does not exist! 
TOPIC.sgw/MOCacheInvalidationTopic
        at 
org.jboss.mq.server.JMSDestinationManager.addMessage(JMSDestinationManager.java:389)
        at 
org.jboss.mq.server.JMSDestinationManager.addMessage(JMSDestinationManager.java:373)
        at 
org.jboss.mq.server.JMSServerInvoker.addMessage(JMSServerInvoker.java:136)
        at 
org.jboss.mq.il.uil2.ServerSocketManagerHandler.handleMsg(ServerSocketManagerHandler.java:92)

I guess it has something to do with the small period of time where kind of two 
master nodes existed in the cluster. After view 201 the node 211 thinks that it 
is the master node, but because 210 did not receice the message this node also 
thinks it is the master node. This situation lasted for ~10 seconds, until view 
203 was received by node 210. How does JGroups/JBoss handle such a scenario?

In order to reproduce the behaviour we did the following: We set up a cluster 
with 4 nodes on 2 machines with the same configuration as on the production 
system. We installed a script which blocks the UDP traffic periodically. The 
script runs every second and blocks the incoming UDP traffic for another second 
with a probability of 50%. That way we wanted to simulate network "jitter", 
because we suspect that UDP packets get lost somehow on the production system. 
I can see the same behaviour on our test cluster by running this script. View 
changed messages get lost from time to time and after that happens the nodes 
fail to lookup the channels, although they are present always on one of the 
nodes. Because I was able to reproduce I thought it is important to let you 
know. Currently I do not want to switch to the TCP stack, because we are not 
100% sure yet that UDP packets really get lost. According to our hoster, 
everything is fine with the network (but, hey, they always tell yo!
 u that ;)

Anyway, if you have any more hints how to solve this problem or have any 
questions on our test setup to reproduce the behaviour, please let me know. 
Thanks very much your answers so far, I really appreciate the time you guys put 
in answering questions here in the forum!

Thanks,

Jochen


View the original post : 
http://www.jboss.com/index.html?module=bb&op=viewtopic&p=3955918#3955918

Reply to the post : 
http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=3955918

Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
JBoss-user mailing list
JBoss-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/jboss-user

[JBoss-user] [Clustering/JBoss] - Re: HA-JMS fails, Master node undeploying channels, no failo

Reply via email to