Sorry for not replying for a while, but I was analyzing the logfiles and trying
to reproduce the behaviour we have on our production system. Thanks to the
answers here I think I understand now better what is going on, and I indeed
found a way to reproduce the behaviour.
First, I was wrong in my assumption that the channels are never rebound to JNDI
when the master node fails. Here's what happens:
Initally node 210 is the master node, and node 211 is a "slave" (hope the
terminology is correct). At 08:14:24 the node 211 begins to receive new views.
Taken from 211's logfile:
2006-06-21 08:14:24,757 INFO
[org.jboss.ha.framework.interfaces.HAPartition.lifecycle.StagePartition] New
cluster view for partition StagePartition (id: 201, delta: -2) :
[62.50.43.211:1099, 62.50.
43.213:1099, 62.50.43.216:1099, 62.50.43.215:1099]
2006-06-21 08:14:24,757 INFO
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition]
I am (62.50.43.211:1099) received membershipChanged event:
2006-06-21 08:14:24,757 INFO
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition]
Dead members: 2 ([62.50.43.210:1099, 62.50.43.214:1099])
2006-06-21 08:14:24,757 INFO
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition]
New Members : 0 ([])
2006-06-21 08:14:24,757 INFO
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition]
All Members : 4 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099,
62.50.43.215:1
099])
As node 211 is now the master node and node 210 is in the list of dead members,
node 211 deploys all channels, like it should.
Taken from 211's logfile:
2006-06-21 08:14:25,496 INFO [org.jboss.web.tomcat.tc5.TomcatDeployer] deploy,
ctxPath=/jbossmq-httpil,
warUrl=.../deploy-hasingleton/jms/jbossmq-httpil.sar/jbossmq-httpil.war/
2006-06-21 08:14:26,916 INFO
[org.jboss.mq.server.jmx.Topic.sgw/MOCacheInvalidationTopic] Bound to JNDI
name: topic/sgw/MOCacheInvalidationTopic
2006-06-21 08:14:26,917 INFO
[org.jboss.mq.server.jmx.Topic.sgw/CdaHtmlCacheInvalidationTopic] Bound to JNDI
name: topic/sgw/CdaHtmlCacheInvalidationTopic
[...]
But: Node 210 did not receive view 201 at all, so this node still has all the
channels deployed as well. The next thing I see in the logfile of 211 is that
node 214 is still sending messages, but from the viewpoint of 211 is not a
cluster member anymore. I do not know if this is of any relevance, but to give
you a complete picture I wanted to mention it.
Taken from 211's logfile:
2006-06-21 08:14:29,985 ERROR [org.jgroups.protocols.pbcast.CoordGmsImpl] mbr
62.50.43.214:54923 (additional data: 17 bytes) is not a member !
2006-06-21 08:14:29,987 INFO
[org.jboss.ha.framework.interfaces.HAPartition.lifecycle.StagePartition]
Suspected member: 62.50.43.214:54923 (additional data: 17 bytes)
Next, 211 is receiving two more view changes (id 202 and 203).
Taken from 211's logfile:
2006-06-21 08:14:34,867 INFO
[org.jboss.ha.framework.interfaces.HAPartition.lifecycle.StagePartition] New
cluster view for partition StagePartition (id: 202, delta: 1) :
[62.50.43.211:1099, 62.50.4
3.213:1099, 62.50.43.216:1099, 62.50.43.215:1099, 62.50.43.214:1099]
2006-06-21 08:14:34,867 INFO
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition]
I am (62.50.43.211:1099) received membershipChanged event:
2006-06-21 08:14:34,867 INFO
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition]
Dead members: 0 ([])
2006-06-21 08:14:34,867 INFO
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition]
New Members : 1 ([62.50.43.214:1099])
2006-06-21 08:14:34,867 INFO
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition]
All Members : 5 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099,
62.50.43.215:1
099, 62.50.43.214:1099])
2006-06-21 08:14:35,021 INFO
[org.jboss.ha.framework.interfaces.HAPartition.lifecycle.StagePartition] New
cluster view for partition StagePartition (id: 203, delta: 1) :
[62.50.43.211:1099, 62.50.4
3.213:1099, 62.50.43.216:1099, 62.50.43.215:1099, 62.50.43.214:1099,
62.50.43.210:1099]
2006-06-21 08:14:35,021 INFO
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition]
I am (62.50.43.211:1099) received membershipChanged event:
2006-06-21 08:14:35,021 INFO
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition]
Dead members: 0 ([])
2006-06-21 08:14:35,021 INFO
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition]
New Members : 1 ([62.50.43.210:1099])
2006-06-21 08:14:35,021 INFO
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition]
All Members : 6 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099,
62.50.43.215:1
099, 62.50.43.214:1099, 62.50.43.210:1099])
Node 210 was not receiving view 202, but view 203. After receiving view 203
node 210 is aware that it is no longer the master node, and it undeploys the
channels:
Taken from 210's logfile:
2006-06-21 08:14:35,049 INFO
[org.jboss.ha.framework.interfaces.HAPartition.StagePartition] New cluster view
for partition StagePartition: 203 ([62.50.43.211:1099, 62.50.43.213:1099,
62.50.43.216:1
099, 62.50.43.215:1099, 62.50.43.214:1099, 62.50.43.210:1099] delta: 0)
2006-06-21 08:14:35,049 INFO
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition]
I am (62.50.43.210:1099) received membershipChanged event:
2006-06-21 08:14:35,049 INFO
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition]
Dead members: 0 ([])
2006-06-21 08:14:35,049 INFO
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition]
New Members : 0 ([])
2006-06-21 08:14:35,049 INFO
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition]
All Members : 6 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099,
62.50.43.215:1
099, 62.50.43.214:1099, 62.50.43.210:1099])
2006-06-21 08:14:35,329 INFO
[org.jboss.mq.server.jmx.Topic.sgw/MOCacheInvalidationTopic] Unbinding JNDI
name: topic/sgw/MOCacheInvalidationTopic
2006-06-21 08:14:35,465 INFO
[org.jboss.mq.server.jmx.Topic.sgw/CdaHtmlCacheInvalidationTopic] Unbinding
JNDI name: topic/sgw/CdaHtmlCacheInvalidationTopic
2006-06-21 08:14:35,466 INFO
[org.jboss.mq.server.jmx.Queue.sgw/AlertUserQueue] Unbinding JNDI name:
queue/sgw/AlertUserQueue
2006-06-21 08:14:35,466 INFO [org.jboss.mq.server.jmx.Queue.sgw/UserQueue]
Unbinding JNDI name: queue/sgw/UserQueue
And exactly from this point onwards the two nodes cannot lookup any channel
anymore, although they are bound on 211 and 211 is the master node according to
the view messages 201 - 203. The next messages appear on all nodes in the
cluster:
javax.jms.InvalidDestinationException: This destination does not exist!
TOPIC.sgw/MOCacheInvalidationTopic
at
org.jboss.mq.server.JMSDestinationManager.addMessage(JMSDestinationManager.java:389)
at
org.jboss.mq.server.JMSDestinationManager.addMessage(JMSDestinationManager.java:373)
at
org.jboss.mq.server.JMSServerInvoker.addMessage(JMSServerInvoker.java:136)
at
org.jboss.mq.il.uil2.ServerSocketManagerHandler.handleMsg(ServerSocketManagerHandler.java:92)
I guess it has something to do with the small period of time where kind of two
master nodes existed in the cluster. After view 201 the node 211 thinks that it
is the master node, but because 210 did not receice the message this node also
thinks it is the master node. This situation lasted for ~10 seconds, until view
203 was received by node 210. How does JGroups/JBoss handle such a scenario?
In order to reproduce the behaviour we did the following: We set up a cluster
with 4 nodes on 2 machines with the same configuration as on the production
system. We installed a script which blocks the UDP traffic periodically. The
script runs every second and blocks the incoming UDP traffic for another second
with a probability of 50%. That way we wanted to simulate network "jitter",
because we suspect that UDP packets get lost somehow on the production system.
I can see the same behaviour on our test cluster by running this script. View
changed messages get lost from time to time and after that happens the nodes
fail to lookup the channels, although they are present always on one of the
nodes. Because I was able to reproduce I thought it is important to let you
know. Currently I do not want to switch to the TCP stack, because we are not
100% sure yet that UDP packets really get lost. According to our hoster,
everything is fine with the network (but, hey, they always tell yo!
u that ;)
Anyway, if you have any more hints how to solve this problem or have any
questions on our test setup to reproduce the behaviour, please let me know.
Thanks very much your answers so far, I really appreciate the time you guys put
in answering questions here in the forum!
Thanks,
Jochen
View the original post :
http://www.jboss.com/index.html?module=bb&op=viewtopic&p=3955918#3955918
Reply to the post :
http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=3955918
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
JBoss-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/jboss-user