Sorry for not replying for a while, but I was analyzing the logfiles and trying to reproduce the behaviour we have on our production system. Thanks to the answers here I think I understand now better what is going on, and I indeed found a way to reproduce the behaviour.
First, I was wrong in my assumption that the channels are never rebound to JNDI when the master node fails. Here's what happens: Initally node 210 is the master node, and node 211 is a "slave" (hope the terminology is correct). At 08:14:24 the node 211 begins to receive new views. Taken from 211's logfile: 2006-06-21 08:14:24,757 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.StagePartition] New cluster view for partition StagePartition (id: 201, delta: -2) : [62.50.43.211:1099, 62.50. 43.213:1099, 62.50.43.216:1099, 62.50.43.215:1099] 2006-06-21 08:14:24,757 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] I am (62.50.43.211:1099) received membershipChanged event: 2006-06-21 08:14:24,757 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] Dead members: 2 ([62.50.43.210:1099, 62.50.43.214:1099]) 2006-06-21 08:14:24,757 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] New Members : 0 ([]) 2006-06-21 08:14:24,757 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] All Members : 4 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099, 62.50.43.215:1 099]) As node 211 is now the master node and node 210 is in the list of dead members, node 211 deploys all channels, like it should. Taken from 211's logfile: 2006-06-21 08:14:25,496 INFO [org.jboss.web.tomcat.tc5.TomcatDeployer] deploy, ctxPath=/jbossmq-httpil, warUrl=.../deploy-hasingleton/jms/jbossmq-httpil.sar/jbossmq-httpil.war/ 2006-06-21 08:14:26,916 INFO [org.jboss.mq.server.jmx.Topic.sgw/MOCacheInvalidationTopic] Bound to JNDI name: topic/sgw/MOCacheInvalidationTopic 2006-06-21 08:14:26,917 INFO [org.jboss.mq.server.jmx.Topic.sgw/CdaHtmlCacheInvalidationTopic] Bound to JNDI name: topic/sgw/CdaHtmlCacheInvalidationTopic [...] But: Node 210 did not receive view 201 at all, so this node still has all the channels deployed as well. The next thing I see in the logfile of 211 is that node 214 is still sending messages, but from the viewpoint of 211 is not a cluster member anymore. I do not know if this is of any relevance, but to give you a complete picture I wanted to mention it. Taken from 211's logfile: 2006-06-21 08:14:29,985 ERROR [org.jgroups.protocols.pbcast.CoordGmsImpl] mbr 62.50.43.214:54923 (additional data: 17 bytes) is not a member ! 2006-06-21 08:14:29,987 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.StagePartition] Suspected member: 62.50.43.214:54923 (additional data: 17 bytes) Next, 211 is receiving two more view changes (id 202 and 203). Taken from 211's logfile: 2006-06-21 08:14:34,867 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.StagePartition] New cluster view for partition StagePartition (id: 202, delta: 1) : [62.50.43.211:1099, 62.50.4 3.213:1099, 62.50.43.216:1099, 62.50.43.215:1099, 62.50.43.214:1099] 2006-06-21 08:14:34,867 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] I am (62.50.43.211:1099) received membershipChanged event: 2006-06-21 08:14:34,867 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] Dead members: 0 ([]) 2006-06-21 08:14:34,867 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] New Members : 1 ([62.50.43.214:1099]) 2006-06-21 08:14:34,867 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] All Members : 5 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099, 62.50.43.215:1 099, 62.50.43.214:1099]) 2006-06-21 08:14:35,021 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.StagePartition] New cluster view for partition StagePartition (id: 203, delta: 1) : [62.50.43.211:1099, 62.50.4 3.213:1099, 62.50.43.216:1099, 62.50.43.215:1099, 62.50.43.214:1099, 62.50.43.210:1099] 2006-06-21 08:14:35,021 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] I am (62.50.43.211:1099) received membershipChanged event: 2006-06-21 08:14:35,021 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] Dead members: 0 ([]) 2006-06-21 08:14:35,021 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] New Members : 1 ([62.50.43.210:1099]) 2006-06-21 08:14:35,021 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] All Members : 6 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099, 62.50.43.215:1 099, 62.50.43.214:1099, 62.50.43.210:1099]) Node 210 was not receiving view 202, but view 203. After receiving view 203 node 210 is aware that it is no longer the master node, and it undeploys the channels: Taken from 210's logfile: 2006-06-21 08:14:35,049 INFO [org.jboss.ha.framework.interfaces.HAPartition.StagePartition] New cluster view for partition StagePartition: 203 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1 099, 62.50.43.215:1099, 62.50.43.214:1099, 62.50.43.210:1099] delta: 0) 2006-06-21 08:14:35,049 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] I am (62.50.43.210:1099) received membershipChanged event: 2006-06-21 08:14:35,049 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] Dead members: 0 ([]) 2006-06-21 08:14:35,049 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] New Members : 0 ([]) 2006-06-21 08:14:35,049 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] All Members : 6 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099, 62.50.43.215:1 099, 62.50.43.214:1099, 62.50.43.210:1099]) 2006-06-21 08:14:35,329 INFO [org.jboss.mq.server.jmx.Topic.sgw/MOCacheInvalidationTopic] Unbinding JNDI name: topic/sgw/MOCacheInvalidationTopic 2006-06-21 08:14:35,465 INFO [org.jboss.mq.server.jmx.Topic.sgw/CdaHtmlCacheInvalidationTopic] Unbinding JNDI name: topic/sgw/CdaHtmlCacheInvalidationTopic 2006-06-21 08:14:35,466 INFO [org.jboss.mq.server.jmx.Queue.sgw/AlertUserQueue] Unbinding JNDI name: queue/sgw/AlertUserQueue 2006-06-21 08:14:35,466 INFO [org.jboss.mq.server.jmx.Queue.sgw/UserQueue] Unbinding JNDI name: queue/sgw/UserQueue And exactly from this point onwards the two nodes cannot lookup any channel anymore, although they are bound on 211 and 211 is the master node according to the view messages 201 - 203. The next messages appear on all nodes in the cluster: javax.jms.InvalidDestinationException: This destination does not exist! TOPIC.sgw/MOCacheInvalidationTopic at org.jboss.mq.server.JMSDestinationManager.addMessage(JMSDestinationManager.java:389) at org.jboss.mq.server.JMSDestinationManager.addMessage(JMSDestinationManager.java:373) at org.jboss.mq.server.JMSServerInvoker.addMessage(JMSServerInvoker.java:136) at org.jboss.mq.il.uil2.ServerSocketManagerHandler.handleMsg(ServerSocketManagerHandler.java:92) I guess it has something to do with the small period of time where kind of two master nodes existed in the cluster. After view 201 the node 211 thinks that it is the master node, but because 210 did not receice the message this node also thinks it is the master node. This situation lasted for ~10 seconds, until view 203 was received by node 210. How does JGroups/JBoss handle such a scenario? In order to reproduce the behaviour we did the following: We set up a cluster with 4 nodes on 2 machines with the same configuration as on the production system. We installed a script which blocks the UDP traffic periodically. The script runs every second and blocks the incoming UDP traffic for another second with a probability of 50%. That way we wanted to simulate network "jitter", because we suspect that UDP packets get lost somehow on the production system. I can see the same behaviour on our test cluster by running this script. View changed messages get lost from time to time and after that happens the nodes fail to lookup the channels, although they are present always on one of the nodes. Because I was able to reproduce I thought it is important to let you know. Currently I do not want to switch to the TCP stack, because we are not 100% sure yet that UDP packets really get lost. According to our hoster, everything is fine with the network (but, hey, they always tell yo! u that ;) Anyway, if you have any more hints how to solve this problem or have any questions on our test setup to reproduce the behaviour, please let me know. Thanks very much your answers so far, I really appreciate the time you guys put in answering questions here in the forum! Thanks, Jochen View the original post : http://www.jboss.com/index.html?module=bb&op=viewtopic&p=3955918#3955918 Reply to the post : http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=3955918 Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ JBoss-user mailing list JBoss-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/jboss-user