[JBoss-user] [Clustering/JBoss] - Re: HA-JMS fails, Master node undeploying channels, no failo
Sorry for not replying for a while, but I was analyzing the logfiles and trying to reproduce the behaviour we have on our production system. Thanks to the answers here I think I understand now better what is going on, and I indeed found a way to reproduce the behaviour. First, I was wrong in my assumption that the channels are never rebound to JNDI when the master node fails. Here's what happens: Initally node 210 is the master node, and node 211 is a slave (hope the terminology is correct). At 08:14:24 the node 211 begins to receive new views. Taken from 211's logfile: 2006-06-21 08:14:24,757 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.StagePartition] New cluster view for partition StagePartition (id: 201, delta: -2) : [62.50.43.211:1099, 62.50. 43.213:1099, 62.50.43.216:1099, 62.50.43.215:1099] 2006-06-21 08:14:24,757 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] I am (62.50.43.211:1099) received membershipChanged event: 2006-06-21 08:14:24,757 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] Dead members: 2 ([62.50.43.210:1099, 62.50.43.214:1099]) 2006-06-21 08:14:24,757 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] New Members : 0 ([]) 2006-06-21 08:14:24,757 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] All Members : 4 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099, 62.50.43.215:1 099]) As node 211 is now the master node and node 210 is in the list of dead members, node 211 deploys all channels, like it should. Taken from 211's logfile: 2006-06-21 08:14:25,496 INFO [org.jboss.web.tomcat.tc5.TomcatDeployer] deploy, ctxPath=/jbossmq-httpil, warUrl=.../deploy-hasingleton/jms/jbossmq-httpil.sar/jbossmq-httpil.war/ 2006-06-21 08:14:26,916 INFO [org.jboss.mq.server.jmx.Topic.sgw/MOCacheInvalidationTopic] Bound to JNDI name: topic/sgw/MOCacheInvalidationTopic 2006-06-21 08:14:26,917 INFO [org.jboss.mq.server.jmx.Topic.sgw/CdaHtmlCacheInvalidationTopic] Bound to JNDI name: topic/sgw/CdaHtmlCacheInvalidationTopic [...] But: Node 210 did not receive view 201 at all, so this node still has all the channels deployed as well. The next thing I see in the logfile of 211 is that node 214 is still sending messages, but from the viewpoint of 211 is not a cluster member anymore. I do not know if this is of any relevance, but to give you a complete picture I wanted to mention it. Taken from 211's logfile: 2006-06-21 08:14:29,985 ERROR [org.jgroups.protocols.pbcast.CoordGmsImpl] mbr 62.50.43.214:54923 (additional data: 17 bytes) is not a member ! 2006-06-21 08:14:29,987 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.StagePartition] Suspected member: 62.50.43.214:54923 (additional data: 17 bytes) Next, 211 is receiving two more view changes (id 202 and 203). Taken from 211's logfile: 2006-06-21 08:14:34,867 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.StagePartition] New cluster view for partition StagePartition (id: 202, delta: 1) : [62.50.43.211:1099, 62.50.4 3.213:1099, 62.50.43.216:1099, 62.50.43.215:1099, 62.50.43.214:1099] 2006-06-21 08:14:34,867 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] I am (62.50.43.211:1099) received membershipChanged event: 2006-06-21 08:14:34,867 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] Dead members: 0 ([]) 2006-06-21 08:14:34,867 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] New Members : 1 ([62.50.43.214:1099]) 2006-06-21 08:14:34,867 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] All Members : 5 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099, 62.50.43.215:1 099, 62.50.43.214:1099]) 2006-06-21 08:14:35,021 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.StagePartition] New cluster view for partition StagePartition (id: 203, delta: 1) : [62.50.43.211:1099, 62.50.4 3.213:1099, 62.50.43.216:1099, 62.50.43.215:1099, 62.50.43.214:1099, 62.50.43.210:1099] 2006-06-21 08:14:35,021 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] I am (62.50.43.211:1099) received membershipChanged event: 2006-06-21 08:14:35,021 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] Dead members: 0 ([]) 2006-06-21 08:14:35,021 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] New Members : 1 ([62.50.43.210:1099]) 2006-06-21 08:14:35,021 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] All Members : 6 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099, 62.50.43.215:1 099, 62.50.43.214:1099, 62.50.43.210:1099]) Node 210 was not receiving view 202, but view 203. After receiving view 203 node 210 is aware
[JBoss-user] [Clustering/JBoss] - Re: HA-JMS fails, Master node undeploying channels, no failo
Thanks very much for your reply. I examined the logfiles again to answer your questions: [EMAIL PROTECTED] wrote : 1) You refer to the master node. Please confirm that this is 62.50.43.211. | No, at that time the master node was 62.50.43.210. The first logoutput and the second one are from this machine, means that the master node (62.50.43.210) produced the output Dead members:0, New members: 0 and immediately after that undeployed all the HA-Queues and HA-Topics. Sorry, I should have made that clear in my first post. [EMAIL PROTECTED] wrote : | 2) On the node that produced the first bit of logging in your post, do you see log entries with this content New cluster view for partition StagePartition: 202 and New cluster view for partition StagePartition: 201? | No, these messages are not present in the logfile. [EMAIL PROTECTED] wrote : | 3) If you have a log entry somewhere that contains New cluster view for partition StagePartition: 200, please compare the list of nodes to the first line in the first log entry in your post. Does it have the same 6 nodes but in different order? | You are right, I can see the same nodes, but in different order [EMAIL PROTECTED] wrote : | What I'm driving at here is I wonder if the machine doing the first bit of logging lost a couple view changes, going from 200 to 203. The result would be Dead members:0, New members: 0 but a different order of members. | Thanks, now I start to understand what is happening. You are right that the machine indeed lost some of the view changes, that's a problem I probably have to investigate on the network level. But the most intersting question for me is: Even if the (Master-)node lost some viewchanges, why does it suddenly undeploy the (HA-)queues and (HA-)topics? And why is the failover not happening, no other node is starting to deploy the queues and topics instead. I cannot explain how this is possible and also found no information in the docs or in the forums on this issue. The critical thing is that if I run into this scenario my HA-Queues and HA-Topics are not present on any instance, leading to lost messages and therefore also lost data. This situation should not be possible at all in a cluster. I am not quite sure if this is a cluster issue (I guess so), so if it is something related to JMS please let me know so I can ask in JMS-Forum. BTW: This is the only real problem we have with the JBoss platform. Everything else is working fine and stable. Developing with JBoss really was a breeze, so thanks for this great piece of software. Thanks again for your help. Jochen View the original post : http://www.jboss.com/index.html?module=bbop=viewtopicp=3954296#3954296 Reply to the post : http://www.jboss.com/index.html?module=bbop=postingmode=replyp=3954296 Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ JBoss-user mailing list JBoss-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/jboss-user
[JBoss-user] [Clustering/JBoss] - HA-JMS fails, Master node undeploying channels, no failover
First, sorry for the lengthy post, but I need to describe the problem in detail: We have a cluster of 6 JBoss instances (JBoss 4.0.3SP1) on 3 physical machines. Each machine runs two JBoss instances and each JBoss instance has its own IP. The machines have one network adapter with two IP-Adresses. We use UDP as the transport layer in JGroups (config below). From the range of cluster services we only use HA-JMS, means clustered topics and queues. Everything works fine, but from time to time (every 2-4 days) the HA-JMS completely fails which means that messages get lost, which should not happen at all (that's why we use a cluster). Here's what happens: All instances are up and running, and I can see that all 6 instances participate in the cluster. Suddenly on the master node I see a log file entry like this: 2006-06-21 08:14:35,049 INFO [org.jboss.ha.framework.interfaces.HAPartition.StagePartition] New cluster view for partition StagePartition: 203 ([62.50.43.21 1:1099, 62.50.43.213:1099, 62.50.43.216:1099, 62.50.43.215:1099, 62.50.43.214:1099, 62.50.43.210:1099] delta: 0) 2006-06-21 08:14:35,049 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] I am (62.50.43.210:1099) received membershipChan ged event: 2006-06-21 08:14:35,049 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] Dead members: 0 ([]) 2006-06-21 08:14:35,049 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] New Members : 0 ([]) 2006-06-21 08:14:35,049 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] All Members : 6 ([62.50.43.211:1099, 62.50.43.21 3:1099, 62.50.43.216:1099, 62.50.43.215:1099, 62.50.43.214:1099, 62.50.43.210:1099]) The first strange this is: Dead members:0, New members: 0 which I read as nothing has changed at all ;) Directly after this message, the master node starts to undeploy all queues and topics: 2006-06-21 08:14:35,329 INFO [org.jboss.mq.server.jmx.Topic.sgw/MOCacheInvalidationTopic] Unbinding JNDI name: topic/sgw/MOCacheInvalidationTopic 2006-06-21 08:14:35,465 INFO [org.jboss.mq.server.jmx.Topic.sgw/CdaHtmlCacheInvalidationTopic] Unbinding JNDI name: topic/sgw/CdaHtmlCacheInvalidationTopic 2006-06-21 08:14:35,466 INFO [org.jboss.mq.server.jmx.Queue.sgw/AlertUserQueue] Unbinding JNDI name: queue/sgw/AlertUserQueue 2006-06-21 08:14:35,466 INFO [org.jboss.mq.server.jmx.Queue.sgw/UserQueue] Unbinding JNDI name: queue/sgw/UserQueue 2006-06-21 08:14:35,467 INFO [org.jboss.mq.server.jmx.Queue.sgw/OrderQueue] Unbinding JNDI name: queue/sgw/OrderQueue [...] 2006-06-21 08:14:35,470 INFO [org.jboss.mq.server.jmx.Queue.DLQ] Unbinding JNDI name: queue/DLQ 2006-06-21 08:14:35,546 INFO [org.jboss.web.tomcat.tc5.TomcatDeployer] undeploy, ctxPath=/jbossmq-httpil, warUrl=.../deploy-hasingleton/jms/jbossmq-httpil.s ar/jbossmq-httpil.war/ But the instance still claims to be the master node. No other instance starts to take over the undeployed services, so whenever an instance tries to post a message we get: javax.jms.InvalidDestinationException: This destination does not exist! TOPIC.sgw/MOCacheInvalidationTopic at org.jboss.mq.server.JMSDestinationManager.addMessage(JMSDestinationManager.java:389) at org.jboss.mq.server.JMSDestinationManager.addMessage(JMSDestinationManager.java:373) at org.jboss.mq.server.JMSServerInvoker.addMessage(JMSServerInvoker.java:136) at org.jboss.mq.il.uil2.ServerSocketManagerHandler.handleMsg(ServerSocketManagerHandler.java:92) at org.jboss.mq.il.uil2.SocketManager$ReadTask.handleMsg(SocketManager.java:369) Exactly at the time when the master node undeploys all services, all the other instances start to go crazy as well: 2006-06-21 08:14:24,728 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] I am (62.50.43.215:1099) received membershipChanged event: 2006-06-21 08:14:24,728 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] Dead members: 2 ([62.50.43.210:1099, 62.50.43.214:1099]) 2006-06-21 08:14:24,728 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] New Members : 0 ([]) 2006-06-21 08:14:24,728 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] All Members : 4 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099, 62.50.43.215:1 099]) 2006-06-21 08:14:24,798 INFO [org.jboss.ha.framework.interfaces.HAPartition.StagePartition] Suspected member: 62.50.43.214:54923 (additional data: 17 bytes) 2006-06-21 08:14:26,800 INFO [org.jboss.ha.framework.interfaces.HAPartition.StagePartition] Suspected member: dep004174-05:54893 (additional data: 17 bytes) 2006-06-21 08:14:31,547 ERROR [com.artnology.sgw.cda.tracking.Webtracking] getObjectType() returns null for SGWID '4-102-0-0-0' 2006-06-21 08:14:34,867 INFO