[JBoss-user] [Clustering/JBoss] - Re: HA-JMS fails, Master node undeploying channels, no failo

2006-07-06 Thread jkressin
Sorry for not replying for a while, but I was analyzing the logfiles and trying 
to reproduce the behaviour we have on our production system. Thanks to the 
answers here I think I understand now better what is going on, and I indeed 
found a way to reproduce the behaviour.

First, I was wrong in my assumption that the channels are never rebound to JNDI 
when the master node fails.  Here's what happens:

Initally node 210 is the master node, and node 211 is a slave (hope the 
terminology is correct).  At 08:14:24 the node 211 begins to receive new views. 
Taken from 211's logfile:

2006-06-21 08:14:24,757 INFO  
[org.jboss.ha.framework.interfaces.HAPartition.lifecycle.StagePartition] New 
cluster view for partition StagePartition (id: 201, delta: -2) : 
[62.50.43.211:1099, 62.50.
43.213:1099, 62.50.43.216:1099, 62.50.43.215:1099]
2006-06-21 08:14:24,757 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
I am (62.50.43.211:1099) received membershipChanged event:
2006-06-21 08:14:24,757 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
Dead members: 2 ([62.50.43.210:1099, 62.50.43.214:1099])
2006-06-21 08:14:24,757 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
New Members : 0 ([])
2006-06-21 08:14:24,757 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
All Members : 4 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099, 
62.50.43.215:1
099])


As node 211 is now the master node and node 210 is in the list of dead members, 
node 211 deploys all channels, like it should.
Taken from 211's logfile:

2006-06-21 08:14:25,496 INFO  [org.jboss.web.tomcat.tc5.TomcatDeployer] deploy, 
ctxPath=/jbossmq-httpil, 
warUrl=.../deploy-hasingleton/jms/jbossmq-httpil.sar/jbossmq-httpil.war/
2006-06-21 08:14:26,916 INFO  
[org.jboss.mq.server.jmx.Topic.sgw/MOCacheInvalidationTopic] Bound to JNDI 
name: topic/sgw/MOCacheInvalidationTopic
2006-06-21 08:14:26,917 INFO  
[org.jboss.mq.server.jmx.Topic.sgw/CdaHtmlCacheInvalidationTopic] Bound to JNDI 
name: topic/sgw/CdaHtmlCacheInvalidationTopic
[...]

But: Node 210 did not receive view 201 at all, so this node still has all the 
channels deployed as well. The next thing I see in the logfile of 211 is that 
node 214 is still sending messages, but from the viewpoint of 211 is not a 
cluster member anymore. I do not know if this is of any relevance, but to give 
you a complete picture I wanted to mention it.
Taken from 211's logfile:
2006-06-21 08:14:29,985 ERROR [org.jgroups.protocols.pbcast.CoordGmsImpl] mbr 
62.50.43.214:54923 (additional data: 17 bytes) is not a member !
2006-06-21 08:14:29,987 INFO  
[org.jboss.ha.framework.interfaces.HAPartition.lifecycle.StagePartition] 
Suspected member: 62.50.43.214:54923 (additional data: 17 bytes)

Next, 211 is receiving two more view changes (id 202 and 203). 
Taken from 211's logfile:

2006-06-21 08:14:34,867 INFO  
[org.jboss.ha.framework.interfaces.HAPartition.lifecycle.StagePartition] New 
cluster view for partition StagePartition (id: 202, delta: 1) : 
[62.50.43.211:1099, 62.50.4
3.213:1099, 62.50.43.216:1099, 62.50.43.215:1099, 62.50.43.214:1099]
2006-06-21 08:14:34,867 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
I am (62.50.43.211:1099) received membershipChanged event:
2006-06-21 08:14:34,867 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
Dead members: 0 ([])
2006-06-21 08:14:34,867 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
New Members : 1 ([62.50.43.214:1099])
2006-06-21 08:14:34,867 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
All Members : 5 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099, 
62.50.43.215:1
099, 62.50.43.214:1099])
2006-06-21 08:14:35,021 INFO  
[org.jboss.ha.framework.interfaces.HAPartition.lifecycle.StagePartition] New 
cluster view for partition StagePartition (id: 203, delta: 1) : 
[62.50.43.211:1099, 62.50.4
3.213:1099, 62.50.43.216:1099, 62.50.43.215:1099, 62.50.43.214:1099, 
62.50.43.210:1099]
2006-06-21 08:14:35,021 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
I am (62.50.43.211:1099) received membershipChanged event:
2006-06-21 08:14:35,021 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
Dead members: 0 ([])
2006-06-21 08:14:35,021 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
New Members : 1 ([62.50.43.210:1099])
2006-06-21 08:14:35,021 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
All Members : 6 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099, 
62.50.43.215:1
099, 62.50.43.214:1099, 62.50.43.210:1099])

Node 210 was not receiving view 202, but view 203. After receiving view 203 
node 210 is aware 

[JBoss-user] [Clustering/JBoss] - Re: HA-JMS fails, Master node undeploying channels, no failo

2006-06-29 Thread jkressin
Thanks very much for your reply. I examined the logfiles again to answer your 
questions:

[EMAIL PROTECTED] wrote : 1) You refer to the master node.  Please confirm 
that this is 62.50.43.211.
  | 

No, at that time the master node was 62.50.43.210. The first logoutput and the 
second one are from this machine, means that the master node (62.50.43.210) 
produced the output Dead members:0, New members: 0 and immediately after that 
undeployed all the HA-Queues and HA-Topics. Sorry, I should have made that 
clear in my first post.

[EMAIL PROTECTED] wrote : 
  | 2) On the node that produced the first bit of logging in your post, do you 
see log entries with this content New cluster view for partition 
StagePartition: 202 and New cluster view for partition StagePartition: 201?
  | 

No, these messages are not present in the logfile.

[EMAIL PROTECTED] wrote : 
  | 3) If you have a log entry somewhere that contains New cluster view for 
partition StagePartition: 200, please compare the list of nodes to the first 
line in the first log entry in your post.  Does it have the same 6 nodes but in 
different order?
  | 

You are right, I can see the same nodes, but in different order

[EMAIL PROTECTED] wrote : 
  | What I'm driving at here is I wonder if the machine doing the first bit of 
logging lost a couple view changes, going from 200 to 203.  The result would be 
Dead members:0, New members: 0 but a different order of members.
  | 

Thanks, now I start to understand what is happening. You are right that the 
machine indeed lost some of the view changes, that's a problem I probably have 
to investigate on the network level. 

But the most intersting question for me is: Even if the (Master-)node lost some 
viewchanges,  why does it suddenly undeploy the (HA-)queues and  (HA-)topics? 
And why is the failover not happening, no other node is starting to deploy the 
queues and topics instead. I cannot explain how this is possible and also found 
no information in the docs or in the forums on this issue.

The critical thing is that if I run into this scenario my HA-Queues and 
HA-Topics are not present on any instance, leading to lost messages and 
therefore also lost data. This situation should not be possible at all in a 
cluster. I am not quite sure if this is a cluster issue (I guess so), so if it 
is something related to JMS please let me know so I can ask in JMS-Forum. 

BTW: This is the only real problem we have with the JBoss platform. Everything 
else is working fine and stable. Developing with JBoss really was a breeze, so 
thanks for this great piece of software. 

Thanks again for your help.

Jochen


View the original post : 
http://www.jboss.com/index.html?module=bbop=viewtopicp=3954296#3954296

Reply to the post : 
http://www.jboss.com/index.html?module=bbop=postingmode=replyp=3954296

Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642
___
JBoss-user mailing list
JBoss-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/jboss-user


[JBoss-user] [Clustering/JBoss] - HA-JMS fails, Master node undeploying channels, no failover

2006-06-22 Thread jkressin
First, sorry for the lengthy post, but I need to describe the problem in detail:
We have a cluster of 6 JBoss instances (JBoss 4.0.3SP1) on 3 physical machines. 
Each machine runs two JBoss instances and each JBoss instance has its own IP. 
The machines have one network adapter with two IP-Adresses. We use UDP as the 
transport layer in JGroups (config below). From the range of cluster services 
we only use HA-JMS, means clustered topics and queues. Everything works fine, 
but from time to time (every 2-4 days) the HA-JMS completely fails which means 
that messages get lost, which should not happen at all (that's why we use a 
cluster).

Here's what happens: All instances are up and running, and I can see that all 6 
instances participate in the cluster. Suddenly on the master node I see a log 
file entry like this:

2006-06-21 08:14:35,049 INFO  
[org.jboss.ha.framework.interfaces.HAPartition.StagePartition] New cluster view 
for partition StagePartition: 203 ([62.50.43.21
1:1099, 62.50.43.213:1099, 62.50.43.216:1099, 62.50.43.215:1099, 
62.50.43.214:1099, 62.50.43.210:1099] delta: 0)
2006-06-21 08:14:35,049 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
I am (62.50.43.210:1099) received membershipChan
ged event:
2006-06-21 08:14:35,049 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
Dead members: 0 ([])
2006-06-21 08:14:35,049 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
New Members : 0 ([])
2006-06-21 08:14:35,049 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
All Members : 6 ([62.50.43.211:1099, 62.50.43.21
3:1099, 62.50.43.216:1099, 62.50.43.215:1099, 62.50.43.214:1099, 
62.50.43.210:1099])

The first strange this is: Dead members:0, New members: 0 which I read as 
nothing has changed at all ;)

Directly after this message, the master node starts to undeploy all queues and 
topics:

2006-06-21 08:14:35,329 INFO  
[org.jboss.mq.server.jmx.Topic.sgw/MOCacheInvalidationTopic] Unbinding JNDI 
name: topic/sgw/MOCacheInvalidationTopic
2006-06-21 08:14:35,465 INFO  
[org.jboss.mq.server.jmx.Topic.sgw/CdaHtmlCacheInvalidationTopic] Unbinding 
JNDI name: topic/sgw/CdaHtmlCacheInvalidationTopic
2006-06-21 08:14:35,466 INFO  
[org.jboss.mq.server.jmx.Queue.sgw/AlertUserQueue] Unbinding JNDI name: 
queue/sgw/AlertUserQueue
2006-06-21 08:14:35,466 INFO  [org.jboss.mq.server.jmx.Queue.sgw/UserQueue] 
Unbinding JNDI name: queue/sgw/UserQueue
2006-06-21 08:14:35,467 INFO  [org.jboss.mq.server.jmx.Queue.sgw/OrderQueue] 
Unbinding JNDI name: queue/sgw/OrderQueue
[...]
2006-06-21 08:14:35,470 INFO  [org.jboss.mq.server.jmx.Queue.DLQ] Unbinding 
JNDI name: queue/DLQ
2006-06-21 08:14:35,546 INFO  [org.jboss.web.tomcat.tc5.TomcatDeployer] 
undeploy, ctxPath=/jbossmq-httpil, 
warUrl=.../deploy-hasingleton/jms/jbossmq-httpil.s
ar/jbossmq-httpil.war/

But the instance still claims to be the master node. No other instance starts 
to take over the undeployed services, so whenever an instance tries to post a 
message we get:

javax.jms.InvalidDestinationException: This destination does not exist! 
TOPIC.sgw/MOCacheInvalidationTopic
at 
org.jboss.mq.server.JMSDestinationManager.addMessage(JMSDestinationManager.java:389)
at 
org.jboss.mq.server.JMSDestinationManager.addMessage(JMSDestinationManager.java:373)
at 
org.jboss.mq.server.JMSServerInvoker.addMessage(JMSServerInvoker.java:136)
at 
org.jboss.mq.il.uil2.ServerSocketManagerHandler.handleMsg(ServerSocketManagerHandler.java:92)
at 
org.jboss.mq.il.uil2.SocketManager$ReadTask.handleMsg(SocketManager.java:369)

Exactly at the time when the master node undeploys all services, all the other 
instances start to go crazy as well:

2006-06-21 08:14:24,728 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
I am (62.50.43.215:1099) received membershipChanged event:
2006-06-21 08:14:24,728 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
Dead members: 2 ([62.50.43.210:1099, 62.50.43.214:1099])
2006-06-21 08:14:24,728 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
New Members : 0 ([])
2006-06-21 08:14:24,728 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.StagePartition] 
All Members : 4 ([62.50.43.211:1099, 62.50.43.213:1099, 62.50.43.216:1099, 
62.50.43.215:1
099])
2006-06-21 08:14:24,798 INFO  
[org.jboss.ha.framework.interfaces.HAPartition.StagePartition] Suspected 
member: 62.50.43.214:54923 (additional data: 17 bytes)
2006-06-21 08:14:26,800 INFO  
[org.jboss.ha.framework.interfaces.HAPartition.StagePartition] Suspected 
member: dep004174-05:54893 (additional data: 17 bytes)
2006-06-21 08:14:31,547 ERROR [com.artnology.sgw.cda.tracking.Webtracking] 
getObjectType() returns null for SGWID '4-102-0-0-0'
2006-06-21 08:14:34,867 INFO