[JBoss-user] [Clustering/JBoss] - Re: Basic TCP Cluster of two nodes fails to recover from a n

kpandey Mon, 12 Jun 2006 19:20:02 -0700

I have more information on this issue --

I have three node setup using TCP for JGroup. It all works fine and if I stop
a node and restart or do a kill -9 and restart oldest becomes Master and all is 
well.
Now while testing error condition with network I'm running into problems. So in 
the normal working case I have three nodes whose 
DefaultPartition CurrentView is
  [10.0.1.48:1099, 10.0.2.130:1099, 10.0.1.61:1099]



Now I unplug the network cable from 10.0.1.61

I see the following debug trace in 10.0.2.130

02:05:18,276 INFO  [DefaultPartition] Suspected member: 10.0.1.61:7800 
(additional data: 14 bytes)
02:05:18,278 INFO  [DefaultPartition] Suspected member: 10.0.1.48:7800 
(additional data: 14 bytes)
02:05:18,280 INFO  [DefaultPartition] New cluster view for partition 
DefaultPartition (id: 3, delta: -2) : [10.0.2.130:1099]
02:05:18,281 INFO  [DefaultPartition] I am (10.0.2.130:1099) received 
membershipChanged event:
02:05:18,281 INFO  [DefaultPartition] Dead members: 2 ([10.0.1.48:1099, 
10.0.1.61:1099])
02:05:18,282 INFO  [DefaultPartition] New Members : 0 ([])
02:05:18,282 INFO  [DefaultPartition] All Members : 1 ([10.0.2.130:1099]

I do not undersatnd why it thought 10.0.1.48 was dead as well?1.48

debug trace in 10.0.1.48 is --

9:50:43,033 INFO  [DefaultPartition] Suspected member: 10.0.1.61:7800 
(additional data: 14 bytes)
19:50:44,611 WARN  [FD] I was suspected, but will not remove myself from 
membership (waiting for EXIT message)
19:50:45,533 INFO  [DefaultPartition] Suspected member: 10.0.1.61:7800 
(additional data: 14 bytes)
19:50:46,122 WARN  [CoordGmsImpl] I am the coord and I'm being am suspected -- 
will probably leave shortly
19:50:46,132 WARN  [GMS] checkSelfInclusion() failed, 10.0.1.48:7810 is not a 
member of view [10.0.2.130:7810|3] [10.0.2.130:7810]; discarding view
19:50:46,517 WARN  [FD] I was suspected, but will not remove myself from 
membership (waiting for EXIT message)
19:50:48,023 WARN  [GMS] checkSelfInclusion() failed, 10.0.1.48:7800 
(additional data: 14 bytes) is not a member of view [10.0.2.130:7800 
(additional data: 15 bytes)|3] [10.0.2.130:7800 (additional data: 15 bytes)]; 
discarding view
19:50:48,032 WARN  [CoordGmsImpl] I am the coord and I'm being am suspected -- 
will probably leave shortly
19:50:48,033 INFO  [DefaultPartition] Suspected member: 10.0.1.61:7800 
(additional data: 14 bytes)
19:50:48,034 INFO  [DefaultPartition] Suspected member: vallance-lnx:7800 
(additional data: 14 bytes)


Why is 10.0.1.48 a suspect?

The result is that both 10.0.1.48 and 10.0.2.130 now runs in Master mode and 
not in a cluster.

Upon connecting the nework cable back to 10.0.1.61 , the cluster goes thru some 
variance of group and finally stettles down the following view on all three 
views
[10.0.2.130:1099, 10.0.1.61:1099, 10.0.1.48:1099]

How do I troubleshoot this? I would expect 10.0.2.130 and 10.0.1.48 to never 
loose the cluser group and 10.0.1.61 tojoin at the end as the newest.

Testing on jboss-3.2.8sp1 and jdk1.5

Thanks
Kumar



View the original post : 
http://www.jboss.com/index.html?module=bb&op=viewtopic&p=3950578#3950578

Reply to the post : 
http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=3950578


_______________________________________________
JBoss-user mailing list
JBoss-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/jboss-user

[JBoss-user] [Clustering/JBoss] - Re: Basic TCP Cluster of two nodes fails to recover from a n

Reply via email to