I have more information on this issue -- I have three node setup using TCP for JGroup. It all works fine and if I stop a node and restart or do a kill -9 and restart oldest becomes Master and all is well. Now while testing error condition with network I'm running into problems. So in the normal working case I have three nodes whose DefaultPartition CurrentView is [10.0.1.48:1099, 10.0.2.130:1099, 10.0.1.61:1099]
Now I unplug the network cable from 10.0.1.61 I see the following debug trace in 10.0.2.130 02:05:18,276 INFO [DefaultPartition] Suspected member: 10.0.1.61:7800 (additional data: 14 bytes) 02:05:18,278 INFO [DefaultPartition] Suspected member: 10.0.1.48:7800 (additional data: 14 bytes) 02:05:18,280 INFO [DefaultPartition] New cluster view for partition DefaultPartition (id: 3, delta: -2) : [10.0.2.130:1099] 02:05:18,281 INFO [DefaultPartition] I am (10.0.2.130:1099) received membershipChanged event: 02:05:18,281 INFO [DefaultPartition] Dead members: 2 ([10.0.1.48:1099, 10.0.1.61:1099]) 02:05:18,282 INFO [DefaultPartition] New Members : 0 ([]) 02:05:18,282 INFO [DefaultPartition] All Members : 1 ([10.0.2.130:1099] I do not undersatnd why it thought 10.0.1.48 was dead as well?1.48 debug trace in 10.0.1.48 is -- 9:50:43,033 INFO [DefaultPartition] Suspected member: 10.0.1.61:7800 (additional data: 14 bytes) 19:50:44,611 WARN [FD] I was suspected, but will not remove myself from membership (waiting for EXIT message) 19:50:45,533 INFO [DefaultPartition] Suspected member: 10.0.1.61:7800 (additional data: 14 bytes) 19:50:46,122 WARN [CoordGmsImpl] I am the coord and I'm being am suspected -- will probably leave shortly 19:50:46,132 WARN [GMS] checkSelfInclusion() failed, 10.0.1.48:7810 is not a member of view [10.0.2.130:7810|3] [10.0.2.130:7810]; discarding view 19:50:46,517 WARN [FD] I was suspected, but will not remove myself from membership (waiting for EXIT message) 19:50:48,023 WARN [GMS] checkSelfInclusion() failed, 10.0.1.48:7800 (additional data: 14 bytes) is not a member of view [10.0.2.130:7800 (additional data: 15 bytes)|3] [10.0.2.130:7800 (additional data: 15 bytes)]; discarding view 19:50:48,032 WARN [CoordGmsImpl] I am the coord and I'm being am suspected -- will probably leave shortly 19:50:48,033 INFO [DefaultPartition] Suspected member: 10.0.1.61:7800 (additional data: 14 bytes) 19:50:48,034 INFO [DefaultPartition] Suspected member: vallance-lnx:7800 (additional data: 14 bytes) Why is 10.0.1.48 a suspect? The result is that both 10.0.1.48 and 10.0.2.130 now runs in Master mode and not in a cluster. Upon connecting the nework cable back to 10.0.1.61 , the cluster goes thru some variance of group and finally stettles down the following view on all three views [10.0.2.130:1099, 10.0.1.61:1099, 10.0.1.48:1099] How do I troubleshoot this? I would expect 10.0.2.130 and 10.0.1.48 to never loose the cluser group and 10.0.1.61 tojoin at the end as the newest. Testing on jboss-3.2.8sp1 and jdk1.5 Thanks Kumar View the original post : http://www.jboss.com/index.html?module=bb&op=viewtopic&p=3950578#3950578 Reply to the post : http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=3950578 _______________________________________________ JBoss-user mailing list JBoss-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/jboss-user