I had a problem recently in production whereby one of the instances in the cluster failed and had to be terminated (via kill -9). This is part of a cluster of 4 servers, on which there are 12 cache instances (1 JVM per server, 3 cache per JVM) in REPL_ASYNC mode.
After a node failure, we restarted one of the JVMs, and then restarted 2 of the remaining JVMs. To make things simple, we first restarted B, then A and D, but left C running. We noticed the following messages in the logs of A B and D after restart: 06/11/2006 14:10:24 WARN [ClientGmsImpl.java:126] - join(A:32937) sent to B:32955 timed out, retrying B:32955 was the coordinator before B was killed with kill -9. It seems that C (the remaining member) incorrectly things that B:32955 is still the coordinator. Here's the protocol stack I am using: UDP(ip_mcast=true;ip_ttl=64;loopback=false;mcast_addr=${treeCache.mcastAddress};mcast_port=${treeCache.mcastPort};mcast_recv_buf_size=80000;mcast_send_buf_size=150000;ucast_recv_buf_size=80000;ucast_send_buf_size=150000;bind_addr=${treeCache.bind_addr}):\ PING(down_thread=false;num_initial_members=3;timeout=2000;up_thread=false):\ MERGE2(max_interval=20000;min_interval=10000):\ FD_SOCK:\ VERIFY_SUSPECT(down_thread=false;timeout=1500;up_thread=false):\ pbcast.NAKACK(down_thread=false;gc_lag=50;retransmit_timeout=600,1200,2400,4800;up_thread=false):\ pbcast.STABLE(desired_avg_gossip=20000;down_thread=false;up_thread=false):\ UNICAST(down_thread=false;;timeout=600,1200,2400):\ FRAG(down_thread=false;frag_size=8192;up_thread=false):\ pbcast.GMS(join_retry_timeout=2000;join_timeout=5000;print_local_addr=true;shun=true):\ pbcast.STATE_TRANSFER(down_thread=true;up_thread=true) When I tried to replicate this scenario on my dev system, the failure detection worked and a new coordinator was successfully elected - therefore I think I may have hit upon a borderline condition. Any idea on what could be going on? View the original post : http://www.jboss.com/index.html?module=bb&op=viewtopic&p=3983716#3983716 Reply to the post : http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=3983716 _______________________________________________ jboss-user mailing list jboss-user@lists.jboss.org https://lists.jboss.org/mailman/listinfo/jboss-user