Hi All, Kindly need your input on some of the behavior seen in Geo cluster setup when a node is isolated and un-isolated.
Suppose Geo cluster has nodes A, B and C residing in one primary data center which is voting and D, E & F residing in secondary data center which is non-voting: If a node is Isolated, let's say Node B, then immediately in the cluster all nodes are unreachable to each other. All nodes wait for a threshold amount of time before making Node B as quarantined and then reachability within the cluster is restored. What is the threshold amount of time it needs to wait? If the node goes down or stopped, this behavior is not seen. It is seen only when it is isolated. How is this different from node down? Log excerpt from Node A when Node B is isolated: 130.103:2550] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 2018-02-15 19:53:56,109 | INFO | lt-dispatcher-22 | kka://opendaylight-cluster-data) | 113 - com.typesafe.akka.slf4j - 2.4.18 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.18.130.105:2550] - Leader can currently not perform its duties, reachability status: [akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 -> akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: Unreachable [Unreachable] (1)], member status: [akka.tcp://opendaylight-cluster-data@10.18.130.103:2550 Up seen=false, akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@10.18.130.84:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@10.18.131.27:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@10.18.131.31:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@10.18.131.39:2550 Up seen=true] If a Shard Leader is Isolated, let’s say you make Node A as shard leader for all shards and data store. On isolating and un-isolating Node A, I see the following: Primary voting nodes are unreachable to secondary nodes and vice versa. Cluster never recovers and all nodes need to be restarted to have cluster working. Is this a bug? Also the isolated node which is un-isolated is unreachable to primary voting nodes and never recovers. Log excerpt: 2018-02-15 19:32:47,174 | INFO | lt-dispatcher-19 | kka://opendaylight-cluster-data) | 113 - com.typesafe.akka.slf4j - 2.4.18 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.18.131.27:2550] - Leader can currently not perform its duties, reachability status: [akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 -> akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: Unreachable [Terminated] (1), akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 -> akka.tcp://opendaylight-cluster-data@10.18.130.84:2550: Unreachable [Unreachable] (2), akka.tcp://opendaylight-cluster-data@10.18.131.27:2550 -> akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: Terminated [Terminated] (4), akka.tcp://opendaylight-cluster-data@10.18.131.27:2550 -> akka.tcp://opendaylight-cluster-data@10.18.130.84:2550: Unreachable [Unreachable] (2), akka.tcp://opendaylight-cluster-data@10.18.131.31:2550 -> akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: Unreachable [Terminated] (3), akka.tcp://opendaylight-cluster-data@10.18.131.31:2550 -> akka.tcp://opendaylight-cluster-data@10.18.130.84:2550: Unreachable [Unreachable] (2), akka.tcp://opendaylight-cluster-data@10.18.131.39:2550 -> akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: Unreachable [Terminated] (3), akka.tcp://opendaylight-cluster-data@10.18.131.39:2550 -> akka.tcp://opendaylight-cluster-data@10.18.130.84:2550: Unreachable [Unreachable] (2)], member status: [akka.tcp://opendaylight-cluster-data@10.18.130.103:2550 Down seen=false, akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 WeaklyUp seen=true, akka.tcp://opendaylight-cluster-data@10.18.130.84:2550 Up seen=false, akka.tcp://opendaylight-cluster-data@10.18.131.27:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@10.18.131.31:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@10.18.131.39:2550 Up seen=true] If a Cluster Leader is Isolated, then "DataStoreUnavailableException: Shard member-2-shard-default-config currently has no leader” exception is seen on nodes where COMMIT fails: Transactions done during this threshold time fail as there is no leader. Is this acceptable? (as threshold time sometimes is very long) Also when the Isolated node is un-isolated, sometimes cluster does not recover and all nodes need to be restarted. Is this a bug? Log excerpt on Node F: 2018-02-15 19:54:12,625 | ERROR | a-change-notif-0 | MdSalHelper | 96 - com.luminanetworks.lsc.app.lsc-app-nodecounter-impl - 1.0.0.SNAPSHOT | DataStore Tx encountered error TransactionCommitFailedException{message=canCommit encountered an unexpected failure, errorList=[RpcError [message=canCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException: Shard member-6-shard-default-operational currently has no leader. Try again later.]]} If a follower is isolated and un-isolated, shard leader is re-elected. Cluster already had a shard leader, so, should re-election happen? Thanks, Chethana
_______________________________________________ controller-dev mailing list controller-dev@lists.opendaylight.org https://lists.opendaylight.org/mailman/listinfo/controller-dev