Hi All,

Kindly need your input on some of the behavior seen in Geo cluster setup when a 
node is isolated and un-isolated.

Suppose Geo cluster has nodes A, B and C residing in one primary data center 
which is voting and D, E & F residing in secondary data center which is 
non-voting:
If a node is Isolated, let's say Node B, then immediately in the cluster all 
nodes are unreachable to each other.
All nodes wait for a threshold amount of time before making Node B as 
quarantined and then reachability within the cluster is restored.

What is the threshold amount of time it needs to wait?
If the node goes down or stopped, this behavior is not seen. It is seen only 
when it is isolated. How is this different from node down?

Log excerpt from Node A when Node B is isolated:
130.103:2550] has failed, address is now gated for [5000] ms. Reason: 
[Disassociated] 
2018-02-15 19:53:56,109 | INFO  | lt-dispatcher-22 | 
kka://opendaylight-cluster-data) | 113 - com.typesafe.akka.slf4j - 2.4.18 | 
Cluster Node [akka.tcp://opendaylight-cluster-data@10.18.130.105:2550] - Leader 
can currently not perform its duties, reachability status: 
[akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 -> 
akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: Unreachable 
[Unreachable] (1)], member status: 
[akka.tcp://opendaylight-cluster-data@10.18.130.103:2550 Up seen=false, 
akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 Up seen=true, 
akka.tcp://opendaylight-cluster-data@10.18.130.84:2550 Up seen=true, 
akka.tcp://opendaylight-cluster-data@10.18.131.27:2550 Up seen=true, 
akka.tcp://opendaylight-cluster-data@10.18.131.31:2550 Up seen=true, 
akka.tcp://opendaylight-cluster-data@10.18.131.39:2550 Up seen=true]

If a Shard Leader is Isolated, let’s say you make Node A as shard leader for 
all shards and data store. On isolating and un-isolating Node A, I see the 
following:

Primary voting nodes are unreachable to secondary nodes and vice versa. Cluster 
never recovers and all nodes need to be restarted to have cluster working. Is 
this a bug?
Also the isolated node which is un-isolated is unreachable to primary voting 
nodes and never recovers.

Log excerpt:
2018-02-15 19:32:47,174 | INFO  | lt-dispatcher-19 | 
kka://opendaylight-cluster-data) | 113 - com.typesafe.akka.slf4j - 2.4.18 | 
Cluster Node [akka.tcp://opendaylight-cluster-data@10.18.131.27:2550] - Leader 
can currently not perform its duties, reachability status: 
[akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 -> 
akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: Unreachable 
[Terminated] (1), akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 -> 
akka.tcp://opendaylight-cluster-data@10.18.130.84:2550: Unreachable 
[Unreachable] (2), akka.tcp://opendaylight-cluster-data@10.18.131.27:2550 -> 
akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: Terminated 
[Terminated] (4), akka.tcp://opendaylight-cluster-data@10.18.131.27:2550 -> 
akka.tcp://opendaylight-cluster-data@10.18.130.84:2550: Unreachable 
[Unreachable] (2), akka.tcp://opendaylight-cluster-data@10.18.131.31:2550 -> 
akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: Unreachable 
[Terminated] (3), akka.tcp://opendaylight-cluster-data@10.18.131.31:2550 -> 
akka.tcp://opendaylight-cluster-data@10.18.130.84:2550: Unreachable 
[Unreachable] (2), akka.tcp://opendaylight-cluster-data@10.18.131.39:2550 -> 
akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: Unreachable 
[Terminated] (3), akka.tcp://opendaylight-cluster-data@10.18.131.39:2550 -> 
akka.tcp://opendaylight-cluster-data@10.18.130.84:2550: Unreachable 
[Unreachable] (2)], member status: 
[akka.tcp://opendaylight-cluster-data@10.18.130.103:2550 Down seen=false, 
akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 WeaklyUp seen=true, 
akka.tcp://opendaylight-cluster-data@10.18.130.84:2550 Up seen=false, 
akka.tcp://opendaylight-cluster-data@10.18.131.27:2550 Up seen=true, 
akka.tcp://opendaylight-cluster-data@10.18.131.31:2550 Up seen=true, 
akka.tcp://opendaylight-cluster-data@10.18.131.39:2550 Up seen=true]

If a Cluster Leader is Isolated, then "DataStoreUnavailableException: Shard 
member-2-shard-default-config currently has no leader” exception is seen on 
nodes where COMMIT fails:

Transactions done during this threshold time fail as there is no leader. Is 
this acceptable? (as threshold time sometimes is very long)
Also when the Isolated node is un-isolated, sometimes cluster does not recover 
and all nodes need to be restarted. Is this a bug?

Log excerpt on Node F:
2018-02-15 19:54:12,625 | ERROR | a-change-notif-0 | MdSalHelper                
      | 96 - com.luminanetworks.lsc.app.lsc-app-nodecounter-impl - 
1.0.0.SNAPSHOT | DataStore Tx encountered error
TransactionCommitFailedException{message=canCommit encountered an unexpected 
failure, errorList=[RpcError [message=canCommit encountered an unexpected 
failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, 
applicationTag=null, info=null, 
cause=org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException:
 Shard member-6-shard-default-operational currently has no leader. Try again 
later.]]}

If a follower is isolated and un-isolated, shard leader is re-elected. Cluster 
already had a shard leader, so, should re-election happen?

Thanks,
Chethana


_______________________________________________
controller-dev mailing list
controller-dev@lists.opendaylight.org
https://lists.opendaylight.org/mailman/listinfo/controller-dev

Reply via email to