[controller-dev] Need Input on Geo Cluster Behavior for Node Isolation/Un-isolation

Chethana Lakshmanappa Thu, 15 Feb 2018 21:43:46 -0800

Hi All,

Kindly need your input on some of the behavior seen in Geo cluster setup when a 
node is isolated and un-isolated.


Suppose Geo cluster has nodes A, B and C residing in one primary data center 
which is voting and D, E & F residing in secondary data center which is 
non-voting:
If a node is Isolated, let's say Node B, then immediately in the cluster all 
nodes are unreachable to each other.
All nodes wait for a threshold amount of time before making Node B as 
quarantined and then reachability within the cluster is restored.

What is the threshold amount of time it needs to wait?
If the node goes down or stopped, this behavior is not seen. It is seen only 
when it is isolated. How is this different from node down?

Log excerpt from Node A when Node B is isolated:
130.103:2550] has failed, address is now gated for [5000] ms. Reason: 
[Disassociated] 
2018-02-15 19:53:56,109 | INFO  | lt-dispatcher-22 | 
kka://opendaylight-cluster-data) | 113 - com.typesafe.akka.slf4j - 2.4.18 | 
Cluster Node [akka.tcp://[email protected]:2550] - Leader 
can currently not perform its duties, reachability status: 
[akka.tcp://[email protected]:2550 -> 
akka.tcp://[email protected]:2550: Unreachable 
[Unreachable] (1)], member status: 
[akka.tcp://[email protected]:2550 Up seen=false, 
akka.tcp://[email protected]:2550 Up seen=true, 
akka.tcp://[email protected]:2550 Up seen=true, 
akka.tcp://[email protected]:2550 Up seen=true, 
akka.tcp://[email protected]:2550 Up seen=true, 
akka.tcp://[email protected]:2550 Up seen=true]

If a Shard Leader is Isolated, let’s say you make Node A as shard leader for 
all shards and data store. On isolating and un-isolating Node A, I see the 
following:

Primary voting nodes are unreachable to secondary nodes and vice versa. Cluster 
never recovers and all nodes need to be restarted to have cluster working. Is 
this a bug?
Also the isolated node which is un-isolated is unreachable to primary voting 
nodes and never recovers.

Log excerpt:
2018-02-15 19:32:47,174 | INFO  | lt-dispatcher-19 | 
kka://opendaylight-cluster-data) | 113 - com.typesafe.akka.slf4j - 2.4.18 | 
Cluster Node [akka.tcp://[email protected]:2550] - Leader 
can currently not perform its duties, reachability status: 
[akka.tcp://[email protected]:2550 -> 
akka.tcp://[email protected]:2550: Unreachable 
[Terminated] (1), akka.tcp://[email protected]:2550 -> 
akka.tcp://[email protected]:2550: Unreachable 
[Unreachable] (2), akka.tcp://[email protected]:2550 -> 
akka.tcp://[email protected]:2550: Terminated 
[Terminated] (4), akka.tcp://[email protected]:2550 -> 
akka.tcp://[email protected]:2550: Unreachable 
[Unreachable] (2), akka.tcp://[email protected]:2550 -> 
akka.tcp://[email protected]:2550: Unreachable 
[Terminated] (3), akka.tcp://[email protected]:2550 -> 
akka.tcp://[email protected]:2550: Unreachable 
[Unreachable] (2), akka.tcp://[email protected]:2550 -> 
akka.tcp://[email protected]:2550: Unreachable 
[Terminated] (3), akka.tcp://[email protected]:2550 -> 
akka.tcp://[email protected]:2550: Unreachable 
[Unreachable] (2)], member status: 
[akka.tcp://[email protected]:2550 Down seen=false, 
akka.tcp://[email protected]:2550 WeaklyUp seen=true, 
akka.tcp://[email protected]:2550 Up seen=false, 
akka.tcp://[email protected]:2550 Up seen=true, 
akka.tcp://[email protected]:2550 Up seen=true, 
akka.tcp://[email protected]:2550 Up seen=true]

If a Cluster Leader is Isolated, then "DataStoreUnavailableException: Shard 
member-2-shard-default-config currently has no leader” exception is seen on 
nodes where COMMIT fails:

Transactions done during this threshold time fail as there is no leader. Is 
this acceptable? (as threshold time sometimes is very long)
Also when the Isolated node is un-isolated, sometimes cluster does not recover 
and all nodes need to be restarted. Is this a bug?

Log excerpt on Node F:
2018-02-15 19:54:12,625 | ERROR | a-change-notif-0 | MdSalHelper                
      | 96 - com.luminanetworks.lsc.app.lsc-app-nodecounter-impl - 
1.0.0.SNAPSHOT | DataStore Tx encountered error
TransactionCommitFailedException{message=canCommit encountered an unexpected 
failure, errorList=[RpcError [message=canCommit encountered an unexpected 
failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, 
applicationTag=null, info=null, 
cause=org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException:
 Shard member-6-shard-default-operational currently has no leader. Try again 
later.]]}

If a follower is isolated and un-isolated, shard leader is re-elected. Cluster 
already had a shard leader, so, should re-election happen?

Thanks,
Chethana

_______________________________________________
controller-dev mailing list
[email protected]
https://lists.opendaylight.org/mailman/listinfo/controller-dev

[controller-dev] Need Input on Geo Cluster Behavior for Node Isolation/Un-isolation

Reply via email to