Re: [controller-dev] Need Input on Geo Cluster Behavior for Node Isolation/Un-isolation

Chethana Lakshmanappa Tue, 20 Feb 2018 10:31:50 -0800

Hi Dev-Team,

A gentle reminder.


Kindly provide your inputs on the below mail.

Thanks,
Chethana
> On Feb 15, 2018, at 9:42 PM, Chethana Lakshmanappa 
> <cheth...@luminanetworks.com> wrote:
> 
> Hi All,
> 
> Kindly need your input on some of the behavior seen in Geo cluster setup when 
> a node is isolated and un-isolated.
> 
> Suppose Geo cluster has nodes A, B and C residing in one primary data center 
> which is voting and D, E & F residing in secondary data center which is 
> non-voting:
> If a node is Isolated, let's say Node B, then immediately in the cluster all 
> nodes are unreachable to each other.
> All nodes wait for a threshold amount of time before making Node B as 
> quarantined and then reachability within the cluster is restored.
> 
> What is the threshold amount of time it needs to wait?
> If the node goes down or stopped, this behavior is not seen. It is seen only 
> when it is isolated. How is this different from node down?
> 
> Log excerpt from Node A when Node B is isolated:
> 130.103:2550] has failed, address is now gated for [5000] ms. Reason: 
> [Disassociated] 
> 2018-02-15 19:53:56,109 | INFO  | lt-dispatcher-22 | 
> kka://opendaylight-cluster-data <kka://opendaylight-cluster-data>) | 113 - 
> com.typesafe.akka.slf4j - 2.4.18 | Cluster Node 
> [akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 
> <akka.tcp://opendaylight-cluster-data@10.18.130.105:2550>] - Leader can 
> currently not perform its duties, reachability status: 
> [akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 
> <akka.tcp://opendaylight-cluster-data@10.18.130.105:2550> -> 
> akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: 
> <akka.tcp://opendaylight-cluster-data@10.18.130.103:2550:> Unreachable 
> [Unreachable] (1)], member status: 
> [akka.tcp://opendaylight-cluster-data@10.18.130.103:2550 
> <akka.tcp://opendaylight-cluster-data@10.18.130.103:2550> Up seen=false, 
> akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 
> <akka.tcp://opendaylight-cluster-data@10.18.130.105:2550> Up seen=true, 
> akka.tcp://opendaylight-cluster-data@10.18.130.84:2550 
> <akka.tcp://opendaylight-cluster-data@10.18.130.84:2550> Up seen=true, 
> akka.tcp://opendaylight-cluster-data@10.18.131.27:2550 
> <akka.tcp://opendaylight-cluster-data@10.18.131.27:2550> Up seen=true, 
> akka.tcp://opendaylight-cluster-data@10.18.131.31:2550 
> <akka.tcp://opendaylight-cluster-data@10.18.131.31:2550> Up seen=true, 
> akka.tcp://opendaylight-cluster-data@10.18.131.39:2550 
> <akka.tcp://opendaylight-cluster-data@10.18.131.39:2550> Up seen=true]
> 
> If a Shard Leader is Isolated, let’s say you make Node A as shard leader for 
> all shards and data store. On isolating and un-isolating Node A, I see the 
> following:
> 
> Primary voting nodes are unreachable to secondary nodes and vice versa. 
> Cluster never recovers and all nodes need to be restarted to have cluster 
> working. Is this a bug?
> Also the isolated node which is un-isolated is unreachable to primary voting 
> nodes and never recovers.
> 
> Log excerpt:
> 2018-02-15 19:32:47,174 | INFO  | lt-dispatcher-19 | 
> kka://opendaylight-cluster-data <kka://opendaylight-cluster-data>) | 113 - 
> com.typesafe.akka.slf4j - 2.4.18 | Cluster Node 
> [akka.tcp://opendaylight-cluster-data@10.18.131.27:2550 
> <akka.tcp://opendaylight-cluster-data@10.18.131.27:2550>] - Leader can 
> currently not perform its duties, reachability status: 
> [akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 
> <akka.tcp://opendaylight-cluster-data@10.18.130.105:2550> -> 
> akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: 
> <akka.tcp://opendaylight-cluster-data@10.18.130.103:2550:> Unreachable 
> [Terminated] (1), akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 
> <akka.tcp://opendaylight-cluster-data@10.18.130.105:2550> -> 
> akka.tcp://opendaylight-cluster-data@10.18.130.84:2550: 
> <akka.tcp://opendaylight-cluster-data@10.18.130.84:2550:> Unreachable 
> [Unreachable] (2), akka.tcp://opendaylight-cluster-data@10.18.131.27:2550 
> <akka.tcp://opendaylight-cluster-data@10.18.131.27:2550> -> 
> akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: 
> <akka.tcp://opendaylight-cluster-data@10.18.130.103:2550:> Terminated 
> [Terminated] (4), akka.tcp://opendaylight-cluster-data@10.18.131.27:2550 
> <akka.tcp://opendaylight-cluster-data@10.18.131.27:2550> -> 
> akka.tcp://opendaylight-cluster-data@10.18.130.84:2550: 
> <akka.tcp://opendaylight-cluster-data@10.18.130.84:2550:> Unreachable 
> [Unreachable] (2), akka.tcp://opendaylight-cluster-data@10.18.131.31:2550 
> <akka.tcp://opendaylight-cluster-data@10.18.131.31:2550> -> 
> akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: 
> <akka.tcp://opendaylight-cluster-data@10.18.130.103:2550:> Unreachable 
> [Terminated] (3), akka.tcp://opendaylight-cluster-data@10.18.131.31:2550 
> <akka.tcp://opendaylight-cluster-data@10.18.131.31:2550> -> 
> akka.tcp://opendaylight-cluster-data@10.18.130.84:2550: 
> <akka.tcp://opendaylight-cluster-data@10.18.130.84:2550:> Unreachable 
> [Unreachable] (2), akka.tcp://opendaylight-cluster-data@10.18.131.39:2550 
> <akka.tcp://opendaylight-cluster-data@10.18.131.39:2550> -> 
> akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: 
> <akka.tcp://opendaylight-cluster-data@10.18.130.103:2550:> Unreachable 
> [Terminated] (3), akka.tcp://opendaylight-cluster-data@10.18.131.39:2550 
> <akka.tcp://opendaylight-cluster-data@10.18.131.39:2550> -> 
> akka.tcp://opendaylight-cluster-data@10.18.130.84:2550: 
> <akka.tcp://opendaylight-cluster-data@10.18.130.84:2550:> Unreachable 
> [Unreachable] (2)], member status: 
> [akka.tcp://opendaylight-cluster-data@10.18.130.103:2550 
> <akka.tcp://opendaylight-cluster-data@10.18.130.103:2550> Down seen=false, 
> akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 
> <akka.tcp://opendaylight-cluster-data@10.18.130.105:2550> WeaklyUp seen=true, 
> akka.tcp://opendaylight-cluster-data@10.18.130.84:2550 
> <akka.tcp://opendaylight-cluster-data@10.18.130.84:2550> Up seen=false, 
> akka.tcp://opendaylight-cluster-data@10.18.131.27:2550 
> <akka.tcp://opendaylight-cluster-data@10.18.131.27:2550> Up seen=true, 
> akka.tcp://opendaylight-cluster-data@10.18.131.31:2550 
> <akka.tcp://opendaylight-cluster-data@10.18.131.31:2550> Up seen=true, 
> akka.tcp://opendaylight-cluster-data@10.18.131.39:2550 
> <akka.tcp://opendaylight-cluster-data@10.18.131.39:2550> Up seen=true]
> 
> If a Cluster Leader is Isolated, then "DataStoreUnavailableException: Shard 
> member-2-shard-default-config currently has no leader” exception is seen on 
> nodes where COMMIT fails:
> 
> Transactions done during this threshold time fail as there is no leader. Is 
> this acceptable? (as threshold time sometimes is very long)
> Also when the Isolated node is un-isolated, sometimes cluster does not 
> recover and all nodes need to be restarted. Is this a bug?
> 
> Log excerpt on Node F:
> 2018-02-15 19:54:12,625 | ERROR | a-change-notif-0 | MdSalHelper              
>         | 96 - com.luminanetworks.lsc.app.lsc-app-nodecounter-impl - 
> 1.0.0.SNAPSHOT | DataStore Tx encountered error
> TransactionCommitFailedException{message=canCommit encountered an unexpected 
> failure, errorList=[RpcError [message=canCommit encountered an unexpected 
> failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, 
> applicationTag=null, info=null, 
> cause=org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException:
>  Shard member-6-shard-default-operational currently has no leader. Try again 
> later.]]}
> 
> If a follower is isolated and un-isolated, shard leader is re-elected. 
> Cluster already had a shard leader, so, should re-election happen?
> 
> Thanks,
> Chethana
> 
>

_______________________________________________
controller-dev mailing list
controller-dev@lists.opendaylight.org
https://lists.opendaylight.org/mailman/listinfo/controller-dev

Re: [controller-dev] Need Input on Geo Cluster Behavior for Node Isolation/Un-isolation

Reply via email to