Re: [controller-dev] Need Input on Geo Cluster Behavior for Node Isolation/Un-isolation

Tom Pantelis Tue, 27 Feb 2018 20:13:41 -0800

On Fri, Feb 16, 2018 at 12:42 AM, Chethana Lakshmanappa <
cheth...@luminanetworks.com> wrote:


> Hi All,
>
> Kindly need your input on some of the behavior seen in Geo cluster setup
> when a node is isolated and un-isolated.
>
> Suppose Geo cluster has nodes A, B and C residing in one primary data
> center which is voting and D, E & F residing in secondary data center which
> is non-voting:
>
>    - If a node is Isolated, let's say Node B, then immediately in the
>    cluster all nodes are unreachable to each other.
>
>
That is odd. How do you know that all nodes became unreachable to each
other? The log excerpt below just indicates that 10.18.130.105 lost
reachability with 10.18.130.103 (Node B I assume) which is expected. The
message "Leader can currently not perform its duties" means that the akka
cluster leader cannot allow new nodes to be added to the cluster or nodes
removed until the lost node comes back or is downed.

>
>    - All nodes wait for a threshold amount of time before making Node B
>    as quarantined and then reachability within the cluster is restored.
>
>    - What is the threshold amount of time it needs to wait?
>       - If the node goes down or stopped, this behavior is not seen. It
>       is seen only when it is isolated. How is this different from node down?
>
>
> *Log excerpt from Node A when Node B is isolated:*
> 130.103:2550] has failed, address is now gated for [5000] ms. Reason:
> [Disassociated]
> 2018-02-15 19:53:56,109 | INFO  | lt-dispatcher-22 |
> kka://opendaylight-cluster-data) | 113 - com.typesafe.akka.slf4j - 2.4.18
> | Cluster Node [akka.tcp://opendaylight-cluster-data@10.18.130.105:2550]
> - Leader can currently not perform its duties, reachability status: [
> akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 ->
> akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: Unreachable
> [Unreachable] (1)], member status: [akka.tcp://opendaylight-
> cluster-data@10.18.130.103:2550 Up seen=false, akka.tcp://opendaylight-
> cluster-data@10.18.130.105:2550 Up seen=true, akka.tcp://opendaylight-
> cluster-data@10.18.130.84:2550 Up seen=true, akka.tcp://opendaylight-
> cluster-data@10.18.131.27:2550 Up seen=true, akka.tcp://opendaylight-
> cluster-data@10.18.131.31:2550 Up seen=true, akka.tcp://opendaylight-
> cluster-data@10.18.131.39:2550 Up seen=true]
>
>
>
>    - If a *Shard Leader* is Isolated, let’s say you make Node A as shard
>    leader for all shards and data store. On isolating and un-isolating Node A,
>    I see the following:
>
>    - Primary voting nodes are unreachable to secondary nodes and vice
>       versa. Cluster never recovers and all nodes need to be restarted to have
>       cluster working. *Is this a bug?*
>       - Also the isolated node which is un-isolated is unreachable to
>       primary voting nodes and never recovers.
>
>
It may be that, on un-isolation, split brain occurred in akka with 2
cluster leaders. I assume that Node A was the akka cluster leader when it
was isolated - it would be interesting to see if this also occurs if a
non-cluster
leader node is isolated.

Also make sure you do not have the auto-down-unreachable-after option
enabled in the akka.conf.


>
> *Log excerpt:*
> 2018-02-15 19:32:47,174 | INFO  | lt-dispatcher-19 |
> kka://opendaylight-cluster-data) | 113 - com.typesafe.akka.slf4j - 2.4.18
> | Cluster Node [akka.tcp://opendaylight-cluster-data@10.18.131.27:2550] -
> Leader can currently not perform its duties, reachability status: [
> akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 ->
> akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: Unreachable
> [Terminated] (1), akka.tcp://opendaylight-cluster-data@10.18.130.105:2550
> -> akka.tcp://opendaylight-cluster-data@10.18.130.84:2550: Unreachable
> [Unreachable] (2), akka.tcp://opendaylight-cluster-data@10.18.131.27:2550
> -> akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: Terminated
> [Terminated] (4), akka.tcp://opendaylight-cluster-data@10.18.131.27:2550
> -> akka.tcp://opendaylight-cluster-data@10.18.130.84:2550: Unreachable
> [Unreachable] (2), akka.tcp://opendaylight-cluster-data@10.18.131.31:2550
> -> akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: Unreachable
> [Terminated] (3), akka.tcp://opendaylight-cluster-data@10.18.131.31:2550
> -> akka.tcp://opendaylight-cluster-data@10.18.130.84:2550: Unreachable
> [Unreachable] (2), akka.tcp://opendaylight-cluster-data@10.18.131.39:2550
> -> akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: Unreachable
> [Terminated] (3), akka.tcp://opendaylight-cluster-data@10.18.131.39:2550
> -> akka.tcp://opendaylight-cluster-data@10.18.130.84:2550: Unreachable
> [Unreachable] (2)], member status: [akka.tcp://opendaylight-
> cluster-data@10.18.130.103:2550 Down seen=false, akka.tcp://opendaylight-
> cluster-data@10.18.130.105:2550 WeaklyUp seen=true,
> akka.tcp://opendaylight-cluster-data@10.18.130.84:2550 Up seen=false,
> akka.tcp://opendaylight-cluster-data@10.18.131.27:2550 Up seen=true,
> akka.tcp://opendaylight-cluster-data@10.18.131.31:2550 Up seen=true,
> akka.tcp://opendaylight-cluster-data@10.18.131.39:2550 Up seen=true]
>
>
>
>    - If a Cluster Leader is Isolated, then "DataStoreUnavailableException:
>    Shard member-2-shard-default-config currently has no leader” exception
>    is seen on nodes where COMMIT fails:
>
>    - Transactions done during this threshold time fail as there is no
>       leader. Is this acceptable? (as threshold time sometimes is very long)
>
> Transactions will fail if there is no shard leader although it does make
every attempt with timeouts and retries. But at some point it gives up.


>    - Also when the Isolated node is un-isolated, sometimes cluster does
>       not recover and all nodes need to be restarted. *Is this a bug?*
>
>
> *Log excerpt on Node F:*
> 2018-02-15 19:54:12,625 | ERROR | a-change-notif-0 | MdSalHelper
>             | 96 - com.luminanetworks.lsc.app.lsc-app-nodecounter-impl -
> 1.0.0.SNAPSHOT | DataStore Tx encountered error
> TransactionCommitFailedException{message=canCommit encountered an
> unexpected failure, errorList=[RpcError [message=canCommit encountered an
> unexpected failure, severity=ERROR, errorType=APPLICATION,
> tag=operation-failed, applicationTag=null, info=null,
> cause=org.opendaylight.controller.md.sal.common.api.data.*DataStoreUnavailableException:
> Shard member-6-shard-default-operational currently has no leader*. Try
> again later.]]}
>
>
>
>    - If a follower is isolated and un-isolated, shard leader is
>    re-elected. Cluster already had a shard leader, so, should re-election
>    happen?
>
> It can happen if the follower is able to send out a RequestVote after
un-isolation.  From the follower's perspective there is no leader so it
tries to become leader - this is the way RAFT works.



>
> Thanks,
> Chethana
>
>
>
> _______________________________________________
> controller-dev mailing list
> controller-dev@lists.opendaylight.org
> https://lists.opendaylight.org/mailman/listinfo/controller-dev
>
>

_______________________________________________
controller-dev mailing list
controller-dev@lists.opendaylight.org
https://lists.opendaylight.org/mailman/listinfo/controller-dev

Re: [controller-dev] Need Input on Geo Cluster Behavior for Node Isolation/Un-isolation

Reply via email to