Re: [controller-dev] [mdsal-dev] cluster - recovery from dual failure

Sela, Guy Sun, 22 Jan 2017 07:07:26 -0800

Sounds like for production, we can use this commercial add-on:
http://doc.akka.io/docs/akka/akka-commercial-addons-1.0/java/split-brain-resolver.html

To handle this situation.
Or develop something similar inside ODL?

From: mdsal-dev-boun...@lists.opendaylight.org 
[mailto:mdsal-dev-boun...@lists.opendaylight.org] On Behalf Of Sela, Guy
Sent: Sunday, January 22, 2017 4:38 PM
To: Tom Pantelis <tompante...@gmail.com>; Alfasi, Shlomi <shlomi.alf...@hpe.com>
Cc: controller-dev@lists.opendaylight.org; mdsal-...@lists.opendaylight.org
Subject: Re: [mdsal-dev] [controller-dev] cluster - recovery from dual failure

Hi,
Just read the documentation, very interesting.
So that means that ODL Cluster can’t automatically recover from more than a 
single concurrent failure.
Even if we had a cluster of 10 nodes, if one becomes unreachable, none of the 
others can restart, until the first one will be reachable again.
Sounds like a serious restriction for production.
Are there any best practices how to deal with this situations? (Without manual 
intervention)

From: 
mdsal-dev-boun...@lists.opendaylight.org<mailto:mdsal-dev-boun...@lists.opendaylight.org>
 [mailto:mdsal-dev-boun...@lists.opendaylight.org] On Behalf Of Tom Pantelis
Sent: Sunday, January 22, 2017 4:30 PM
To: Alfasi, Shlomi <shlomi.alf...@hpe.com<mailto:shlomi.alf...@hpe.com>>
Cc: 
controller-dev@lists.opendaylight.org<mailto:controller-dev@lists.opendaylight.org>;
 mdsal-...@lists.opendaylight.org<mailto:mdsal-...@lists.opendaylight.org>
Subject: Re: [mdsal-dev] [controller-dev] cluster - recovery from dual failure

This is a side effect of how akka clustering works. All unreachable nodes must 
first become reachable again, or the status of the unreachable nodes must be 
changed to 'Down', either manually or auto-downed.  You can enable auto-downing 
but akka doesn't recommend it in production 
(http://doc.akka.io/docs/akka/current/java/cluster-usage.html).

On Sun, Jan 22, 2017 at 8:53 AM, Alfasi, Shlomi 
<shlomi.alf...@hpe.com<mailto:shlomi.alf...@hpe.com>> wrote:
Hi All,

I configured a clustered setup with 3 nodes (attached the akka.conf of one of 
the nodes).
At a specific time one of the members in the cluster was down and then I 
restarted another node.
In the restarted node I see that it fails to read information from the 
datastore and repetitively throw exceptions [1]
In the node that was always up, every 10 seconds there is a log that imply that 
the restarted node doesn’t manage to join [2]

What is the expected behavior in this case? Is this state recoverable?

Shlomi

[1]
WARN  | ult-dispatcher-2 | DataStoreAppConfigMetadata       | 153 - 
org.opendaylight.controller.blueprint - 0.5.2.SNAPSHOT | 
org.opendaylight.netvirt.elanmanager-impl (elanConfig): Read of app config 
org.opend
aylight.yang.gen.v1.urn.opendaylight.netvirt.elan.config.rev150710.ElanConfig 
failed - retrying
ReadFailedException{message=Error executeRead ReadData for path 
/(urn:opendaylight:netvirt:elan:config?revision=2015-07-10)elan-config, 
errorList=[RpcError [message=Error executeRead ReadData for path 
/(urn:opendaylight:netvirt:elan:co
nfig?revision=2015-07-10)elan-config, severity=ERROR, errorType=APPLICATION, 
tag=operation-failed, applicationTag=null, info=null, 
cause=org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException:
 Shard member-3-s
hard-default-config currently has no leader. Try again later.]]}

[2]
2017-01-22 15:19:56,290 | INFO  | lt-dispatcher-22 | 
kka://opendaylight-cluster-data) | 159 - com.typesafe.akka.slf4j - 2.4.7 | 
Cluster Node 
[akka.tcp://opendaylight-cluster-data@10.0.77.33:2550<http://opendaylight-cluster-data@10.0.77.33:2550>]
 - New incarnation of existing member [M
ember(address = 
akka.tcp://opendaylight-cluster-data@10.0.97.128:2550<http://opendaylight-cluster-data@10.0.97.128:2550>,
 status = Down)] is trying to join. Existing will be removed from the cluster 
and then new member will be allowed to join.

_______________________________________________
controller-dev mailing list
controller-dev@lists.opendaylight.org<mailto:controller-dev@lists.opendaylight.org>
https://lists.opendaylight.org/mailman/listinfo/controller-dev

_______________________________________________
controller-dev mailing list
controller-dev@lists.opendaylight.org
https://lists.opendaylight.org/mailman/listinfo/controller-dev

Re: [controller-dev] [mdsal-dev] cluster - recovery from dual failure

Reply via email to