Re: [controller-dev] [mdsal-dev] cluster - recovery from dual failure

Sela, Guy Sun, 22 Jan 2017 07:28:58 -0800

I’m sorry, just read about weakly-up documentation.
Sounds like it will solve Shlomi’s problem.
What did you mean by gets it “partly” to the way we want it ? What’s missing?

From: Tom Pantelis [mailto:[email protected]]
Sent: Sunday, January 22, 2017 5:08 PM
To: Sela, Guy <[email protected]>
Cc: Alfasi, Shlomi <[email protected]>; 
[email protected]; [email protected]
Subject: Re: [mdsal-dev] [controller-dev] cluster - recovery from dual failure

That's the way it works and the akka designers have reasons for it. They added 
"weakly-up" which gets it partly to the way we would want it to work and 
they've said they may add more options to better control the behavior.

You can enable auto-down in your setup. Or an external script to monitor the 
process and, if it goes down, then send a "down" request (via jolokia) to the 
cluster leader.

On Sun, Jan 22, 2017 at 9:37 AM, Sela, Guy 
<[email protected]<mailto:[email protected]>> wrote:
Hi,
Just read the documentation, very interesting.
So that means that ODL Cluster can’t automatically recover from more than a 
single concurrent failure.
Even if we had a cluster of 10 nodes, if one becomes unreachable, none of the 
others can restart, until the first one will be reachable again.
Sounds like a serious restriction for production.
Are there any best practices how to deal with this situations? (Without manual 
intervention)

From: 
[email protected]<mailto:[email protected]>

[mailto:[email protected]<mailto:[email protected]>]
 On Behalf Of Tom Pantelis
Sent: Sunday, January 22, 2017 4:30 PM
To: Alfasi, Shlomi <[email protected]<mailto:[email protected]>>
Cc: 
[email protected]<mailto:[email protected]>;
 [email protected]<mailto:[email protected]>
Subject: Re: [mdsal-dev] [controller-dev] cluster - recovery from dual failure

This is a side effect of how akka clustering works. All unreachable nodes must 
first become reachable again, or the status of the unreachable nodes must be 
changed to 'Down', either manually or auto-downed.  You can enable auto-downing 
but akka doesn't recommend it in production 
(http://doc.akka.io/docs/akka/current/java/cluster-usage.html).

On Sun, Jan 22, 2017 at 8:53 AM, Alfasi, Shlomi 
<[email protected]<mailto:[email protected]>> wrote:
Hi All,

I configured a clustered setup with 3 nodes (attached the akka.conf of one of 
the nodes).
At a specific time one of the members in the cluster was down and then I 
restarted another node.
In the restarted node I see that it fails to read information from the 
datastore and repetitively throw exceptions [1]
In the node that was always up, every 10 seconds there is a log that imply that 
the restarted node doesn’t manage to join [2]

What is the expected behavior in this case? Is this state recoverable?

Shlomi

[1]
WARN  | ult-dispatcher-2 | DataStoreAppConfigMetadata       | 153 - 
org.opendaylight.controller.blueprint - 0.5.2.SNAPSHOT | 
org.opendaylight.netvirt.elanmanager-impl (elanConfig): Read of app config 
org.opend
aylight.yang.gen.v1.urn.opendaylight.netvirt.elan.config.rev150710.ElanConfig 
failed - retrying
ReadFailedException{message=Error executeRead ReadData for path 
/(urn:opendaylight:netvirt:elan:config?revision=2015-07-10)elan-config, 
errorList=[RpcError [message=Error executeRead ReadData for path 
/(urn:opendaylight:netvirt:elan:co
nfig?revision=2015-07-10)elan-config, severity=ERROR, errorType=APPLICATION, 
tag=operation-failed, applicationTag=null, info=null, 
cause=org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException:
 Shard member-3-s
hard-default-config currently has no leader. Try again later.]]}

[2]
2017-01-22 15:19:56,290 | INFO  | lt-dispatcher-22 | 
kka://opendaylight-cluster-data) | 159 - com.typesafe.akka.slf4j - 2.4.7 | 
Cluster Node 
[akka.tcp://[email protected]:2550<http://[email protected]:2550>]
 - New incarnation of existing member [M
ember(address = 
akka.tcp://[email protected]:2550<http://[email protected]:2550>,
 status = Down)] is trying to join. Existing will be removed from the cluster 
and then new member will be allowed to join.

_______________________________________________
controller-dev mailing list
[email protected]<mailto:[email protected]>
https://lists.opendaylight.org/mailman/listinfo/controller-dev

_______________________________________________
controller-dev mailing list
[email protected]
https://lists.opendaylight.org/mailman/listinfo/controller-dev

Re: [controller-dev] [mdsal-dev] cluster - recovery from dual failure

Reply via email to