I’m sorry, just read about weakly-up documentation. Sounds like it will solve Shlomi’s problem. What did you mean by gets it “partly” to the way we want it ? What’s missing?
From: Tom Pantelis [mailto:tompante...@gmail.com] Sent: Sunday, January 22, 2017 5:08 PM To: Sela, Guy <guy.s...@hpe.com> Cc: Alfasi, Shlomi <shlomi.alf...@hpe.com>; controller-dev@lists.opendaylight.org; mdsal-...@lists.opendaylight.org Subject: Re: [mdsal-dev] [controller-dev] cluster - recovery from dual failure That's the way it works and the akka designers have reasons for it. They added "weakly-up" which gets it partly to the way we would want it to work and they've said they may add more options to better control the behavior. You can enable auto-down in your setup. Or an external script to monitor the process and, if it goes down, then send a "down" request (via jolokia) to the cluster leader. On Sun, Jan 22, 2017 at 9:37 AM, Sela, Guy <guy.s...@hpe.com<mailto:guy.s...@hpe.com>> wrote: Hi, Just read the documentation, very interesting. So that means that ODL Cluster can’t automatically recover from more than a single concurrent failure. Even if we had a cluster of 10 nodes, if one becomes unreachable, none of the others can restart, until the first one will be reachable again. Sounds like a serious restriction for production. Are there any best practices how to deal with this situations? (Without manual intervention) From: mdsal-dev-boun...@lists.opendaylight.org<mailto:mdsal-dev-boun...@lists.opendaylight.org> [mailto:mdsal-dev-boun...@lists.opendaylight.org<mailto:mdsal-dev-boun...@lists.opendaylight.org>] On Behalf Of Tom Pantelis Sent: Sunday, January 22, 2017 4:30 PM To: Alfasi, Shlomi <shlomi.alf...@hpe.com<mailto:shlomi.alf...@hpe.com>> Cc: controller-dev@lists.opendaylight.org<mailto:controller-dev@lists.opendaylight.org>; mdsal-...@lists.opendaylight.org<mailto:mdsal-...@lists.opendaylight.org> Subject: Re: [mdsal-dev] [controller-dev] cluster - recovery from dual failure This is a side effect of how akka clustering works. All unreachable nodes must first become reachable again, or the status of the unreachable nodes must be changed to 'Down', either manually or auto-downed. You can enable auto-downing but akka doesn't recommend it in production (http://doc.akka.io/docs/akka/current/java/cluster-usage.html). On Sun, Jan 22, 2017 at 8:53 AM, Alfasi, Shlomi <shlomi.alf...@hpe.com<mailto:shlomi.alf...@hpe.com>> wrote: Hi All, I configured a clustered setup with 3 nodes (attached the akka.conf of one of the nodes). At a specific time one of the members in the cluster was down and then I restarted another node. In the restarted node I see that it fails to read information from the datastore and repetitively throw exceptions [1] In the node that was always up, every 10 seconds there is a log that imply that the restarted node doesn’t manage to join [2] What is the expected behavior in this case? Is this state recoverable? Shlomi [1] WARN | ult-dispatcher-2 | DataStoreAppConfigMetadata | 153 - org.opendaylight.controller.blueprint - 0.5.2.SNAPSHOT | org.opendaylight.netvirt.elanmanager-impl (elanConfig): Read of app config org.opend aylight.yang.gen.v1.urn.opendaylight.netvirt.elan.config.rev150710.ElanConfig failed - retrying ReadFailedException{message=Error executeRead ReadData for path /(urn:opendaylight:netvirt:elan:config?revision=2015-07-10)elan-config, errorList=[RpcError [message=Error executeRead ReadData for path /(urn:opendaylight:netvirt:elan:co nfig?revision=2015-07-10)elan-config, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException: Shard member-3-s hard-default-config currently has no leader. Try again later.]]} [2] 2017-01-22 15:19:56,290 | INFO | lt-dispatcher-22 | kka://opendaylight-cluster-data) | 159 - com.typesafe.akka.slf4j - 2.4.7 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.0.77.33:2550<http://opendaylight-cluster-data@10.0.77.33:2550>] - New incarnation of existing member [M ember(address = akka.tcp://opendaylight-cluster-data@10.0.97.128:2550<http://opendaylight-cluster-data@10.0.97.128:2550>, status = Down)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join. _______________________________________________ controller-dev mailing list controller-dev@lists.opendaylight.org<mailto:controller-dev@lists.opendaylight.org> https://lists.opendaylight.org/mailman/listinfo/controller-dev
_______________________________________________ controller-dev mailing list controller-dev@lists.opendaylight.org https://lists.opendaylight.org/mailman/listinfo/controller-dev