Re: [controller-dev] [mdsal-dev] cluster - recovery from dual failure

Tom Pantelis Sun, 22 Jan 2017 08:47:57 -0800

There are no plans right now that I know of.  Of course contributions are
welcome from anyone :)


The script I mentioned isn't equivalent to akka's auto-down - it would
utilize manual downing but with the knowledge that the process is actually
down. The reason auto down isn't recommended is b/c it makes an assumption
that the remote process is down w/o knowing for sure.

On Sun, Jan 22, 2017 at 10:15 AM, Sela, Guy <guy.s...@hpe.com> wrote:

> Auto-down or external script (Which is the equivalent of auto-down) are
> not recommend for production as you mentioned.
>
> So for production, do we have a plan to extend some capabilities that will
> give us some of the power of http://doc.akka.io/docs/akka/
> akka-commercial-addons-1.0/java/split-brain-resolver.html?
>
>
>
>
>
> *From:* Tom Pantelis [mailto:tompante...@gmail.com]
> *Sent:* Sunday, January 22, 2017 5:08 PM
> *To:* Sela, Guy <guy.s...@hpe.com>
> *Cc:* Alfasi, Shlomi <shlomi.alf...@hpe.com>; controller-dev@lists.
> opendaylight.org; mdsal-...@lists.opendaylight.org
>
> *Subject:* Re: [mdsal-dev] [controller-dev] cluster - recovery from dual
> failure
>
>
>
> That's the way it works and the akka designers have reasons for it. They
> added "weakly-up" which gets it partly to the way we would want it to work
> and they've said they may add more options to better control the behavior.
>
>
>
> You can enable auto-down in your setup. Or an external script to monitor
> the process and, if it goes down, then send a "down" request (via jolokia)
> to the cluster leader.
>
>
>
> On Sun, Jan 22, 2017 at 9:37 AM, Sela, Guy <guy.s...@hpe.com> wrote:
>
> Hi,
>
> Just read the documentation, very interesting.
>
> So that means that ODL Cluster can’t automatically recover from more than
> a single concurrent failure.
>
> Even if we had a cluster of 10 nodes, if one becomes unreachable, none of
> the others can restart, until the first one will be reachable again.
>
> Sounds like a serious restriction for production.
>
> Are there any best practices how to deal with this situations? (Without
> manual intervention)
>
>
>
> *From:* mdsal-dev-boun...@lists.opendaylight.org [mailto:
> mdsal-dev-boun...@lists.opendaylight.org] *On Behalf Of *Tom Pantelis
> *Sent:* Sunday, January 22, 2017 4:30 PM
> *To:* Alfasi, Shlomi <shlomi.alf...@hpe.com>
> *Cc:* controller-dev@lists.opendaylight.org; mdsal-dev@lists.opendaylight.
> org
> *Subject:* Re: [mdsal-dev] [controller-dev] cluster - recovery from dual
> failure
>
>
>
> This is a side effect of how akka clustering works. All unreachable nodes
> must first become reachable again, or the status of the unreachable nodes
> must be changed to 'Down', either manually or auto-downed.  You can enable
> auto-downing but akka doesn't recommend it in production (
> http://doc.akka.io/docs/akka/current/java/cluster-usage.html).
>
>
>
> On Sun, Jan 22, 2017 at 8:53 AM, Alfasi, Shlomi <shlomi.alf...@hpe.com>
> wrote:
>
> Hi All,
>
>
>
> I configured a clustered setup with 3 nodes (attached the akka.conf of one
> of the nodes).
>
> At a specific time one of the members in the cluster was down and then I
> restarted another node.
>
> In the restarted node I see that it fails to read information from the
> datastore and repetitively throw exceptions [1]
>
> In the node that was always up, every 10 seconds there is a log that imply
> that the restarted node doesn’t manage to join [2]
>
>
>
> What is the expected behavior in this case? Is this state recoverable?
>
>
>
> Shlomi
>
>
>
> [1]
>
> WARN  | ult-dispatcher-2 | DataStoreAppConfigMetadata       | 153 -
> org.opendaylight.controller.blueprint - 0.5.2.SNAPSHOT |
> org.opendaylight.netvirt.elanmanager-impl (elanConfig): Read of app
> config org.opend
>
> aylight.yang.gen.v1.urn.opendaylight.netvirt.elan.config.rev150710.ElanConfig
> failed - retrying
>
> ReadFailedException{message=Error executeRead ReadData for path
> /(urn:opendaylight:netvirt:elan:config?revision=2015-07-10)elan-config,
> errorList=[RpcError [message=Error executeRead ReadData for path
> /(urn:opendaylight:netvirt:elan:co
>
> nfig?revision=2015-07-10)elan-config, severity=ERROR,
> errorType=APPLICATION, tag=operation-failed, applicationTag=null,
> info=null, 
> cause=org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException:
> Shard member-3-s
>
> hard-default-config currently has no leader. Try again later.]]}
>
>
>
> [2]
>
> 2017-01-22 15:19:56,290 | INFO  | lt-dispatcher-22 |
> kka://opendaylight-cluster-data) | 159 - com.typesafe.akka.slf4j - 2.4.7
> | Cluster Node [akka.tcp://opendaylight-cluster-data@10.0.77.33:2550] -
> New incarnation of existing member [M
>
> ember(address = akka.tcp://opendaylight-cluster-data@10.0.97.128:2550,
> status = Down)] is trying to join. Existing will be removed from the
> cluster and then new member will be allowed to join.
>
>
>
>
> _______________________________________________
> controller-dev mailing list
> controller-dev@lists.opendaylight.org
> https://lists.opendaylight.org/mailman/listinfo/controller-dev
>
>
>
>
>

_______________________________________________
controller-dev mailing list
controller-dev@lists.opendaylight.org
https://lists.opendaylight.org/mailman/listinfo/controller-dev

Re: [controller-dev] [mdsal-dev] cluster - recovery from dual failure

Reply via email to