Re: HA doesn't work for the sequence "locator1-server1 down; locator2-server2 down; locator1-server1 up"

Jens Deppe Tue, 28 Nov 2017 08:02:26 -0800

Hi Anton,

The scenario and sequence you describe will always result in the issue
because you are starting up an old instance. You should consider the
following options:


1. Always have at least part of the cluster running before bringing members
online or offline. That way, when new members join, there will never be an
issue with potentially stale data.

2. Given that your scenario doesn't really reflect HA (at some point all
your nodes are down), you could also have shut down the whole cluster at
once (thereby ensuring consistency) and then bring up whatever members need
to be brought up first.

If you need to actually recover this cluster you might have to export the
cluster configuration diskstore and re-import it to a new locator in order
to get things going. I'd suggest you contact your Support contact to
determine if there is such a procedure.

--Jens

On Tue, Nov 28, 2017 at 6:54 AM, Anton Mironenko <[email protected]>
wrote:

> Hello,
>
> There is one use case which can seriously affect High Availability of
> Geode.
>
> The topology is 2 hosts, 1 locator and 1 GF server on each host.
> “enable-cluster-configuration=true” is used
>
> Here is the flow:
>
>
>
> 1) GF cluster was up and running;
>
> 2) host1 was brought down due to VM issues. As a result, locator1 and
> server1 were down;
>
> 3) then host2 was brought down du to VM issues. As a result, locator2 and
> server2 were down;
>
> 4) host1 was brought back to live, but locator1 started with the following
> message:
>
>
>
> "Cluster configuration service is waiting for other locators with newer
> shared configuration data.
>
> This locator might have stale cluster configuration data.
>
> Following locators contain potentially newer cluster configuration data"
>
>
>
> server1 tried to join the locator1, and exited with the error in the
> cacheserver.log:
>
>
>
> [error 2017/11/28 17:44:34.417 MSK host1-server-1 <main> tid=0x1]
> org.apache.geode.GemFireConfigException: cluster configuration service
> not available
>
>
>
> [severe 2017/11/28 17:44:34.428 MSK host1-server-1 <main> tid=0x1] Cache
> server error
>
> org.apache.geode.GemFireConfigException: cluster configuration service
> not available
>
>         at org.apache.geode.internal.cache.GemFireCacheImpl.
> requestSharedConfiguration(GemFireCacheImpl.java:1058)
>
>         at org.apache.geode.internal.cache.GemFireCacheImpl.<init>(
> GemFireCacheImpl.java:817)
>
> …
>
> Caused by: org.apache.geode.internal.process.
> ClusterConfigurationNotAvailableException: Unable to retrieve cluster
> configuration from the locator.
>
>         at org.apache.geode.internal.cache.ClusterConfigurationLoader.
> requestConfigurationFromLocators(ClusterConfigurationLoader.java:257)
>
>         at org.apache.geode.internal.cache.GemFireCacheImpl.
> requestSharedConfiguration(GemFireCacheImpl.java:1021)
>
>         ... 8 more
>
>
>
> Since Gemfire provides HA, what is a way to bring back to live the first
> half of the cluster on host1: locator1 and server1?
>
> Let's say, host2 will be down for 1 week, during this time we have to
> operate.
>
> What is a way to join the second half of the cluster on host2: locator2
> and server2?
>
>
>
> How to reproduce this issue:
>
> https://issues.apache.org/jira/secure/attachment/12870290/geode-host1.zip
>
> https://issues.apache.org/jira/secure/attachment/12870291/geode-host2.zip
>
> (this is from https://issues.apache.org/jira/browse/GEODE-3003 )
>
>
>
> 1) extract geode-host1.zip to host1, geode-host2.zip to host2
>
> 2) adjust in the start-locator.sh the locator IPs to your values
>
>   --locators=10.50.3.38[20236],10.50.3.14[20236] \
>
> 3) run start-locator.sh on host1
>
> 4) run start-locator.sh on host2
>
> 5) run start-server.sh on host1
>
> 6) run start- server.sh on host2 - check 4 members via gfsh list members,
> everything is fine here
>
> 7) kill locator-PID server-PID on host1
>
> 8) kill locator-PID server-PID on host2
>
> 9) run start-locator.sh on host1 - observe the "stale cluster
> configuration data" message
>
> 10) run start-server.sh on host1 - observe “cluster configuration service
> not available” and server exit
>
>
>
> Anton Mironenko
>
> Software Architect
>
> Amdocs ASP team
>
>
> This message and the information contained herein is proprietary and
> confidential and subject to the Amdocs policy statement,
> you may review at https://www.amdocs.com/about/email-disclaimer
>

Re: HA doesn't work for the sequence "locator1-server1 down; locator2-server2 down; locator1-server1 up"

Reply via email to