RE: Geode Exception: cluster configuration service not available

2017-05-25 Thread Anton Mironenko
Hi Aravind,
Why did you miss the locator property

  --enable-cluster-configuration=true \
?

Anton Mironenko
Software Architect
Amdocs ASP team

-Original Message-
From: Aravind Musigumpula 
Sent: Thursday, May 25, 2017 10:54
To: dev@geode.apache.org
Subject: RE: Geode Exception: cluster configuration service not available

Hi,
Currently I am using the following configuration :

locator-specific-props:(used by both locators)
--J=-Dgemfire.jmx-manager=true
--J=-Dgemfire.jmx-manager-start=true
--J=-Dgemfire.jmx-manager-port=(portno.)
--J=-Dgemfire.jmx-manager-http-port=(portno.)
locator-common-props:( used by both locators)
--initial-heap=256M --max-heap=256M
--mcast-port=0
--locators=address(port), address(port)
--dir=.
--J=-XX:+UseParNewGC --J=-XX:+UseConcMarkSweepGC 
--J=-XX:CMSInitiatingOccupancyFraction=60 
--J=-XX:+UseCMSInitiatingOccupancyOnly --J=-XX:+CMSParallelRemarkEnabled --  
J=XX:+DisableExplicitGC --J=-XX:+CMSClassUnloadingEnabled --J=-verbose:gc 
--J=-Xloggc:locator-gc.log --J=-XX:+PrintGCDateStamps --J=-XX:+PrintGCDetails 
--J=-XX:+PrintTenuringDistribution -J=-XX:+PrintGCApplicationConcurrentTime 
--J=-XX:+PrintGCApplicationStoppedTime
server-common-props(used by both servers)
locators= address(port), address(port)
cache-xml-file=./Server.xml
-J-Dgemfire.ALLOW_PERSISTENT_TRANSACTIONS=true
enable-network-partition-detection=true
disable-auto-reconnect=true
statistic-archive-file=statistics.log
server-specific-props-1(used by both servers)
jmx-manager=false
jmx-manager-start=false
server-jvm-props(used by both servers)
-J-Xmx256M -J-Xms256M
-J-XX:+UseParNewGC -J-XX:+UseConcMarkSweepGC 
-J-XX:CMSInitiatingOccupancyFraction=60 -J-XX:+UseCMSInitiatingOccupancyOnly 
-J-XX:+CMSParallelRemarkEnabled -J-   
XX:+DisableExplicitGC -J-XX:+CMSClassUnloadingEnabled -J-verbose:gc 
-J-Xloggc:server-gc.log -J-XX:+PrintGCDateStamps -J-XX:+PrintGCDetails 
-J-XX:+PrintTenuringDistribution -J- 
XX:+PrintGCApplicationConcurrentTime -J-XX:+PrintGCApplicationStoppedTime

Initially it was set to true as it was not working I set it to false and tried, 
but it also didn’t work.



Thanks,
Aravind Musigumpula 

-Original Message-
From: Jinmei Liao [mailto:jil...@pivotal.io]
Sent: Wednesday, May 24, 2017 8:58 PM
To: dev@geode.apache.org
Subject: Re: Geode Exception: cluster configuration service not available

Aravind, can you provide us with your startup script and the relevant 
locator/server properties file? Is there any reason you want to set the 
server's "disable-auto-reconnect" to false?

On Wed, May 24, 2017 at 4:41 AM, Aravind Musigumpula < 
aravind.musigump...@amdocs.com> wrote:

>
> Hi,
>
> I am using a cluster configuration in geode 1.1.1 . I am starting two 
> locators on different hosts and one server for each locator. When I 
> stop them and restart the cluster, I can see that in one of the 
> locator view , it is receiving only one locator. In gfsh list members, 
> I can see only one locator of that host but no server and no other locator 
> and its server.
>
> I tried enabling the following parameters:
> In locator-specific-props: I have set "enable-cluster-configuration=true"
> In sever-common-props: I have set "disable-auto-reconnect=false" , 
> "use-cluster-configuration=true"
>
> In Server cache log, I am getting an exception :
> Cache server error
> org.apache.geode.GemFireConfigException: cluster configuration service 
> not available
> at org.apache.geode.internal.cache.GemFireCacheImpl.
> requestSharedConfiguration(GemFireCacheImpl.java:1067)
> at org.apache.geode.internal.cache.GemFireCacheImpl.
> initialize(GemFireCacheImpl.java:1200)
> at org.apache.geode.internal.cache.GemFireCacheImpl.
> basicCreate(GemFireCacheImpl.java:798)
> at org.apache.geode.internal.cache.GemFireCacheImpl.create(
> GemFireCacheImpl.java:783)
> at org.apache.geode.cache.CacheFactory.create(
> CacheFactory.java:178)
> at org.apache.geode.cache.CacheFactory.create(
> CacheFactory.java:171)
> at org.apache.geode.internal.cache.CacheServerLauncher.
> createCache(CacheServerLauncher.java:813)
> at org.apache.geode.internal.cache.CacheServerLauncher.
> server(CacheServerLauncher.java:657)
> at org.apache.geode.internal.cache.CacheServerLauncher.
> main(CacheServerLauncher.java:201)
> Caused by: org.apache.geode.internal.process.
> ClusterConfigurationNotAvailableException: Unable to retrieve cluster 
> configuration from the locator.
> at org.apache.geode.internal.cache.ClusterConfigurationLoader.
> requestConfigurationFromLocators(ClusterConfigurationLoader.

2 unit tests fail in geode-core

2017-08-01 Thread Anton Mironenko
Hello,
2 unit tests fail in geode-core on the latest commit 
9d59402b71beea84199c79399aa0260955a19d2c (August 1).
Whereas all unit tests passed on aa4878ef19d27a78454dc10b451199f088a2f37d (July 
18).

What is an easiest way to fix it?

gradlew geode-core:test
...
org.apache.geode.distributed.LocatorLauncherTest > 
testSetBindAddressToNonLocalHost FAILED
java.lang.Exception: Unexpected exception, 
expected but 
was

Caused by:
org.junit.ComparisonFailure: expected:<[yahoo.com is not an address for 
this machine].> but was:<[The hostname/IP address to which the Locator will be 
bound is unknown].>
at org.junit.Assert.assertEquals(Assert.java:115)
at org.junit.Assert.assertEquals(Assert.java:144)
at 
org.apache.geode.distributed.LocatorLauncherTest.testSetBindAddressToNonLocalHost(LocatorLauncherTest.java:167)

org.apache.geode.distributed.ServerLauncherTest > 
testSetServerBindAddressToNonLocalHost FAILED
   java.lang.Exception: Unexpected exception, 
expected but 
was

Caused by:
org.junit.ComparisonFailure: expected:<[yahoo.com is not an address for 
this machine].> but was:<[The hostname/IP address to which the Server will be 
bound is unknown].>

Anton Mironenko
Software Architect
Amdocs ASP team

This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer 
<https://www.amdocs.com/about/email-disclaimer>


RE: 2 unit tests fail in geode-core

2017-08-01 Thread Anton Mironenko
Hi Nabarun,
My connectivity to the Internet is done via HTTP proxy. Maybe this can be the 
reason. But it is not an excuse for a test not to pass.
Two weeks ago the tests passed, and now they don't. Seems like something was 
broken in the code.

BR, 
Anton

-Original Message-
From: Nabarun Nag [mailto:n...@apache.org] 
Sent: Tuesday, August 01, 2017 18:06
To: dev@geode.apache.org
Subject: Re: 2 unit tests fail in geode-core

Hi Anton,

I was able to reproduce the issue if I shut down my wifi and remove my ethernet 
cable from my Mac [no network connections active / on]. Once wifi is switched 
on or ethernet is connected to the machine the tests pass.

Regards
Nabarun Nag

> On Aug 1, 2017, at 7:48 AM, Anton Mironenko  wrote:
> 
> gradlew geode-core:test

This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer 
<https://www.amdocs.com/about/email-disclaimer>



[GitHub] geode pull request #677: GEODE-3038: A server process shuts down quietly whe...

2017-08-02 Thread anton-mironenko
GitHub user anton-mironenko opened a pull request:

https://github.com/apache/geode/pull/677

GEODE-3038: A server process shuts down quietly when path to cache.xml is 
incorrect

[GEODE-3038](https://issues.apache.org/jira/browse/GEODE-3038)
The error 
"Declarative Cache XML file/resource \[path-to-cache-xml\] does not exist"
is trying to be written into the log in GemFireCacheImpl.basicCreate(), 
after method GemFireCacheImpl.close() is called, so that writing into the log 
is already not available.

Thank you for submitting a contribution to Apache Geode.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

### For all changes:
- [x] Is there a JIRA ticket associated with this PR? Is it referenced in 
the commit message?

- [ ] Has your PR been rebased against the latest commit within the target 
branch (typically `develop`)?

- [x] Is your initial contribution a single, squashed commit?

- [x] Does `gradlew build` run cleanly?

- [ ] Have you written or updated unit tests to verify your changes?

- [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?

### Note:
Please ensure that once the PR is submitted, you check travis-ci for build 
issues and
submit an update to your PR as soon as possible. If you need help, please 
send an
email to dev@geode.apache.org.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/anton-mironenko/geode develop

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/geode/pull/677.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #677


commit f119fb381d4212e8714d26f2388860206daa3d8a
Author: Anton Mironenko 
Date:   2017-08-01T12:18:12Z

write cachexml-not-found exception into the log in a proper place

commit bf0b059035751b3d7cba82a145546edb7efce824
Author: Anton Mironenko 
Date:   2017-08-01T12:52:14Z

Merge remote-tracking branch 'upstream/develop' into GEODE-3038




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] geode issue #677: GEODE-3038: A server process shuts down quietly when path ...

2017-08-03 Thread anton-mironenko
Github user anton-mironenko commented on the issue:

https://github.com/apache/geode/pull/677
  
@dschneider-pivotal Thank you for your feedback. I've replaced 
CacheXmlException with RuntimeException. Sorry for two duplicate commits 
instead of the only one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] geode pull request #677: GEODE-3038: A server process shuts down quietly whe...

2017-08-03 Thread anton-mironenko
Github user anton-mironenko commented on a diff in the pull request:

https://github.com/apache/geode/pull/677#discussion_r131154689
  
--- Diff: 
geode-core/src/main/java/org/apache/geode/internal/cache/GemFireCacheImpl.java 
---
@@ -1208,6 +1208,9 @@ private void initialize() {
   this.system.getConfig());
   initializeDeclarativeCache();
   completedCacheXml = true;
+} catch (CacheXmlException e) {
--- End diff --

@dschneider-pivotal Thank you for your feedback. I've replaced 
CacheXmlException with RuntimeException. Sorry for two duplicate commits 
instead of the only one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] geode issue #677: GEODE-3038: A server process shuts down quietly when path ...

2017-08-10 Thread anton-mironenko
Github user anton-mironenko commented on the issue:

https://github.com/apache/geode/pull/677
  
Hello, is there anything I can do to proceed with this pull request?
I was asked to put RuntimeException instead of CacheXmlException. This is 
what I did. 

This PR contains not-squashed commits, also this PR is done in the main 
stream "develop" in my forked repository on GitHub, instead of being done in 
the proper branch "GEODE-3038".
May be these are the reasons, why my PR stuck? So I could create another PR 
with the only one squashed commit from the right branch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] geode issue #677: GEODE-3038: A server process shuts down quietly when path ...

2017-08-29 Thread anton-mironenko
Github user anton-mironenko commented on the issue:

https://github.com/apache/geode/pull/677
  
Actually there is already a test which tests exactly this bug:
org.apache.geode.cache30.CacheXml66DUnitTest#testNonExistentFile
But I see that all the package org.apache.geode.cache30 is not included now 
into unit tests run. What is the best way to start this test being executed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] geode issue #677: GEODE-3038: A server process shuts down quietly when path ...

2017-08-29 Thread anton-mironenko
Github user anton-mironenko commented on the issue:

https://github.com/apache/geode/pull/677
  
Well, this test category is DistributedTest, not Test. This is why I didn't 
see it in regular unit tests run. 
Now the question is - why this test doesn't fail, whereas it should fail 
without my fix. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


RE: [GitHub] geode issue #677: GEODE-3038: A server process shuts down quietly when path ...

2017-09-11 Thread Anton Mironenko
Hello,
How can we go further with this pull request?

Somehow my latest comment was not published in the mailing list:

" I've added the test 
org.apache.geode.cache30.CacheXmlNotFoundUnitTest#testCacheXmlNotFoundInRealLog.
 
It tests that an error about missing cache-xml file is indeed printed in the 
text log file, specified via "log-file" parameter.
The existing test 
org.apache.geode.cache30.CacheXml66DUnitTest#testNonExistentFile() is supposed 
to test the same, 
but actually it doesn't. It only checks for an CacheXmlException exception to 
be thrown."

BR, 
Anton

-----Original Message-
From: anton-mironenko [mailto:g...@git.apache.org] 
Sent: Tuesday, August 29, 2017 20:10
To: dev@geode.apache.org
Subject: [GitHub] geode issue #677: GEODE-3038: A server process shuts down 
quietly when path ...

Github user anton-mironenko commented on the issue:

https://github.com/apache/geode/pull/677
  
Well, this test category is DistributedTest, not Test. This is why I didn't 
see it in regular unit tests run. 
Now the question is - why this test doesn't fail, whereas it should fail 
without my fix. 


---
If your project is set up for it, you can reply to this email and have your 
reply appear on GitHub as well. If your project does not have this feature 
enabled and wishes so, or if the feature is enabled but not working, please 
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with 
INFRA.
---
This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer 
<https://www.amdocs.com/about/email-disclaimer>


"existing member used the same name" - visible only in fine/debug logs

2017-11-23 Thread Anton Mironenko
Hello,
Currently when I start two servers, there is no any indication what went wrong.
Only when I add --log-level=fine, I can get a clue what's going on:

[fine 2017/11/23 19:26:22.911 MSK host1-server-1  tid=0x1] cleaning up 
incompletely started DistributionManager due to exception
org.apache.geode.IncompatibleSystemException: Member 
10.50.3.14(host1-server-1:13008):1024 could not join this distributed 
system because the existing member 10.50.3.38(host1-server-1:6609):1025 
used the same name. Set the "name" gemfire property to a unique value.
at 
org.apache.geode.distributed.internal.DistributionManager.create(DistributionManager.java:593)
at 
org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:740)
at 
org.apache.geode.distributed.internal.InternalDistributedSystem.newInstance(InternalDistributedSystem.java:350)
at 
org.apache.geode.distributed.internal.InternalDistributedSystem.newInstance(InternalDistributedSystem.java:336)
at 
org.apache.geode.distributed.internal.InternalDistributedSystem.newInstance(InternalDistributedSystem.java:330)
at 
org.apache.geode.distributed.DistributedSystem.connect(DistributedSystem.java:205)
at 
org.apache.geode.internal.cache.CacheServerLauncher.connect(CacheServerLauncher.java:792)
at 
org.apache.geode.internal.cache.CacheServerLauncher.server(CacheServerLauncher.java:677)
at 
org.apache.geode.internal.cache.CacheServerLauncher.main(CacheServerLauncher.java:214)

My question is:
Why is it a DEBUG category, and not ERROR?

https://github.com/apache/geode/blob/develop/geode-core/src/main/java/org/apache/geode/distributed/internal/DistributionManager.java#L658

} catch (RuntimeException r) {
  if (distributionManager != null) {
if (logger.isDebugEnabled()) {
  logger.debug("cleaning up incompletely started DistributionManager 
due to exception", r);
}
distributionManager.uncleanShutdown(beforeJoined);
  }
      throw r;
}

Anton Mironenko
Software Architect
Amdocs ASP team

This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer 
<https://www.amdocs.com/about/email-disclaimer>


RE: "existing member used the same name" - visible only in fine/debug logs

2017-11-28 Thread Anton Mironenko
Thanks Kirk,
Indeed, when I start a server via gfsh, I explicitly see this error in 
stdout/stderr:

Exception in thread "main" org.apache.geode.IncompatibleSystemException: Member 
10.50.3.14(host1-server-1:19737):1025 could not join this distributed 
system because the existing member 10.50.3.38(host1-server-1:19808):1025 
used the same name. Set the "name" gemfire property to a unique value.
at 
org.apache.geode.distributed.internal.DistributionManager.create(DistributionManager.java:593)
at 
org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:740)
at 
org.apache.geode.distributed.internal.InternalDistributedSystem.newInstance(InternalDistributedSystem.java:350)
at 
org.apache.geode.distributed.internal.InternalDistributedSystem.newInstance(InternalDistributedSystem.java:338)
at 
org.apache.geode.distributed.internal.InternalDistributedSystem.newInstance(InternalDistributedSystem.java:330)
at 
org.apache.geode.distributed.DistributedSystem.connect(DistributedSystem.java:205)
at org.apache.geode.cache.CacheFactory.create(CacheFactory.java:217)
at 
org.apache.geode.distributed.internal.DefaultServerLauncherCacheProvider.createCache(DefaultServerLauncherCacheProvider.java:52)
at 
org.apache.geode.distributed.ServerLauncher.createCache(ServerLauncher.java:845)
at 
org.apache.geode.distributed.ServerLauncher.start(ServerLauncher.java:757)
at 
org.apache.geode.distributed.ServerLauncher.run(ServerLauncher.java:684)
at 
org.apache.geode.distributed.ServerLauncher.main(ServerLauncher.java:217)

and in the logs:
[info 2017/11/28 19:15:59.112 MSK host1-server-1  tid=0x1] Performing 
final check for suspect member 10.50.3.38(host1-server-1:19808):1025 
reason=member is using the name of 10.50.3.14(host1-server-1:19737):1025

-Original Message-
From: Kirk Lund [mailto:kl...@apache.org] 
Sent: Monday, November 27, 2017 20:33
To: geode 
Subject: Re: "existing member used the same name" - visible only in fine/debug 
logs

Side note: org.apache.geode.internal.cache.CacheServerLauncher is the old 
deprecated launcher class which may be removed in an upcoming release. You 
should consider moving to org.apache.geode.distributed.ServerLauncher instead. 
GFSH uses ServerLauncher instead of CacheServerLauncher.

On Fri, Nov 24, 2017 at 2:52 PM, Bruce Schuchardt 
wrote:

> I believe this is at debug level because the exception & its text 
> ought to be visible to the person attempting to start the new node.  
> If that's not the case we should probably change this to error/severe 
> level though it likely wouldn't make it to an alert listener because 
> the node is still joining the system.  I see that you're using 
> CacheServerLauncher.  That API and the ServerLauncher API both have a 
> flaw that you should investigate - see GEODE-4013. That flaw can cause 
> a node to appear to have crashed and take a while to clear from the 
> membership view.  I recently saw this same problem of conflicting names and 
> tracked its cause down to this flaw.
>
>
> On 11/23/17 8:43 AM, Anton Mironenko wrote:
>
>> Hello,
>> Currently when I start two servers, there is no any indication what 
>> went wrong.
>> Only when I add --log-level=fine, I can get a clue what's going on:
>>
>> [fine 2017/11/23 19:26:22.911 MSK host1-server-1  tid=0x1] 
>> cleaning up incompletely started DistributionManager due to exception
>> org.apache.geode.IncompatibleSystemException: Member
>> 10.50.3.14(host1-server-1:13008):1024 could not join this 
>> distributed system because the existing member
>> 10.50.3.38(host1-server-1:6609):1025 used the same name. Set the 
>> "name" gemfire property to a unique value.
>>  at 
>> org.apache.geode.distributed.internal.DistributionManager.cr
>> eate(DistributionManager.java:593)
>>  at 
>> org.apache.geode.distributed.internal.InternalDistributedSys
>> tem.initialize(InternalDistributedSystem.java:740)
>>  at 
>> org.apache.geode.distributed.internal.InternalDistributedSys
>> tem.newInstance(InternalDistributedSystem.java:350)
>>  at 
>> org.apache.geode.distributed.internal.InternalDistributedSys
>> tem.newInstance(InternalDistributedSystem.java:336)
>>  at 
>> org.apache.geode.distributed.internal.InternalDistributedSys
>> tem.newInstance(InternalDistributedSystem.java:330)
>>  at 
>> org.apache.geode.distributed.DistributedSystem.connect(Distr
>> ibutedSystem.java:205)
>>  at 
>> org.apache.geode.internal.cache.CacheServerLauncher.connect(
>> CacheServerLauncher.java:792)
>>  at 
>> org.apa

[jira] [Created] (GEODE-3003) Geode doesn't start after cluster restart when using cluster-configuration

2017-05-29 Thread Anton Mironenko (JIRA)
Anton Mironenko created GEODE-3003:
--

 Summary: Geode doesn't start after cluster restart when using 
cluster-configuration
 Key: GEODE-3003
 URL: https://issues.apache.org/jira/browse/GEODE-3003
 Project: Geode
  Issue Type: Bug
  Components: configuration
Reporter: Anton Mironenko


There is a two-host Geode cluster with locator and server on each host.
First start of all nodes goes well.
Then all nodes are gracefully stopped (kill [locator-PID] [server-PID]).
The second start goes wrong: the locator on the first host always doesn't join 
the rest of the cluster with the error in the locator log:
"Region /_ConfigurationRegion has potentially stale data. It is waiting for 
another member to recover the latest data."

And sometimes (once per 5 starts) some server shuts down just after start with 
the error 
"org.apache.geode.GemFireConfigException: cluster configuration service not 
available".

This bug started appearing only when we moved to Geode 1.1.1. And it totally 
blocks us.
On GemFire 8.2.1 there was no such a bug.

This is very easy to reproduce.

Test preparation:
-
Here are two attached zip files - "geode-host1.zip" and "geode-host2.zip"
1) unzip "geode-host1.zip" into some folder on your first host
2) in start-locator.sh change the IPs of locators to the values of your host1 
and host2
"--locators=10.50.3.38[20236],10.50.3.14[20236]"
3) in start-server.sh 
"locators=10.50.3.38[20236],10.50.3.14[20236]" change the IPs of locators to 
the values of your host1 and host2
4) do the bullets 1)-4) for host2, the folder where you unzip the file should 
be the same as on the first host

Test running:
---
1) rm -rf {locator0,server1}
2) run ./start-locator.sh; ./start-server.sh on both hosts. See that this 
cluster start is successful.
3) kill locator and server processes on both hosts
kill [locator-PID] [server-PID]
4) run ./start-locator.sh; ./start-server.sh on both hosts
5) see that actually there are two clusters: "host1-locator" and "host1-server, 
host2-locator, host2-server" instead of one cluster. And sometimes there is no 
"host1-server", because it shutdown with error "Region /_ConfigurationRegion 
has potentially stale data. It is waiting for another member to recover the 
latest data.".



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (GEODE-3003) Geode doesn't start after cluster restart when using cluster-configuration

2017-05-29 Thread Anton Mironenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/GEODE-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Mironenko updated GEODE-3003:
---
Priority: Blocker  (was: Major)

> Geode doesn't start after cluster restart when using cluster-configuration
> --
>
> Key: GEODE-3003
> URL: https://issues.apache.org/jira/browse/GEODE-3003
> Project: Geode
>  Issue Type: Bug
>  Components: configuration
>    Reporter: Anton Mironenko
>Priority: Blocker
>
> There is a two-host Geode cluster with locator and server on each host.
> First start of all nodes goes well.
> Then all nodes are gracefully stopped (kill [locator-PID] [server-PID]).
> The second start goes wrong: the locator on the first host always doesn't 
> join the rest of the cluster with the error in the locator log:
> "Region /_ConfigurationRegion has potentially stale data. It is waiting for 
> another member to recover the latest data."
> And sometimes (once per 5 starts) some server shuts down just after start 
> with the error 
> "org.apache.geode.GemFireConfigException: cluster configuration service not 
> available".
> This bug started appearing only when we moved to Geode 1.1.1. And it totally 
> blocks us.
> On GemFire 8.2.1 there was no such a bug.
> This is very easy to reproduce.
> Test preparation:
> -
> Here are two attached zip files - "geode-host1.zip" and "geode-host2.zip"
> 1) unzip "geode-host1.zip" into some folder on your first host
> 2) in start-locator.sh change the IPs of locators to the values of your host1 
> and host2
> "--locators=10.50.3.38[20236],10.50.3.14[20236]"
> 3) in start-server.sh 
> "locators=10.50.3.38[20236],10.50.3.14[20236]" change the IPs of locators to 
> the values of your host1 and host2
> 4) do the bullets 1)-4) for host2, the folder where you unzip the file should 
> be the same as on the first host
> Test running:
> ---
> 1) rm -rf {locator0,server1}
> 2) run ./start-locator.sh; ./start-server.sh on both hosts. See that this 
> cluster start is successful.
> 3) kill locator and server processes on both hosts
> kill [locator-PID] [server-PID]
> 4) run ./start-locator.sh; ./start-server.sh on both hosts
> 5) see that actually there are two clusters: "host1-locator" and 
> "host1-server, host2-locator, host2-server" instead of one cluster. And 
> sometimes there is no "host1-server", because it shutdown with error "Region 
> /_ConfigurationRegion has potentially stale data. It is waiting for another 
> member to recover the latest data.".



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (GEODE-3003) Geode doesn't start after cluster restart when using cluster-configuration

2017-05-29 Thread Anton Mironenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/GEODE-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Mironenko updated GEODE-3003:
---
Attachment: readme.txt
20170522-geode-vyazma.zip
20170522-geode-klyazma.zip
geode-host2.zip
geode-host1.zip

"geode-host1.zip" and "geode-host2.zip" - bash scripts and Server.xml for 
reproducing the issue
"20170522-geode-klyazma.zip" and "20170522-geode-vyazma.zip" - Geode folder 
content with logs

> Geode doesn't start after cluster restart when using cluster-configuration
> --
>
> Key: GEODE-3003
> URL: https://issues.apache.org/jira/browse/GEODE-3003
> Project: Geode
>  Issue Type: Bug
>  Components: configuration
>Reporter: Anton Mironenko
>Priority: Blocker
> Attachments: 20170522-geode-klyazma.zip, 20170522-geode-vyazma.zip, 
> geode-host1.zip, geode-host2.zip, readme.txt
>
>
> There is a two-host Geode cluster with locator and server on each host.
> First start of all nodes goes well.
> Then all nodes are gracefully stopped (kill [locator-PID] [server-PID]).
> The second start goes wrong: the locator on the first host always doesn't 
> join the rest of the cluster with the error in the locator log:
> "Region /_ConfigurationRegion has potentially stale data. It is waiting for 
> another member to recover the latest data."
> And sometimes (once per 5 starts) some server shuts down just after start 
> with the error 
> "org.apache.geode.GemFireConfigException: cluster configuration service not 
> available".
> This bug started appearing only when we moved to Geode 1.1.1. And it totally 
> blocks us.
> On GemFire 8.2.1 there was no such a bug.
> This is very easy to reproduce.
> Test preparation:
> -
> Here are two attached zip files - "geode-host1.zip" and "geode-host2.zip"
> 1) unzip "geode-host1.zip" into some folder on your first host
> 2) in start-locator.sh change the IPs of locators to the values of your host1 
> and host2
> "--locators=10.50.3.38[20236],10.50.3.14[20236]"
> 3) in start-server.sh 
> "locators=10.50.3.38[20236],10.50.3.14[20236]" change the IPs of locators to 
> the values of your host1 and host2
> 4) do the bullets 1)-4) for host2, the folder where you unzip the file should 
> be the same as on the first host
> Test running:
> ---
> 1) rm -rf {locator0,server1}
> 2) run ./start-locator.sh; ./start-server.sh on both hosts. See that this 
> cluster start is successful.
> 3) kill locator and server processes on both hosts
> kill [locator-PID] [server-PID]
> 4) run ./start-locator.sh; ./start-server.sh on both hosts
> 5) see that actually there are two clusters: "host1-locator" and 
> "host1-server, host2-locator, host2-server" instead of one cluster. And 
> sometimes there is no "host1-server", because it shutdown with error "Region 
> /_ConfigurationRegion has potentially stale data. It is waiting for another 
> member to recover the latest data.".



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)