Re: [DISCUSS/VOTE] Proposal to bring GEODE-7465 to release/1.11.0

2019-11-26 Thread Owen Nichols
Voting summary:
+1: 8
 0: 0
-1: 0

git cherry-pick -x e148cef9cb63eba283cf86bc490eb280023567ce completed on 
release/1.11.0.

> On Nov 26, 2019, at 1:35 PM, Ivan Godwin  wrote:
> 
> +1
> 
> On Tue, Nov 26, 2019 at 1:31 PM Xiaojian Zhou  wrote:
> 
>> +1
>> 
>> On Tue, Nov 26, 2019 at 12:48 PM Joris Melchior 
>> wrote:
>> 
>>> +1
>>> 
>>> On Tue, Nov 26, 2019 at 2:41 PM Jason Huynh  wrote:
>>> 
 +1
 
 On Tue, Nov 26, 2019 at 11:34 AM Anilkumar Gingade <
>> aging...@pivotal.io>
 wrote:
 
> +1
> 
> On Tue, Nov 26, 2019 at 11:32 AM Udo Kohlmeyer 
>> wrote:
> 
>> This is no-brainer
>> 
>> *+1*
>> 
>> On 11/26/19 11:27 AM, Owen Nichols wrote:
>>> I would like to propose bringing “GEODE-7465: Set eventProcessor
>> to
> null
>> in serial AEQ when it is stopped” into the 1.11 release
>>> (necessitating
 an
>> RC4).
>>> 
>>> Without the fix, a sequence of ordinary gfsh commands will leave
>>> the
> WAN
>> gateway in an unrecoverable hung state:
>>> stop gateway-sender
>>> start gateway-sender
>>> The only recourse is to restart the server.
>>> 
>>> This fix is critical because the distributed system fails to sync
 data
>> between WAN sites as the user would expect.
>>> This issue did exist in previous releases, but recent
>> enhancements
>>> to
>> WAN/AEQ such as AEQ-pause are increasing user interaction with
> WAN-related
>> gfsh commands.
>>> 
>>> The fix is simple, low risk, tested, and has been on develop for
>> 5
> days:
>>> 
>> 
> 
 
>>> 
>> https://github.com/apache/geode/commit/e148cef9cb63eba283cf86bc490eb280023567ce
>> 
> 
 
>>> 
>>> 
>>> --
>>> *Joris Melchior *
>>> CF Engineering
>>> Pivotal Toronto
>>> 416 877 5427
>>> 
>>> “Programs must be written for people to read, and only incidentally for
>>> machines to execute.” – *Hal Abelson*
>>> 
>>> 
>> 



Re: Cache.close is not synchronous?

2019-11-26 Thread Kirk Lund
I added a stack trace to the closing of both GemFireCacheImpl and
InternalDistributedSystem and found a difference.

The test passes when it's the test thread doing the close:

java.lang.Throwable: KIRK GemFireCacheImpl closed 1046056441
at
org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2365)
at
org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1912)
at
org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1902)
at
org.apache.geode.cache.CacheFactoryRecreateRegressionTest.recreateDoesNotThrowDistributedSystemDisconnectedException(CacheFactoryRecreateRegressionTest.java:56)
java.lang.Throwable: KIRK InternalDistributedSystem closed 1311844206
at
org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1637)
at
org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1225)
at
org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2351)
at
org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1912)
at
org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1902)
at
org.apache.geode.cache.CacheFactoryRecreateRegressionTest.recreateDoesNotThrowDistributedSystemDisconnectedException(CacheFactoryRecreateRegressionTest.java:56)

When the test fails and reproduces the problem, the close is apparently
completed by a different background thread:

java.lang.Throwable: KIRK GemFireCacheImpl closed 277876155
at
org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2365)
at
org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1917)
at
org.apache.geode.internal.cache.DiskStoreImpl.lambda$handleDiskAccessException$2(DiskStoreImpl.java:3380)
at java.lang.Thread.run(Thread.java:748)
java.lang.Throwable: KIRK InternalDistributedSystem closed 306674056
at
org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1637)
at
org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1225)
at
org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2351)
at
org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1917)
at
org.apache.geode.internal.cache.DiskStoreImpl.lambda$handleDiskAccessException$2(DiskStoreImpl.java:3380)
at java.lang.Thread.run(Thread.java:748)

On Tue, Nov 26, 2019 at 9:20 AM Kirk Lund  wrote:

> Seems like this must be a bug, so I filed
> https://issues.apache.org/jira/browse/GEODE-7503. I'll look into it...
>
> On Mon, Nov 25, 2019 at 3:24 PM Anilkumar Gingade 
> wrote:
>
>> Looking at the code, the cache.close() and InternalCacheBuilder.create()
>> are synchronized on "GemFireCacheImpl.class"'; it's the
>> internalCachebuilder create that seems to be using reference to the old
>> distributed-system.
>> The GemFireCacheImpl.getInstance() and getExisting() both perform
>> "isClosing" check and does early return. The InternalCacheBuilder is new;
>> not sure if its missing early checks.
>>
>> -Anil.
>>
>> On Mon, Nov 25, 2019 at 2:47 PM Mark Hanson  wrote:
>>
>> > +1 to fix.
>> >
>> > > On Nov 25, 2019, at 2:02 PM, John Blum  wrote:
>> > >
>> > > +1 ^ 64!
>> > >
>> > > I found this out the hard way some time ago and is why STDG exists in
>> the
>> > > first place (i.e. usability issues, particularly with testing).
>> > >
>> > > On Mon, Nov 25, 2019 at 1:41 PM Kirk Lund  wrote:
>> > >
>> > >> I found a test that closes the cache and then recreates the cache
>> > multiple
>> > >> times with 2 second sleep between each. I tried to remove the
>> > Thread.sleep
>> > >> and found that recreating the cache
>> > >> throws DistributedSystemDisconnectedException (see below).
>> > >>
>> > >> This seems like a usability nightmare. Anyone have any ideas WHY it's
>> > this
>> > >> way?
>> > >>
>> > >> Personally, I want Cache.close() to block until both Cache and
>> > >> DistributedSystem are closed and the API is ready to create a new
>> Cache.
>> > >>
>> > >> org.apache.geode.distributed.DistributedSystemDisconnectedException:
>> > This
>> > >> connection to a distributed system has been disconnected.
>> > >>at
>> > >>
>> > >>
>> >
>> org.apache.geode.distributed.internal.InternalDistributedSystem.checkConnected(InternalDistributedSystem.java:945)
>> > >>at
>> > >>
>> > >>
>> >
>> org.apache.geode.distributed.internal.InternalDistributedSystem.getDistributionManager(InternalDistributedSystem.java:1665)
>> > >>at
>> > >>
>> > >>
>> >
>> org.apache.geode.internal.cache.GemFireCacheImpl.(GemFireCacheImpl.java:791)
>> > >>at
>> > >>
>> > >>
>> >
>> 

Re: [DISCUSS/VOTE] Proposal to bring GEODE-7465 to release/1.11.0

2019-11-26 Thread Ivan Godwin
+1

On Tue, Nov 26, 2019 at 1:31 PM Xiaojian Zhou  wrote:

> +1
>
> On Tue, Nov 26, 2019 at 12:48 PM Joris Melchior 
> wrote:
>
> > +1
> >
> > On Tue, Nov 26, 2019 at 2:41 PM Jason Huynh  wrote:
> >
> > > +1
> > >
> > > On Tue, Nov 26, 2019 at 11:34 AM Anilkumar Gingade <
> aging...@pivotal.io>
> > > wrote:
> > >
> > > > +1
> > > >
> > > > On Tue, Nov 26, 2019 at 11:32 AM Udo Kohlmeyer 
> wrote:
> > > >
> > > > > This is no-brainer
> > > > >
> > > > > *+1*
> > > > >
> > > > > On 11/26/19 11:27 AM, Owen Nichols wrote:
> > > > > > I would like to propose bringing “GEODE-7465: Set eventProcessor
> to
> > > > null
> > > > > in serial AEQ when it is stopped” into the 1.11 release
> > (necessitating
> > > an
> > > > > RC4).
> > > > > >
> > > > > > Without the fix, a sequence of ordinary gfsh commands will leave
> > the
> > > > WAN
> > > > > gateway in an unrecoverable hung state:
> > > > > > stop gateway-sender
> > > > > > start gateway-sender
> > > > > > The only recourse is to restart the server.
> > > > > >
> > > > > > This fix is critical because the distributed system fails to sync
> > > data
> > > > > between WAN sites as the user would expect.
> > > > > > This issue did exist in previous releases, but recent
> enhancements
> > to
> > > > > WAN/AEQ such as AEQ-pause are increasing user interaction with
> > > > WAN-related
> > > > > gfsh commands.
> > > > > >
> > > > > > The fix is simple, low risk, tested, and has been on develop for
> 5
> > > > days:
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/geode/commit/e148cef9cb63eba283cf86bc490eb280023567ce
> > > > >
> > > >
> > >
> >
> >
> > --
> > *Joris Melchior *
> > CF Engineering
> > Pivotal Toronto
> > 416 877 5427
> >
> > “Programs must be written for people to read, and only incidentally for
> > machines to execute.” – *Hal Abelson*
> > 
> >
>


Re: [DISCUSS/VOTE] Proposal to bring GEODE-7465 to release/1.11.0

2019-11-26 Thread Xiaojian Zhou
+1

On Tue, Nov 26, 2019 at 12:48 PM Joris Melchior 
wrote:

> +1
>
> On Tue, Nov 26, 2019 at 2:41 PM Jason Huynh  wrote:
>
> > +1
> >
> > On Tue, Nov 26, 2019 at 11:34 AM Anilkumar Gingade 
> > wrote:
> >
> > > +1
> > >
> > > On Tue, Nov 26, 2019 at 11:32 AM Udo Kohlmeyer  wrote:
> > >
> > > > This is no-brainer
> > > >
> > > > *+1*
> > > >
> > > > On 11/26/19 11:27 AM, Owen Nichols wrote:
> > > > > I would like to propose bringing “GEODE-7465: Set eventProcessor to
> > > null
> > > > in serial AEQ when it is stopped” into the 1.11 release
> (necessitating
> > an
> > > > RC4).
> > > > >
> > > > > Without the fix, a sequence of ordinary gfsh commands will leave
> the
> > > WAN
> > > > gateway in an unrecoverable hung state:
> > > > > stop gateway-sender
> > > > > start gateway-sender
> > > > > The only recourse is to restart the server.
> > > > >
> > > > > This fix is critical because the distributed system fails to sync
> > data
> > > > between WAN sites as the user would expect.
> > > > > This issue did exist in previous releases, but recent enhancements
> to
> > > > WAN/AEQ such as AEQ-pause are increasing user interaction with
> > > WAN-related
> > > > gfsh commands.
> > > > >
> > > > > The fix is simple, low risk, tested, and has been on develop for 5
> > > days:
> > > > >
> > > >
> > >
> >
> https://github.com/apache/geode/commit/e148cef9cb63eba283cf86bc490eb280023567ce
> > > >
> > >
> >
>
>
> --
> *Joris Melchior *
> CF Engineering
> Pivotal Toronto
> 416 877 5427
>
> “Programs must be written for people to read, and only incidentally for
> machines to execute.” – *Hal Abelson*
> 
>


Re: [DISCUSS/VOTE] Proposal to bring GEODE-7465 to release/1.11.0

2019-11-26 Thread Joris Melchior
+1

On Tue, Nov 26, 2019 at 2:41 PM Jason Huynh  wrote:

> +1
>
> On Tue, Nov 26, 2019 at 11:34 AM Anilkumar Gingade 
> wrote:
>
> > +1
> >
> > On Tue, Nov 26, 2019 at 11:32 AM Udo Kohlmeyer  wrote:
> >
> > > This is no-brainer
> > >
> > > *+1*
> > >
> > > On 11/26/19 11:27 AM, Owen Nichols wrote:
> > > > I would like to propose bringing “GEODE-7465: Set eventProcessor to
> > null
> > > in serial AEQ when it is stopped” into the 1.11 release (necessitating
> an
> > > RC4).
> > > >
> > > > Without the fix, a sequence of ordinary gfsh commands will leave the
> > WAN
> > > gateway in an unrecoverable hung state:
> > > > stop gateway-sender
> > > > start gateway-sender
> > > > The only recourse is to restart the server.
> > > >
> > > > This fix is critical because the distributed system fails to sync
> data
> > > between WAN sites as the user would expect.
> > > > This issue did exist in previous releases, but recent enhancements to
> > > WAN/AEQ such as AEQ-pause are increasing user interaction with
> > WAN-related
> > > gfsh commands.
> > > >
> > > > The fix is simple, low risk, tested, and has been on develop for 5
> > days:
> > > >
> > >
> >
> https://github.com/apache/geode/commit/e148cef9cb63eba283cf86bc490eb280023567ce
> > >
> >
>


-- 
*Joris Melchior *
CF Engineering
Pivotal Toronto
416 877 5427

“Programs must be written for people to read, and only incidentally for
machines to execute.” – *Hal Abelson*



Re: [VOTE] Release candidate for Apache Geode version 1.11.0.RC3.

2019-11-26 Thread Blake Bender
-1 from native client as well, sorry.  RC3 mistakenly picked up an
unnecessary commit, and left out the crash fix I needed.  If you revert
commit 5d012199055a9a7657563727f6e26a406b287fc3 and
cherry-pick 55da853760c200c53568fe2e6549c912ec26cc27, "GEODE-7426: Fixes
segfault in log message.", native client should be good to go.

Thanks,

Blake



On Tue, Nov 26, 2019 at 11:35 AM Lynn Hughes-Godfrey <
lhughesgodf...@pivotal.io> wrote:

> -1: Analyzing a hang that looks similar to GEODE-5307: Hang with servers
> all in waitForPrimaryMember and one server in NO_PRIMARY_HOSTING state
> https://issues.apache.org/jira/browse/GEODE-5307
>
> On Mon, Nov 25, 2019 at 9:13 PM Mark Hanson  wrote:
>
> > Hello Geode Dev Community,
> >
> > This is a release candidate for Apache Geode version 1.11.0.RC3.
> > Thanks to all the community members for their contributions to this
> > release!
> >
> > Please do a review and give your feedback, including the checks you
> > performed.
> >
> > Voting deadline:
> > 11AM PST Monday December 2 2019.
> >
> > Please note that we are voting upon the source tag:
> > rel/v1.11.0.RC3
> >
> > Release notes:
> >
> >
> https://cwiki.apache.org/confluence/display/GEODE/Release+Notes#ReleaseNotes-1.11.0
> >
> > Source and binary distributions:
> > https://dist.apache.org/repos/dist/dev/geode/1.11.0.RC3/
> >
> > Maven staging repo:
> > https://repository.apache.org/content/repositories/orgapachegeode-1063
> >
> > GitHub:
> > https://github.com/apache/geode/tree/rel/v1.11.0.RC3
> > https://github.com/apache/geode-examples/tree/rel/v1.11.0.RC3
> > https://github.com/apache/geode-native/tree/rel/v1.11.0.RC3
> >
> > Pipelines:
> >
> >
> https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-release-1-11-0-main
> >
> >
> https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-release-1-11-0-rc
> >
> > Geode's KEYS file containing PGP keys we use to sign the release:
> > https://github.com/apache/geode/blob/develop/KEYS
> >
> > Command to run geode-examples:
> > ./gradlew -PgeodeReleaseUrl=
> > https://dist.apache.org/repos/dist/dev/geode/1.11.0.RC3
> > -PgeodeRepositoryUrl=
> > https://repository.apache.org/content/repositories/orgapachegeode-1063
> > build runAll
> >
> > Regards
> > Mark Hanson
>


Re: [DISCUSS/VOTE] Proposal to bring GEODE-7465 to release/1.11.0

2019-11-26 Thread Jason Huynh
+1

On Tue, Nov 26, 2019 at 11:34 AM Anilkumar Gingade 
wrote:

> +1
>
> On Tue, Nov 26, 2019 at 11:32 AM Udo Kohlmeyer  wrote:
>
> > This is no-brainer
> >
> > *+1*
> >
> > On 11/26/19 11:27 AM, Owen Nichols wrote:
> > > I would like to propose bringing “GEODE-7465: Set eventProcessor to
> null
> > in serial AEQ when it is stopped” into the 1.11 release (necessitating an
> > RC4).
> > >
> > > Without the fix, a sequence of ordinary gfsh commands will leave the
> WAN
> > gateway in an unrecoverable hung state:
> > > stop gateway-sender
> > > start gateway-sender
> > > The only recourse is to restart the server.
> > >
> > > This fix is critical because the distributed system fails to sync data
> > between WAN sites as the user would expect.
> > > This issue did exist in previous releases, but recent enhancements to
> > WAN/AEQ such as AEQ-pause are increasing user interaction with
> WAN-related
> > gfsh commands.
> > >
> > > The fix is simple, low risk, tested, and has been on develop for 5
> days:
> > >
> >
> https://github.com/apache/geode/commit/e148cef9cb63eba283cf86bc490eb280023567ce
> >
>


Re: [DISCUSS/VOTE] Proposal to bring GEODE-7465 to release/1.11.0

2019-11-26 Thread Anilkumar Gingade
+1

On Tue, Nov 26, 2019 at 11:32 AM Udo Kohlmeyer  wrote:

> This is no-brainer
>
> *+1*
>
> On 11/26/19 11:27 AM, Owen Nichols wrote:
> > I would like to propose bringing “GEODE-7465: Set eventProcessor to null
> in serial AEQ when it is stopped” into the 1.11 release (necessitating an
> RC4).
> >
> > Without the fix, a sequence of ordinary gfsh commands will leave the WAN
> gateway in an unrecoverable hung state:
> > stop gateway-sender
> > start gateway-sender
> > The only recourse is to restart the server.
> >
> > This fix is critical because the distributed system fails to sync data
> between WAN sites as the user would expect.
> > This issue did exist in previous releases, but recent enhancements to
> WAN/AEQ such as AEQ-pause are increasing user interaction with WAN-related
> gfsh commands.
> >
> > The fix is simple, low risk, tested, and has been on develop for 5 days:
> >
> https://github.com/apache/geode/commit/e148cef9cb63eba283cf86bc490eb280023567ce
>


Re: [DISCUSS/VOTE] Proposal to bring GEODE-7465 to release/1.11.0

2019-11-26 Thread Alexander Murmann
+1

On Tue, Nov 26, 2019 at 11:32 AM Udo Kohlmeyer  wrote:

> This is no-brainer
>
> *+1*
>
> On 11/26/19 11:27 AM, Owen Nichols wrote:
> > I would like to propose bringing “GEODE-7465: Set eventProcessor to null
> in serial AEQ when it is stopped” into the 1.11 release (necessitating an
> RC4).
> >
> > Without the fix, a sequence of ordinary gfsh commands will leave the WAN
> gateway in an unrecoverable hung state:
> > stop gateway-sender
> > start gateway-sender
> > The only recourse is to restart the server.
> >
> > This fix is critical because the distributed system fails to sync data
> between WAN sites as the user would expect.
> > This issue did exist in previous releases, but recent enhancements to
> WAN/AEQ such as AEQ-pause are increasing user interaction with WAN-related
> gfsh commands.
> >
> > The fix is simple, low risk, tested, and has been on develop for 5 days:
> >
> https://github.com/apache/geode/commit/e148cef9cb63eba283cf86bc490eb280023567ce
>


-- 
Alexander J. Murmann
(650) 283-1933


Re: [VOTE] Release candidate for Apache Geode version 1.11.0.RC3.

2019-11-26 Thread Lynn Hughes-Godfrey
-1: Analyzing a hang that looks similar to GEODE-5307: Hang with servers
all in waitForPrimaryMember and one server in NO_PRIMARY_HOSTING state
https://issues.apache.org/jira/browse/GEODE-5307

On Mon, Nov 25, 2019 at 9:13 PM Mark Hanson  wrote:

> Hello Geode Dev Community,
>
> This is a release candidate for Apache Geode version 1.11.0.RC3.
> Thanks to all the community members for their contributions to this
> release!
>
> Please do a review and give your feedback, including the checks you
> performed.
>
> Voting deadline:
> 11AM PST Monday December 2 2019.
>
> Please note that we are voting upon the source tag:
> rel/v1.11.0.RC3
>
> Release notes:
>
> https://cwiki.apache.org/confluence/display/GEODE/Release+Notes#ReleaseNotes-1.11.0
>
> Source and binary distributions:
> https://dist.apache.org/repos/dist/dev/geode/1.11.0.RC3/
>
> Maven staging repo:
> https://repository.apache.org/content/repositories/orgapachegeode-1063
>
> GitHub:
> https://github.com/apache/geode/tree/rel/v1.11.0.RC3
> https://github.com/apache/geode-examples/tree/rel/v1.11.0.RC3
> https://github.com/apache/geode-native/tree/rel/v1.11.0.RC3
>
> Pipelines:
>
> https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-release-1-11-0-main
>
> https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-release-1-11-0-rc
>
> Geode's KEYS file containing PGP keys we use to sign the release:
> https://github.com/apache/geode/blob/develop/KEYS
>
> Command to run geode-examples:
> ./gradlew -PgeodeReleaseUrl=
> https://dist.apache.org/repos/dist/dev/geode/1.11.0.RC3
> -PgeodeRepositoryUrl=
> https://repository.apache.org/content/repositories/orgapachegeode-1063
> build runAll
>
> Regards
> Mark Hanson


Re: [DISCUSS/VOTE] Proposal to bring GEODE-7465 to release/1.11.0

2019-11-26 Thread Udo Kohlmeyer

This is no-brainer

*+1*

On 11/26/19 11:27 AM, Owen Nichols wrote:

I would like to propose bringing “GEODE-7465: Set eventProcessor to null in 
serial AEQ when it is stopped” into the 1.11 release (necessitating an RC4).

Without the fix, a sequence of ordinary gfsh commands will leave the WAN 
gateway in an unrecoverable hung state:
stop gateway-sender
start gateway-sender
The only recourse is to restart the server.

This fix is critical because the distributed system fails to sync data between 
WAN sites as the user would expect.
This issue did exist in previous releases, but recent enhancements to WAN/AEQ 
such as AEQ-pause are increasing user interaction with WAN-related gfsh 
commands.

The fix is simple, low risk, tested, and has been on develop for 5 days:
https://github.com/apache/geode/commit/e148cef9cb63eba283cf86bc490eb280023567ce


Re: [DISCUSS/VOTE] Proposal to bring GEODE-7465 to release/1.11.0

2019-11-26 Thread Dick Cavender
+1

On Tue, Nov 26, 2019 at 11:27 AM Owen Nichols  wrote:

> I would like to propose bringing “GEODE-7465: Set eventProcessor to null
> in serial AEQ when it is stopped” into the 1.11 release (necessitating an
> RC4).
>
> Without the fix, a sequence of ordinary gfsh commands will leave the WAN
> gateway in an unrecoverable hung state:
> stop gateway-sender
> start gateway-sender
> The only recourse is to restart the server.
>
> This fix is critical because the distributed system fails to sync data
> between WAN sites as the user would expect.
> This issue did exist in previous releases, but recent enhancements to
> WAN/AEQ such as AEQ-pause are increasing user interaction with WAN-related
> gfsh commands.
>
> The fix is simple, low risk, tested, and has been on develop for 5 days:
>
> https://github.com/apache/geode/commit/e148cef9cb63eba283cf86bc490eb280023567ce


[DISCUSS/VOTE] Proposal to bring GEODE-7465 to release/1.11.0

2019-11-26 Thread Owen Nichols
I would like to propose bringing “GEODE-7465: Set eventProcessor to null in 
serial AEQ when it is stopped” into the 1.11 release (necessitating an RC4).

Without the fix, a sequence of ordinary gfsh commands will leave the WAN 
gateway in an unrecoverable hung state:
stop gateway-sender
start gateway-sender
The only recourse is to restart the server.

This fix is critical because the distributed system fails to sync data between 
WAN sites as the user would expect. 
This issue did exist in previous releases, but recent enhancements to WAN/AEQ 
such as AEQ-pause are increasing user interaction with WAN-related gfsh 
commands.

The fix is simple, low risk, tested, and has been on develop for 5 days:
https://github.com/apache/geode/commit/e148cef9cb63eba283cf86bc490eb280023567ce

Re: Cache.close is not synchronous?

2019-11-26 Thread Kirk Lund
Seems like this must be a bug, so I filed
https://issues.apache.org/jira/browse/GEODE-7503. I'll look into it...

On Mon, Nov 25, 2019 at 3:24 PM Anilkumar Gingade 
wrote:

> Looking at the code, the cache.close() and InternalCacheBuilder.create()
> are synchronized on "GemFireCacheImpl.class"'; it's the
> internalCachebuilder create that seems to be using reference to the old
> distributed-system.
> The GemFireCacheImpl.getInstance() and getExisting() both perform
> "isClosing" check and does early return. The InternalCacheBuilder is new;
> not sure if its missing early checks.
>
> -Anil.
>
> On Mon, Nov 25, 2019 at 2:47 PM Mark Hanson  wrote:
>
> > +1 to fix.
> >
> > > On Nov 25, 2019, at 2:02 PM, John Blum  wrote:
> > >
> > > +1 ^ 64!
> > >
> > > I found this out the hard way some time ago and is why STDG exists in
> the
> > > first place (i.e. usability issues, particularly with testing).
> > >
> > > On Mon, Nov 25, 2019 at 1:41 PM Kirk Lund  wrote:
> > >
> > >> I found a test that closes the cache and then recreates the cache
> > multiple
> > >> times with 2 second sleep between each. I tried to remove the
> > Thread.sleep
> > >> and found that recreating the cache
> > >> throws DistributedSystemDisconnectedException (see below).
> > >>
> > >> This seems like a usability nightmare. Anyone have any ideas WHY it's
> > this
> > >> way?
> > >>
> > >> Personally, I want Cache.close() to block until both Cache and
> > >> DistributedSystem are closed and the API is ready to create a new
> Cache.
> > >>
> > >> org.apache.geode.distributed.DistributedSystemDisconnectedException:
> > This
> > >> connection to a distributed system has been disconnected.
> > >>at
> > >>
> > >>
> >
> org.apache.geode.distributed.internal.InternalDistributedSystem.checkConnected(InternalDistributedSystem.java:945)
> > >>at
> > >>
> > >>
> >
> org.apache.geode.distributed.internal.InternalDistributedSystem.getDistributionManager(InternalDistributedSystem.java:1665)
> > >>at
> > >>
> > >>
> >
> org.apache.geode.internal.cache.GemFireCacheImpl.(GemFireCacheImpl.java:791)
> > >>at
> > >>
> > >>
> >
> org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:187)
> > >>at
> > >>
> > >>
> >
> org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:158)
> > >>at
> > >> org.apache.geode.cache.CacheFactory.create(CacheFactory.java:142)
> > >>
> > >
> > >
> > > --
> > > -John
> > > john.blum10101 (skype)
> >
> >
>


Re: Cache.close is not synchronous?

2019-11-26 Thread Ivan Godwin
+1 for fixing.

On Tue, Nov 26, 2019 at 6:53 AM Alberto Bustamante Reyes
 wrote:

> +1 for fixing it.
> 
> De: Anilkumar Gingade 
> Enviado: martes, 26 de noviembre de 2019 0:24
> Para: geode 
> Asunto: Re: Cache.close is not synchronous?
>
> Looking at the code, the cache.close() and InternalCacheBuilder.create()
> are synchronized on "GemFireCacheImpl.class"'; it's the
> internalCachebuilder create that seems to be using reference to the old
> distributed-system.
> The GemFireCacheImpl.getInstance() and getExisting() both perform
> "isClosing" check and does early return. The InternalCacheBuilder is new;
> not sure if its missing early checks.
>
> -Anil.
>
> On Mon, Nov 25, 2019 at 2:47 PM Mark Hanson  wrote:
>
> > +1 to fix.
> >
> > > On Nov 25, 2019, at 2:02 PM, John Blum  wrote:
> > >
> > > +1 ^ 64!
> > >
> > > I found this out the hard way some time ago and is why STDG exists in
> the
> > > first place (i.e. usability issues, particularly with testing).
> > >
> > > On Mon, Nov 25, 2019 at 1:41 PM Kirk Lund  wrote:
> > >
> > >> I found a test that closes the cache and then recreates the cache
> > multiple
> > >> times with 2 second sleep between each. I tried to remove the
> > Thread.sleep
> > >> and found that recreating the cache
> > >> throws DistributedSystemDisconnectedException (see below).
> > >>
> > >> This seems like a usability nightmare. Anyone have any ideas WHY it's
> > this
> > >> way?
> > >>
> > >> Personally, I want Cache.close() to block until both Cache and
> > >> DistributedSystem are closed and the API is ready to create a new
> Cache.
> > >>
> > >> org.apache.geode.distributed.DistributedSystemDisconnectedException:
> > This
> > >> connection to a distributed system has been disconnected.
> > >>at
> > >>
> > >>
> >
> org.apache.geode.distributed.internal.InternalDistributedSystem.checkConnected(InternalDistributedSystem.java:945)
> > >>at
> > >>
> > >>
> >
> org.apache.geode.distributed.internal.InternalDistributedSystem.getDistributionManager(InternalDistributedSystem.java:1665)
> > >>at
> > >>
> > >>
> >
> org.apache.geode.internal.cache.GemFireCacheImpl.(GemFireCacheImpl.java:791)
> > >>at
> > >>
> > >>
> >
> org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:187)
> > >>at
> > >>
> > >>
> >
> org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:158)
> > >>at
> > >> org.apache.geode.cache.CacheFactory.create(CacheFactory.java:142)
> > >>
> > >
> > >
> > > --
> > > -John
> > > john.blum10101 (skype)
> >
> >
>


Re: Odg: Odg: Restart gateway-receiver

2019-11-26 Thread Jacob Barrett
It’s is likely that your firewall is closing idle connections. This is very 
common for a firewall to drop the state information for a connection that 
hasn’t seen any traffic in some set period of time. It’s a way for firewalls to 
reclaim resources for connections that were likely closed on one side or the 
other without any FIN sequence.

If you don’t have any control over your firewall then we should explore a keep 
alive method.

-Jake


> On Nov 26, 2019, at 2:06 AM, Mario Kevo  wrote:
> 
> Thanks a lot @Barry Oglesby!
> 
> It seems that this closing inactive connection is done by Kubernetes as we 
> run Geode on it.
> 
> BR,
> Mario
> 
> 
> Šalje: Barry Oglesby 
> Poslano: 22. studenog 2019. 22:35
> Prima: Mario Kevo 
> Kopija: dev@geode.apache.org 
> Predmet: Re: Odg: Restart gateway-receiver
> 
>> If we don't send any event, the connection will be closed after some time as 
>> connection is inactive.
> 
> Are you seeing this behavior? I don't think this is true by default.
> 
> AbstractRemoteGatewaySender.initProxy sets these fields on the PoolFactory:
> 
> pf.setReadTimeout(this.socketReadTimeout);
> pf.setIdleTimeout(connectionIdleTimeOut);
> 
> By default, socketReadTimeout is 0 (no timeout), and connectionIdleTimeOut is 
> -1 (disabled).
> 
> Each Event Processor thread will have its own connection to the remote site:
> 
> Event Processor for GatewaySender_ny.1: 
> GatewaySenderEventRemoteDispatcher.initializeConnection connection=Pooled 
> Connection to 127.0.0.1:5452: 
> Connection[127.0.0.1:5452]@306907760
> Event Processor for GatewaySender_ny.2: 
> GatewaySenderEventRemoteDispatcher.initializeConnection connection=Pooled 
> Connection to 127.0.0.1:5452: 
> Connection[127.0.0.1:5452]@608855137
> Event Processor for GatewaySender_ny.0: 
> GatewaySenderEventRemoteDispatcher.initializeConnection connection=Pooled 
> Connection to 127.0.0.1:5452: 
> Connection[127.0.0.1:5452]@950613560
> Event Processor for GatewaySender_ny.4: 
> GatewaySenderEventRemoteDispatcher.initializeConnection connection=Pooled 
> Connection to 127.0.0.1:5452: 
> Connection[127.0.0.1:5452]@1005378489
> Event Processor for GatewaySender_ny.3: 
> GatewaySenderEventRemoteDispatcher.initializeConnection connection=Pooled 
> Connection to 127.0.0.1:5452: 
> Connection[127.0.0.1:5452]@629246640
> 
> There will be one Event Processor thread for each dispatcher thread (5 by 
> default).
> 
> There aren't any good public ways to monitor the connections other than JMX.
> 
> One way to monitor these connections is with ConnectionStats on the sender 
> side.
> 
> You can do that with vsd: ClientStats -> GatewaySenderStats -> connections
> 
> You can also do it with code like:
> 
> private int getConnectionCount(String gatewaySenderId) {
>  AbstractGatewaySender sender = (AbstractGatewaySender) 
> cache.getGatewaySender(gatewaySenderId);
>  int totalConnections = 0;
>  if (sender != null) {
>for (ConnectionStats connectionStats : 
> sender.getProxy().getEndpointManager().getAllStats().values()) {
>  totalConnections += connectionStats.getConnections();
>}
>System.out.println("Sender=" + gatewaySenderId + "; connectionCount=" + 
> totalConnections);
>  }
>  return totalConnections;
> }
> 
> You can also dump whether the dispatcher is connected like:
> 
> private void dumpConnected(String gatewaySenderId) {
>  AbstractGatewaySender sender = (AbstractGatewaySender) 
> cache.getGatewaySender(gatewaySenderId);
>  if (sender.isParallel()) {
>ConcurrentParallelGatewaySenderEventProcessor concurrentProcessor = 
> (ConcurrentParallelGatewaySenderEventProcessor) sender.getEventProcessor();
>for (ParallelGatewaySenderEventProcessor processor : 
> concurrentProcessor.getProcessors()) {
>  System.out.println("Processor=" + processor + "; isConnected=" + 
> processor.getDispatcher().isConnectedToRemote());
>}
>  } else {
>ConcurrentSerialGatewaySenderEventProcessor concurrentProcessor = 
> (ConcurrentSerialGatewaySenderEventProcessor) sender.getEventProcessor();
>List processors = 
> concurrentProcessor.getProcessors();
>for (SerialGatewaySenderEventProcessor processor : 
> concurrentProcessor.getProcessors()) {
>  System.out.println("Processor=" + processor + "; isConnected=" + 
> processor.getDispatcher().isConnectedToRemote());
>}
>  }
> }
> 
> The isConnectedToRemote method does:
> 
> return connection != null && !connection.isDestroyed();
> 
> Thanks,
> Barry Oglesby
> 
> 
> 
> On Thu, Nov 21, 2019 at 11:15 PM Mario Kevo  wrote:
> Hi,
> 
> @Barry Oglesby, thanks for the clarification.
> 
> If we don't send any event, the connection will be closed after some time as 
> connection is inactive.
> Is it possible to somehow monitor from the app if the replication is 
> established to get 

RE: Cache.close is not synchronous?

2019-11-26 Thread Alberto Bustamante Reyes
+1 for fixing it.

De: Anilkumar Gingade 
Enviado: martes, 26 de noviembre de 2019 0:24
Para: geode 
Asunto: Re: Cache.close is not synchronous?

Looking at the code, the cache.close() and InternalCacheBuilder.create()
are synchronized on "GemFireCacheImpl.class"'; it's the
internalCachebuilder create that seems to be using reference to the old
distributed-system.
The GemFireCacheImpl.getInstance() and getExisting() both perform
"isClosing" check and does early return. The InternalCacheBuilder is new;
not sure if its missing early checks.

-Anil.

On Mon, Nov 25, 2019 at 2:47 PM Mark Hanson  wrote:

> +1 to fix.
>
> > On Nov 25, 2019, at 2:02 PM, John Blum  wrote:
> >
> > +1 ^ 64!
> >
> > I found this out the hard way some time ago and is why STDG exists in the
> > first place (i.e. usability issues, particularly with testing).
> >
> > On Mon, Nov 25, 2019 at 1:41 PM Kirk Lund  wrote:
> >
> >> I found a test that closes the cache and then recreates the cache
> multiple
> >> times with 2 second sleep between each. I tried to remove the
> Thread.sleep
> >> and found that recreating the cache
> >> throws DistributedSystemDisconnectedException (see below).
> >>
> >> This seems like a usability nightmare. Anyone have any ideas WHY it's
> this
> >> way?
> >>
> >> Personally, I want Cache.close() to block until both Cache and
> >> DistributedSystem are closed and the API is ready to create a new Cache.
> >>
> >> org.apache.geode.distributed.DistributedSystemDisconnectedException:
> This
> >> connection to a distributed system has been disconnected.
> >>at
> >>
> >>
> org.apache.geode.distributed.internal.InternalDistributedSystem.checkConnected(InternalDistributedSystem.java:945)
> >>at
> >>
> >>
> org.apache.geode.distributed.internal.InternalDistributedSystem.getDistributionManager(InternalDistributedSystem.java:1665)
> >>at
> >>
> >>
> org.apache.geode.internal.cache.GemFireCacheImpl.(GemFireCacheImpl.java:791)
> >>at
> >>
> >>
> org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:187)
> >>at
> >>
> >>
> org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:158)
> >>at
> >> org.apache.geode.cache.CacheFactory.create(CacheFactory.java:142)
> >>
> >
> >
> > --
> > -John
> > john.blum10101 (skype)
>
>


Odg: Odg: Restart gateway-receiver

2019-11-26 Thread Mario Kevo
Thanks a lot @Barry Oglesby!

It seems that this closing inactive connection is done by Kubernetes as we run 
Geode on it.

BR,
Mario


Šalje: Barry Oglesby 
Poslano: 22. studenog 2019. 22:35
Prima: Mario Kevo 
Kopija: dev@geode.apache.org 
Predmet: Re: Odg: Restart gateway-receiver

> If we don't send any event, the connection will be closed after some time as 
> connection is inactive.

Are you seeing this behavior? I don't think this is true by default.

AbstractRemoteGatewaySender.initProxy sets these fields on the PoolFactory:

pf.setReadTimeout(this.socketReadTimeout);
pf.setIdleTimeout(connectionIdleTimeOut);

By default, socketReadTimeout is 0 (no timeout), and connectionIdleTimeOut is 
-1 (disabled).

Each Event Processor thread will have its own connection to the remote site:

Event Processor for GatewaySender_ny.1: 
GatewaySenderEventRemoteDispatcher.initializeConnection connection=Pooled 
Connection to 127.0.0.1:5452: 
Connection[127.0.0.1:5452]@306907760
Event Processor for GatewaySender_ny.2: 
GatewaySenderEventRemoteDispatcher.initializeConnection connection=Pooled 
Connection to 127.0.0.1:5452: 
Connection[127.0.0.1:5452]@608855137
Event Processor for GatewaySender_ny.0: 
GatewaySenderEventRemoteDispatcher.initializeConnection connection=Pooled 
Connection to 127.0.0.1:5452: 
Connection[127.0.0.1:5452]@950613560
Event Processor for GatewaySender_ny.4: 
GatewaySenderEventRemoteDispatcher.initializeConnection connection=Pooled 
Connection to 127.0.0.1:5452: 
Connection[127.0.0.1:5452]@1005378489
Event Processor for GatewaySender_ny.3: 
GatewaySenderEventRemoteDispatcher.initializeConnection connection=Pooled 
Connection to 127.0.0.1:5452: 
Connection[127.0.0.1:5452]@629246640

There will be one Event Processor thread for each dispatcher thread (5 by 
default).

There aren't any good public ways to monitor the connections other than JMX.

One way to monitor these connections is with ConnectionStats on the sender side.

You can do that with vsd: ClientStats -> GatewaySenderStats -> connections

You can also do it with code like:

private int getConnectionCount(String gatewaySenderId) {
  AbstractGatewaySender sender = (AbstractGatewaySender) 
cache.getGatewaySender(gatewaySenderId);
  int totalConnections = 0;
  if (sender != null) {
for (ConnectionStats connectionStats : 
sender.getProxy().getEndpointManager().getAllStats().values()) {
  totalConnections += connectionStats.getConnections();
}
System.out.println("Sender=" + gatewaySenderId + "; connectionCount=" + 
totalConnections);
  }
  return totalConnections;
}

You can also dump whether the dispatcher is connected like:

private void dumpConnected(String gatewaySenderId) {
  AbstractGatewaySender sender = (AbstractGatewaySender) 
cache.getGatewaySender(gatewaySenderId);
  if (sender.isParallel()) {
ConcurrentParallelGatewaySenderEventProcessor concurrentProcessor = 
(ConcurrentParallelGatewaySenderEventProcessor) sender.getEventProcessor();
for (ParallelGatewaySenderEventProcessor processor : 
concurrentProcessor.getProcessors()) {
  System.out.println("Processor=" + processor + "; isConnected=" + 
processor.getDispatcher().isConnectedToRemote());
}
  } else {
ConcurrentSerialGatewaySenderEventProcessor concurrentProcessor = 
(ConcurrentSerialGatewaySenderEventProcessor) sender.getEventProcessor();
List processors = 
concurrentProcessor.getProcessors();
for (SerialGatewaySenderEventProcessor processor : 
concurrentProcessor.getProcessors()) {
  System.out.println("Processor=" + processor + "; isConnected=" + 
processor.getDispatcher().isConnectedToRemote());
}
  }
}

The isConnectedToRemote method does:

return connection != null && !connection.isDestroyed();

Thanks,
Barry Oglesby



On Thu, Nov 21, 2019 at 11:15 PM Mario Kevo  wrote:
Hi,

@Barry Oglesby, thanks for the clarification.

If we don't send any event, the connection will be closed after some time as 
connection is inactive.
Is it possible to somehow monitor from the app if the replication is 
established to get information if there is a some problem with network or it is 
just closed due to inactivity?
Can we monitor the replication on some other way than looking "isConnected" 
state on JMX?

BR,
Mario

Šalje: Barry Oglesby mailto:bogle...@pivotal.io>>
Poslano: 14. studenog 2019. 18:29
Prima: dev@geode.apache.org 
mailto:dev@geode.apache.org>>
Predmet: Re: Restart gateway-receiver

Mario,

Thats the current behavior. When the sender is initially started, it
attempts to connect to the receiver. If it does not connect, it won't retry
until there is a batch of events to send. Look for callers of
GatewaySenderEventRemoteDispatcher.initializeConnection to see the

Odg: Proposal of new config property "ssl-server-name-extension"

2019-11-26 Thread Mario Ivanac
Hi Sai,


The security provider main class is configured through a java security file:

-Djava.security.properties=custom-security.file



Where we set:

security.provider.1=my.security.provider.class



The security provider is packaged as a .jar and added to the classpath. The 
security provider code is triggered once the geode default context is 
initialized, so there is no room to take over the context before that.



Also, the configuration of the TLS handshake message extensions is part of the 
SSLSocket configuration. I’m not aware of a way to configure this through the 
context.


BR,

Mario


Šalje: Sai Boorlagadda 
Poslano: 24. studenog 2019. 17:33
Prima: dev@geode.apache.org 
Predmet: Re: Proposal of new config property "ssl-server-name-extension"

Hello Mario,

I would like to see if having a custom security provider allows you to
configure the default SSL context to set the SNI?

>From your proposal, I see that you have implemented a Java Security
Provider to provide custom KeyManager implementation which distinguishes
certificate based on which the wan-site the peer client is connecting to.
How are you configuring this security provider? I am assuming you have some
bootstrapping code that inserts your security provider before launching
Geode, and also set gemfire property `ssl-use-default-context` to true to
let Geode use the default SSL context. Can this bootstrapping code create
and configure an SSL context with SNI and set it as default context before
launching geode?

This may appear as a workaround but the rationale behind
`ssl-use-default-context` is to delegate the external environment to
configure the SSL context in a required manner and let Geode just use it.

Sai

On Tue, Nov 19, 2019 at 3:27 AM Mario Ivanac  wrote:

> Hi geode dev,
>
> as a part of solution for https://issues.apache.org/jira/browse/GEODE-7414
> we would like to introduce new config property "ssl-server-name-extension".
>
> This property will contain generic string, which will be added as Server
> Name Indication (SNI) parameter to Client Hello message.
>
> Do you agree with this proposal?
>
> Thanks,
> Mario
>