Re: Us vs Docker vs Gradle vs JUnit

2020-07-01 Thread Jacob Barrett


> On Jul 1, 2020, at 12:16 PM, Udo Kohlmeyer  wrote:
> 
> To think a little more left field, with our continued investment in K8’s, 
> maybe we can look into that area?
> Run tests in parallel using K8’s?

It’s an interesting idea but even more complicated than the docker solution. 
Since k8s could schedule the job on any node we have to make sure the build is 
distributed to all nodes. We could do this with a shared volume or config map. 
The result collection would be limited to just the test output collected form 
log data, getting other artifacts from the test would be even more complicated. 
Probably need yet another shared volume for that, or more than one. Not 
impossible but pretty complicated workload.

-Jake




Re: Us vs Docker vs Gradle vs JUnit

2020-07-01 Thread Udo Kohlmeyer
To think a little more left field, with our continued investment in K8’s, maybe 
we can look into that area?
Run tests in parallel using K8’s?

But I am also supportive of fixing the tests that we can run them in parallel 
without the extra container scaffolding.

—Udo
On Jul 1, 2020, 11:38 AM -0700, Jacob Barrett , wrote:


On Jul 1, 2020, at 11:14 AM, Kirk Lund  wrote:

I'm not a big fan of forking the Docker plugin and making it a new Geode
submodule. This approach kind of flies in the face of the intentions of OSS
in general. For example, we want folks contributing to Apache Geode rather
than forking Geode to create their own new project while never giving back
to this project.

If the original Docker plugin project refuses to maintain or take on new
contributors then some of us should help lead the creation of a new Docker
plugin repo in Github that is independent of Apache Geode. This way it
becomes a new living Docker plugin project that many contributors can
become involved in. It also becomes valuable to more projects than just
one. I've seen this happen with a number of OSS projects in which the
original maintainer of a project disappears or leaves and I think this is
generally accepted as the best approach for reviving a dead project that no
new committers can be added to.

I agree, if we want to keep using the plugin it does make sense to open source 
the fork outside of Geode.

-Jake




Re: Back-Port GEODE-8240 to 1.12, 1.13

2020-07-01 Thread Dave Barnes
Hey, Bill, you got the votes. Go ahead with the back-ports.
Thanks,
Dave

On Wed, Jul 1, 2020 at 10:54 AM Kirk Lund  wrote:

> +1
>
> On Wed, Jul 1, 2020 at 9:59 AM Dick Cavender  wrote:
>
> > +1
> >
> > -Original Message-
> > From: Bruce Schuchardt 
> > Sent: Wednesday, July 1, 2020 9:49 AM
> > To: dev@geode.apache.org
> > Subject: Re: Back-Port GEODE-8240 to 1.12, 1.13
> >
> > +1
> >
> > On 7/1/20, 9:43 AM, "Bill Burcham"  wrote:
> >
> > I'd like permission to back-port the fix for rolling upgrade bug
> > GEODE-8240
> > to support/1.12 and support/1.13
> >
> > -Bill
> >
> >
>


Re: Us vs Docker vs Gradle vs JUnit

2020-07-01 Thread Jacob Barrett



> On Jul 1, 2020, at 11:14 AM, Kirk Lund  wrote:
> 
> I'm not a big fan of forking the Docker plugin and making it a new Geode
> submodule. This approach kind of flies in the face of the intentions of OSS
> in general. For example, we want folks contributing to Apache Geode rather
> than forking Geode to create their own new project while never giving back
> to this project.
> 
> If the original Docker plugin project refuses to maintain or take on new
> contributors then some of us should help lead the creation of a new Docker
> plugin repo in Github that is independent of Apache Geode. This way it
> becomes a new living Docker plugin project that many contributors can
> become involved in. It also becomes valuable to more projects than just
> one. I've seen this happen with a number of OSS projects in which the
> original maintainer of a project disappears or leaves and I think this is
> generally accepted as the best approach for reviving a dead project that no
> new committers can be added to.

I agree, if we want to keep using the plugin it does make sense to open source 
the fork outside of Geode.

-Jake




Odg: Back-Port GEODE-8240 to 1.12, 1.13

2020-07-01 Thread Mario Kevo
+1

Šalje: Kirk Lund 
Poslano: 1. srpnja 2020. 19:54
Prima: dev@geode.apache.org 
Predmet: Re: Back-Port GEODE-8240 to 1.12, 1.13

+1

On Wed, Jul 1, 2020 at 9:59 AM Dick Cavender  wrote:

> +1
>
> -Original Message-
> From: Bruce Schuchardt 
> Sent: Wednesday, July 1, 2020 9:49 AM
> To: dev@geode.apache.org
> Subject: Re: Back-Port GEODE-8240 to 1.12, 1.13
>
> +1
>
> On 7/1/20, 9:43 AM, "Bill Burcham"  wrote:
>
> I'd like permission to back-port the fix for rolling upgrade bug
> GEODE-8240
> to support/1.12 and support/1.13
>
> -Bill
>
>


Re: Us vs Docker vs Gradle vs JUnit

2020-07-01 Thread Kirk Lund
I'm not a big fan of forking the Docker plugin and making it a new Geode
submodule. This approach kind of flies in the face of the intentions of OSS
in general. For example, we want folks contributing to Apache Geode rather
than forking Geode to create their own new project while never giving back
to this project.

If the original Docker plugin project refuses to maintain or take on new
contributors then some of us should help lead the creation of a new Docker
plugin repo in Github that is independent of Apache Geode. This way it
becomes a new living Docker plugin project that many contributors can
become involved in. It also becomes valuable to more projects than just
one. I've seen this happen with a number of OSS projects in which the
original maintainer of a project disappears or leaves and I think this is
generally accepted as the best approach for reviving a dead project that no
new committers can be added to.

On Tue, Jun 30, 2020 at 11:30 AM Jacob Barrett  wrote:

> All,
>
> We are in a bit of a pickle. As you recall from a few years back in an
> effort to both stabilize and parallelize integration, distributed and other
> integration/system like test we use Docker. Many of the tests reused the
> same ports for services which cause them to fail or interact with each
> other when run in parallel. By using Docker to isolate a test we put a
> bandage on that issue. The plugin overrides Gradle’s default forked runner
> by starting the runners in Docker containers and marshaling the execution
> parameters to those Dockerized runners.
>
> The Docker test plugin is effectively unmaintained. The author seems
> content on keeping it compatible with Gradle 4. We forked it to work with
> Gradle 5 and various other issues we have hit over the years. We have
> shared patches in the past with little luck in having them merged and still
> its only compatible with Gradle 4.8 at best. I spent some time trying to
> port it to Gradle 6 but its going to be a larger undertaking given that
> Gradle 6 is fully Java modules compatible. They added new members
> throughout to handle modules in addition to class paths.
>
> Long story short because our tests can’t be parallelized without a
> container system we are stuck. We can’t go to JUnit 5 without updating
> Docker plugin (potentially minor changes). We can’t go to Gradle 6 without
> updating the Docker plugin (potentially huge changes). Being stuck is not a
> good place. I see two paths out of this:
>
> 1) We buckle down and fix the tests so they can run in parallel via the
> normal forking mechanism of Gradle. I know some effort has been expended in
> this by using our new rules for starting servers. We should need to go
> further.
>
> 2) Fully invest in the Docker plugin. We would need to fork this off as a
> fully maintain sub-project of Geode. We would need to add to it support for
> both Gradle 6 and JUnit 5.
>
> My money is on fixing the tests. It is clear, at least from my exhaustive
> searching, nobody in the Gradle and JUnit communities are isolating their
> tests with containers. They are creating containers to host service for
> system level testing, see Testcontainers project. The tests themselves run
> in the local kernel space (not in container).
>
> We made this push in the C++ and .NET tests, a much smaller set of tests,
> and it works great. The framework takes care to create clusters that do not
> interact with each other on the same host. Some things in Geode make this
> harder than others, like http service not support ephemeral port selection,
> and gfsh not providing machine readable output about ephemeral port
> selections. We use port knocking to prevent the OS from assigning the port
> ephemerally to another process. The framework knocks, opens and then
> closes, all the ports it needs for the server/locator services and starts
> them explicitly on those ports. Because of port recycling rules in the OS
> another ephemeral port request won’t get those ports for some time after
> they are closed. It's not perfect but it works. Fixing Geode to support
> ephemeral port selection and a better reporting mechanisms for those port
> choices would be more ideal. Also, we only start services necessary for the
> test, like don’t start the http ports if they aren’t going to be used.
>
> I would love some feedback and thoughts on this issue. Does anyone else
> see a different path forward?
>
> -Jake
>
>
>
>
>
>


Re: Back-Port GEODE-8240 to 1.12, 1.13

2020-07-01 Thread Kirk Lund
+1

On Wed, Jul 1, 2020 at 9:59 AM Dick Cavender  wrote:

> +1
>
> -Original Message-
> From: Bruce Schuchardt 
> Sent: Wednesday, July 1, 2020 9:49 AM
> To: dev@geode.apache.org
> Subject: Re: Back-Port GEODE-8240 to 1.12, 1.13
>
> +1
>
> On 7/1/20, 9:43 AM, "Bill Burcham"  wrote:
>
> I'd like permission to back-port the fix for rolling upgrade bug
> GEODE-8240
> to support/1.12 and support/1.13
>
> -Bill
>
>


Re: Odg: negative ActiveCQCount

2020-07-01 Thread Kirk Lund
Yeah, https://issues.apache.org/jira/browse/GEODE-8293 sounds like a
statistic decrement bug for activeCqCount. Somewhere, each Server is
decrementing it once too many times.

You could find the statistics class containing activeCqCount and try adding
some debugging log statements or even add some breakpoints for debugger if
it's easily reproduced.

On Wed, Jul 1, 2020 at 5:52 AM Mario Kevo  wrote:

> Hi Kirk, thanks for the response!
>
> I just realized that I wrongly describe the problem as I tried so many
> case. Sorry!
>
> We have system with two servers. If the redundancy is 0 then we have
> properly that on the first server is activeCqCount=1 and on the second is
> activeCqCount=0.
> After close CQ we got on first server activeCqCount=0 and on the second is
> activeCqCount=-1.
> gfsh>show metrics --categories=query
> Cluster-wide Metrics
>
> Category |  Metric  | Value
>  |  | -
> query| activeCQCount| -1
>  | queryRequestRate | 0.0
>
>
> In case we set redundancy to 1 it increments properly as expected, on both
> servers by one. But when cq is closed we got on both servers
> activeCqCount=-1. And show metrics command has the following output
> gfsh>show metrics --categories=query
> Cluster-wide Metrics
>
> Category |  Metric  | Value
>  |  | -
> query| activeCQCount| -1
>  | queryRequestRate | 0.0
>
> What I found is that when server register cq on one server it send message
> to other servers in the system with opType=REGISTER_CQ and in that case it
> creates new instance of ServerCqImpl on second server(with empty
> constructor of ServerCqImpl). When we close CQ there is two different
> instances on servers and it closed both of them, but as they are in RUNNING
> state before closing, it decrements activeCqCount on both of them.
>
> BR,
> Mario
>
> 
> Šalje: Kirk Lund 
> Poslano: 30. lipnja 2020. 19:54
> Prima: dev@geode.apache.org 
> Predmet: Re: negative ActiveCQCount
>
> I think *show metrics --categories=query* is showing you the query stats
> from DistributedSystemMXBean (see
> ShowMetricsCommand#writeSystemWideMetricValues). DistributedSystemMXBean
> aggregates values across all members in the cluster, so I would have
> expected activeCQCount to initially show a value of 2 after you create a
> ServerCQImpl in 2 servers. Then after closing the CQ, it should drop to a
> value of 0.
>
> When you create a CQ on a Server, it should be reflected asynchronously on
> the CacheServerMXBean in that Server. Each Server has its own
> CacheServerMXBean. Over on the Locator (JMX Manager), the
> DistributedSystemMXBean aggregates the count of active CQs in
> ServerClusterStatsMonitor by invoking
> DistributedSystemBridge#updateCacheServer when the CacheServerMXBean state
> is federated to the Locator (JMX Manager).
>
> Based on what I see in code and in the description on GEODE-8293, I think
> you might want to see if increment has a problem instead of decrement.
>
> I don't see anything that would limit the activeCQCount to only count the
> CQs on primaries. So, I would expect redundancy=1 to result in a value of
> 2. Does anyone else have different info about this?
>
> On Tue, Jun 30, 2020 at 5:31 AM Mario Kevo  wrote:
>
> > Hi geode-dev,
> >
> > I have a question about CQ(
> > https://issues.apache.org/jira/browse/GEODE-8293).
> > If we run CQ it register cq on one of the
> > servers(setPoolSubscriptionRedundancy is 1) and increment activeCQCount.
> > As I understand then it processInputBuffer to another server and there is
> > deserialization of the message. In case if opType is REGISTER_CQ or
> > SET_CQ_STATE it will call readCq from CqServiceProvider, at the end calls
> > empty contructor ServerCQImpl which is used for deserialization.
> >
> > The problem is when we close CQ then it has ServerCqImpl reference on
> both
> > servers, close them, and decrement on both of them. In that case we have
> > negative value of activeCQCount in show metrics command.
> >
> > Does anyone knows how to get in close method which is the primary and
> only
> > decrement on it?
> > Any advice is welcome!
> >
> > BR,
> > Mario
> >
>


Re: negative ActiveCQCount

2020-07-01 Thread Anilkumar Gingade
Seems like a bug to me. Can you please create a jira ticket.

The active CQ counts will be more meaningful at member level; they could be 
different on different servers based on the CQs registered and the redundancy 
level set. And that helps to determine the load on each server.

-Anil. 

On 7/1/20, 5:52 AM, "Mario Kevo"  wrote:

Hi Kirk, thanks for the response!

I just realized that I wrongly describe the problem as I tried so many 
case. Sorry!

We have system with two servers. If the redundancy is 0 then we have 
properly that on the first server is activeCqCount=1 and on the second is 
activeCqCount=0.
After close CQ we got on first server activeCqCount=0 and on the second is 
activeCqCount=-1.
gfsh>show metrics --categories=query
Cluster-wide Metrics

Category |  Metric  | Value
 |  | -
query| activeCQCount| -1
 | queryRequestRate | 0.0


In case we set redundancy to 1 it increments properly as expected, on both 
servers by one. But when cq is closed we got on both servers activeCqCount=-1. 
And show metrics command has the following output
gfsh>show metrics --categories=query
Cluster-wide Metrics

Category |  Metric  | Value
 |  | -
query| activeCQCount| -1
 | queryRequestRate | 0.0

What I found is that when server register cq on one server it send message 
to other servers in the system with opType=REGISTER_CQ and in that case it 
creates new instance of ServerCqImpl on second server(with empty constructor of 
ServerCqImpl). When we close CQ there is two different instances on servers and 
it closed both of them, but as they are in RUNNING state before closing, it 
decrements activeCqCount on both of them.

BR,
Mario


Šalje: Kirk Lund 
Poslano: 30. lipnja 2020. 19:54
Prima: dev@geode.apache.org 
Predmet: Re: negative ActiveCQCount

I think *show metrics --categories=query* is showing you the query stats
from DistributedSystemMXBean (see
ShowMetricsCommand#writeSystemWideMetricValues). DistributedSystemMXBean
aggregates values across all members in the cluster, so I would have
expected activeCQCount to initially show a value of 2 after you create a
ServerCQImpl in 2 servers. Then after closing the CQ, it should drop to a
value of 0.

When you create a CQ on a Server, it should be reflected asynchronously on
the CacheServerMXBean in that Server. Each Server has its own
CacheServerMXBean. Over on the Locator (JMX Manager), the
DistributedSystemMXBean aggregates the count of active CQs in
ServerClusterStatsMonitor by invoking
DistributedSystemBridge#updateCacheServer when the CacheServerMXBean state
is federated to the Locator (JMX Manager).

Based on what I see in code and in the description on GEODE-8293, I think
you might want to see if increment has a problem instead of decrement.

I don't see anything that would limit the activeCQCount to only count the
CQs on primaries. So, I would expect redundancy=1 to result in a value of
2. Does anyone else have different info about this?

On Tue, Jun 30, 2020 at 5:31 AM Mario Kevo  wrote:

> Hi geode-dev,
>
> I have a question about CQ(
> https://issues.apache.org/jira/browse/GEODE-8293).
> If we run CQ it register cq on one of the
> servers(setPoolSubscriptionRedundancy is 1) and increment activeCQCount.
> As I understand then it processInputBuffer to another server and there is
> deserialization of the message. In case if opType is REGISTER_CQ or
> SET_CQ_STATE it will call readCq from CqServiceProvider, at the end calls
> empty contructor ServerCQImpl which is used for deserialization.
>
> The problem is when we close CQ then it has ServerCqImpl reference on both
> servers, close them, and decrement on both of them. In that case we have
> negative value of activeCQCount in show metrics command.
>
> Does anyone knows how to get in close method which is the primary and only
> decrement on it?
> Any advice is welcome!
>
> BR,
> Mario
>



RE: Back-Port GEODE-8240 to 1.12, 1.13

2020-07-01 Thread Dick Cavender
+1

-Original Message-
From: Bruce Schuchardt  
Sent: Wednesday, July 1, 2020 9:49 AM
To: dev@geode.apache.org
Subject: Re: Back-Port GEODE-8240 to 1.12, 1.13

+1

On 7/1/20, 9:43 AM, "Bill Burcham"  wrote:

I'd like permission to back-port the fix for rolling upgrade bug GEODE-8240
to support/1.12 and support/1.13

-Bill



Re: [PROPOSAL] merge GEODE-8259 to support branches

2020-07-01 Thread Dave Barnes
OK, Gester, please merge.
Thanks,
Dave

On Wed, Jul 1, 2020 at 8:33 AM Bruce Schuchardt  wrote:

> +1
> I reviewed this PR and, as Gester said, it's low risk.  If it fixes a
> problem someone is having let's backport it.
>
> On 6/30/20, 3:51 PM, "Xiaojian Zhou"  wrote:
>
> Customer encountered a singlehop getAll failure due to
> SerializationException which is identified as socket error. The
> solution is
> to retry the getAll in this race condition (currently we did not
> retry).
>
>
> The fix is tested in both develop and support branches. The fix is
> conservative and very low risk.
>
>
>
> So it would be nice to bring to before 1.13.0 release.
>
>
>
> Regards
>
> Xiaojian Zhou
>
>


Re: Back-Port GEODE-8240 to 1.12, 1.13

2020-07-01 Thread Bruce Schuchardt
+1

On 7/1/20, 9:43 AM, "Bill Burcham"  wrote:

I'd like permission to back-port the fix for rolling upgrade bug GEODE-8240
to support/1.12 and support/1.13

-Bill



Re: Back-Port GEODE-8240 to 1.12, 1.13

2020-07-01 Thread Owen Nichols
I see this fix has been well-received on develop, and getting rolling upgrade 
right definitely sounds critical to me!

+1

On 7/1/20, 9:43 AM, "Bill Burcham"  wrote:

I'd like permission to back-port the fix for rolling upgrade bug GEODE-8240
to support/1.12 and support/1.13

-Bill



Back-Port GEODE-8240 to 1.12, 1.13

2020-07-01 Thread Bill Burcham
I'd like permission to back-port the fix for rolling upgrade bug GEODE-8240
to support/1.12 and support/1.13

-Bill


Re: [PROPOSAL] merge GEODE-8259 to support branches

2020-07-01 Thread Bruce Schuchardt
+1
I reviewed this PR and, as Gester said, it's low risk.  If it fixes a problem 
someone is having let's backport it.

On 6/30/20, 3:51 PM, "Xiaojian Zhou"  wrote:

Customer encountered a singlehop getAll failure due to
SerializationException which is identified as socket error. The solution is
to retry the getAll in this race condition (currently we did not retry).


The fix is tested in both develop and support branches. The fix is
conservative and very low risk.



So it would be nice to bring to before 1.13.0 release.



Regards

Xiaojian Zhou



Odg: negative ActiveCQCount

2020-07-01 Thread Mario Kevo
Hi Kirk, thanks for the response!

I just realized that I wrongly describe the problem as I tried so many case. 
Sorry!

We have system with two servers. If the redundancy is 0 then we have properly 
that on the first server is activeCqCount=1 and on the second is 
activeCqCount=0.
After close CQ we got on first server activeCqCount=0 and on the second is 
activeCqCount=-1.
gfsh>show metrics --categories=query
Cluster-wide Metrics

Category |  Metric  | Value
 |  | -
query| activeCQCount| -1
 | queryRequestRate | 0.0


In case we set redundancy to 1 it increments properly as expected, on both 
servers by one. But when cq is closed we got on both servers activeCqCount=-1. 
And show metrics command has the following output
gfsh>show metrics --categories=query
Cluster-wide Metrics

Category |  Metric  | Value
 |  | -
query| activeCQCount| -1
 | queryRequestRate | 0.0

What I found is that when server register cq on one server it send message to 
other servers in the system with opType=REGISTER_CQ and in that case it creates 
new instance of ServerCqImpl on second server(with empty constructor of 
ServerCqImpl). When we close CQ there is two different instances on servers and 
it closed both of them, but as they are in RUNNING state before closing, it 
decrements activeCqCount on both of them.

BR,
Mario


Šalje: Kirk Lund 
Poslano: 30. lipnja 2020. 19:54
Prima: dev@geode.apache.org 
Predmet: Re: negative ActiveCQCount

I think *show metrics --categories=query* is showing you the query stats
from DistributedSystemMXBean (see
ShowMetricsCommand#writeSystemWideMetricValues). DistributedSystemMXBean
aggregates values across all members in the cluster, so I would have
expected activeCQCount to initially show a value of 2 after you create a
ServerCQImpl in 2 servers. Then after closing the CQ, it should drop to a
value of 0.

When you create a CQ on a Server, it should be reflected asynchronously on
the CacheServerMXBean in that Server. Each Server has its own
CacheServerMXBean. Over on the Locator (JMX Manager), the
DistributedSystemMXBean aggregates the count of active CQs in
ServerClusterStatsMonitor by invoking
DistributedSystemBridge#updateCacheServer when the CacheServerMXBean state
is federated to the Locator (JMX Manager).

Based on what I see in code and in the description on GEODE-8293, I think
you might want to see if increment has a problem instead of decrement.

I don't see anything that would limit the activeCQCount to only count the
CQs on primaries. So, I would expect redundancy=1 to result in a value of
2. Does anyone else have different info about this?

On Tue, Jun 30, 2020 at 5:31 AM Mario Kevo  wrote:

> Hi geode-dev,
>
> I have a question about CQ(
> https://issues.apache.org/jira/browse/GEODE-8293).
> If we run CQ it register cq on one of the
> servers(setPoolSubscriptionRedundancy is 1) and increment activeCQCount.
> As I understand then it processInputBuffer to another server and there is
> deserialization of the message. In case if opType is REGISTER_CQ or
> SET_CQ_STATE it will call readCq from CqServiceProvider, at the end calls
> empty contructor ServerCQImpl which is used for deserialization.
>
> The problem is when we close CQ then it has ServerCqImpl reference on both
> servers, close them, and decrement on both of them. In that case we have
> negative value of activeCQCount in show metrics command.
>
> Does anyone knows how to get in close method which is the primary and only
> decrement on it?
> Any advice is welcome!
>
> BR,
> Mario
>