Re: Us vs Docker vs Gradle vs JUnit
> On Jul 1, 2020, at 12:16 PM, Udo Kohlmeyer wrote: > > To think a little more left field, with our continued investment in K8’s, > maybe we can look into that area? > Run tests in parallel using K8’s? It’s an interesting idea but even more complicated than the docker solution. Since k8s could schedule the job on any node we have to make sure the build is distributed to all nodes. We could do this with a shared volume or config map. The result collection would be limited to just the test output collected form log data, getting other artifacts from the test would be even more complicated. Probably need yet another shared volume for that, or more than one. Not impossible but pretty complicated workload. -Jake
Re: Us vs Docker vs Gradle vs JUnit
To think a little more left field, with our continued investment in K8’s, maybe we can look into that area? Run tests in parallel using K8’s? But I am also supportive of fixing the tests that we can run them in parallel without the extra container scaffolding. —Udo On Jul 1, 2020, 11:38 AM -0700, Jacob Barrett , wrote: On Jul 1, 2020, at 11:14 AM, Kirk Lund wrote: I'm not a big fan of forking the Docker plugin and making it a new Geode submodule. This approach kind of flies in the face of the intentions of OSS in general. For example, we want folks contributing to Apache Geode rather than forking Geode to create their own new project while never giving back to this project. If the original Docker plugin project refuses to maintain or take on new contributors then some of us should help lead the creation of a new Docker plugin repo in Github that is independent of Apache Geode. This way it becomes a new living Docker plugin project that many contributors can become involved in. It also becomes valuable to more projects than just one. I've seen this happen with a number of OSS projects in which the original maintainer of a project disappears or leaves and I think this is generally accepted as the best approach for reviving a dead project that no new committers can be added to. I agree, if we want to keep using the plugin it does make sense to open source the fork outside of Geode. -Jake
Re: Back-Port GEODE-8240 to 1.12, 1.13
Hey, Bill, you got the votes. Go ahead with the back-ports. Thanks, Dave On Wed, Jul 1, 2020 at 10:54 AM Kirk Lund wrote: > +1 > > On Wed, Jul 1, 2020 at 9:59 AM Dick Cavender wrote: > > > +1 > > > > -Original Message- > > From: Bruce Schuchardt > > Sent: Wednesday, July 1, 2020 9:49 AM > > To: dev@geode.apache.org > > Subject: Re: Back-Port GEODE-8240 to 1.12, 1.13 > > > > +1 > > > > On 7/1/20, 9:43 AM, "Bill Burcham" wrote: > > > > I'd like permission to back-port the fix for rolling upgrade bug > > GEODE-8240 > > to support/1.12 and support/1.13 > > > > -Bill > > > > >
Re: Us vs Docker vs Gradle vs JUnit
> On Jul 1, 2020, at 11:14 AM, Kirk Lund wrote: > > I'm not a big fan of forking the Docker plugin and making it a new Geode > submodule. This approach kind of flies in the face of the intentions of OSS > in general. For example, we want folks contributing to Apache Geode rather > than forking Geode to create their own new project while never giving back > to this project. > > If the original Docker plugin project refuses to maintain or take on new > contributors then some of us should help lead the creation of a new Docker > plugin repo in Github that is independent of Apache Geode. This way it > becomes a new living Docker plugin project that many contributors can > become involved in. It also becomes valuable to more projects than just > one. I've seen this happen with a number of OSS projects in which the > original maintainer of a project disappears or leaves and I think this is > generally accepted as the best approach for reviving a dead project that no > new committers can be added to. I agree, if we want to keep using the plugin it does make sense to open source the fork outside of Geode. -Jake
Odg: Back-Port GEODE-8240 to 1.12, 1.13
+1 Šalje: Kirk Lund Poslano: 1. srpnja 2020. 19:54 Prima: dev@geode.apache.org Predmet: Re: Back-Port GEODE-8240 to 1.12, 1.13 +1 On Wed, Jul 1, 2020 at 9:59 AM Dick Cavender wrote: > +1 > > -Original Message- > From: Bruce Schuchardt > Sent: Wednesday, July 1, 2020 9:49 AM > To: dev@geode.apache.org > Subject: Re: Back-Port GEODE-8240 to 1.12, 1.13 > > +1 > > On 7/1/20, 9:43 AM, "Bill Burcham" wrote: > > I'd like permission to back-port the fix for rolling upgrade bug > GEODE-8240 > to support/1.12 and support/1.13 > > -Bill > >
Re: Us vs Docker vs Gradle vs JUnit
I'm not a big fan of forking the Docker plugin and making it a new Geode submodule. This approach kind of flies in the face of the intentions of OSS in general. For example, we want folks contributing to Apache Geode rather than forking Geode to create their own new project while never giving back to this project. If the original Docker plugin project refuses to maintain or take on new contributors then some of us should help lead the creation of a new Docker plugin repo in Github that is independent of Apache Geode. This way it becomes a new living Docker plugin project that many contributors can become involved in. It also becomes valuable to more projects than just one. I've seen this happen with a number of OSS projects in which the original maintainer of a project disappears or leaves and I think this is generally accepted as the best approach for reviving a dead project that no new committers can be added to. On Tue, Jun 30, 2020 at 11:30 AM Jacob Barrett wrote: > All, > > We are in a bit of a pickle. As you recall from a few years back in an > effort to both stabilize and parallelize integration, distributed and other > integration/system like test we use Docker. Many of the tests reused the > same ports for services which cause them to fail or interact with each > other when run in parallel. By using Docker to isolate a test we put a > bandage on that issue. The plugin overrides Gradle’s default forked runner > by starting the runners in Docker containers and marshaling the execution > parameters to those Dockerized runners. > > The Docker test plugin is effectively unmaintained. The author seems > content on keeping it compatible with Gradle 4. We forked it to work with > Gradle 5 and various other issues we have hit over the years. We have > shared patches in the past with little luck in having them merged and still > its only compatible with Gradle 4.8 at best. I spent some time trying to > port it to Gradle 6 but its going to be a larger undertaking given that > Gradle 6 is fully Java modules compatible. They added new members > throughout to handle modules in addition to class paths. > > Long story short because our tests can’t be parallelized without a > container system we are stuck. We can’t go to JUnit 5 without updating > Docker plugin (potentially minor changes). We can’t go to Gradle 6 without > updating the Docker plugin (potentially huge changes). Being stuck is not a > good place. I see two paths out of this: > > 1) We buckle down and fix the tests so they can run in parallel via the > normal forking mechanism of Gradle. I know some effort has been expended in > this by using our new rules for starting servers. We should need to go > further. > > 2) Fully invest in the Docker plugin. We would need to fork this off as a > fully maintain sub-project of Geode. We would need to add to it support for > both Gradle 6 and JUnit 5. > > My money is on fixing the tests. It is clear, at least from my exhaustive > searching, nobody in the Gradle and JUnit communities are isolating their > tests with containers. They are creating containers to host service for > system level testing, see Testcontainers project. The tests themselves run > in the local kernel space (not in container). > > We made this push in the C++ and .NET tests, a much smaller set of tests, > and it works great. The framework takes care to create clusters that do not > interact with each other on the same host. Some things in Geode make this > harder than others, like http service not support ephemeral port selection, > and gfsh not providing machine readable output about ephemeral port > selections. We use port knocking to prevent the OS from assigning the port > ephemerally to another process. The framework knocks, opens and then > closes, all the ports it needs for the server/locator services and starts > them explicitly on those ports. Because of port recycling rules in the OS > another ephemeral port request won’t get those ports for some time after > they are closed. It's not perfect but it works. Fixing Geode to support > ephemeral port selection and a better reporting mechanisms for those port > choices would be more ideal. Also, we only start services necessary for the > test, like don’t start the http ports if they aren’t going to be used. > > I would love some feedback and thoughts on this issue. Does anyone else > see a different path forward? > > -Jake > > > > > >
Re: Back-Port GEODE-8240 to 1.12, 1.13
+1 On Wed, Jul 1, 2020 at 9:59 AM Dick Cavender wrote: > +1 > > -Original Message- > From: Bruce Schuchardt > Sent: Wednesday, July 1, 2020 9:49 AM > To: dev@geode.apache.org > Subject: Re: Back-Port GEODE-8240 to 1.12, 1.13 > > +1 > > On 7/1/20, 9:43 AM, "Bill Burcham" wrote: > > I'd like permission to back-port the fix for rolling upgrade bug > GEODE-8240 > to support/1.12 and support/1.13 > > -Bill > >
Re: Odg: negative ActiveCQCount
Yeah, https://issues.apache.org/jira/browse/GEODE-8293 sounds like a statistic decrement bug for activeCqCount. Somewhere, each Server is decrementing it once too many times. You could find the statistics class containing activeCqCount and try adding some debugging log statements or even add some breakpoints for debugger if it's easily reproduced. On Wed, Jul 1, 2020 at 5:52 AM Mario Kevo wrote: > Hi Kirk, thanks for the response! > > I just realized that I wrongly describe the problem as I tried so many > case. Sorry! > > We have system with two servers. If the redundancy is 0 then we have > properly that on the first server is activeCqCount=1 and on the second is > activeCqCount=0. > After close CQ we got on first server activeCqCount=0 and on the second is > activeCqCount=-1. > gfsh>show metrics --categories=query > Cluster-wide Metrics > > Category | Metric | Value > | | - > query| activeCQCount| -1 > | queryRequestRate | 0.0 > > > In case we set redundancy to 1 it increments properly as expected, on both > servers by one. But when cq is closed we got on both servers > activeCqCount=-1. And show metrics command has the following output > gfsh>show metrics --categories=query > Cluster-wide Metrics > > Category | Metric | Value > | | - > query| activeCQCount| -1 > | queryRequestRate | 0.0 > > What I found is that when server register cq on one server it send message > to other servers in the system with opType=REGISTER_CQ and in that case it > creates new instance of ServerCqImpl on second server(with empty > constructor of ServerCqImpl). When we close CQ there is two different > instances on servers and it closed both of them, but as they are in RUNNING > state before closing, it decrements activeCqCount on both of them. > > BR, > Mario > > > Šalje: Kirk Lund > Poslano: 30. lipnja 2020. 19:54 > Prima: dev@geode.apache.org > Predmet: Re: negative ActiveCQCount > > I think *show metrics --categories=query* is showing you the query stats > from DistributedSystemMXBean (see > ShowMetricsCommand#writeSystemWideMetricValues). DistributedSystemMXBean > aggregates values across all members in the cluster, so I would have > expected activeCQCount to initially show a value of 2 after you create a > ServerCQImpl in 2 servers. Then after closing the CQ, it should drop to a > value of 0. > > When you create a CQ on a Server, it should be reflected asynchronously on > the CacheServerMXBean in that Server. Each Server has its own > CacheServerMXBean. Over on the Locator (JMX Manager), the > DistributedSystemMXBean aggregates the count of active CQs in > ServerClusterStatsMonitor by invoking > DistributedSystemBridge#updateCacheServer when the CacheServerMXBean state > is federated to the Locator (JMX Manager). > > Based on what I see in code and in the description on GEODE-8293, I think > you might want to see if increment has a problem instead of decrement. > > I don't see anything that would limit the activeCQCount to only count the > CQs on primaries. So, I would expect redundancy=1 to result in a value of > 2. Does anyone else have different info about this? > > On Tue, Jun 30, 2020 at 5:31 AM Mario Kevo wrote: > > > Hi geode-dev, > > > > I have a question about CQ( > > https://issues.apache.org/jira/browse/GEODE-8293). > > If we run CQ it register cq on one of the > > servers(setPoolSubscriptionRedundancy is 1) and increment activeCQCount. > > As I understand then it processInputBuffer to another server and there is > > deserialization of the message. In case if opType is REGISTER_CQ or > > SET_CQ_STATE it will call readCq from CqServiceProvider, at the end calls > > empty contructor ServerCQImpl which is used for deserialization. > > > > The problem is when we close CQ then it has ServerCqImpl reference on > both > > servers, close them, and decrement on both of them. In that case we have > > negative value of activeCQCount in show metrics command. > > > > Does anyone knows how to get in close method which is the primary and > only > > decrement on it? > > Any advice is welcome! > > > > BR, > > Mario > > >
Re: negative ActiveCQCount
Seems like a bug to me. Can you please create a jira ticket. The active CQ counts will be more meaningful at member level; they could be different on different servers based on the CQs registered and the redundancy level set. And that helps to determine the load on each server. -Anil. On 7/1/20, 5:52 AM, "Mario Kevo" wrote: Hi Kirk, thanks for the response! I just realized that I wrongly describe the problem as I tried so many case. Sorry! We have system with two servers. If the redundancy is 0 then we have properly that on the first server is activeCqCount=1 and on the second is activeCqCount=0. After close CQ we got on first server activeCqCount=0 and on the second is activeCqCount=-1. gfsh>show metrics --categories=query Cluster-wide Metrics Category | Metric | Value | | - query| activeCQCount| -1 | queryRequestRate | 0.0 In case we set redundancy to 1 it increments properly as expected, on both servers by one. But when cq is closed we got on both servers activeCqCount=-1. And show metrics command has the following output gfsh>show metrics --categories=query Cluster-wide Metrics Category | Metric | Value | | - query| activeCQCount| -1 | queryRequestRate | 0.0 What I found is that when server register cq on one server it send message to other servers in the system with opType=REGISTER_CQ and in that case it creates new instance of ServerCqImpl on second server(with empty constructor of ServerCqImpl). When we close CQ there is two different instances on servers and it closed both of them, but as they are in RUNNING state before closing, it decrements activeCqCount on both of them. BR, Mario Šalje: Kirk Lund Poslano: 30. lipnja 2020. 19:54 Prima: dev@geode.apache.org Predmet: Re: negative ActiveCQCount I think *show metrics --categories=query* is showing you the query stats from DistributedSystemMXBean (see ShowMetricsCommand#writeSystemWideMetricValues). DistributedSystemMXBean aggregates values across all members in the cluster, so I would have expected activeCQCount to initially show a value of 2 after you create a ServerCQImpl in 2 servers. Then after closing the CQ, it should drop to a value of 0. When you create a CQ on a Server, it should be reflected asynchronously on the CacheServerMXBean in that Server. Each Server has its own CacheServerMXBean. Over on the Locator (JMX Manager), the DistributedSystemMXBean aggregates the count of active CQs in ServerClusterStatsMonitor by invoking DistributedSystemBridge#updateCacheServer when the CacheServerMXBean state is federated to the Locator (JMX Manager). Based on what I see in code and in the description on GEODE-8293, I think you might want to see if increment has a problem instead of decrement. I don't see anything that would limit the activeCQCount to only count the CQs on primaries. So, I would expect redundancy=1 to result in a value of 2. Does anyone else have different info about this? On Tue, Jun 30, 2020 at 5:31 AM Mario Kevo wrote: > Hi geode-dev, > > I have a question about CQ( > https://issues.apache.org/jira/browse/GEODE-8293). > If we run CQ it register cq on one of the > servers(setPoolSubscriptionRedundancy is 1) and increment activeCQCount. > As I understand then it processInputBuffer to another server and there is > deserialization of the message. In case if opType is REGISTER_CQ or > SET_CQ_STATE it will call readCq from CqServiceProvider, at the end calls > empty contructor ServerCQImpl which is used for deserialization. > > The problem is when we close CQ then it has ServerCqImpl reference on both > servers, close them, and decrement on both of them. In that case we have > negative value of activeCQCount in show metrics command. > > Does anyone knows how to get in close method which is the primary and only > decrement on it? > Any advice is welcome! > > BR, > Mario >
RE: Back-Port GEODE-8240 to 1.12, 1.13
+1 -Original Message- From: Bruce Schuchardt Sent: Wednesday, July 1, 2020 9:49 AM To: dev@geode.apache.org Subject: Re: Back-Port GEODE-8240 to 1.12, 1.13 +1 On 7/1/20, 9:43 AM, "Bill Burcham" wrote: I'd like permission to back-port the fix for rolling upgrade bug GEODE-8240 to support/1.12 and support/1.13 -Bill
Re: [PROPOSAL] merge GEODE-8259 to support branches
OK, Gester, please merge. Thanks, Dave On Wed, Jul 1, 2020 at 8:33 AM Bruce Schuchardt wrote: > +1 > I reviewed this PR and, as Gester said, it's low risk. If it fixes a > problem someone is having let's backport it. > > On 6/30/20, 3:51 PM, "Xiaojian Zhou" wrote: > > Customer encountered a singlehop getAll failure due to > SerializationException which is identified as socket error. The > solution is > to retry the getAll in this race condition (currently we did not > retry). > > > The fix is tested in both develop and support branches. The fix is > conservative and very low risk. > > > > So it would be nice to bring to before 1.13.0 release. > > > > Regards > > Xiaojian Zhou > >
Re: Back-Port GEODE-8240 to 1.12, 1.13
+1 On 7/1/20, 9:43 AM, "Bill Burcham" wrote: I'd like permission to back-port the fix for rolling upgrade bug GEODE-8240 to support/1.12 and support/1.13 -Bill
Re: Back-Port GEODE-8240 to 1.12, 1.13
I see this fix has been well-received on develop, and getting rolling upgrade right definitely sounds critical to me! +1 On 7/1/20, 9:43 AM, "Bill Burcham" wrote: I'd like permission to back-port the fix for rolling upgrade bug GEODE-8240 to support/1.12 and support/1.13 -Bill
Back-Port GEODE-8240 to 1.12, 1.13
I'd like permission to back-port the fix for rolling upgrade bug GEODE-8240 to support/1.12 and support/1.13 -Bill
Re: [PROPOSAL] merge GEODE-8259 to support branches
+1 I reviewed this PR and, as Gester said, it's low risk. If it fixes a problem someone is having let's backport it. On 6/30/20, 3:51 PM, "Xiaojian Zhou" wrote: Customer encountered a singlehop getAll failure due to SerializationException which is identified as socket error. The solution is to retry the getAll in this race condition (currently we did not retry). The fix is tested in both develop and support branches. The fix is conservative and very low risk. So it would be nice to bring to before 1.13.0 release. Regards Xiaojian Zhou
Odg: negative ActiveCQCount
Hi Kirk, thanks for the response! I just realized that I wrongly describe the problem as I tried so many case. Sorry! We have system with two servers. If the redundancy is 0 then we have properly that on the first server is activeCqCount=1 and on the second is activeCqCount=0. After close CQ we got on first server activeCqCount=0 and on the second is activeCqCount=-1. gfsh>show metrics --categories=query Cluster-wide Metrics Category | Metric | Value | | - query| activeCQCount| -1 | queryRequestRate | 0.0 In case we set redundancy to 1 it increments properly as expected, on both servers by one. But when cq is closed we got on both servers activeCqCount=-1. And show metrics command has the following output gfsh>show metrics --categories=query Cluster-wide Metrics Category | Metric | Value | | - query| activeCQCount| -1 | queryRequestRate | 0.0 What I found is that when server register cq on one server it send message to other servers in the system with opType=REGISTER_CQ and in that case it creates new instance of ServerCqImpl on second server(with empty constructor of ServerCqImpl). When we close CQ there is two different instances on servers and it closed both of them, but as they are in RUNNING state before closing, it decrements activeCqCount on both of them. BR, Mario Šalje: Kirk Lund Poslano: 30. lipnja 2020. 19:54 Prima: dev@geode.apache.org Predmet: Re: negative ActiveCQCount I think *show metrics --categories=query* is showing you the query stats from DistributedSystemMXBean (see ShowMetricsCommand#writeSystemWideMetricValues). DistributedSystemMXBean aggregates values across all members in the cluster, so I would have expected activeCQCount to initially show a value of 2 after you create a ServerCQImpl in 2 servers. Then after closing the CQ, it should drop to a value of 0. When you create a CQ on a Server, it should be reflected asynchronously on the CacheServerMXBean in that Server. Each Server has its own CacheServerMXBean. Over on the Locator (JMX Manager), the DistributedSystemMXBean aggregates the count of active CQs in ServerClusterStatsMonitor by invoking DistributedSystemBridge#updateCacheServer when the CacheServerMXBean state is federated to the Locator (JMX Manager). Based on what I see in code and in the description on GEODE-8293, I think you might want to see if increment has a problem instead of decrement. I don't see anything that would limit the activeCQCount to only count the CQs on primaries. So, I would expect redundancy=1 to result in a value of 2. Does anyone else have different info about this? On Tue, Jun 30, 2020 at 5:31 AM Mario Kevo wrote: > Hi geode-dev, > > I have a question about CQ( > https://issues.apache.org/jira/browse/GEODE-8293). > If we run CQ it register cq on one of the > servers(setPoolSubscriptionRedundancy is 1) and increment activeCQCount. > As I understand then it processInputBuffer to another server and there is > deserialization of the message. In case if opType is REGISTER_CQ or > SET_CQ_STATE it will call readCq from CqServiceProvider, at the end calls > empty contructor ServerCQImpl which is used for deserialization. > > The problem is when we close CQ then it has ServerCqImpl reference on both > servers, close them, and decrement on both of them. In that case we have > negative value of activeCQCount in show metrics command. > > Does anyone knows how to get in close method which is the primary and only > decrement on it? > Any advice is welcome! > > BR, > Mario >