Re: Flaky test caused by missing JDK dependency

2020-07-17 Thread Kirk Lund
Closing out this discussion thread about GEODE-6183:

We believe that the machine performed an update of Java during the test.
This caused the tools.jar to be unavailable only while this test was
executing. There are many other tests that use the Attach API which passed
in this overall CI job suggesting that it was only a momentary problem.

I'm not sure how easy or practical it is to turn off Java updates on
instances running CI jobs. I'll leave that to others to decide.

On Wed, Jul 8, 2020 at 1:30 PM Kirk Lund  wrote:

> The Attach API is optional for Users running the product.
>
> The Attach API is required to compile the classes that use the Attach API
> and to run tests that cover this feature (such as "--pid").
>
> On Wed, Jul 8, 2020 at 12:11 PM Anthony Baker  wrote:
>
>> I thought we made the dependency on the Attach API optional when we added
>> support for JDK 11?
>>
>> Anthony
>>
>>
>> > On Jul 8, 2020, at 10:17 AM, Kirk Lund  wrote:
>> >
>> > To transition away from Attach API, the community needs a proposal to
>> do so
>> > and we'll need to deprecate the GFSH options that depend on Attach API
>> such
>> > as "--pid" in "status server --pid 20938". Even then we're looking at a
>> > minimum of one major release before we can remove options after they are
>> > deprecated.
>> >
>> > We haven't had a major release in 4+ years so don't hold your breath! :)
>> >
>> > On Wed, Jul 8, 2020 at 9:59 AM Sean Goller  wrote:
>> >
>> >> The Liberica JDK does not include the Attach API. I'm investigating
>> why.
>> >> Given the inherent insecurity of this API, I recommend we transition
>> away
>> >> from using it.
>> >> 
>> >> From: Kirk Lund 
>> >> Sent: Monday, July 6, 2020 10:36 AM
>> >> To: dev@geode.apache.org 
>> >> Subject: Flaky test caused by missing JDK dependency
>> >>
>> >> CI Failure:
>> >> LocatorLauncherRemoteFileIntegrationTest.startDeletesStaleControlFiles
>> >> failed with ConditionTimeoutException
>> >>
>> >>
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-6183data=02%7C01%7Cbakera%40vmware.com%7Cdb5c3b93c1994223ff8b08d82362e699%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637298254971916697sdata=g67yHspVjXA8pJjp0shhYf7fZWltB7EexUXJ6sck8F8%3Dreserved=0
>> >>
>> >> I've debugged the latest occurrence of GEODE-6183 (intermittent failure
>> >> CI):
>> >>
>> >>
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fconcourse.apachegeode-ci.info%2Fteams%2Fmain%2Fpipelines%2Fapache-support-1-13-main%2Fjobs%2FWindowsCoreIntegrationTestOpenJDK11%2Fbuilds%2F34data=02%7C01%7Cbakera%40vmware.com%7Cdb5c3b93c1994223ff8b08d82362e699%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637298254971916697sdata=bACP7X6c%2By%2FtzyN6S65UEJ0xWRYxgEhM1KyvlYaYbSU%3Dreserved=0
>> >>
>> >> The underlying cause is a missing dependency: the Attach API. In the
>> Oracle
>> >> JDK, the Attach API is found in the JAVA_HOME/lib/tools.jar. In some
>> JDKs,
>> >> including LibericaJDK, there may not be a tools.jar or it may be
>> missing
>> >> from our image of specific JDKs or a specific OS. I confirmed that the
>> >> Attach API is actually inside a different .jar on some Mac releases of
>> the
>> >> JDK.
>> >>
>> >> Other than JDK differences, I'm not sure why tools.jar would
>> intermittently
>> >> be missing from our testing image, but that's definitely the cause of
>> >> WindowsCoreIntegrationTestOpenJDK11 failing. I've reviewed a couple
>> other
>> >> older runs and it was the same intermittent cause of failure.
>> >>
>> >> Does anyone know if LibericaJDK includes tools.jar or the Attach API?
>> >>
>> >> Does anyone know how to verify that our images all have tools.jar or
>> its
>> >> equivalent?
>> >>
>> >> java.util.ServiceConfigurationError:
>> >> com.sun.tools.attach.spi.AttachProvider: Provider
>> >> sun.tools.attach.WindowsAttachProvider not found
>> >> at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:588)
>> >> at
>> >>
>> >>
>> java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.nextProviderClass(ServiceLoader.java:1211)
>> >> at
>> >>
>> >>
>> java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.hasNextService(ServiceLoader.java:1220)
>> >> at
>> >>
>> >>
>> java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.hasNext(ServiceLoader.java:1264)
>> >> at java.base/java.util.ServiceLoader$2.hasNext(ServiceLoader.java:1299)
>> >> at java.base/java.util.ServiceLoader$3.hasNext(ServiceLoader.java:1384)
>> >> at
>> >>
>> >>
>> jdk.attach/com.sun.tools.attach.spi.AttachProvider.providers(AttachProvider.java:258)
>> >> at
>> >>
>> >>
>> jdk.attach/com.sun.tools.attach.VirtualMachine.list(VirtualMachine.java:144)
>> >> at
>> >>
>> >>
>> org.apache.geode.internal.process.AttachProcessUtils.isProcessAlive(AttachProcessUtils.java:35)
>> >> at
>> >>
>> >>
>> org.apache.geode.internal.process.ProcessUtils.isProcessAlive(ProcessUtils.java:99)
>> >> at
>> >>
>> >>
>> 

Re: Non-persistent parallel gateway sender on non-persistent region (collocated with persistent region)

2020-07-17 Thread Jakov Varenina

Hi Barrett,

We have suspected that this could be bug, and it is great that you 
confirmed it and this quickly created a fix.


Thank you very much for your effort!

BRs,

Jakov

On 16. 07. 2020. 21:39, Barrett Oglesby wrote:

I think you've found a bug in this scenario. The 
ParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR method currently 
compares the data policy of the input region's leader region with the sender's 
persistence policy. It assumes the input region and the leader region have the 
same data policy. In this scenario, that is not the case. The input region is 
'part_a' which is not persistent, and the leader region is '_part_hidden' which 
is persistent. The sender is 'sender' which is not persistent. So, instead of 
comparing the data policy of 'part_a' to the sender which would succeed since 
they are both not persistent, it compares the data policy of '_part_hidden' to 
the sender which fails since one is persistent and one is not.

I made a small change to 
ParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR to address the 
issue. I'll file a JIRA and run CI on it to see if it is a valid change. I'll 
also add a test for this scenario.


From: Jakov Varenina 
Sent: Friday, July 10, 2020 3:34 AM
To: dev@geode.apache.org 
Subject: Re: Non-persistent parallel gateway sender on non-persistent region 
(collocated with persistent region)

Hi devs,

just a kind reminder. We would be really grateful if you could take look
at question in below mail.

BRs,

Jakov

On 06. 07. 2020. 15:50, Jakov Varenina wrote:

Hi all,


We are trying to setup non-persistent parallel gateway sender
(‘sender’) on a non-persistent  partitioned region (‘part_a’). This
works OK.
But when this same region ‘part_a’ is colocated with another
persistent region (‘_part_hidden’),  Geode throws an exception:

Exception in thread "main"
org.apache.geode.internal.cache.wan.GatewaySenderException: Non
persistent gateway sender sender can not be attached to persistent
region /_part_hidden
 at
org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderQueue.java:461)
 at
org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderQueue.java:451)
 at
org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderEventProcessor.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderEventProcessor.java:191)
 at
org.apache.geode.internal.cache.wan.parallel.ConcurrentParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ConcurrentParallelGatewaySenderQueue.java:177)
 at
org.apache.geode.internal.cache.PartitionedRegion.postCreateRegion(PartitionedRegion.java:1174)
 at
org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3010)
 at
org.apache.geode.internal.cache.GemFireCacheImpl.basicCreateRegion(GemFireCacheImpl.java:2869)
 at
org.apache.geode.internal.cache.xmlcache.RegionCreation.createRoot(RegionCreation.java:237)
 at
org.apache.geode.internal.cache.xmlcache.CacheCreation.initializeRegions(CacheCreation.java:658)
 at
org.apache.geode.internal.cache.xmlcache.CacheCreation.create(CacheCreation.java:592)
 at
org.apache.geode.internal.cache.xmlcache.CacheXmlParser.create(CacheXmlParser.java:338)
 at
org.apache.geode.internal.cache.GemFireCacheImpl.loadCacheXml(GemFireCacheImpl.java:4081)
 at
org.apache.geode.internal.cache.GemFireCacheImpl.initializeDeclarativeCache(GemFireCacheImpl.java:1535)
 at
org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1374)
 at
org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:191)
 at
org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:158)
 at
org.apache.geode.cache.CacheFactory.create(CacheFactory.java:142)
 at
org.apache.geode.distributed.internal.DefaultServerLauncherCacheProvider.createCache(DefaultServerLauncherCacheProvider.java:52)
 at
org.apache.geode.distributed.ServerLauncher.createCache(ServerLauncher.java:894)
 at
org.apache.geode.distributed.ServerLauncher.start(ServerLauncher.java:809)
 at
org.apache.geode.distributed.ServerLauncher.run(ServerLauncher.java:739)
 at
org.apache.geode.distributed.ServerLauncher.main(ServerLauncher.java:256)


This is cache.xml used:

https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgeode.apache.org%2Fschema%2Fcachedata=02%7C01%7Cboglesby%40vmware.com%7C6b93b4396d44490cb36c08d824bce120%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637299740942592531sdata=sJIqN%2BcbftenSAKI3SDiZSgQiuLrq98UrdZbASDo3gY%3Dreserved=0;


Odg: Odg: negative ActiveCQCount

2020-07-17 Thread Mario Kevo
Hi devs,

Just reminder if someone is familiar with this, or someone has some idea how to 
resolve this issue.

Thanks and BR,
Mario

Šalje: Mario Kevo 
Poslano: 7. srpnja 2020. 15:24
Prima: dev@geode.apache.org 
Predmet: Odg: Odg: negative ActiveCQCount

Hi,

Thank you all for the response!

What I got for now is that when I register CQ on the one server it 
processMessage to the other server through FilterProfile and in the message 
opType is REGISTER_CQ.
In fromData() method in FilterProfile.java states following:
if (isCqOp(this.opType)) {
this.serverCqName = in.readUTF();
if (this.opType == operationType.REGISTER_CQ || this.opType == 
operationType.SET_CQ_STATE) {
  this.cq = CqServiceProvider.readCq(in);
}
And there it register cq on the other server and not increment cqActiveCount, 
which is ok as redundancy is 0. But it now has on both server different 
instances of ServerCqImpl for the same cq. The ones created with constructor 
with arguments at the execute cq and another with empty constructor while 
deserializing the message with opType=REGISTER_CQ. For me this is ok as we need 
to follow up all changes on both servers as maybe some fullfil CQ condition on 
the other server. Correct me if I'm wrong.

But when it is going to close cq it executes it on both server, for me it is ok 
that what is started should be closed. But in the close method we have 
decrement if stateBeforeClosing is RUNNING. So it will be good if we can 
somehow process cq_state of this ServerCqImpl instance which is created by 
constructor with parameters before closing this created by deserialization.
Does anyone has an idea how to get this? Or some other idea to solve this issue?

BR,
Mario


Šalje: Kirk Lund 
Poslano: 1. srpnja 2020. 19:52
Prima: dev@geode.apache.org 
Predmet: Re: Odg: negative ActiveCQCount

Yeah, https://issues.apache.org/jira/browse/GEODE-8293 sounds like a
statistic decrement bug for activeCqCount. Somewhere, each Server is
decrementing it once too many times.

You could find the statistics class containing activeCqCount and try adding
some debugging log statements or even add some breakpoints for debugger if
it's easily reproduced.

On Wed, Jul 1, 2020 at 5:52 AM Mario Kevo  wrote:

> Hi Kirk, thanks for the response!
>
> I just realized that I wrongly describe the problem as I tried so many
> case. Sorry!
>
> We have system with two servers. If the redundancy is 0 then we have
> properly that on the first server is activeCqCount=1 and on the second is
> activeCqCount=0.
> After close CQ we got on first server activeCqCount=0 and on the second is
> activeCqCount=-1.
> gfsh>show metrics --categories=query
> Cluster-wide Metrics
>
> Category |  Metric  | Value
>  |  | -
> query| activeCQCount| -1
>  | queryRequestRate | 0.0
>
>
> In case we set redundancy to 1 it increments properly as expected, on both
> servers by one. But when cq is closed we got on both servers
> activeCqCount=-1. And show metrics command has the following output
> gfsh>show metrics --categories=query
> Cluster-wide Metrics
>
> Category |  Metric  | Value
>  |  | -
> query| activeCQCount| -1
>  | queryRequestRate | 0.0
>
> What I found is that when server register cq on one server it send message
> to other servers in the system with opType=REGISTER_CQ and in that case it
> creates new instance of ServerCqImpl on second server(with empty
> constructor of ServerCqImpl). When we close CQ there is two different
> instances on servers and it closed both of them, but as they are in RUNNING
> state before closing, it decrements activeCqCount on both of them.
>
> BR,
> Mario
>
> 
> Šalje: Kirk Lund 
> Poslano: 30. lipnja 2020. 19:54
> Prima: dev@geode.apache.org 
> Predmet: Re: negative ActiveCQCount
>
> I think *show metrics --categories=query* is showing you the query stats
> from DistributedSystemMXBean (see
> ShowMetricsCommand#writeSystemWideMetricValues). DistributedSystemMXBean
> aggregates values across all members in the cluster, so I would have
> expected activeCQCount to initially show a value of 2 after you create a
> ServerCQImpl in 2 servers. Then after closing the CQ, it should drop to a
> value of 0.
>
> When you create a CQ on a Server, it should be reflected asynchronously on
> the CacheServerMXBean in that Server. Each Server has its own
> CacheServerMXBean. Over on the Locator (JMX Manager), the
> DistributedSystemMXBean aggregates the count of active CQs in
> ServerClusterStatsMonitor by invoking
> DistributedSystemBridge#updateCacheServer when the CacheServerMXBean state
> is federated to the Locator (JMX Manager).
>
> Based on what I see in code and in the description on GEODE-8293, I think
> you might want to see if increment has a problem instead of decrement.
>
> I don't see anything