Re: Supporting "resumable" operations on a large tree
Hi, For re-indexing, there are two problems actually: * Indexing can take multiple days, so resume would be nice * For synchronous indexes, indexing create a large commit, which is problematic (specially for MongoDB) To solve both problems ("kill two birds with one stone"), we could instead try to split indexing into multiple commits. For example use a "fromPath" .. "toPath" range, and only re-index part of the repository at a time. See also https://issues.apache.org/jira/browse/OAK-5324?focusedCommentId=15837941&pa ge=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment -15837941 Regards, Thomas On 20/02/17 13:13, "Chetan Mehrotra" wrote: >Hi Team, > >In Oak many a times we perform operations which traverse the tree as >part of some processing. For e.g. commit hooks, side grade, indexing >etc. For small tree this works fine and in case of failure the >processing can be done again from start. > >However for large operations like reindexing whole repository for some >index this posses a problem. For example consider a Mongo setup having >100M+ nodes and we need to provision a new index. This would trigger >an IndexUpdate which would go through all the nodes in the repository >(in some depth first manner) and then build up the index. This process >can take long time say 1-2 days for a Mongo based setup. > >As with any remote setup such a flow may get interrupted due to some >network issue or outage on Mongo/RDB side. In such a case the whole >traversal is started again from start. > >Same would be the case for any sidegrade operation where we convert a >big repository from one form to another. > >To improve the resiliency of such operations (OAK-2063) we need a way >to "resume" traversal in a tree from some last known point. For >operations performed on a sorted list such a "resume" is easy but >doing that over a tree traversal looks tricky. > >Thoughts on what approach can be taken for enabling this? > >May be if we can expect a stable order in traversal at a given >revision then we can keep track of paths t certain depth and then on >retry skip processing of subtrees untill untill we get that path > >Chetan Mehrotra
Re: [Observation] Should listeners require constant inflow of commits to get all events?
>> >>But agreed, this is a bug and we should fix it. >> >Actually, I'm not too sure as long as we concretely document the >behavior and potentially have a sample abstract >commit-creator/listener which does the job well (may be similar to the >hack I used) I've created OAK-5740 and attached test case that reproduces this. We can follow up there if/when/how we want to fix this. Cheers, Stefan
Flaky tests due to timing issues
Hi, I assume that at least some of the tests that sporadically fail on the Apache Jenkins fail because of timing issues. To address this we could either a) skip these tests on Jenkins, b) increase the time-out, c) apply platform dependent time-outs. I would prefer b). I presume that there is no impact on the build time unless the build fails anyway because it is running into one of these time-outs. If this is not acceptable we could go for b) and provision platform dependent time-outs through the CIHelpers class. I somewhat dislike the additional complexity though. As last resort we can still do a). WDYT? Michal
Re: Flaky tests due to timing issues
Hi, I assume with (b) you mean: change tests to use loops, combined with very high timeouts. Example: Before: save(); Thread.sleep(1000); assertTrue(abc()); After: save(); for(int i=0; !abc() && i<600; i++) { Thread.sleep(100); } assertTrue(abc()); The additional benefit of this logic is that on a fast machine, the test is faster (only 100 ms sleep instead of 1 second). Disadvantage: additional complexity, as you wrote (could be avoided with Java 8 lambda expressions). Regards, Thomas On 21/02/17 13:49, "Michael Dürig" wrote: > >Hi, > >I assume that at least some of the tests that sporadically fail on the >Apache Jenkins fail because of timing issues. To address this we could >either > >a) skip these tests on Jenkins, >b) increase the time-out, >c) apply platform dependent time-outs. > > >I would prefer b). I presume that there is no impact on the build time >unless the build fails anyway because it is running into one of these >time-outs. If this is not acceptable we could go for b) and provision >platform dependent time-outs through the CIHelpers class. I somewhat >dislike the additional complexity though. As last resort we can still do >a). > >WDYT? > >Michal
Re: Flaky tests due to timing issues
Hi, If in this context b) actually means 'fix the tests to be more lenient' I agree this should be the way to go. However I think the failing tests are not being given enough priority currently and if people aren't able to carve out the time for investigation, then it means we're stuck with very unreliable builds and this can hide more problems down the line. I would disable the tests that have failed more that a few times, on Jenkins only with the goal of returning to a consistent green build again. alex On Tue, Feb 21, 2017 at 2:09 PM, Thomas Mueller wrote: > Hi, > > I assume with (b) you mean: change tests to use loops, combined with very > high timeouts. Example: > > Before: > > save(); > Thread.sleep(1000); > assertTrue(abc()); > > After: > > save(); > for(int i=0; !abc() && i<600; i++) { > Thread.sleep(100); > } > assertTrue(abc()); > > > > The additional benefit of this logic is that on a fast machine, the test > is faster (only 100 ms sleep instead of 1 second). Disadvantage: > additional complexity, as you wrote (could be avoided with Java 8 lambda > expressions). > > Regards, > Thomas > > > > On 21/02/17 13:49, "Michael Dürig" wrote: > > > > >Hi, > > > >I assume that at least some of the tests that sporadically fail on the > >Apache Jenkins fail because of timing issues. To address this we could > >either > > > >a) skip these tests on Jenkins, > >b) increase the time-out, > >c) apply platform dependent time-outs. > > > > > >I would prefer b). I presume that there is no impact on the build time > >unless the build fails anyway because it is running into one of these > >time-outs. If this is not acceptable we could go for b) and provision > >platform dependent time-outs through the CIHelpers class. I somewhat > >dislike the additional complexity though. As last resort we can still do > >a). > > > >WDYT? > > > >Michal > >
Re: Flaky tests due to timing issues
On 21.02.17 14:09, Thomas Mueller wrote: Hi, I assume with (b) you mean: change tests to use loops, combined with very high timeouts. Example: No I actually meant getting individual time-out values (or a scaling factor for time-outs) from CIHelper. That class already provides the means to skip tests based on where they are running. So it should be relatively straight forward to have it supply scaling factors for time-outs in a similar manner. Michael Before: save(); Thread.sleep(1000); assertTrue(abc()); After: save(); for(int i=0; !abc() && i<600; i++) { Thread.sleep(100); } assertTrue(abc()); The additional benefit of this logic is that on a fast machine, the test is faster (only 100 ms sleep instead of 1 second). Disadvantage: additional complexity, as you wrote (could be avoided with Java 8 lambda expressions). Regards, Thomas On 21/02/17 13:49, "Michael Dürig" wrote: Hi, I assume that at least some of the tests that sporadically fail on the Apache Jenkins fail because of timing issues. To address this we could either a) skip these tests on Jenkins, b) increase the time-out, c) apply platform dependent time-outs. I would prefer b). I presume that there is no impact on the build time unless the build fails anyway because it is running into one of these time-outs. If this is not acceptable we could go for b) and provision platform dependent time-outs through the CIHelpers class. I somewhat dislike the additional complexity though. As last resort we can still do a). WDYT? Michal
Re: Flaky tests due to timing issues
On 21.02.17 15:32, Alex Parvulescu wrote: Hi, If in this context b) actually means 'fix the tests to be more lenient' I agree this should be the way to go. Yes and this is not necessarily bad as I presume that some of the time-outs might well be overly tight (at least in some environments). However I think the failing tests are not being given enough priority currently and if people aren't able to carve out the time for investigation, then it means we're stuck with very unreliable builds and this can hide more problems down the line. I would disable the tests that have failed more that a few times, on Jenkins only with the goal of returning to a consistent green build again. +1. Please have a look at the CIHelper class and its usage for disabling tests on Jenkins only. Michael alex On Tue, Feb 21, 2017 at 2:09 PM, Thomas Mueller wrote: Hi, I assume with (b) you mean: change tests to use loops, combined with very high timeouts. Example: Before: save(); Thread.sleep(1000); assertTrue(abc()); After: save(); for(int i=0; !abc() && i<600; i++) { Thread.sleep(100); } assertTrue(abc()); The additional benefit of this logic is that on a fast machine, the test is faster (only 100 ms sleep instead of 1 second). Disadvantage: additional complexity, as you wrote (could be avoided with Java 8 lambda expressions). Regards, Thomas On 21/02/17 13:49, "Michael Dürig" wrote: Hi, I assume that at least some of the tests that sporadically fail on the Apache Jenkins fail because of timing issues. To address this we could either a) skip these tests on Jenkins, b) increase the time-out, c) apply platform dependent time-outs. I would prefer b). I presume that there is no impact on the build time unless the build fails anyway because it is running into one of these time-outs. If this is not acceptable we could go for b) and provision platform dependent time-outs through the CIHelpers class. I somewhat dislike the additional complexity though. As last resort we can still do a). WDYT? Michal
Re: Flaky tests due to timing issues
Hi, >No I actually meant getting individual time-out values (or a scaling >factor for time-outs) from CIHelper. That class already provides the >means to skip tests based on where they are running. So it should be >relatively straight forward to have it supply scaling factors for >time-outs in a similar manner. Do you have an example? I think timeouts in the order of seconds are problematic, and I don't think that "scaling" them to 5 seconds or so will fully solve the problem. Timeouts in the order of minutes are better, but I wouldn't want to _always_ delay tests that long. That's why I believe using loops is better. But in that case, configuration seems unnecessary. Regards, Thomas
[Oak origin/1.6] Apache Jackrabbit Oak matrix - Build # 1441 - Still Failing
The Apache Jenkins build system has built Apache Jackrabbit Oak matrix (build #1441) Status: Still Failing Check console output at https://builds.apache.org/job/Apache%20Jackrabbit%20Oak%20matrix/1441/ to view the results. Changes: [reschke] OAK-5738: Potential NPE in LargeLdapProviderTest (ported to 1.6) Test results: 1 tests failed. FAILED: org.apache.jackrabbit.oak.run.osgi.SecurityProviderRegistrationTest.testRequiredUserAuthenticationFactoryNotAvailable Error Message: assert securityProviderServiceReferences != null| |null false Stack Trace: Assertion failed: assert securityProviderServiceReferences != null | | null false at org.codehaus.groovy.runtime.InvokerHelper.assertFailed(InvokerHelper.java:398) at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.assertFailed(ScriptBytecodeAdapter.java:646) at org.apache.jackrabbit.oak.run.osgi.SecurityProviderRegistrationTest.testRequiredService(SecurityProviderRegistrationTest.groovy:296) at org.apache.jackrabbit.oak.run.osgi.SecurityProviderRegistrationTest.this$3$testRequiredService(SecurityProviderRegistrationTest.groovy) at org.apache.jackrabbit.oak.run.osgi.SecurityProviderRegistrationTest$this$3$testRequiredService.callCurrent(Unknown Source) at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallCurrent(CallSiteArray.java:49) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:133) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:145) at org.apache.jackrabbit.oak.run.osgi.SecurityProviderRegistrationTest.testRequiredUserAuthenticationFactoryNotAvailable(SecurityProviderRegistrationTest.groovy:139) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) at org.junit.rules.RunRules.evaluate(RunRules.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189) at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165) at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)
[Oak origin/1.4] Apache Jackrabbit Oak matrix - Build # 1442 - Still Failing
The Apache Jenkins build system has built Apache Jackrabbit Oak matrix (build #1442) Status: Still Failing Check console output at https://builds.apache.org/job/Apache%20Jackrabbit%20Oak%20matrix/1442/ to view the results. Changes: [reschke] OAK-5738: Potential NPE in LargeLdapProviderTest (ported to 1.4) Test results: 1 tests failed. FAILED: org.apache.jackrabbit.oak.plugins.segment.standby.ExternalSharedStoreIT.testProxyFlippedIntermediateByte Error Message: expected:<{ root = { ... } }> but was:<{ root : { } }> Stack Trace: java.lang.AssertionError: expected:<{ root = { ... } }> but was:<{ root : { } }> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:834) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:144) at org.apache.jackrabbit.oak.plugins.segment.standby.DataStoreTestBase.useProxy(DataStoreTestBase.java:224) at org.apache.jackrabbit.oak.plugins.segment.standby.DataStoreTestBase.testProxyFlippedIntermediateByte(DataStoreTestBase.java:174) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189) at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165) at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)
[Oak origin/1.2] Apache Jackrabbit Oak matrix - Build # 1443 - Still Failing
The Apache Jenkins build system has built Apache Jackrabbit Oak matrix (build #1443) Status: Still Failing Check console output at https://builds.apache.org/job/Apache%20Jackrabbit%20Oak%20matrix/1443/ to view the results. Changes: [reschke] OAK-5738: Potential NPE in LargeLdapProviderTest (ported to 1.2) Test results: All tests passed
[Oak origin/trunk] Apache Jackrabbit Oak matrix - Build # 1444 - Still Failing
The Apache Jenkins build system has built Apache Jackrabbit Oak matrix (build #1444) Status: Still Failing Check console output at https://builds.apache.org/job/Apache%20Jackrabbit%20Oak%20matrix/1444/ to view the results. Changes: [reschke] OAK-5738: Potential NPE in LargeLdapProviderTest [reschke] OAK-5612: Test failure: [thomasm] OAK-4888 Warn or fail queries above a configurable cost value [stefanegli] OAK-5742 : logging more details in case stopAndWait returns false - to [angela] OAK-5743 : UserQueryManager: omits nt-name when searching for properties [angela] OAK-5689 : AbstractSecurityTest: enforce test-failure for traversal Test results: All tests passed