Thanks a lot Laurent, much appreciated! BTW, were you able to reproduce the problem?
On Wed, May 29, 2019 at 11:29 PM Laurent Goujon <laur...@dremio.com> wrote: > Here's the pull request: https://github.com/apache/calcite/pull/1240 > > On Wed, May 29, 2019 at 1:58 PM Laurent Goujon <laur...@dremio.com> wrote: > > > Looks like most {{RelNode#create()}} access the RelOptCluster/RelBuilder > > instance from their child, and some then, perform metadata operation, > which > > would cause CyclicMetadataException for example. I'll create a fixup > patch. > > > > On Wed, May 29, 2019 at 9:00 AM Laurent Goujon <laur...@dremio.com> > wrote: > > > >> I actually did the change from using a static relbuilder to a mix of > >> static (to create the scans) and per test because of concurrency issues > :( > >> Maybe the test should only been using a per test-case relbuilder > instead, > >> but it would mean that there's some code relying on the static > relbuilder > >> (by accessing it from the scan rel nodes) in a non-static context. I'll > try > >> to debug it too (but so far was unlucky reproducing the jenkins > problem). > >> > >> On Wed, May 29, 2019 at 5:02 AM Francis Chuang < > francischu...@apache.org> > >> wrote: > >> > >>> In Go, there's a built in race detector that can be used when running > >>> tests using `go test`. > >>> > >>> There's RacerD [1] from Facebook that can detect races in Java. Perhaps > >>> this is something that can be looked in to to find the race. > >>> > >>> [1] https://fbinfer.com/docs/racerd.html > >>> > >>> On 29/05/2019 9:52 pm, Stamatis Zampetakis wrote: > >>> > Good insights, Ruben, Danny! > >>> > > >>> > Assuming that is indeed a concurrency problem, it will be difficult > to > >>> > identify since many parts in Calcite are not thread safe. Not being > >>> able to > >>> > reproduce the problem makes the things even worse. > >>> > > >>> > Given that intermittent test failures occur often on Jenkins, how > about > >>> > creating a new branch dedicated to debugging? > >>> > It can be associated with a new Jenkins jobs (that obviously doesn't > >>> send > >>> > mails to everybody when there are failures) that can be launched > >>> on-demand > >>> > by the person who is looking into the problem. > >>> > > >>> > On the other hand, we do not really make an effort to have tests that > >>> can > >>> > be executed concurrently so another alternative would be to run tests > >>> only > >>> > sequentially. > >>> > I am not very fan of this approach since it will rather hide problems > >>> than > >>> > solve them. > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > On Wed, May 29, 2019 at 1:02 PM Yuzhao Chen <yuzhao....@gmail.com> > >>> wrote: > >>> > > >>> >> Thanks Ruben for your good analysis. > >>> >> > >>> >> What I’m confused is that isn’t the static REL_BUILDER more prone to > >>> have > >>> >> concurrency problems ? And the pushed scans(EMP_SCAN and DEPT_SCAN) > >>> are all > >>> >> nodes(immutable), how could this be a problem ? > >>> >> > >>> >> Best, > >>> >> Danny Chan > >>> >> 在 2019年5月29日 +0800 PM5:37,Ruben Q L <rube...@gmail.com>,写道: > >>> >>> I'm checking the commit [1] and I see something strange in > >>> >> RelOptUtilTest. > >>> >>> Maybe I'm wrong and it is nothing, but just in case it may help: > >>> >>> > >>> >>> With the latest modification, it seems that we have two > >>> RelBuilder(s) in > >>> >>> place: > >>> >>> - A static one that is created ad-hoc on a static block to generate > >>> the > >>> >>> EMP_SCAN and DEPT_SCAN RelNodes [2] > >>> >>> - An instance one to be used in the tests, that is initialized on > >>> >>> the @Before public void setUp() method [3] > >>> >>> > >>> >>> Before this commit, the EMP_SCAN / DEPT_SCAN were only used to read > >>> their > >>> >>> rowTypes to test some join auxiliary methods. But the new > >>> >>> tests testPushDownJoinConditions* actually build a plan and push > >>> these > >>> >>> scans into the RelBuilder to be tested [4] (which is a different > one > >>> than > >>> >>> the static RelBuider that created the scans). > >>> >>> Maybe this is no problem generally, but it can potentially be under > >>> >> certain > >>> >>> circumstances?, which would explain the randomness of the issue. > >>> >>> Could this explain the exception? > >>> >>> > >>> >>> [1] > >>> >>> > >>> >> > >>> > https://github.com/apache/calcite/commit/82e7d4e760cb203d31956c55e38e0fdd56119d58 > >>> >>> > >>> >>> [2] > >>> >>> > >>> >> > >>> > https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L71 > >>> >>> [3] > >>> >>> > >>> >> > >>> > https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L92 > >>> >>> [4] > >>> >>> > >>> >> > >>> > https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L292 > >>> >>> > >>> >>> > >>> >>> > >>> >>> Le mer. 29 mai 2019 à 02:20, Julian Hyde <jh...@apache.org> a > écrit > >>> : > >>> >>> > >>> >>>> It’s a tough call. It is probable that the problem existed already > >>> and > >>> >> the > >>> >>>> change merely surfaced it. > >>> >>>> > >>> >>>>> On May 28, 2019, at 5:17 PM, Stamatis Zampetakis < > >>> zabe...@gmail.com> > >>> >>>> wrote: > >>> >>>>> > >>> >>>>> It is not the only test that is failing after commit [1] but all > >>> the > >>> >> new > >>> >>>>> tests that were added. > >>> >>>>> > >>> >>>>> I've seen the problem on Jenkins on all JDKS but I cannot > reproduce > >>> >> it > >>> >>>>> locally. > >>> >>>>> I guess we have to do with a race condition most likely due to > the > >>> >>>>> concurrent execution of tests with surefire. > >>> >>>>> > >>> >>>>> Should we revert the commit till we find a solution? > >>> >>>>> > >>> >>>>> [1] > >>> >>>>> > >>> >>>> > >>> >> > >>> > https://github.com/apache/calcite/commit/82e7d4e760cb203d31956c55e38e0fdd56119d58 > >>> >>>>> > >>> >>>>> On Tue, May 28, 2019 at 7:57 PM Julian Hyde <jh...@apache.org> > >>> >> wrote: > >>> >>>>> > >>> >>>>>> I have seen this intermittent failure 3 times in the last week: > >>> >>>>>> > >>> >>>>>> [INFO] Running org.apache.calcite.plan.RelOptUtilTest > >>> >>>>>> [ERROR] Tests run: 11, Failures: 0, Errors: 1, Skipped: 0, Time > >>> >> elapsed: > >>> >>>>>> 0.411 s <<< FAILURE! - in org.apache.calcite.plan.RelOptUtilTest > >>> >>>>>> [ERROR] > >>> >>>>>> > >>> >>>> > >>> >> > >>> > testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase(org.apache.calcite.plan.RelOptUtilTest) > >>> >>>>>> Time elapsed: 0.349 s <<< ERROR! > >>> >>>>>> org.apache.calcite.rel.metadata.CyclicMetadataException > >>> >>>>>> at > >>> >>>>>> > >>> >>>> > >>> >> > >>> > org.apache.calcite.plan.RelOptUtilTest.testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase(RelOptUtilTest.java:445) > >>> >>>>>> > >>> >>>>>> I have seen it on Oracle JDK 12 and OpenJDK 10. The test was > only > >>> >> added > >>> >>>> on > >>> >>>>>> May 22 so I assume that it will continue to fail intermittently > >>> >> until > >>> >>>> we do > >>> >>>>>> something. > >>> >>>>>> > >>> >>>>>> Anyone have any ideas? > >>> >>>>>> > >>> >>>>>> Laurent, As you added the test can you please look into it? > >>> >>>>>> > >>> >>>>>> Julian > >>> >>>>>> > >>> >>>>>> > >>> >>>> > >>> >>>> > >>> >> > >>> > > >>> > >> >