Looks like most {{RelNode#create()}} access the RelOptCluster/RelBuilder instance from their child, and some then, perform metadata operation, which would cause CyclicMetadataException for example. I'll create a fixup patch.
On Wed, May 29, 2019 at 9:00 AM Laurent Goujon <laur...@dremio.com> wrote: > I actually did the change from using a static relbuilder to a mix of > static (to create the scans) and per test because of concurrency issues :( > Maybe the test should only been using a per test-case relbuilder instead, > but it would mean that there's some code relying on the static relbuilder > (by accessing it from the scan rel nodes) in a non-static context. I'll try > to debug it too (but so far was unlucky reproducing the jenkins problem). > > On Wed, May 29, 2019 at 5:02 AM Francis Chuang <francischu...@apache.org> > wrote: > >> In Go, there's a built in race detector that can be used when running >> tests using `go test`. >> >> There's RacerD [1] from Facebook that can detect races in Java. Perhaps >> this is something that can be looked in to to find the race. >> >> [1] https://fbinfer.com/docs/racerd.html >> >> On 29/05/2019 9:52 pm, Stamatis Zampetakis wrote: >> > Good insights, Ruben, Danny! >> > >> > Assuming that is indeed a concurrency problem, it will be difficult to >> > identify since many parts in Calcite are not thread safe. Not being >> able to >> > reproduce the problem makes the things even worse. >> > >> > Given that intermittent test failures occur often on Jenkins, how about >> > creating a new branch dedicated to debugging? >> > It can be associated with a new Jenkins jobs (that obviously doesn't >> send >> > mails to everybody when there are failures) that can be launched >> on-demand >> > by the person who is looking into the problem. >> > >> > On the other hand, we do not really make an effort to have tests that >> can >> > be executed concurrently so another alternative would be to run tests >> only >> > sequentially. >> > I am not very fan of this approach since it will rather hide problems >> than >> > solve them. >> > >> > >> > >> > >> > >> > >> > On Wed, May 29, 2019 at 1:02 PM Yuzhao Chen <yuzhao....@gmail.com> >> wrote: >> > >> >> Thanks Ruben for your good analysis. >> >> >> >> What I’m confused is that isn’t the static REL_BUILDER more prone to >> have >> >> concurrency problems ? And the pushed scans(EMP_SCAN and DEPT_SCAN) >> are all >> >> nodes(immutable), how could this be a problem ? >> >> >> >> Best, >> >> Danny Chan >> >> 在 2019年5月29日 +0800 PM5:37,Ruben Q L <rube...@gmail.com>,写道: >> >>> I'm checking the commit [1] and I see something strange in >> >> RelOptUtilTest. >> >>> Maybe I'm wrong and it is nothing, but just in case it may help: >> >>> >> >>> With the latest modification, it seems that we have two RelBuilder(s) >> in >> >>> place: >> >>> - A static one that is created ad-hoc on a static block to generate >> the >> >>> EMP_SCAN and DEPT_SCAN RelNodes [2] >> >>> - An instance one to be used in the tests, that is initialized on >> >>> the @Before public void setUp() method [3] >> >>> >> >>> Before this commit, the EMP_SCAN / DEPT_SCAN were only used to read >> their >> >>> rowTypes to test some join auxiliary methods. But the new >> >>> tests testPushDownJoinConditions* actually build a plan and push these >> >>> scans into the RelBuilder to be tested [4] (which is a different one >> than >> >>> the static RelBuider that created the scans). >> >>> Maybe this is no problem generally, but it can potentially be under >> >> certain >> >>> circumstances?, which would explain the randomness of the issue. >> >>> Could this explain the exception? >> >>> >> >>> [1] >> >>> >> >> >> https://github.com/apache/calcite/commit/82e7d4e760cb203d31956c55e38e0fdd56119d58 >> >>> >> >>> [2] >> >>> >> >> >> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L71 >> >>> [3] >> >>> >> >> >> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L92 >> >>> [4] >> >>> >> >> >> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L292 >> >>> >> >>> >> >>> >> >>> Le mer. 29 mai 2019 à 02:20, Julian Hyde <jh...@apache.org> a écrit : >> >>> >> >>>> It’s a tough call. It is probable that the problem existed already >> and >> >> the >> >>>> change merely surfaced it. >> >>>> >> >>>>> On May 28, 2019, at 5:17 PM, Stamatis Zampetakis <zabe...@gmail.com >> > >> >>>> wrote: >> >>>>> >> >>>>> It is not the only test that is failing after commit [1] but all the >> >> new >> >>>>> tests that were added. >> >>>>> >> >>>>> I've seen the problem on Jenkins on all JDKS but I cannot reproduce >> >> it >> >>>>> locally. >> >>>>> I guess we have to do with a race condition most likely due to the >> >>>>> concurrent execution of tests with surefire. >> >>>>> >> >>>>> Should we revert the commit till we find a solution? >> >>>>> >> >>>>> [1] >> >>>>> >> >>>> >> >> >> https://github.com/apache/calcite/commit/82e7d4e760cb203d31956c55e38e0fdd56119d58 >> >>>>> >> >>>>> On Tue, May 28, 2019 at 7:57 PM Julian Hyde <jh...@apache.org> >> >> wrote: >> >>>>> >> >>>>>> I have seen this intermittent failure 3 times in the last week: >> >>>>>> >> >>>>>> [INFO] Running org.apache.calcite.plan.RelOptUtilTest >> >>>>>> [ERROR] Tests run: 11, Failures: 0, Errors: 1, Skipped: 0, Time >> >> elapsed: >> >>>>>> 0.411 s <<< FAILURE! - in org.apache.calcite.plan.RelOptUtilTest >> >>>>>> [ERROR] >> >>>>>> >> >>>> >> >> >> testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase(org.apache.calcite.plan.RelOptUtilTest) >> >>>>>> Time elapsed: 0.349 s <<< ERROR! >> >>>>>> org.apache.calcite.rel.metadata.CyclicMetadataException >> >>>>>> at >> >>>>>> >> >>>> >> >> >> org.apache.calcite.plan.RelOptUtilTest.testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase(RelOptUtilTest.java:445) >> >>>>>> >> >>>>>> I have seen it on Oracle JDK 12 and OpenJDK 10. The test was only >> >> added >> >>>> on >> >>>>>> May 22 so I assume that it will continue to fail intermittently >> >> until >> >>>> we do >> >>>>>> something. >> >>>>>> >> >>>>>> Anyone have any ideas? >> >>>>>> >> >>>>>> Laurent, As you added the test can you please look into it? >> >>>>>> >> >>>>>> Julian >> >>>>>> >> >>>>>> >> >>>> >> >>>> >> >> >> > >> >