just run that single test would not reproduce. It also passes on my machine if run it separately. mvn clean install can repro every time.
Thanks~ Haisheng Yuan------------------------------------------------------------------ 发件人:Ruben Q L<rube...@gmail.com> 日 期:2019年05月31日 14:03:44 收件人:<dev@calcite.apache.org> 抄 送:Stamatis Zampetakis<zabe...@gmail.com> 主 题:Re: Re: CyclicMetadataException in testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase Thanks for the info. I'm running EnumerableRepeatUnionHierarchyTest, and in my local environment (macOS 10.14.5, JDK 8) it ends successfully. I'll continue digging into it... Le ven. 31 mai 2019 à 04:20, Haisheng Yuan <h.y...@alibaba-inc.com> a écrit : > The test suite stuck on my laptop for ever. macOS 10.14.4, JDK 8. > Ignoring test case EnumerableRepeatUnionHierarchyTest makes the whole test > pass again. > > Thanks ~ > Haisheng Yuan > ------------------------------------------------------------------ > 发件人:Stamatis Zampetakis<zabe...@gmail.com> > 日 期:2019年05月30日 22:49:22 > 收件人:<dev@calcite.apache.org> > 主 题:Re: CyclicMetadataException in > testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase > > Thanks for noticing Laurent! > > The last builds were successful so I guess we have to do with another > intermittent failure. > > If the problem persists, I will create a JIRA and follow up there 😉 > > On Thu, May 30, 2019, 6:43 AM Laurent Goujon <laur...@dremio.com> wrote: > > > Looks like Jenkins started to timeout with the following change: > > > > > https://github.com/apache/calcite/commit/73e6d05fa65f16485caca80571d1fe4fda5c7468 > > . > > > > Ruben, Stamatis: any idea? > > > > On Wed, May 29, 2019 at 8:22 PM Laurent Goujon <laur...@dremio.com> > wrote: > > > > > It looks like Jenkins build with JDK8 now takes more than 40min and is > > > being aborted, with the core module taking something like 38min and > being > > > stopped during tests. It seems to be independent of my test change > since > > > RelOptUtilTest shows to be running successfully in 1second. > > > That said, I don't see a test still be running from the logs, so I > guess > > > it would be something keeping the jvm from stopping even if all tests > are > > > completed? > > > > > > On Wed, May 29, 2019 at 4:15 PM Stamatis Zampetakis <zabe...@gmail.com > > > > > wrote: > > > > > >> Makes sense, I was expecting something like that. > > >> > > >> +1 for the PR. Even if we are not 100% that will resolve the problem, > > >> getting rid of the statics seems like a good idea and beneficial in > any > > >> case. > > >> > > >> On Thu, May 30, 2019 at 12:02 AM Laurent Goujon <laur...@dremio.com> > > >> wrote: > > >> > > >> > Unfortunately, not locally, so I ended in crude debugging and code > > >> analysis > > >> > to understand where the static reloptcluster created for the static > > >> > relbuilder was used in a context which would cause issue in a > > >> multithreaded > > >> > context. Once I saw several rel nodes getting the parent cluster > (the > > >> > static table scan) to access the metadata query (which cannot be > > reused > > >> > between threads), that's where I assumed this would be the problem. > > >> > > > >> > On Wed, May 29, 2019 at 2:58 PM Stamatis Zampetakis < > > zabe...@gmail.com> > > >> > wrote: > > >> > > > >> > > Thanks a lot Laurent, much appreciated! > > >> > > > > >> > > BTW, were you able to reproduce the problem? > > >> > > > > >> > > On Wed, May 29, 2019 at 11:29 PM Laurent Goujon < > laur...@dremio.com > > > > > >> > > wrote: > > >> > > > > >> > > > Here's the pull request: > > >> https://github.com/apache/calcite/pull/1240 > > >> > > > > > >> > > > On Wed, May 29, 2019 at 1:58 PM Laurent Goujon < > > laur...@dremio.com> > > >> > > wrote: > > >> > > > > > >> > > > > Looks like most {{RelNode#create()}} access the > > >> > > RelOptCluster/RelBuilder > > >> > > > > instance from their child, and some then, perform metadata > > >> operation, > > >> > > > which > > >> > > > > would cause CyclicMetadataException for example. I'll create a > > >> fixup > > >> > > > patch. > > >> > > > > > > >> > > > > On Wed, May 29, 2019 at 9:00 AM Laurent Goujon < > > >> laur...@dremio.com> > > >> > > > wrote: > > >> > > > > > > >> > > > >> I actually did the change from using a static relbuilder to a > > >> mix of > > >> > > > >> static (to create the scans) and per test because of > > concurrency > > >> > > issues > > >> > > > :( > > >> > > > >> Maybe the test should only been using a per test-case > > relbuilder > > >> > > > instead, > > >> > > > >> but it would mean that there's some code relying on the > static > > >> > > > relbuilder > > >> > > > >> (by accessing it from the scan rel nodes) in a non-static > > >> context. > > >> > > I'll > > >> > > > try > > >> > > > >> to debug it too (but so far was unlucky reproducing the > jenkins > > >> > > > problem). > > >> > > > >> > > >> > > > >> On Wed, May 29, 2019 at 5:02 AM Francis Chuang < > > >> > > > francischu...@apache.org> > > >> > > > >> wrote: > > >> > > > >> > > >> > > > >>> In Go, there's a built in race detector that can be used > when > > >> > running > > >> > > > >>> tests using `go test`. > > >> > > > >>> > > >> > > > >>> There's RacerD [1] from Facebook that can detect races in > > Java. > > >> > > Perhaps > > >> > > > >>> this is something that can be looked in to to find the race. > > >> > > > >>> > > >> > > > >>> [1] https://fbinfer.com/docs/racerd.html > > >> > > > >>> > > >> > > > >>> On 29/05/2019 9:52 pm, Stamatis Zampetakis wrote: > > >> > > > >>> > Good insights, Ruben, Danny! > > >> > > > >>> > > > >> > > > >>> > Assuming that is indeed a concurrency problem, it will be > > >> > difficult > > >> > > > to > > >> > > > >>> > identify since many parts in Calcite are not thread safe. > > Not > > >> > being > > >> > > > >>> able to > > >> > > > >>> > reproduce the problem makes the things even worse. > > >> > > > >>> > > > >> > > > >>> > Given that intermittent test failures occur often on > > Jenkins, > > >> how > > >> > > > about > > >> > > > >>> > creating a new branch dedicated to debugging? > > >> > > > >>> > It can be associated with a new Jenkins jobs (that > obviously > > >> > > doesn't > > >> > > > >>> send > > >> > > > >>> > mails to everybody when there are failures) that can be > > >> launched > > >> > > > >>> on-demand > > >> > > > >>> > by the person who is looking into the problem. > > >> > > > >>> > > > >> > > > >>> > On the other hand, we do not really make an effort to have > > >> tests > > >> > > that > > >> > > > >>> can > > >> > > > >>> > be executed concurrently so another alternative would be > to > > >> run > > >> > > tests > > >> > > > >>> only > > >> > > > >>> > sequentially. > > >> > > > >>> > I am not very fan of this approach since it will rather > hide > > >> > > problems > > >> > > > >>> than > > >> > > > >>> > solve them. > > >> > > > >>> > > > >> > > > >>> > > > >> > > > >>> > > > >> > > > >>> > > > >> > > > >>> > > > >> > > > >>> > > > >> > > > >>> > On Wed, May 29, 2019 at 1:02 PM Yuzhao Chen < > > >> > yuzhao....@gmail.com> > > >> > > > >>> wrote: > > >> > > > >>> > > > >> > > > >>> >> Thanks Ruben for your good analysis. > > >> > > > >>> >> > > >> > > > >>> >> What I’m confused is that isn’t the static REL_BUILDER > more > > >> > prone > > >> > > to > > >> > > > >>> have > > >> > > > >>> >> concurrency problems ? And the pushed scans(EMP_SCAN and > > >> > > DEPT_SCAN) > > >> > > > >>> are all > > >> > > > >>> >> nodes(immutable), how could this be a problem ? > > >> > > > >>> >> > > >> > > > >>> >> Best, > > >> > > > >>> >> Danny Chan > > >> > > > >>> >> 在 2019年5月29日 +0800 PM5:37,Ruben Q L <rube...@gmail.com > > >,写道: > > >> > > > >>> >>> I'm checking the commit [1] and I see something strange > in > > >> > > > >>> >> RelOptUtilTest. > > >> > > > >>> >>> Maybe I'm wrong and it is nothing, but just in case it > may > > >> > help: > > >> > > > >>> >>> > > >> > > > >>> >>> With the latest modification, it seems that we have two > > >> > > > >>> RelBuilder(s) in > > >> > > > >>> >>> place: > > >> > > > >>> >>> - A static one that is created ad-hoc on a static block > to > > >> > > generate > > >> > > > >>> the > > >> > > > >>> >>> EMP_SCAN and DEPT_SCAN RelNodes [2] > > >> > > > >>> >>> - An instance one to be used in the tests, that is > > >> initialized > > >> > on > > >> > > > >>> >>> the @Before public void setUp() method [3] > > >> > > > >>> >>> > > >> > > > >>> >>> Before this commit, the EMP_SCAN / DEPT_SCAN were only > > used > > >> to > > >> > > read > > >> > > > >>> their > > >> > > > >>> >>> rowTypes to test some join auxiliary methods. But the > new > > >> > > > >>> >>> tests testPushDownJoinConditions* actually build a plan > > and > > >> > push > > >> > > > >>> these > > >> > > > >>> >>> scans into the RelBuilder to be tested [4] (which is a > > >> > different > > >> > > > one > > >> > > > >>> than > > >> > > > >>> >>> the static RelBuider that created the scans). > > >> > > > >>> >>> Maybe this is no problem generally, but it can > potentially > > >> be > > >> > > under > > >> > > > >>> >> certain > > >> > > > >>> >>> circumstances?, which would explain the randomness of > the > > >> > issue. > > >> > > > >>> >>> Could this explain the exception? > > >> > > > >>> >>> > > >> > > > >>> >>> [1] > > >> > > > >>> >>> > > >> > > > >>> >> > > >> > > > >>> > > >> > > > > > >> > > > > >> > > > >> > > > https://github.com/apache/calcite/commit/82e7d4e760cb203d31956c55e38e0fdd56119d58 > > >> > > > >>> >>> > > >> > > > >>> >>> [2] > > >> > > > >>> >>> > > >> > > > >>> >> > > >> > > > >>> > > >> > > > > > >> > > > > >> > > > >> > > > https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L71 > > >> > > > >>> >>> [3] > > >> > > > >>> >>> > > >> > > > >>> >> > > >> > > > >>> > > >> > > > > > >> > > > > >> > > > >> > > > https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L92 > > >> > > > >>> >>> [4] > > >> > > > >>> >>> > > >> > > > >>> >> > > >> > > > >>> > > >> > > > > > >> > > > > >> > > > >> > > > https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L292 > > >> > > > >>> >>> > > >> > > > >>> >>> > > >> > > > >>> >>> > > >> > > > >>> >>> Le mer. 29 mai 2019 à 02:20, Julian Hyde < > > jh...@apache.org> > > >> a > > >> > > > écrit > > >> > > > >>> : > > >> > > > >>> >>> > > >> > > > >>> >>>> It’s a tough call. It is probable that the problem > > existed > > >> > > already > > >> > > > >>> and > > >> > > > >>> >> the > > >> > > > >>> >>>> change merely surfaced it. > > >> > > > >>> >>>> > > >> > > > >>> >>>>> On May 28, 2019, at 5:17 PM, Stamatis Zampetakis < > > >> > > > >>> zabe...@gmail.com> > > >> > > > >>> >>>> wrote: > > >> > > > >>> >>>>> > > >> > > > >>> >>>>> It is not the only test that is failing after commit > [1] > > >> but > > >> > > all > > >> > > > >>> the > > >> > > > >>> >> new > > >> > > > >>> >>>>> tests that were added. > > >> > > > >>> >>>>> > > >> > > > >>> >>>>> I've seen the problem on Jenkins on all JDKS but I > > cannot > > >> > > > reproduce > > >> > > > >>> >> it > > >> > > > >>> >>>>> locally. > > >> > > > >>> >>>>> I guess we have to do with a race condition most > likely > > >> due > > >> > to > > >> > > > the > > >> > > > >>> >>>>> concurrent execution of tests with surefire. > > >> > > > >>> >>>>> > > >> > > > >>> >>>>> Should we revert the commit till we find a solution? > > >> > > > >>> >>>>> > > >> > > > >>> >>>>> [1] > > >> > > > >>> >>>>> > > >> > > > >>> >>>> > > >> > > > >>> >> > > >> > > > >>> > > >> > > > > > >> > > > > >> > > > >> > > > https://github.com/apache/calcite/commit/82e7d4e760cb203d31956c55e38e0fdd56119d58 > > >> > > > >>> >>>>> > > >> > > > >>> >>>>> On Tue, May 28, 2019 at 7:57 PM Julian Hyde < > > >> > jh...@apache.org> > > >> > > > >>> >> wrote: > > >> > > > >>> >>>>> > > >> > > > >>> >>>>>> I have seen this intermittent failure 3 times in the > > last > > >> > > week: > > >> > > > >>> >>>>>> > > >> > > > >>> >>>>>> [INFO] Running org.apache.calcite.plan.RelOptUtilTest > > >> > > > >>> >>>>>> [ERROR] Tests run: 11, Failures: 0, Errors: 1, > Skipped: > > >> 0, > > >> > > Time > > >> > > > >>> >> elapsed: > > >> > > > >>> >>>>>> 0.411 s <<< FAILURE! - in > > >> > > org.apache.calcite.plan.RelOptUtilTest > > >> > > > >>> >>>>>> [ERROR] > > >> > > > >>> >>>>>> > > >> > > > >>> >>>> > > >> > > > >>> >> > > >> > > > >>> > > >> > > > > > >> > > > > >> > > > >> > > > testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase(org.apache.calcite.plan.RelOptUtilTest) > > >> > > > >>> >>>>>> Time elapsed: 0.349 s <<< ERROR! > > >> > > > >>> >>>>>> > org.apache.calcite.rel.metadata.CyclicMetadataException > > >> > > > >>> >>>>>> at > > >> > > > >>> >>>>>> > > >> > > > >>> >>>> > > >> > > > >>> >> > > >> > > > >>> > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.calcite.plan.RelOptUtilTest.testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase(RelOptUtilTest.java:445) > > >> > > > >>> >>>>>> > > >> > > > >>> >>>>>> I have seen it on Oracle JDK 12 and OpenJDK 10. The > > test > > >> was > > >> > > > only > > >> > > > >>> >> added > > >> > > > >>> >>>> on > > >> > > > >>> >>>>>> May 22 so I assume that it will continue to fail > > >> > > intermittently > > >> > > > >>> >> until > > >> > > > >>> >>>> we do > > >> > > > >>> >>>>>> something. > > >> > > > >>> >>>>>> > > >> > > > >>> >>>>>> Anyone have any ideas? > > >> > > > >>> >>>>>> > > >> > > > >>> >>>>>> Laurent, As you added the test can you please look > into > > >> it? > > >> > > > >>> >>>>>> > > >> > > > >>> >>>>>> Julian > > >> > > > >>> >>>>>> > > >> > > > >>> >>>>>> > > >> > > > >>> >>>> > > >> > > > >>> >>>> > > >> > > > >>> >> > > >> > > > >>> > > > >> > > > >>> > > >> > > > >> > > >> > > > > > >> > > > > >> > > > >> > > > > > > >