My guess here is that: EnumerableRepeatUnionHierarchyTest is a Parameterized test, if the different tests are executed in parallel, since they all use the same plan with minor modifications, but especially since they all share the same transient table name (the temporary table that is used to accumulate results in the iterative process), they might interfere on each other's results and make some tests run forever. I can prepare a patch modifying EnumerableRepeatUnionHierarchyTest to ensure that the transient table is also parameterized, and see if that solves the problem.
Le ven. 31 mai 2019 à 09:26, Stamatis Zampetakis <zabe...@gmail.com> a écrit : > @Haisheng: since you can reproduce it every time locally can you share the > stacktrace at the moment that it is stucked? > > I think it will help a lot in finding the problem. > > On Fri, May 31, 2019, 8:22 AM Ruben Q L <rube...@gmail.com> wrote: > >> Ok, I understand. I'll see if I can reproduce it on my side. >> Thanks >> >> >> Le ven. 31 mai 2019 à 08:19, Haisheng Yuan <h.y...@alibaba-inc.com> a >> écrit : >> >>> just run that single test would not reproduce. >>> It also passes on my machine if run it separately. mvn clean install can >>> repro every time. >>> >>> >>> >>> >>> >>> Thanks~ >>> Haisheng >>> Yuan------------------------------------------------------------------ >>> 发件人:Ruben Q L<rube...@gmail.com> >>> 日 期:2019年05月31日 14:03:44 >>> 收件人:<dev@calcite.apache.org> >>> 抄 送:Stamatis Zampetakis<zabe...@gmail.com> >>> 主 题:Re: Re: CyclicMetadataException in >>> testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase >>> >>> Thanks for the info. >>> I'm running EnumerableRepeatUnionHierarchyTest, and in my local >>> environment >>> (macOS 10.14.5, JDK 8) it ends successfully. >>> I'll continue digging into it... >>> >>> >>> Le ven. 31 mai 2019 à 04:20, Haisheng Yuan <h.y...@alibaba-inc.com> a >>> écrit : >>> >>> > The test suite stuck on my laptop for ever. macOS 10.14.4, JDK 8. >>> > Ignoring test case EnumerableRepeatUnionHierarchyTest makes the whole >>> test >>> > pass again. >>> > >>> > Thanks ~ >>> > Haisheng Yuan >>> > ------------------------------------------------------------------ >>> > 发件人:Stamatis Zampetakis<zabe...@gmail.com> >>> > 日 期:2019年05月30日 22:49:22 >>> > 收件人:<dev@calcite.apache.org> >>> > 主 题:Re: CyclicMetadataException in >>> > testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase >>> > >>> > Thanks for noticing Laurent! >>> > >>> > The last builds were successful so I guess we have to do with another >>> > intermittent failure. >>> > >>> > If the problem persists, I will create a JIRA and follow up there 😉 >>> > >>> > On Thu, May 30, 2019, 6:43 AM Laurent Goujon <laur...@dremio.com> >>> wrote: >>> > >>> > > Looks like Jenkins started to timeout with the following change: >>> > > >>> > > >>> > >>> https://github.com/apache/calcite/commit/73e6d05fa65f16485caca80571d1fe4fda5c7468 >>> > > . >>> > > >>> > > Ruben, Stamatis: any idea? >>> > > >>> > > On Wed, May 29, 2019 at 8:22 PM Laurent Goujon <laur...@dremio.com> >>> > wrote: >>> > > >>> > > > It looks like Jenkins build with JDK8 now takes more than 40min >>> and is >>> > > > being aborted, with the core module taking something like 38min and >>> > being >>> > > > stopped during tests. It seems to be independent of my test change >>> > since >>> > > > RelOptUtilTest shows to be running successfully in 1second. >>> > > > That said, I don't see a test still be running from the logs, so I >>> > guess >>> > > > it would be something keeping the jvm from stopping even if all >>> tests >>> > are >>> > > > completed? >>> > > > >>> > > > On Wed, May 29, 2019 at 4:15 PM Stamatis Zampetakis < >>> zabe...@gmail.com >>> > > >>> > > > wrote: >>> > > > >>> > > >> Makes sense, I was expecting something like that. >>> > > >> >>> > > >> +1 for the PR. Even if we are not 100% that will resolve the >>> problem, >>> > > >> getting rid of the statics seems like a good idea and beneficial >>> in >>> > any >>> > > >> case. >>> > > >> >>> > > >> On Thu, May 30, 2019 at 12:02 AM Laurent Goujon < >>> laur...@dremio.com> >>> > > >> wrote: >>> > > >> >>> > > >> > Unfortunately, not locally, so I ended in crude debugging and >>> code >>> > > >> analysis >>> > > >> > to understand where the static reloptcluster created for the >>> static >>> > > >> > relbuilder was used in a context which would cause issue in a >>> > > >> multithreaded >>> > > >> > context. Once I saw several rel nodes getting the parent cluster >>> > (the >>> > > >> > static table scan) to access the metadata query (which cannot be >>> > > reused >>> > > >> > between threads), that's where I assumed this would be the >>> problem. >>> > > >> > >>> > > >> > On Wed, May 29, 2019 at 2:58 PM Stamatis Zampetakis < >>> > > zabe...@gmail.com> >>> > > >> > wrote: >>> > > >> > >>> > > >> > > Thanks a lot Laurent, much appreciated! >>> > > >> > > >>> > > >> > > BTW, were you able to reproduce the problem? >>> > > >> > > >>> > > >> > > On Wed, May 29, 2019 at 11:29 PM Laurent Goujon < >>> > laur...@dremio.com >>> > > > >>> > > >> > > wrote: >>> > > >> > > >>> > > >> > > > Here's the pull request: >>> > > >> https://github.com/apache/calcite/pull/1240 >>> > > >> > > > >>> > > >> > > > On Wed, May 29, 2019 at 1:58 PM Laurent Goujon < >>> > > laur...@dremio.com> >>> > > >> > > wrote: >>> > > >> > > > >>> > > >> > > > > Looks like most {{RelNode#create()}} access the >>> > > >> > > RelOptCluster/RelBuilder >>> > > >> > > > > instance from their child, and some then, perform metadata >>> > > >> operation, >>> > > >> > > > which >>> > > >> > > > > would cause CyclicMetadataException for example. I'll >>> create a >>> > > >> fixup >>> > > >> > > > patch. >>> > > >> > > > > >>> > > >> > > > > On Wed, May 29, 2019 at 9:00 AM Laurent Goujon < >>> > > >> laur...@dremio.com> >>> > > >> > > > wrote: >>> > > >> > > > > >>> > > >> > > > >> I actually did the change from using a static relbuilder >>> to a >>> > > >> mix of >>> > > >> > > > >> static (to create the scans) and per test because of >>> > > concurrency >>> > > >> > > issues >>> > > >> > > > :( >>> > > >> > > > >> Maybe the test should only been using a per test-case >>> > > relbuilder >>> > > >> > > > instead, >>> > > >> > > > >> but it would mean that there's some code relying on the >>> > static >>> > > >> > > > relbuilder >>> > > >> > > > >> (by accessing it from the scan rel nodes) in a non-static >>> > > >> context. >>> > > >> > > I'll >>> > > >> > > > try >>> > > >> > > > >> to debug it too (but so far was unlucky reproducing the >>> > jenkins >>> > > >> > > > problem). >>> > > >> > > > >> >>> > > >> > > > >> On Wed, May 29, 2019 at 5:02 AM Francis Chuang < >>> > > >> > > > francischu...@apache.org> >>> > > >> > > > >> wrote: >>> > > >> > > > >> >>> > > >> > > > >>> In Go, there's a built in race detector that can be used >>> > when >>> > > >> > running >>> > > >> > > > >>> tests using `go test`. >>> > > >> > > > >>> >>> > > >> > > > >>> There's RacerD [1] from Facebook that can detect races >>> in >>> > > Java. >>> > > >> > > Perhaps >>> > > >> > > > >>> this is something that can be looked in to to find the >>> race. >>> > > >> > > > >>> >>> > > >> > > > >>> [1] https://fbinfer.com/docs/racerd.html >>> > > >> > > > >>> >>> > > >> > > > >>> On 29/05/2019 9:52 pm, Stamatis Zampetakis wrote: >>> > > >> > > > >>> > Good insights, Ruben, Danny! >>> > > >> > > > >>> > >>> > > >> > > > >>> > Assuming that is indeed a concurrency problem, it >>> will be >>> > > >> > difficult >>> > > >> > > > to >>> > > >> > > > >>> > identify since many parts in Calcite are not thread >>> safe. >>> > > Not >>> > > >> > being >>> > > >> > > > >>> able to >>> > > >> > > > >>> > reproduce the problem makes the things even worse. >>> > > >> > > > >>> > >>> > > >> > > > >>> > Given that intermittent test failures occur often on >>> > > Jenkins, >>> > > >> how >>> > > >> > > > about >>> > > >> > > > >>> > creating a new branch dedicated to debugging? >>> > > >> > > > >>> > It can be associated with a new Jenkins jobs (that >>> > obviously >>> > > >> > > doesn't >>> > > >> > > > >>> send >>> > > >> > > > >>> > mails to everybody when there are failures) that can >>> be >>> > > >> launched >>> > > >> > > > >>> on-demand >>> > > >> > > > >>> > by the person who is looking into the problem. >>> > > >> > > > >>> > >>> > > >> > > > >>> > On the other hand, we do not really make an effort to >>> have >>> > > >> tests >>> > > >> > > that >>> > > >> > > > >>> can >>> > > >> > > > >>> > be executed concurrently so another alternative would >>> be >>> > to >>> > > >> run >>> > > >> > > tests >>> > > >> > > > >>> only >>> > > >> > > > >>> > sequentially. >>> > > >> > > > >>> > I am not very fan of this approach since it will >>> rather >>> > hide >>> > > >> > > problems >>> > > >> > > > >>> than >>> > > >> > > > >>> > solve them. >>> > > >> > > > >>> > >>> > > >> > > > >>> > >>> > > >> > > > >>> > >>> > > >> > > > >>> > >>> > > >> > > > >>> > >>> > > >> > > > >>> > >>> > > >> > > > >>> > On Wed, May 29, 2019 at 1:02 PM Yuzhao Chen < >>> > > >> > yuzhao....@gmail.com> >>> > > >> > > > >>> wrote: >>> > > >> > > > >>> > >>> > > >> > > > >>> >> Thanks Ruben for your good analysis. >>> > > >> > > > >>> >> >>> > > >> > > > >>> >> What I’m confused is that isn’t the static >>> REL_BUILDER >>> > more >>> > > >> > prone >>> > > >> > > to >>> > > >> > > > >>> have >>> > > >> > > > >>> >> concurrency problems ? And the pushed scans(EMP_SCAN >>> and >>> > > >> > > DEPT_SCAN) >>> > > >> > > > >>> are all >>> > > >> > > > >>> >> nodes(immutable), how could this be a problem ? >>> > > >> > > > >>> >> >>> > > >> > > > >>> >> Best, >>> > > >> > > > >>> >> Danny Chan >>> > > >> > > > >>> >> 在 2019年5月29日 +0800 PM5:37,Ruben Q L < >>> rube...@gmail.com >>> > > >,写道: >>> > > >> > > > >>> >>> I'm checking the commit [1] and I see something >>> strange >>> > in >>> > > >> > > > >>> >> RelOptUtilTest. >>> > > >> > > > >>> >>> Maybe I'm wrong and it is nothing, but just in case >>> it >>> > may >>> > > >> > help: >>> > > >> > > > >>> >>> >>> > > >> > > > >>> >>> With the latest modification, it seems that we have >>> two >>> > > >> > > > >>> RelBuilder(s) in >>> > > >> > > > >>> >>> place: >>> > > >> > > > >>> >>> - A static one that is created ad-hoc on a static >>> block >>> > to >>> > > >> > > generate >>> > > >> > > > >>> the >>> > > >> > > > >>> >>> EMP_SCAN and DEPT_SCAN RelNodes [2] >>> > > >> > > > >>> >>> - An instance one to be used in the tests, that is >>> > > >> initialized >>> > > >> > on >>> > > >> > > > >>> >>> the @Before public void setUp() method [3] >>> > > >> > > > >>> >>> >>> > > >> > > > >>> >>> Before this commit, the EMP_SCAN / DEPT_SCAN were >>> only >>> > > used >>> > > >> to >>> > > >> > > read >>> > > >> > > > >>> their >>> > > >> > > > >>> >>> rowTypes to test some join auxiliary methods. But >>> the >>> > new >>> > > >> > > > >>> >>> tests testPushDownJoinConditions* actually build a >>> plan >>> > > and >>> > > >> > push >>> > > >> > > > >>> these >>> > > >> > > > >>> >>> scans into the RelBuilder to be tested [4] (which >>> is a >>> > > >> > different >>> > > >> > > > one >>> > > >> > > > >>> than >>> > > >> > > > >>> >>> the static RelBuider that created the scans). >>> > > >> > > > >>> >>> Maybe this is no problem generally, but it can >>> > potentially >>> > > >> be >>> > > >> > > under >>> > > >> > > > >>> >> certain >>> > > >> > > > >>> >>> circumstances?, which would explain the randomness >>> of >>> > the >>> > > >> > issue. >>> > > >> > > > >>> >>> Could this explain the exception? >>> > > >> > > > >>> >>> >>> > > >> > > > >>> >>> [1] >>> > > >> > > > >>> >>> >>> > > >> > > > >>> >> >>> > > >> > > > >>> >>> > > >> > > > >>> > > >> > > >>> > > >> > >>> > > >> >>> > > >>> > >>> https://github.com/apache/calcite/commit/82e7d4e760cb203d31956c55e38e0fdd56119d58 >>> > > >> > > > >>> >>> >>> > > >> > > > >>> >>> [2] >>> > > >> > > > >>> >>> >>> > > >> > > > >>> >> >>> > > >> > > > >>> >>> > > >> > > > >>> > > >> > > >>> > > >> > >>> > > >> >>> > > >>> > >>> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L71 >>> > > >> > > > >>> >>> [3] >>> > > >> > > > >>> >>> >>> > > >> > > > >>> >> >>> > > >> > > > >>> >>> > > >> > > > >>> > > >> > > >>> > > >> > >>> > > >> >>> > > >>> > >>> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L92 >>> > > >> > > > >>> >>> [4] >>> > > >> > > > >>> >>> >>> > > >> > > > >>> >> >>> > > >> > > > >>> >>> > > >> > > > >>> > > >> > > >>> > > >> > >>> > > >> >>> > > >>> > >>> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L292 >>> > > >> > > > >>> >>> >>> > > >> > > > >>> >>> >>> > > >> > > > >>> >>> >>> > > >> > > > >>> >>> Le mer. 29 mai 2019 à 02:20, Julian Hyde < >>> > > jh...@apache.org> >>> > > >> a >>> > > >> > > > écrit >>> > > >> > > > >>> : >>> > > >> > > > >>> >>> >>> > > >> > > > >>> >>>> It’s a tough call. It is probable that the problem >>> > > existed >>> > > >> > > already >>> > > >> > > > >>> and >>> > > >> > > > >>> >> the >>> > > >> > > > >>> >>>> change merely surfaced it. >>> > > >> > > > >>> >>>> >>> > > >> > > > >>> >>>>> On May 28, 2019, at 5:17 PM, Stamatis Zampetakis < >>> > > >> > > > >>> zabe...@gmail.com> >>> > > >> > > > >>> >>>> wrote: >>> > > >> > > > >>> >>>>> >>> > > >> > > > >>> >>>>> It is not the only test that is failing after >>> commit >>> > [1] >>> > > >> but >>> > > >> > > all >>> > > >> > > > >>> the >>> > > >> > > > >>> >> new >>> > > >> > > > >>> >>>>> tests that were added. >>> > > >> > > > >>> >>>>> >>> > > >> > > > >>> >>>>> I've seen the problem on Jenkins on all JDKS but I >>> > > cannot >>> > > >> > > > reproduce >>> > > >> > > > >>> >> it >>> > > >> > > > >>> >>>>> locally. >>> > > >> > > > >>> >>>>> I guess we have to do with a race condition most >>> > likely >>> > > >> due >>> > > >> > to >>> > > >> > > > the >>> > > >> > > > >>> >>>>> concurrent execution of tests with surefire. >>> > > >> > > > >>> >>>>> >>> > > >> > > > >>> >>>>> Should we revert the commit till we find a >>> solution? >>> > > >> > > > >>> >>>>> >>> > > >> > > > >>> >>>>> [1] >>> > > >> > > > >>> >>>>> >>> > > >> > > > >>> >>>> >>> > > >> > > > >>> >> >>> > > >> > > > >>> >>> > > >> > > > >>> > > >> > > >>> > > >> > >>> > > >> >>> > > >>> > >>> https://github.com/apache/calcite/commit/82e7d4e760cb203d31956c55e38e0fdd56119d58 >>> > > >> > > > >>> >>>>> >>> > > >> > > > >>> >>>>> On Tue, May 28, 2019 at 7:57 PM Julian Hyde < >>> > > >> > jh...@apache.org> >>> > > >> > > > >>> >> wrote: >>> > > >> > > > >>> >>>>> >>> > > >> > > > >>> >>>>>> I have seen this intermittent failure 3 times in >>> the >>> > > last >>> > > >> > > week: >>> > > >> > > > >>> >>>>>> >>> > > >> > > > >>> >>>>>> [INFO] Running >>> org.apache.calcite.plan.RelOptUtilTest >>> > > >> > > > >>> >>>>>> [ERROR] Tests run: 11, Failures: 0, Errors: 1, >>> > Skipped: >>> > > >> 0, >>> > > >> > > Time >>> > > >> > > > >>> >> elapsed: >>> > > >> > > > >>> >>>>>> 0.411 s <<< FAILURE! - in >>> > > >> > > org.apache.calcite.plan.RelOptUtilTest >>> > > >> > > > >>> >>>>>> [ERROR] >>> > > >> > > > >>> >>>>>> >>> > > >> > > > >>> >>>> >>> > > >> > > > >>> >> >>> > > >> > > > >>> >>> > > >> > > > >>> > > >> > > >>> > > >> > >>> > > >> >>> > > >>> > >>> testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase(org.apache.calcite.plan.RelOptUtilTest) >>> > > >> > > > >>> >>>>>> Time elapsed: 0.349 s <<< ERROR! >>> > > >> > > > >>> >>>>>> >>> > org.apache.calcite.rel.metadata.CyclicMetadataException >>> > > >> > > > >>> >>>>>> at >>> > > >> > > > >>> >>>>>> >>> > > >> > > > >>> >>>> >>> > > >> > > > >>> >> >>> > > >> > > > >>> >>> > > >> > > > >>> > > >> > > >>> > > >> > >>> > > >> >>> > > >>> > >>> org.apache.calcite.plan.RelOptUtilTest.testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase(RelOptUtilTest.java:445) >>> > > >> > > > >>> >>>>>> >>> > > >> > > > >>> >>>>>> I have seen it on Oracle JDK 12 and OpenJDK 10. >>> The >>> > > test >>> > > >> was >>> > > >> > > > only >>> > > >> > > > >>> >> added >>> > > >> > > > >>> >>>> on >>> > > >> > > > >>> >>>>>> May 22 so I assume that it will continue to fail >>> > > >> > > intermittently >>> > > >> > > > >>> >> until >>> > > >> > > > >>> >>>> we do >>> > > >> > > > >>> >>>>>> something. >>> > > >> > > > >>> >>>>>> >>> > > >> > > > >>> >>>>>> Anyone have any ideas? >>> > > >> > > > >>> >>>>>> >>> > > >> > > > >>> >>>>>> Laurent, As you added the test can you please >>> look >>> > into >>> > > >> it? >>> > > >> > > > >>> >>>>>> >>> > > >> > > > >>> >>>>>> Julian >>> > > >> > > > >>> >>>>>> >>> > > >> > > > >>> >>>>>> >>> > > >> > > > >>> >>>> >>> > > >> > > > >>> >>>> >>> > > >> > > > >>> >> >>> > > >> > > > >>> > >>> > > >> > > > >>> >>> > > >> > > > >> >>> > > >> > > > >>> > > >> > > >>> > > >> > >>> > > >> >>> > > > >>> > > >>> > >>> > >>> >>>