Good insights, Ruben, Danny!

Assuming that is indeed a concurrency problem, it will be difficult to
identify since many parts in Calcite are not thread safe. Not being able to
reproduce the problem makes the things even worse.

Given that intermittent test failures occur often on Jenkins, how about
creating a new branch dedicated to debugging?
It can be associated with a new Jenkins jobs (that obviously doesn't send
mails to everybody when there are failures) that can be launched on-demand
by the person who is looking into the problem.

On the other hand, we do not really make an effort to have tests that can
be executed concurrently so another alternative would be to run tests only
sequentially.
I am not very fan of this approach since it will rather hide problems than
solve them.






On Wed, May 29, 2019 at 1:02 PM Yuzhao Chen <yuzhao....@gmail.com> wrote:

> Thanks Ruben for your good analysis.
>
> What I’m confused is that isn’t the static REL_BUILDER more prone to have
> concurrency problems ? And the pushed scans(EMP_SCAN and DEPT_SCAN) are all
> nodes(immutable), how could this be a problem ?
>
> Best,
> Danny Chan
> 在 2019年5月29日 +0800 PM5:37,Ruben Q L <rube...@gmail.com>,写道:
> > I'm checking the commit [1] and I see something strange in
> RelOptUtilTest.
> > Maybe I'm wrong and it is nothing, but just in case it may help:
> >
> > With the latest modification, it seems that we have two RelBuilder(s) in
> > place:
> > - A static one that is created ad-hoc on a static block to generate the
> > EMP_SCAN and DEPT_SCAN RelNodes [2]
> > - An instance one to be used in the tests, that is initialized on
> > the @Before public void setUp() method [3]
> >
> > Before this commit, the EMP_SCAN / DEPT_SCAN were only used to read their
> > rowTypes to test some join auxiliary methods. But the new
> > tests testPushDownJoinConditions* actually build a plan and push these
> > scans into the RelBuilder to be tested [4] (which is a different one than
> > the static RelBuider that created the scans).
> > Maybe this is no problem generally, but it can potentially be under
> certain
> > circumstances?, which would explain the randomness of the issue.
> > Could this explain the exception?
> >
> > [1]
> >
> https://github.com/apache/calcite/commit/82e7d4e760cb203d31956c55e38e0fdd56119d58
> >
> > [2]
> >
> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L71
> > [3]
> >
> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L92
> > [4]
> >
> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L292
> >
> >
> >
> > Le mer. 29 mai 2019 à 02:20, Julian Hyde <jh...@apache.org> a écrit :
> >
> > > It’s a tough call. It is probable that the problem existed already and
> the
> > > change merely surfaced it.
> > >
> > > > On May 28, 2019, at 5:17 PM, Stamatis Zampetakis <zabe...@gmail.com>
> > > wrote:
> > > >
> > > > It is not the only test that is failing after commit [1] but all the
> new
> > > > tests that were added.
> > > >
> > > > I've seen the problem on Jenkins on all JDKS but I cannot reproduce
> it
> > > > locally.
> > > > I guess we have to do with a race condition most likely due to the
> > > > concurrent execution of tests with surefire.
> > > >
> > > > Should we revert the commit till we find a solution?
> > > >
> > > > [1]
> > > >
> > >
> https://github.com/apache/calcite/commit/82e7d4e760cb203d31956c55e38e0fdd56119d58
> > > >
> > > > On Tue, May 28, 2019 at 7:57 PM Julian Hyde <jh...@apache.org>
> wrote:
> > > >
> > > > > I have seen this intermittent failure 3 times in the last week:
> > > > >
> > > > > [INFO] Running org.apache.calcite.plan.RelOptUtilTest
> > > > > [ERROR] Tests run: 11, Failures: 0, Errors: 1, Skipped: 0, Time
> elapsed:
> > > > > 0.411 s <<< FAILURE! - in org.apache.calcite.plan.RelOptUtilTest
> > > > > [ERROR]
> > > > >
> > >
> testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase(org.apache.calcite.plan.RelOptUtilTest)
> > > > > Time elapsed: 0.349 s <<< ERROR!
> > > > > org.apache.calcite.rel.metadata.CyclicMetadataException
> > > > > at
> > > > >
> > >
> org.apache.calcite.plan.RelOptUtilTest.testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase(RelOptUtilTest.java:445)
> > > > >
> > > > > I have seen it on Oracle JDK 12 and OpenJDK 10. The test was only
> added
> > > on
> > > > > May 22 so I assume that it will continue to fail intermittently
> until
> > > we do
> > > > > something.
> > > > >
> > > > > Anyone have any ideas?
> > > > >
> > > > > Laurent, As you added the test can you please look into it?
> > > > >
> > > > > Julian
> > > > >
> > > > >
> > >
> > >
>

Reply via email to