Looks like most {{RelNode#create()}} access the RelOptCluster/RelBuilder
instance from their child, and some then, perform metadata operation, which
would cause CyclicMetadataException for example. I'll create a fixup patch.

On Wed, May 29, 2019 at 9:00 AM Laurent Goujon <laur...@dremio.com> wrote:

> I actually did the change from using a static relbuilder to a mix of
> static (to create the scans) and per test because of concurrency issues :(
> Maybe the test should only been using a per test-case relbuilder instead,
> but it would mean that there's some code relying on the static relbuilder
> (by accessing it from the scan rel nodes) in a non-static context. I'll try
> to debug it too (but so far was unlucky reproducing the jenkins problem).
>
> On Wed, May 29, 2019 at 5:02 AM Francis Chuang <francischu...@apache.org>
> wrote:
>
>> In Go, there's a built in race detector that can be used when running
>> tests using `go test`.
>>
>> There's RacerD [1] from Facebook that can detect races in Java. Perhaps
>> this is something that can be looked in to to find the race.
>>
>> [1] https://fbinfer.com/docs/racerd.html
>>
>> On 29/05/2019 9:52 pm, Stamatis Zampetakis wrote:
>> > Good insights, Ruben, Danny!
>> >
>> > Assuming that is indeed a concurrency problem, it will be difficult to
>> > identify since many parts in Calcite are not thread safe. Not being
>> able to
>> > reproduce the problem makes the things even worse.
>> >
>> > Given that intermittent test failures occur often on Jenkins, how about
>> > creating a new branch dedicated to debugging?
>> > It can be associated with a new Jenkins jobs (that obviously doesn't
>> send
>> > mails to everybody when there are failures) that can be launched
>> on-demand
>> > by the person who is looking into the problem.
>> >
>> > On the other hand, we do not really make an effort to have tests that
>> can
>> > be executed concurrently so another alternative would be to run tests
>> only
>> > sequentially.
>> > I am not very fan of this approach since it will rather hide problems
>> than
>> > solve them.
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Wed, May 29, 2019 at 1:02 PM Yuzhao Chen <yuzhao....@gmail.com>
>> wrote:
>> >
>> >> Thanks Ruben for your good analysis.
>> >>
>> >> What I’m confused is that isn’t the static REL_BUILDER more prone to
>> have
>> >> concurrency problems ? And the pushed scans(EMP_SCAN and DEPT_SCAN)
>> are all
>> >> nodes(immutable), how could this be a problem ?
>> >>
>> >> Best,
>> >> Danny Chan
>> >> 在 2019年5月29日 +0800 PM5:37,Ruben Q L <rube...@gmail.com>,写道:
>> >>> I'm checking the commit [1] and I see something strange in
>> >> RelOptUtilTest.
>> >>> Maybe I'm wrong and it is nothing, but just in case it may help:
>> >>>
>> >>> With the latest modification, it seems that we have two RelBuilder(s)
>> in
>> >>> place:
>> >>> - A static one that is created ad-hoc on a static block to generate
>> the
>> >>> EMP_SCAN and DEPT_SCAN RelNodes [2]
>> >>> - An instance one to be used in the tests, that is initialized on
>> >>> the @Before public void setUp() method [3]
>> >>>
>> >>> Before this commit, the EMP_SCAN / DEPT_SCAN were only used to read
>> their
>> >>> rowTypes to test some join auxiliary methods. But the new
>> >>> tests testPushDownJoinConditions* actually build a plan and push these
>> >>> scans into the RelBuilder to be tested [4] (which is a different one
>> than
>> >>> the static RelBuider that created the scans).
>> >>> Maybe this is no problem generally, but it can potentially be under
>> >> certain
>> >>> circumstances?, which would explain the randomness of the issue.
>> >>> Could this explain the exception?
>> >>>
>> >>> [1]
>> >>>
>> >>
>> https://github.com/apache/calcite/commit/82e7d4e760cb203d31956c55e38e0fdd56119d58
>> >>>
>> >>> [2]
>> >>>
>> >>
>> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L71
>> >>> [3]
>> >>>
>> >>
>> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L92
>> >>> [4]
>> >>>
>> >>
>> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L292
>> >>>
>> >>>
>> >>>
>> >>> Le mer. 29 mai 2019 à 02:20, Julian Hyde <jh...@apache.org> a écrit :
>> >>>
>> >>>> It’s a tough call. It is probable that the problem existed already
>> and
>> >> the
>> >>>> change merely surfaced it.
>> >>>>
>> >>>>> On May 28, 2019, at 5:17 PM, Stamatis Zampetakis <zabe...@gmail.com
>> >
>> >>>> wrote:
>> >>>>>
>> >>>>> It is not the only test that is failing after commit [1] but all the
>> >> new
>> >>>>> tests that were added.
>> >>>>>
>> >>>>> I've seen the problem on Jenkins on all JDKS but I cannot reproduce
>> >> it
>> >>>>> locally.
>> >>>>> I guess we have to do with a race condition most likely due to the
>> >>>>> concurrent execution of tests with surefire.
>> >>>>>
>> >>>>> Should we revert the commit till we find a solution?
>> >>>>>
>> >>>>> [1]
>> >>>>>
>> >>>>
>> >>
>> https://github.com/apache/calcite/commit/82e7d4e760cb203d31956c55e38e0fdd56119d58
>> >>>>>
>> >>>>> On Tue, May 28, 2019 at 7:57 PM Julian Hyde <jh...@apache.org>
>> >> wrote:
>> >>>>>
>> >>>>>> I have seen this intermittent failure 3 times in the last week:
>> >>>>>>
>> >>>>>> [INFO] Running org.apache.calcite.plan.RelOptUtilTest
>> >>>>>> [ERROR] Tests run: 11, Failures: 0, Errors: 1, Skipped: 0, Time
>> >> elapsed:
>> >>>>>> 0.411 s <<< FAILURE! - in org.apache.calcite.plan.RelOptUtilTest
>> >>>>>> [ERROR]
>> >>>>>>
>> >>>>
>> >>
>> testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase(org.apache.calcite.plan.RelOptUtilTest)
>> >>>>>> Time elapsed: 0.349 s <<< ERROR!
>> >>>>>> org.apache.calcite.rel.metadata.CyclicMetadataException
>> >>>>>> at
>> >>>>>>
>> >>>>
>> >>
>> org.apache.calcite.plan.RelOptUtilTest.testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase(RelOptUtilTest.java:445)
>> >>>>>>
>> >>>>>> I have seen it on Oracle JDK 12 and OpenJDK 10. The test was only
>> >> added
>> >>>> on
>> >>>>>> May 22 so I assume that it will continue to fail intermittently
>> >> until
>> >>>> we do
>> >>>>>> something.
>> >>>>>>
>> >>>>>> Anyone have any ideas?
>> >>>>>>
>> >>>>>> Laurent, As you added the test can you please look into it?
>> >>>>>>
>> >>>>>> Julian
>> >>>>>>
>> >>>>>>
>> >>>>
>> >>>>
>> >>
>> >
>>
>

Reply via email to