Thanks for noticing Laurent!

The last builds were successful so I guess we have to do with another
intermittent failure.

If the problem persists, I will create a JIRA and follow up there 😉

On Thu, May 30, 2019, 6:43 AM Laurent Goujon <laur...@dremio.com> wrote:

> Looks like Jenkins started to timeout with the following change:
>
> https://github.com/apache/calcite/commit/73e6d05fa65f16485caca80571d1fe4fda5c7468
> .
>
> Ruben, Stamatis: any idea?
>
> On Wed, May 29, 2019 at 8:22 PM Laurent Goujon <laur...@dremio.com> wrote:
>
> > It looks like Jenkins build with JDK8 now takes more than 40min and is
> > being aborted, with the core module taking something like 38min and being
> > stopped during tests. It seems to be independent of my test change since
> > RelOptUtilTest shows to be running successfully in 1second.
> > That said, I don't see a test still be running from the logs, so I guess
> > it would be something keeping the jvm from stopping even if all tests are
> > completed?
> >
> > On Wed, May 29, 2019 at 4:15 PM Stamatis Zampetakis <zabe...@gmail.com>
> > wrote:
> >
> >> Makes sense, I was expecting something like that.
> >>
> >> +1 for the PR. Even if we are not 100% that will resolve the problem,
> >> getting rid of the statics seems like a good idea and beneficial in any
> >> case.
> >>
> >> On Thu, May 30, 2019 at 12:02 AM Laurent Goujon <laur...@dremio.com>
> >> wrote:
> >>
> >> > Unfortunately, not locally, so I ended in crude debugging and code
> >> analysis
> >> > to understand where the static reloptcluster created for the static
> >> > relbuilder was used in a context which would cause issue in a
> >> multithreaded
> >> > context. Once I saw several rel nodes getting the parent cluster (the
> >> > static table scan) to access the metadata query (which cannot be
> reused
> >> > between threads), that's where I assumed this would be the problem.
> >> >
> >> > On Wed, May 29, 2019 at 2:58 PM Stamatis Zampetakis <
> zabe...@gmail.com>
> >> > wrote:
> >> >
> >> > > Thanks a lot Laurent, much appreciated!
> >> > >
> >> > > BTW, were you able to reproduce the problem?
> >> > >
> >> > > On Wed, May 29, 2019 at 11:29 PM Laurent Goujon <laur...@dremio.com
> >
> >> > > wrote:
> >> > >
> >> > > > Here's the pull request:
> >> https://github.com/apache/calcite/pull/1240
> >> > > >
> >> > > > On Wed, May 29, 2019 at 1:58 PM Laurent Goujon <
> laur...@dremio.com>
> >> > > wrote:
> >> > > >
> >> > > > > Looks like most {{RelNode#create()}} access the
> >> > > RelOptCluster/RelBuilder
> >> > > > > instance from their child, and some then, perform metadata
> >> operation,
> >> > > > which
> >> > > > > would cause CyclicMetadataException for example. I'll create a
> >> fixup
> >> > > > patch.
> >> > > > >
> >> > > > > On Wed, May 29, 2019 at 9:00 AM Laurent Goujon <
> >> laur...@dremio.com>
> >> > > > wrote:
> >> > > > >
> >> > > > >> I actually did the change from using a static relbuilder to a
> >> mix of
> >> > > > >> static (to create the scans) and per test because of
> concurrency
> >> > > issues
> >> > > > :(
> >> > > > >> Maybe the test should only been using a per test-case
> relbuilder
> >> > > > instead,
> >> > > > >> but it would mean that there's some code relying on the static
> >> > > > relbuilder
> >> > > > >> (by accessing it from the scan rel nodes) in a non-static
> >> context.
> >> > > I'll
> >> > > > try
> >> > > > >> to debug it too (but so far was unlucky reproducing the jenkins
> >> > > > problem).
> >> > > > >>
> >> > > > >> On Wed, May 29, 2019 at 5:02 AM Francis Chuang <
> >> > > > francischu...@apache.org>
> >> > > > >> wrote:
> >> > > > >>
> >> > > > >>> In Go, there's a built in race detector that can be used when
> >> > running
> >> > > > >>> tests using `go test`.
> >> > > > >>>
> >> > > > >>> There's RacerD [1] from Facebook that can detect races in
> Java.
> >> > > Perhaps
> >> > > > >>> this is something that can be looked in to to find the race.
> >> > > > >>>
> >> > > > >>> [1] https://fbinfer.com/docs/racerd.html
> >> > > > >>>
> >> > > > >>> On 29/05/2019 9:52 pm, Stamatis Zampetakis wrote:
> >> > > > >>> > Good insights, Ruben, Danny!
> >> > > > >>> >
> >> > > > >>> > Assuming that is indeed a concurrency problem, it will be
> >> > difficult
> >> > > > to
> >> > > > >>> > identify since many parts in Calcite are not thread safe.
> Not
> >> > being
> >> > > > >>> able to
> >> > > > >>> > reproduce the problem makes the things even worse.
> >> > > > >>> >
> >> > > > >>> > Given that intermittent test failures occur often on
> Jenkins,
> >> how
> >> > > > about
> >> > > > >>> > creating a new branch dedicated to debugging?
> >> > > > >>> > It can be associated with a new Jenkins jobs (that obviously
> >> > > doesn't
> >> > > > >>> send
> >> > > > >>> > mails to everybody when there are failures) that can be
> >> launched
> >> > > > >>> on-demand
> >> > > > >>> > by the person who is looking into the problem.
> >> > > > >>> >
> >> > > > >>> > On the other hand, we do not really make an effort to have
> >> tests
> >> > > that
> >> > > > >>> can
> >> > > > >>> > be executed concurrently so another alternative would be to
> >> run
> >> > > tests
> >> > > > >>> only
> >> > > > >>> > sequentially.
> >> > > > >>> > I am not very fan of this approach since it will rather hide
> >> > > problems
> >> > > > >>> than
> >> > > > >>> > solve them.
> >> > > > >>> >
> >> > > > >>> >
> >> > > > >>> >
> >> > > > >>> >
> >> > > > >>> >
> >> > > > >>> >
> >> > > > >>> > On Wed, May 29, 2019 at 1:02 PM Yuzhao Chen <
> >> > yuzhao....@gmail.com>
> >> > > > >>> wrote:
> >> > > > >>> >
> >> > > > >>> >> Thanks Ruben for your good analysis.
> >> > > > >>> >>
> >> > > > >>> >> What I’m confused is that isn’t the static REL_BUILDER more
> >> > prone
> >> > > to
> >> > > > >>> have
> >> > > > >>> >> concurrency problems ? And the pushed scans(EMP_SCAN and
> >> > > DEPT_SCAN)
> >> > > > >>> are all
> >> > > > >>> >> nodes(immutable), how could this be a problem ?
> >> > > > >>> >>
> >> > > > >>> >> Best,
> >> > > > >>> >> Danny Chan
> >> > > > >>> >> 在 2019年5月29日 +0800 PM5:37,Ruben Q L <rube...@gmail.com
> >,写道:
> >> > > > >>> >>> I'm checking the commit [1] and I see something strange in
> >> > > > >>> >> RelOptUtilTest.
> >> > > > >>> >>> Maybe I'm wrong and it is nothing, but just in case it may
> >> > help:
> >> > > > >>> >>>
> >> > > > >>> >>> With the latest modification, it seems that we have two
> >> > > > >>> RelBuilder(s) in
> >> > > > >>> >>> place:
> >> > > > >>> >>> - A static one that is created ad-hoc on a static block to
> >> > > generate
> >> > > > >>> the
> >> > > > >>> >>> EMP_SCAN and DEPT_SCAN RelNodes [2]
> >> > > > >>> >>> - An instance one to be used in the tests, that is
> >> initialized
> >> > on
> >> > > > >>> >>> the @Before public void setUp() method [3]
> >> > > > >>> >>>
> >> > > > >>> >>> Before this commit, the EMP_SCAN / DEPT_SCAN were only
> used
> >> to
> >> > > read
> >> > > > >>> their
> >> > > > >>> >>> rowTypes to test some join auxiliary methods. But the new
> >> > > > >>> >>> tests testPushDownJoinConditions* actually build a plan
> and
> >> > push
> >> > > > >>> these
> >> > > > >>> >>> scans into the RelBuilder to be tested [4] (which is a
> >> > different
> >> > > > one
> >> > > > >>> than
> >> > > > >>> >>> the static RelBuider that created the scans).
> >> > > > >>> >>> Maybe this is no problem generally, but it can potentially
> >> be
> >> > > under
> >> > > > >>> >> certain
> >> > > > >>> >>> circumstances?, which would explain the randomness of the
> >> > issue.
> >> > > > >>> >>> Could this explain the exception?
> >> > > > >>> >>>
> >> > > > >>> >>> [1]
> >> > > > >>> >>>
> >> > > > >>> >>
> >> > > > >>>
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/calcite/commit/82e7d4e760cb203d31956c55e38e0fdd56119d58
> >> > > > >>> >>>
> >> > > > >>> >>> [2]
> >> > > > >>> >>>
> >> > > > >>> >>
> >> > > > >>>
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L71
> >> > > > >>> >>> [3]
> >> > > > >>> >>>
> >> > > > >>> >>
> >> > > > >>>
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L92
> >> > > > >>> >>> [4]
> >> > > > >>> >>>
> >> > > > >>> >>
> >> > > > >>>
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L292
> >> > > > >>> >>>
> >> > > > >>> >>>
> >> > > > >>> >>>
> >> > > > >>> >>> Le mer. 29 mai 2019 à 02:20, Julian Hyde <
> jh...@apache.org>
> >> a
> >> > > > écrit
> >> > > > >>> :
> >> > > > >>> >>>
> >> > > > >>> >>>> It’s a tough call. It is probable that the problem
> existed
> >> > > already
> >> > > > >>> and
> >> > > > >>> >> the
> >> > > > >>> >>>> change merely surfaced it.
> >> > > > >>> >>>>
> >> > > > >>> >>>>> On May 28, 2019, at 5:17 PM, Stamatis Zampetakis <
> >> > > > >>> zabe...@gmail.com>
> >> > > > >>> >>>> wrote:
> >> > > > >>> >>>>>
> >> > > > >>> >>>>> It is not the only test that is failing after commit [1]
> >> but
> >> > > all
> >> > > > >>> the
> >> > > > >>> >> new
> >> > > > >>> >>>>> tests that were added.
> >> > > > >>> >>>>>
> >> > > > >>> >>>>> I've seen the problem on Jenkins on all JDKS but I
> cannot
> >> > > > reproduce
> >> > > > >>> >> it
> >> > > > >>> >>>>> locally.
> >> > > > >>> >>>>> I guess we have to do with a race condition most likely
> >> due
> >> > to
> >> > > > the
> >> > > > >>> >>>>> concurrent execution of tests with surefire.
> >> > > > >>> >>>>>
> >> > > > >>> >>>>> Should we revert the commit till we find a solution?
> >> > > > >>> >>>>>
> >> > > > >>> >>>>> [1]
> >> > > > >>> >>>>>
> >> > > > >>> >>>>
> >> > > > >>> >>
> >> > > > >>>
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/calcite/commit/82e7d4e760cb203d31956c55e38e0fdd56119d58
> >> > > > >>> >>>>>
> >> > > > >>> >>>>> On Tue, May 28, 2019 at 7:57 PM Julian Hyde <
> >> > jh...@apache.org>
> >> > > > >>> >> wrote:
> >> > > > >>> >>>>>
> >> > > > >>> >>>>>> I have seen this intermittent failure 3 times in the
> last
> >> > > week:
> >> > > > >>> >>>>>>
> >> > > > >>> >>>>>> [INFO] Running org.apache.calcite.plan.RelOptUtilTest
> >> > > > >>> >>>>>> [ERROR] Tests run: 11, Failures: 0, Errors: 1, Skipped:
> >> 0,
> >> > > Time
> >> > > > >>> >> elapsed:
> >> > > > >>> >>>>>> 0.411 s <<< FAILURE! - in
> >> > > org.apache.calcite.plan.RelOptUtilTest
> >> > > > >>> >>>>>> [ERROR]
> >> > > > >>> >>>>>>
> >> > > > >>> >>>>
> >> > > > >>> >>
> >> > > > >>>
> >> > > >
> >> > >
> >> >
> >>
> testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase(org.apache.calcite.plan.RelOptUtilTest)
> >> > > > >>> >>>>>> Time elapsed: 0.349 s <<< ERROR!
> >> > > > >>> >>>>>> org.apache.calcite.rel.metadata.CyclicMetadataException
> >> > > > >>> >>>>>> at
> >> > > > >>> >>>>>>
> >> > > > >>> >>>>
> >> > > > >>> >>
> >> > > > >>>
> >> > > >
> >> > >
> >> >
> >>
> org.apache.calcite.plan.RelOptUtilTest.testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase(RelOptUtilTest.java:445)
> >> > > > >>> >>>>>>
> >> > > > >>> >>>>>> I have seen it on Oracle JDK 12 and OpenJDK 10. The
> test
> >> was
> >> > > > only
> >> > > > >>> >> added
> >> > > > >>> >>>> on
> >> > > > >>> >>>>>> May 22 so I assume that it will continue to fail
> >> > > intermittently
> >> > > > >>> >> until
> >> > > > >>> >>>> we do
> >> > > > >>> >>>>>> something.
> >> > > > >>> >>>>>>
> >> > > > >>> >>>>>> Anyone have any ideas?
> >> > > > >>> >>>>>>
> >> > > > >>> >>>>>> Laurent, As you added the test can you please look into
> >> it?
> >> > > > >>> >>>>>>
> >> > > > >>> >>>>>> Julian
> >> > > > >>> >>>>>>
> >> > > > >>> >>>>>>
> >> > > > >>> >>>>
> >> > > > >>> >>>>
> >> > > > >>> >>
> >> > > > >>> >
> >> > > > >>>
> >> > > > >>
> >> > > >
> >> > >
> >> >
> >>
> >
>

Reply via email to