Re: [ANNOUNCE] Build Issues Solved

Stephan Ewen Tue, 31 May 2016 03:10:14 -0700

Hi Chiwan!

I think the Execution environment is not shared, because what the
TestEnvironment sets is a Context Environment Factory. Every time you call
"ExecutionEnvironment.getExecutionEnvironment()", you get a new environment.


Stephan


On Tue, May 31, 2016 at 11:53 AM, Chiwan Park <chiwanp...@apache.org> wrote:

> I’ve created a JIRA issue [1] related to KNN test cases. I will send a PR
> for it.
>
> From my investigation [2], cluster for ML tests have only one taskmanager
> with 4 slots. Is 2048 insufficient for total number of network numbers? I
> still think the problem is sharing ExecutionEnvironment between test cases.
>
> [1]: https://issues.apache.org/jira/browse/FLINK-3994
> [2]:
> https://github.com/apache/flink/blob/master/flink-test-utils/src/test/scala/org/apache/flink/test/util/FlinkTestBase.scala#L56
>
> Regards,
> Chiwan Park
>
> > On May 31, 2016, at 6:05 PM, Maximilian Michels <m...@apache.org> wrote:
> >
> > Thanks Stephan for the synopsis of our last weeks test instability
> > madness. It's sad to see the shortcomings of Maven test plugins but
> > another lesson learned is that our testing infrastructure should get a
> > bit more attention. We have reached a point several times where our
> > tests where inherently instable. Now we saw that even more problems
> > were hidden in the dark. I would like to see more maintenance
> > dedicated to testing.
> >
> > @Chiwan: Please, no hotfix! Please open a JIRA issue and a pull
> > request with a systematic fix. Those things are too crucial to be
> > fixed on the go. The problems is that Travis reports the number of
> > processors to be "32" (which is used for the number of task slots in
> > local execution). The network buffers are not adjusted accordingly. We
> > should set them correctly in the MiniCluster. Also, we could define an
> > upper limit to the number of task slots for tests.
> >
> > On Tue, May 31, 2016 at 10:59 AM, Chiwan Park <chiwanp...@apache.org>
> wrote:
> >> I think that the tests fail because of sharing ExecutionEnvironment
> between test cases. I’m not sure why it is problem, but it is only
> difference between other ML tests.
> >>
> >> I created a hotfix and pushed it to my repository. When it seems fixed
> [1], I’ll merge the hotfix to master branch.
> >>
> >> [1]: https://travis-ci.org/chiwanpark/flink/builds/134104491
> >>
> >> Regards,
> >> Chiwan Park
> >>
> >>> On May 31, 2016, at 5:43 PM, Chiwan Park <chiwanp...@apache.org>
> wrote:
> >>>
> >>> Maybe it seems about KNN test case which is merged into yesterday.
> I’ll look into ML test.
> >>>
> >>> Regards,
> >>> Chiwan Park
> >>>
> >>>> On May 31, 2016, at 5:38 PM, Ufuk Celebi <u...@apache.org> wrote:
> >>>>
> >>>> Currently, an ML test is reliably failing and occasionally some HA
> >>>> tests. Is someone looking into the ML test?
> >>>>
> >>>> For HA, I will revert a commit, which might cause the HA
> >>>> instabilities. Till is working on a proper fix as far as I know.
> >>>>
> >>>> On Tue, May 31, 2016 at 3:50 AM, Chiwan Park <chiwanp...@apache.org>
> wrote:
> >>>>> Thanks for the great work! :-)
> >>>>>
> >>>>> Regards,
> >>>>> Chiwan Park
> >>>>>
> >>>>>> On May 31, 2016, at 7:47 AM, Flavio Pompermaier <
> pomperma...@okkam.it> wrote:
> >>>>>>
> >>>>>> Awesome work guys!
> >>>>>> And even more thanks for the detailed report...This troubleshooting
> summary
> >>>>>> will be undoubtedly useful for all our maven projects!
> >>>>>>
> >>>>>> Best,
> >>>>>> Flavio
> >>>>>> On 30 May 2016 23:47, "Ufuk Celebi" <u...@apache.org> wrote:
> >>>>>>
> >>>>>>> Thanks for the effort, Max and Stephan! Happy to see the green
> light again.
> >>>>>>>
> >>>>>>> On Mon, May 30, 2016 at 11:03 PM, Stephan Ewen <se...@apache.org>
> wrote:
> >>>>>>>> Hi all!
> >>>>>>>>
> >>>>>>>> After a few weeks of terrible build issues, I am happy to
> announce that
> >>>>>>> the
> >>>>>>>> build works again properly, and we actually get meaningful CI
> results.
> >>>>>>>>
> >>>>>>>> Here is a story in many acts, from builds deep red to bright
> green joy.
> >>>>>>>> Kudos to Max, who did most of this troubleshooting. This evening,
> Max and
> >>>>>>>> me debugged the final issue and got the build back on track.
> >>>>>>>>
> >>>>>>>> ------------------
> >>>>>>>> The Journey
> >>>>>>>> ------------------
> >>>>>>>>
> >>>>>>>> (1) Failsafe Plugin
> >>>>>>>>
> >>>>>>>> The Maven Failsafe Build Plugin had a critical bug due to which
> failed
> >>>>>>>> tests did not result in a failed build.
> >>>>>>>>
> >>>>>>>> That is a pretty bad bug for a plugin whose only task is to run
> tests and
> >>>>>>>> fail the build if a test fails.
> >>>>>>>>
> >>>>>>>> After we recognized that, we upgraded the Failsafe Plugin.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> (2) Failsafe Plugin Dependency Issues
> >>>>>>>>
> >>>>>>>> After the upgrade, the Failsafe Plugin behaved differently and
> did not
> >>>>>>>> interoperate with Dependency Shading any more.
> >>>>>>>>
> >>>>>>>> Because of that, we switched to the Surefire Plugin.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> (3) Fixing all the issues introduced in the meantime
> >>>>>>>>
> >>>>>>>> Naturally, a number of test instabilities had been introduced,
> which
> >>>>>>> needed
> >>>>>>>> to be fixed.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> (4) Yarn Tests and Test Scope Refactoring
> >>>>>>>>
> >>>>>>>> In the meantime, a Pull Request was merged that moved the Yarn
> Tests to
> >>>>>>> the
> >>>>>>>> test scope.
> >>>>>>>> Because the configuration searched for tests in the "main" scope,
> no Yarn
> >>>>>>>> tests were executed for a while, until the scope was fixed.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> (5) Yarn Tests and JMX Metrics
> >>>>>>>>
> >>>>>>>> After the Yarn Tests were re-activated, we saw them fail due to
> warnings
> >>>>>>>> created by the newly introduced metrics code. We could fix that by
> >>>>>>> updating
> >>>>>>>> the metrics code and temporarily not registering JMX beans for all
> >>>>>>> metrics.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> (6) Yarn / Surefire Deadlock
> >>>>>>>>
> >>>>>>>> Finally, some Yarn tests failed reliably in Maven (though not in
> the
> >>>>>>> IDE).
> >>>>>>>> It turned out that those test a command line interface that
> interacts
> >>>>>>> with
> >>>>>>>> the standard input stream.
> >>>>>>>>
> >>>>>>>> The newly deployed Surefire Plugin uses standard input as well,
> for
> >>>>>>>> communication with forked JVMs. Since Surefire internally locks
> the
> >>>>>>>> standard input stream, the Yarn CLI cannot poll the standard
> input stream
> >>>>>>>> without locking up and stalling the tests.
> >>>>>>>>
> >>>>>>>> We adjusted the tests and now the build happily builds again.
> >>>>>>>>
> >>>>>>>> -----------------
> >>>>>>>> Conclusions:
> >>>>>>>> -----------------
> >>>>>>>>
> >>>>>>>> - CI is terribly crucial It took us weeks with the fallout of
> having a
> >>>>>>>> period of unreliably CI.
> >>>>>>>>
> >>>>>>>> - Maven could do a better job. A bug as crucial as the one that
> started
> >>>>>>>> our problem should not occur in a test plugin like surefire.
> Also, the
> >>>>>>>> constant change of semantics and dependency scopes is annoying.
> The
> >>>>>>>> semantic changes are subtle, but for a build as complex as Flink,
> they
> >>>>>>> make
> >>>>>>>> a difference.
> >>>>>>>>
> >>>>>>>> - File-based communication is rarely a good idea. The bug in the
> >>>>>>> failsafe
> >>>>>>>> plugin was caused by improper file-based communication, and some
> of our
> >>>>>>>> discovered instabilities as well.
> >>>>>>>>
> >>>>>>>> Greetings,
> >>>>>>>> Stephan
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> PS: Some issues and mysteries remain for us to solve: When we
> allow our
> >>>>>>>> metrics subsystem to register JMX beans, we see some tests
> failing due to
> >>>>>>>> spontaneous JVM process kills. Whoever has a pointer there,
> please ping
> >>>>>>> us!
> >>>>>>>
> >>>>>
> >>>
> >>
>
>

Re: [ANNOUNCE] Build Issues Solved

Reply via email to