Re: [ANNOUNCE] Build Issues Solved

Maximilian Michels Thu, 02 Jun 2016 04:27:08 -0700

I thought this had been fixed by Chiwan in the meantime. Could you
post a build log?


On Thu, Jun 2, 2016 at 1:14 PM, Ufuk Celebi <u...@apache.org> wrote:
> With the recent fixes, the builds are more stable, but I still see
> many failing, because of the Scala shell tests, which lead to the JVMs
> crashing. I've researched this a little bit, but didn't find an
> obvious solution to the problem.
>
> Does it make sense to disable the tests until someone has time to look into 
> it?
>
> – Ufuk
>
> On Tue, May 31, 2016 at 1:46 PM, Stephan Ewen <se...@apache.org> wrote:
>> You are right, Chiwan.
>>
>> I think that this pattern you use should be supported, though. Would be
>> good to check if the job executes at the point of the "collect()" calls
>> more than is necessary.
>> That would explain the network buffer issue then...
>>
>> On Tue, May 31, 2016 at 12:18 PM, Chiwan Park <chiwanp...@apache.org> wrote:
>>
>>> Hi Stephan,
>>>
>>> Yes, right. But KNNITSuite calls
>>> ExecutionEnvironment.getExecutionEnvironment only once [1]. I’m testing
>>> with moving method call of getExecutionEnvironment to each test case.
>>>
>>> [1]:
>>> https://github.com/apache/flink/blob/master/flink-libraries/flink-ml/src/test/scala/org/apache/flink/ml/nn/KNNITSuite.scala#L45
>>>
>>> Regards,
>>> Chiwan Park
>>>
>>> > On May 31, 2016, at 7:09 PM, Stephan Ewen <se...@apache.org> wrote:
>>> >
>>> > Hi Chiwan!
>>> >
>>> > I think the Execution environment is not shared, because what the
>>> > TestEnvironment sets is a Context Environment Factory. Every time you
>>> call
>>> > "ExecutionEnvironment.getExecutionEnvironment()", you get a new
>>> environment.
>>> >
>>> > Stephan
>>> >
>>> >
>>> > On Tue, May 31, 2016 at 11:53 AM, Chiwan Park <chiwanp...@apache.org>
>>> wrote:
>>> >
>>> >> I’ve created a JIRA issue [1] related to KNN test cases. I will send a
>>> PR
>>> >> for it.
>>> >>
>>> >> From my investigation [2], cluster for ML tests have only one
>>> taskmanager
>>> >> with 4 slots. Is 2048 insufficient for total number of network numbers?
>>> I
>>> >> still think the problem is sharing ExecutionEnvironment between test
>>> cases.
>>> >>
>>> >> [1]: https://issues.apache.org/jira/browse/FLINK-3994
>>> >> [2]:
>>> >>
>>> https://github.com/apache/flink/blob/master/flink-test-utils/src/test/scala/org/apache/flink/test/util/FlinkTestBase.scala#L56
>>> >>
>>> >> Regards,
>>> >> Chiwan Park
>>> >>
>>> >>> On May 31, 2016, at 6:05 PM, Maximilian Michels <m...@apache.org>
>>> wrote:
>>> >>>
>>> >>> Thanks Stephan for the synopsis of our last weeks test instability
>>> >>> madness. It's sad to see the shortcomings of Maven test plugins but
>>> >>> another lesson learned is that our testing infrastructure should get a
>>> >>> bit more attention. We have reached a point several times where our
>>> >>> tests where inherently instable. Now we saw that even more problems
>>> >>> were hidden in the dark. I would like to see more maintenance
>>> >>> dedicated to testing.
>>> >>>
>>> >>> @Chiwan: Please, no hotfix! Please open a JIRA issue and a pull
>>> >>> request with a systematic fix. Those things are too crucial to be
>>> >>> fixed on the go. The problems is that Travis reports the number of
>>> >>> processors to be "32" (which is used for the number of task slots in
>>> >>> local execution). The network buffers are not adjusted accordingly. We
>>> >>> should set them correctly in the MiniCluster. Also, we could define an
>>> >>> upper limit to the number of task slots for tests.
>>> >>>
>>> >>> On Tue, May 31, 2016 at 10:59 AM, Chiwan Park <chiwanp...@apache.org>
>>> >> wrote:
>>> >>>> I think that the tests fail because of sharing ExecutionEnvironment
>>> >> between test cases. I’m not sure why it is problem, but it is only
>>> >> difference between other ML tests.
>>> >>>>
>>> >>>> I created a hotfix and pushed it to my repository. When it seems fixed
>>> >> [1], I’ll merge the hotfix to master branch.
>>> >>>>
>>> >>>> [1]: https://travis-ci.org/chiwanpark/flink/builds/134104491
>>> >>>>
>>> >>>> Regards,
>>> >>>> Chiwan Park
>>> >>>>
>>> >>>>> On May 31, 2016, at 5:43 PM, Chiwan Park <chiwanp...@apache.org>
>>> >> wrote:
>>> >>>>>
>>> >>>>> Maybe it seems about KNN test case which is merged into yesterday.
>>> >> I’ll look into ML test.
>>> >>>>>
>>> >>>>> Regards,
>>> >>>>> Chiwan Park
>>> >>>>>
>>> >>>>>> On May 31, 2016, at 5:38 PM, Ufuk Celebi <u...@apache.org> wrote:
>>> >>>>>>
>>> >>>>>> Currently, an ML test is reliably failing and occasionally some HA
>>> >>>>>> tests. Is someone looking into the ML test?
>>> >>>>>>
>>> >>>>>> For HA, I will revert a commit, which might cause the HA
>>> >>>>>> instabilities. Till is working on a proper fix as far as I know.
>>> >>>>>>
>>> >>>>>> On Tue, May 31, 2016 at 3:50 AM, Chiwan Park <chiwanp...@apache.org
>>> >
>>> >> wrote:
>>> >>>>>>> Thanks for the great work! :-)
>>> >>>>>>>
>>> >>>>>>> Regards,
>>> >>>>>>> Chiwan Park
>>> >>>>>>>
>>> >>>>>>>> On May 31, 2016, at 7:47 AM, Flavio Pompermaier <
>>> >> pomperma...@okkam.it> wrote:
>>> >>>>>>>>
>>> >>>>>>>> Awesome work guys!
>>> >>>>>>>> And even more thanks for the detailed report...This
>>> troubleshooting
>>> >> summary
>>> >>>>>>>> will be undoubtedly useful for all our maven projects!
>>> >>>>>>>>
>>> >>>>>>>> Best,
>>> >>>>>>>> Flavio
>>> >>>>>>>> On 30 May 2016 23:47, "Ufuk Celebi" <u...@apache.org> wrote:
>>> >>>>>>>>
>>> >>>>>>>>> Thanks for the effort, Max and Stephan! Happy to see the green
>>> >> light again.
>>> >>>>>>>>>
>>> >>>>>>>>> On Mon, May 30, 2016 at 11:03 PM, Stephan Ewen <se...@apache.org
>>> >
>>> >> wrote:
>>> >>>>>>>>>> Hi all!
>>> >>>>>>>>>>
>>> >>>>>>>>>> After a few weeks of terrible build issues, I am happy to
>>> >> announce that
>>> >>>>>>>>> the
>>> >>>>>>>>>> build works again properly, and we actually get meaningful CI
>>> >> results.
>>> >>>>>>>>>>
>>> >>>>>>>>>> Here is a story in many acts, from builds deep red to bright
>>> >> green joy.
>>> >>>>>>>>>> Kudos to Max, who did most of this troubleshooting. This
>>> evening,
>>> >> Max and
>>> >>>>>>>>>> me debugged the final issue and got the build back on track.
>>> >>>>>>>>>>
>>> >>>>>>>>>> ------------------
>>> >>>>>>>>>> The Journey
>>> >>>>>>>>>> ------------------
>>> >>>>>>>>>>
>>> >>>>>>>>>> (1) Failsafe Plugin
>>> >>>>>>>>>>
>>> >>>>>>>>>> The Maven Failsafe Build Plugin had a critical bug due to which
>>> >> failed
>>> >>>>>>>>>> tests did not result in a failed build.
>>> >>>>>>>>>>
>>> >>>>>>>>>> That is a pretty bad bug for a plugin whose only task is to run
>>> >> tests and
>>> >>>>>>>>>> fail the build if a test fails.
>>> >>>>>>>>>>
>>> >>>>>>>>>> After we recognized that, we upgraded the Failsafe Plugin.
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>> (2) Failsafe Plugin Dependency Issues
>>> >>>>>>>>>>
>>> >>>>>>>>>> After the upgrade, the Failsafe Plugin behaved differently and
>>> >> did not
>>> >>>>>>>>>> interoperate with Dependency Shading any more.
>>> >>>>>>>>>>
>>> >>>>>>>>>> Because of that, we switched to the Surefire Plugin.
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>> (3) Fixing all the issues introduced in the meantime
>>> >>>>>>>>>>
>>> >>>>>>>>>> Naturally, a number of test instabilities had been introduced,
>>> >> which
>>> >>>>>>>>> needed
>>> >>>>>>>>>> to be fixed.
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>> (4) Yarn Tests and Test Scope Refactoring
>>> >>>>>>>>>>
>>> >>>>>>>>>> In the meantime, a Pull Request was merged that moved the Yarn
>>> >> Tests to
>>> >>>>>>>>> the
>>> >>>>>>>>>> test scope.
>>> >>>>>>>>>> Because the configuration searched for tests in the "main"
>>> scope,
>>> >> no Yarn
>>> >>>>>>>>>> tests were executed for a while, until the scope was fixed.
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>> (5) Yarn Tests and JMX Metrics
>>> >>>>>>>>>>
>>> >>>>>>>>>> After the Yarn Tests were re-activated, we saw them fail due to
>>> >> warnings
>>> >>>>>>>>>> created by the newly introduced metrics code. We could fix that
>>> by
>>> >>>>>>>>> updating
>>> >>>>>>>>>> the metrics code and temporarily not registering JMX beans for
>>> all
>>> >>>>>>>>> metrics.
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>> (6) Yarn / Surefire Deadlock
>>> >>>>>>>>>>
>>> >>>>>>>>>> Finally, some Yarn tests failed reliably in Maven (though not in
>>> >> the
>>> >>>>>>>>> IDE).
>>> >>>>>>>>>> It turned out that those test a command line interface that
>>> >> interacts
>>> >>>>>>>>> with
>>> >>>>>>>>>> the standard input stream.
>>> >>>>>>>>>>
>>> >>>>>>>>>> The newly deployed Surefire Plugin uses standard input as well,
>>> >> for
>>> >>>>>>>>>> communication with forked JVMs. Since Surefire internally locks
>>> >> the
>>> >>>>>>>>>> standard input stream, the Yarn CLI cannot poll the standard
>>> >> input stream
>>> >>>>>>>>>> without locking up and stalling the tests.
>>> >>>>>>>>>>
>>> >>>>>>>>>> We adjusted the tests and now the build happily builds again.
>>> >>>>>>>>>>
>>> >>>>>>>>>> -----------------
>>> >>>>>>>>>> Conclusions:
>>> >>>>>>>>>> -----------------
>>> >>>>>>>>>>
>>> >>>>>>>>>> - CI is terribly crucial It took us weeks with the fallout of
>>> >> having a
>>> >>>>>>>>>> period of unreliably CI.
>>> >>>>>>>>>>
>>> >>>>>>>>>> - Maven could do a better job. A bug as crucial as the one that
>>> >> started
>>> >>>>>>>>>> our problem should not occur in a test plugin like surefire.
>>> >> Also, the
>>> >>>>>>>>>> constant change of semantics and dependency scopes is annoying.
>>> >> The
>>> >>>>>>>>>> semantic changes are subtle, but for a build as complex as
>>> Flink,
>>> >> they
>>> >>>>>>>>> make
>>> >>>>>>>>>> a difference.
>>> >>>>>>>>>>
>>> >>>>>>>>>> - File-based communication is rarely a good idea. The bug in the
>>> >>>>>>>>> failsafe
>>> >>>>>>>>>> plugin was caused by improper file-based communication, and some
>>> >> of our
>>> >>>>>>>>>> discovered instabilities as well.
>>> >>>>>>>>>>
>>> >>>>>>>>>> Greetings,
>>> >>>>>>>>>> Stephan
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>> PS: Some issues and mysteries remain for us to solve: When we
>>> >> allow our
>>> >>>>>>>>>> metrics subsystem to register JMX beans, we see some tests
>>> >> failing due to
>>> >>>>>>>>>> spontaneous JVM process kills. Whoever has a pointer there,
>>> >> please ping
>>> >>>>>>>>> us!
>>> >>>>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>>
>>> >>
>>> >>
>>>
>>>

Re: [ANNOUNCE] Build Issues Solved

Reply via email to