Thanks Stephan for the synopsis of our last weeks test instability
madness. It's sad to see the shortcomings of Maven test plugins but
another lesson learned is that our testing infrastructure should get a
bit more attention. We have reached a point several times where our
tests where inherently instable. Now we saw that even more problems
were hidden in the dark. I would like to see more maintenance
dedicated to testing.

@Chiwan: Please, no hotfix! Please open a JIRA issue and a pull
request with a systematic fix. Those things are too crucial to be
fixed on the go. The problems is that Travis reports the number of
processors to be "32" (which is used for the number of task slots in
local execution). The network buffers are not adjusted accordingly. We
should set them correctly in the MiniCluster. Also, we could define an
upper limit to the number of task slots for tests.

On Tue, May 31, 2016 at 10:59 AM, Chiwan Park <chiwanp...@apache.org> wrote:
> I think that the tests fail because of sharing ExecutionEnvironment between 
> test cases. I’m not sure why it is problem, but it is only difference between 
> other ML tests.
>
> I created a hotfix and pushed it to my repository. When it seems fixed [1], 
> I’ll merge the hotfix to master branch.
>
> [1]: https://travis-ci.org/chiwanpark/flink/builds/134104491
>
> Regards,
> Chiwan Park
>
>> On May 31, 2016, at 5:43 PM, Chiwan Park <chiwanp...@apache.org> wrote:
>>
>> Maybe it seems about KNN test case which is merged into yesterday. I’ll look 
>> into ML test.
>>
>> Regards,
>> Chiwan Park
>>
>>> On May 31, 2016, at 5:38 PM, Ufuk Celebi <u...@apache.org> wrote:
>>>
>>> Currently, an ML test is reliably failing and occasionally some HA
>>> tests. Is someone looking into the ML test?
>>>
>>> For HA, I will revert a commit, which might cause the HA
>>> instabilities. Till is working on a proper fix as far as I know.
>>>
>>> On Tue, May 31, 2016 at 3:50 AM, Chiwan Park <chiwanp...@apache.org> wrote:
>>>> Thanks for the great work! :-)
>>>>
>>>> Regards,
>>>> Chiwan Park
>>>>
>>>>> On May 31, 2016, at 7:47 AM, Flavio Pompermaier <pomperma...@okkam.it> 
>>>>> wrote:
>>>>>
>>>>> Awesome work guys!
>>>>> And even more thanks for the detailed report...This troubleshooting 
>>>>> summary
>>>>> will be undoubtedly useful for all our maven projects!
>>>>>
>>>>> Best,
>>>>> Flavio
>>>>> On 30 May 2016 23:47, "Ufuk Celebi" <u...@apache.org> wrote:
>>>>>
>>>>>> Thanks for the effort, Max and Stephan! Happy to see the green light 
>>>>>> again.
>>>>>>
>>>>>> On Mon, May 30, 2016 at 11:03 PM, Stephan Ewen <se...@apache.org> wrote:
>>>>>>> Hi all!
>>>>>>>
>>>>>>> After a few weeks of terrible build issues, I am happy to announce that
>>>>>> the
>>>>>>> build works again properly, and we actually get meaningful CI results.
>>>>>>>
>>>>>>> Here is a story in many acts, from builds deep red to bright green joy.
>>>>>>> Kudos to Max, who did most of this troubleshooting. This evening, Max 
>>>>>>> and
>>>>>>> me debugged the final issue and got the build back on track.
>>>>>>>
>>>>>>> ------------------
>>>>>>> The Journey
>>>>>>> ------------------
>>>>>>>
>>>>>>> (1) Failsafe Plugin
>>>>>>>
>>>>>>> The Maven Failsafe Build Plugin had a critical bug due to which failed
>>>>>>> tests did not result in a failed build.
>>>>>>>
>>>>>>> That is a pretty bad bug for a plugin whose only task is to run tests 
>>>>>>> and
>>>>>>> fail the build if a test fails.
>>>>>>>
>>>>>>> After we recognized that, we upgraded the Failsafe Plugin.
>>>>>>>
>>>>>>>
>>>>>>> (2) Failsafe Plugin Dependency Issues
>>>>>>>
>>>>>>> After the upgrade, the Failsafe Plugin behaved differently and did not
>>>>>>> interoperate with Dependency Shading any more.
>>>>>>>
>>>>>>> Because of that, we switched to the Surefire Plugin.
>>>>>>>
>>>>>>>
>>>>>>> (3) Fixing all the issues introduced in the meantime
>>>>>>>
>>>>>>> Naturally, a number of test instabilities had been introduced, which
>>>>>> needed
>>>>>>> to be fixed.
>>>>>>>
>>>>>>>
>>>>>>> (4) Yarn Tests and Test Scope Refactoring
>>>>>>>
>>>>>>> In the meantime, a Pull Request was merged that moved the Yarn Tests to
>>>>>> the
>>>>>>> test scope.
>>>>>>> Because the configuration searched for tests in the "main" scope, no 
>>>>>>> Yarn
>>>>>>> tests were executed for a while, until the scope was fixed.
>>>>>>>
>>>>>>>
>>>>>>> (5) Yarn Tests and JMX Metrics
>>>>>>>
>>>>>>> After the Yarn Tests were re-activated, we saw them fail due to warnings
>>>>>>> created by the newly introduced metrics code. We could fix that by
>>>>>> updating
>>>>>>> the metrics code and temporarily not registering JMX beans for all
>>>>>> metrics.
>>>>>>>
>>>>>>>
>>>>>>> (6) Yarn / Surefire Deadlock
>>>>>>>
>>>>>>> Finally, some Yarn tests failed reliably in Maven (though not in the
>>>>>> IDE).
>>>>>>> It turned out that those test a command line interface that interacts
>>>>>> with
>>>>>>> the standard input stream.
>>>>>>>
>>>>>>> The newly deployed Surefire Plugin uses standard input as well, for
>>>>>>> communication with forked JVMs. Since Surefire internally locks the
>>>>>>> standard input stream, the Yarn CLI cannot poll the standard input 
>>>>>>> stream
>>>>>>> without locking up and stalling the tests.
>>>>>>>
>>>>>>> We adjusted the tests and now the build happily builds again.
>>>>>>>
>>>>>>> -----------------
>>>>>>> Conclusions:
>>>>>>> -----------------
>>>>>>>
>>>>>>> - CI is terribly crucial It took us weeks with the fallout of having a
>>>>>>> period of unreliably CI.
>>>>>>>
>>>>>>> - Maven could do a better job. A bug as crucial as the one that started
>>>>>>> our problem should not occur in a test plugin like surefire. Also, the
>>>>>>> constant change of semantics and dependency scopes is annoying. The
>>>>>>> semantic changes are subtle, but for a build as complex as Flink, they
>>>>>> make
>>>>>>> a difference.
>>>>>>>
>>>>>>> - File-based communication is rarely a good idea. The bug in the
>>>>>> failsafe
>>>>>>> plugin was caused by improper file-based communication, and some of our
>>>>>>> discovered instabilities as well.
>>>>>>>
>>>>>>> Greetings,
>>>>>>> Stephan
>>>>>>>
>>>>>>>
>>>>>>> PS: Some issues and mysteries remain for us to solve: When we allow our
>>>>>>> metrics subsystem to register JMX beans, we see some tests failing due 
>>>>>>> to
>>>>>>> spontaneous JVM process kills. Whoever has a pointer there, please ping
>>>>>> us!
>>>>>>
>>>>
>>
>

Reply via email to