I thought this had been fixed by Chiwan in the meantime. Could you post a build log?
On Thu, Jun 2, 2016 at 1:14 PM, Ufuk Celebi <u...@apache.org> wrote: > With the recent fixes, the builds are more stable, but I still see > many failing, because of the Scala shell tests, which lead to the JVMs > crashing. I've researched this a little bit, but didn't find an > obvious solution to the problem. > > Does it make sense to disable the tests until someone has time to look into > it? > > – Ufuk > > On Tue, May 31, 2016 at 1:46 PM, Stephan Ewen <se...@apache.org> wrote: >> You are right, Chiwan. >> >> I think that this pattern you use should be supported, though. Would be >> good to check if the job executes at the point of the "collect()" calls >> more than is necessary. >> That would explain the network buffer issue then... >> >> On Tue, May 31, 2016 at 12:18 PM, Chiwan Park <chiwanp...@apache.org> wrote: >> >>> Hi Stephan, >>> >>> Yes, right. But KNNITSuite calls >>> ExecutionEnvironment.getExecutionEnvironment only once [1]. I’m testing >>> with moving method call of getExecutionEnvironment to each test case. >>> >>> [1]: >>> https://github.com/apache/flink/blob/master/flink-libraries/flink-ml/src/test/scala/org/apache/flink/ml/nn/KNNITSuite.scala#L45 >>> >>> Regards, >>> Chiwan Park >>> >>> > On May 31, 2016, at 7:09 PM, Stephan Ewen <se...@apache.org> wrote: >>> > >>> > Hi Chiwan! >>> > >>> > I think the Execution environment is not shared, because what the >>> > TestEnvironment sets is a Context Environment Factory. Every time you >>> call >>> > "ExecutionEnvironment.getExecutionEnvironment()", you get a new >>> environment. >>> > >>> > Stephan >>> > >>> > >>> > On Tue, May 31, 2016 at 11:53 AM, Chiwan Park <chiwanp...@apache.org> >>> wrote: >>> > >>> >> I’ve created a JIRA issue [1] related to KNN test cases. I will send a >>> PR >>> >> for it. >>> >> >>> >> From my investigation [2], cluster for ML tests have only one >>> taskmanager >>> >> with 4 slots. Is 2048 insufficient for total number of network numbers? >>> I >>> >> still think the problem is sharing ExecutionEnvironment between test >>> cases. >>> >> >>> >> [1]: https://issues.apache.org/jira/browse/FLINK-3994 >>> >> [2]: >>> >> >>> https://github.com/apache/flink/blob/master/flink-test-utils/src/test/scala/org/apache/flink/test/util/FlinkTestBase.scala#L56 >>> >> >>> >> Regards, >>> >> Chiwan Park >>> >> >>> >>> On May 31, 2016, at 6:05 PM, Maximilian Michels <m...@apache.org> >>> wrote: >>> >>> >>> >>> Thanks Stephan for the synopsis of our last weeks test instability >>> >>> madness. It's sad to see the shortcomings of Maven test plugins but >>> >>> another lesson learned is that our testing infrastructure should get a >>> >>> bit more attention. We have reached a point several times where our >>> >>> tests where inherently instable. Now we saw that even more problems >>> >>> were hidden in the dark. I would like to see more maintenance >>> >>> dedicated to testing. >>> >>> >>> >>> @Chiwan: Please, no hotfix! Please open a JIRA issue and a pull >>> >>> request with a systematic fix. Those things are too crucial to be >>> >>> fixed on the go. The problems is that Travis reports the number of >>> >>> processors to be "32" (which is used for the number of task slots in >>> >>> local execution). The network buffers are not adjusted accordingly. We >>> >>> should set them correctly in the MiniCluster. Also, we could define an >>> >>> upper limit to the number of task slots for tests. >>> >>> >>> >>> On Tue, May 31, 2016 at 10:59 AM, Chiwan Park <chiwanp...@apache.org> >>> >> wrote: >>> >>>> I think that the tests fail because of sharing ExecutionEnvironment >>> >> between test cases. I’m not sure why it is problem, but it is only >>> >> difference between other ML tests. >>> >>>> >>> >>>> I created a hotfix and pushed it to my repository. When it seems fixed >>> >> [1], I’ll merge the hotfix to master branch. >>> >>>> >>> >>>> [1]: https://travis-ci.org/chiwanpark/flink/builds/134104491 >>> >>>> >>> >>>> Regards, >>> >>>> Chiwan Park >>> >>>> >>> >>>>> On May 31, 2016, at 5:43 PM, Chiwan Park <chiwanp...@apache.org> >>> >> wrote: >>> >>>>> >>> >>>>> Maybe it seems about KNN test case which is merged into yesterday. >>> >> I’ll look into ML test. >>> >>>>> >>> >>>>> Regards, >>> >>>>> Chiwan Park >>> >>>>> >>> >>>>>> On May 31, 2016, at 5:38 PM, Ufuk Celebi <u...@apache.org> wrote: >>> >>>>>> >>> >>>>>> Currently, an ML test is reliably failing and occasionally some HA >>> >>>>>> tests. Is someone looking into the ML test? >>> >>>>>> >>> >>>>>> For HA, I will revert a commit, which might cause the HA >>> >>>>>> instabilities. Till is working on a proper fix as far as I know. >>> >>>>>> >>> >>>>>> On Tue, May 31, 2016 at 3:50 AM, Chiwan Park <chiwanp...@apache.org >>> > >>> >> wrote: >>> >>>>>>> Thanks for the great work! :-) >>> >>>>>>> >>> >>>>>>> Regards, >>> >>>>>>> Chiwan Park >>> >>>>>>> >>> >>>>>>>> On May 31, 2016, at 7:47 AM, Flavio Pompermaier < >>> >> pomperma...@okkam.it> wrote: >>> >>>>>>>> >>> >>>>>>>> Awesome work guys! >>> >>>>>>>> And even more thanks for the detailed report...This >>> troubleshooting >>> >> summary >>> >>>>>>>> will be undoubtedly useful for all our maven projects! >>> >>>>>>>> >>> >>>>>>>> Best, >>> >>>>>>>> Flavio >>> >>>>>>>> On 30 May 2016 23:47, "Ufuk Celebi" <u...@apache.org> wrote: >>> >>>>>>>> >>> >>>>>>>>> Thanks for the effort, Max and Stephan! Happy to see the green >>> >> light again. >>> >>>>>>>>> >>> >>>>>>>>> On Mon, May 30, 2016 at 11:03 PM, Stephan Ewen <se...@apache.org >>> > >>> >> wrote: >>> >>>>>>>>>> Hi all! >>> >>>>>>>>>> >>> >>>>>>>>>> After a few weeks of terrible build issues, I am happy to >>> >> announce that >>> >>>>>>>>> the >>> >>>>>>>>>> build works again properly, and we actually get meaningful CI >>> >> results. >>> >>>>>>>>>> >>> >>>>>>>>>> Here is a story in many acts, from builds deep red to bright >>> >> green joy. >>> >>>>>>>>>> Kudos to Max, who did most of this troubleshooting. This >>> evening, >>> >> Max and >>> >>>>>>>>>> me debugged the final issue and got the build back on track. >>> >>>>>>>>>> >>> >>>>>>>>>> ------------------ >>> >>>>>>>>>> The Journey >>> >>>>>>>>>> ------------------ >>> >>>>>>>>>> >>> >>>>>>>>>> (1) Failsafe Plugin >>> >>>>>>>>>> >>> >>>>>>>>>> The Maven Failsafe Build Plugin had a critical bug due to which >>> >> failed >>> >>>>>>>>>> tests did not result in a failed build. >>> >>>>>>>>>> >>> >>>>>>>>>> That is a pretty bad bug for a plugin whose only task is to run >>> >> tests and >>> >>>>>>>>>> fail the build if a test fails. >>> >>>>>>>>>> >>> >>>>>>>>>> After we recognized that, we upgraded the Failsafe Plugin. >>> >>>>>>>>>> >>> >>>>>>>>>> >>> >>>>>>>>>> (2) Failsafe Plugin Dependency Issues >>> >>>>>>>>>> >>> >>>>>>>>>> After the upgrade, the Failsafe Plugin behaved differently and >>> >> did not >>> >>>>>>>>>> interoperate with Dependency Shading any more. >>> >>>>>>>>>> >>> >>>>>>>>>> Because of that, we switched to the Surefire Plugin. >>> >>>>>>>>>> >>> >>>>>>>>>> >>> >>>>>>>>>> (3) Fixing all the issues introduced in the meantime >>> >>>>>>>>>> >>> >>>>>>>>>> Naturally, a number of test instabilities had been introduced, >>> >> which >>> >>>>>>>>> needed >>> >>>>>>>>>> to be fixed. >>> >>>>>>>>>> >>> >>>>>>>>>> >>> >>>>>>>>>> (4) Yarn Tests and Test Scope Refactoring >>> >>>>>>>>>> >>> >>>>>>>>>> In the meantime, a Pull Request was merged that moved the Yarn >>> >> Tests to >>> >>>>>>>>> the >>> >>>>>>>>>> test scope. >>> >>>>>>>>>> Because the configuration searched for tests in the "main" >>> scope, >>> >> no Yarn >>> >>>>>>>>>> tests were executed for a while, until the scope was fixed. >>> >>>>>>>>>> >>> >>>>>>>>>> >>> >>>>>>>>>> (5) Yarn Tests and JMX Metrics >>> >>>>>>>>>> >>> >>>>>>>>>> After the Yarn Tests were re-activated, we saw them fail due to >>> >> warnings >>> >>>>>>>>>> created by the newly introduced metrics code. We could fix that >>> by >>> >>>>>>>>> updating >>> >>>>>>>>>> the metrics code and temporarily not registering JMX beans for >>> all >>> >>>>>>>>> metrics. >>> >>>>>>>>>> >>> >>>>>>>>>> >>> >>>>>>>>>> (6) Yarn / Surefire Deadlock >>> >>>>>>>>>> >>> >>>>>>>>>> Finally, some Yarn tests failed reliably in Maven (though not in >>> >> the >>> >>>>>>>>> IDE). >>> >>>>>>>>>> It turned out that those test a command line interface that >>> >> interacts >>> >>>>>>>>> with >>> >>>>>>>>>> the standard input stream. >>> >>>>>>>>>> >>> >>>>>>>>>> The newly deployed Surefire Plugin uses standard input as well, >>> >> for >>> >>>>>>>>>> communication with forked JVMs. Since Surefire internally locks >>> >> the >>> >>>>>>>>>> standard input stream, the Yarn CLI cannot poll the standard >>> >> input stream >>> >>>>>>>>>> without locking up and stalling the tests. >>> >>>>>>>>>> >>> >>>>>>>>>> We adjusted the tests and now the build happily builds again. >>> >>>>>>>>>> >>> >>>>>>>>>> ----------------- >>> >>>>>>>>>> Conclusions: >>> >>>>>>>>>> ----------------- >>> >>>>>>>>>> >>> >>>>>>>>>> - CI is terribly crucial It took us weeks with the fallout of >>> >> having a >>> >>>>>>>>>> period of unreliably CI. >>> >>>>>>>>>> >>> >>>>>>>>>> - Maven could do a better job. A bug as crucial as the one that >>> >> started >>> >>>>>>>>>> our problem should not occur in a test plugin like surefire. >>> >> Also, the >>> >>>>>>>>>> constant change of semantics and dependency scopes is annoying. >>> >> The >>> >>>>>>>>>> semantic changes are subtle, but for a build as complex as >>> Flink, >>> >> they >>> >>>>>>>>> make >>> >>>>>>>>>> a difference. >>> >>>>>>>>>> >>> >>>>>>>>>> - File-based communication is rarely a good idea. The bug in the >>> >>>>>>>>> failsafe >>> >>>>>>>>>> plugin was caused by improper file-based communication, and some >>> >> of our >>> >>>>>>>>>> discovered instabilities as well. >>> >>>>>>>>>> >>> >>>>>>>>>> Greetings, >>> >>>>>>>>>> Stephan >>> >>>>>>>>>> >>> >>>>>>>>>> >>> >>>>>>>>>> PS: Some issues and mysteries remain for us to solve: When we >>> >> allow our >>> >>>>>>>>>> metrics subsystem to register JMX beans, we see some tests >>> >> failing due to >>> >>>>>>>>>> spontaneous JVM process kills. Whoever has a pointer there, >>> >> please ping >>> >>>>>>>>> us! >>> >>>>>>>>> >>> >>>>>>> >>> >>>>> >>> >>>> >>> >> >>> >> >>> >>>