Thanks Bogdan. Your analysis is much appreciated. According to your findings, it looks like it must be an infrastructure issue. But that doesn't explain why the test is a lot more stable on the 3.4 branch. I'd like to setup new jenkins jobs on a private infrastructure (other than Apache) on branch-3.5 with java8 and see if there's any difference.
Increasing the timeout further is a good idea I think. Test runtimes are quite close to 10 mins and it would be nice to see whether it's just takes ages to finish or fallen into a deadlock. Andor On 07/19/2018 08:46 AM, Bogdan Kanivets wrote: > Hi Andor, > > For testManyChildWatchersAutoReset, that was me who put 10min timeout on > the test itself. I wanted to see the logs and the problem is that when test > is timed out by ant (default 15min) logs aren't captured. > I agree that it became much flakier. I've pushed the PR right now to > increase to 14min. > > Also, I'm still looking at the slowness and posted some thoughts in jira > https://issues.apache.org/jira/browse/ZOOKEEPER-3046 > > On Wed, Jul 18, 2018 at 9:09 PM Michael Han <h...@apache.org> wrote: > >> Thanks Pat for promptly fixing this! >> >> I have no idea of the "failed to get" symptoms. Probably we could give it >> more days and see if the pattern recurs? If not might be a transient infra >> issue... >> >> On Wed, Jul 18, 2018 at 11:16 AM, Patrick Hunt <ph...@apache.org> wrote: >> >>> Ok, I committed a change that seems to address the main failure: >>> >> https://github.com/apache/zookeeper/commit/06b9507ab78a1a055b8f467846c157 >>> 91600b72ee >>> >>> https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>> ZooKeeper-Find-Flaky-Tests/lastSuccessfulBuild/artifact/report.html >>> >>> However I do notice some oddness in the sense that for some jobs/runs it >>> fails to get the information from the REST interface, even though it's >> fine >>> for most of them, take a look, any ideas? >>> https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>> ZooKeeper-Find-Flaky-Tests/456/console >>> >>> [ZooKeeper-Find-Flaky-Tests] $ /bin/bash /tmp/ >>> jenkins4452773653790031730.sh >>> ERROR:__main__:failed to get: >>> https://builds.apache.org/job/ZooKeeper-trunk/108/ >>> testReport/api/json?tree=suites%5Bname%2Ccases% >>> 5BclassName%2Cname%2Cstatus%5D%5D >>> ERROR:__main__:failed to get: >>> https://builds.apache.org/job/ZooKeeper-trunk/104/ >>> testReport/api/json?tree=suites%5Bname%2Ccases% >>> 5BclassName%2Cname%2Cstatus%5D%5D >>> ERROR:__main__:failed to get: >>> https://builds.apache.org/job/ZooKeeper-trunk/100/ >>> testReport/api/json?tree=suites%5Bname%2Ccases% >>> 5BclassName%2Cname%2Cstatus%5D%5D >>> >>> >>> Notice that it doesn't complain about job 107 (etc...) >>> >>> Any ideas on this? Have you seen this before? Perhaps we should open an >>> INFRA jira? >>> >>> Patrick >>> >>> On Wed, Jul 18, 2018 at 10:52 AM Patrick Hunt <ph...@apache.org> wrote: >>> >>>> FYI, created this: >>>> https://issues.apache.org/jira/browse/INFRA-16785 >>>> for the security warnings, not sure if that's causing the issue. Likely >>>> it's the recent jenkins upgrade, looking into it a bit... >>>> >>>> Patrick >>>> >>>> >>>> On Wed, Jul 18, 2018 at 9:48 AM Michael Han <h...@apache.org> wrote: >>>> >>>>> Hi Andor, >>>>> >>>>>>> I suspect it should succeed eventually if we were to increase the >>>>> timeout even more. But is that correct? Bug or infrastructure issue? >>>>> >>>>> You could set up a dedicated git branch with all patches (e.g. the one >>> in >>>>> ZOOKEEPER-2251) you want to apply and I can set up a dedicated Jenkins >>> job >>>>> that points to this branch and stress test the entire unit test suite. >>>>> Some >>>>> tests are only flaky when they ran on Apache infrastructure and when >>> they >>>>> ran together. >>>>> >>>>> It would be interesting to figure out what cause this test fail. Since >>>>> same >>>>> test works reliably in 3.4, there must be some commits in 3.5 that we >>>>> could >>>>> possibly blame... >>>>> >>>>>>> I'm going to raise a ticket on that if somebody willing to fix it. >>>>> I just had a brief look before Jenkins is down. Looks like python was >>>>> complaining about some SSL stuff and I suspect if we upgrade to use >>> later >>>>> version of python (3.x) it might work. I'll try that later when >> Jenkins >>> is >>>>> back. >>>>> >>>>> >>>>> On Wed, Jul 18, 2018 at 8:42 AM, Andor Molnar >>> <an...@cloudera.com.invalid >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> *branch-3.4* >>>>>> >>>>>> I've taken a quick look at our Jenkins builds and in terms of flaky >>>>> tests, >>>>>> it looks like branch-3.4 is in a pretty good shape. The build hasn't >>>>> failed >>>>>> for 5-6 days on all JDKs which I think is pretty awesome. >>>>>> >>>>>> *branch-3.5* >>>>>> >>>>>> This branch is in very bad condition. Which is quite unfortunate >> given >>>>>> we're in the middle of stabilising it. :) >>>>>> Especially on JDK8, last successful build was 11 days ago. JDK9 (50% >>>>>> failing) and JDK10 (30% failing) are looking better in the last 10 >>>>> builds. >>>>>> Interestingly (apart from a few quite rare ones) it looks there's >>> only 1 >>>>>> test which is quite nasty on this branch: >>> testManyChildWatchersAutoReset >>>>>> There's a Jira about fixing it and a fix has been merged by >> increasing >>>>> the >>>>>> timeout of the test, but having a bug on the branch is also possible >>>>>> causing the test to fail even with 10 min timeout. >>>>>> >>>>>> I wasn't able to repro the failing test on my machine (Mac and >>>>> CentOS7), it >>>>>> always finished in 30-40 seconds maximum. On jenkins slaves it shows >>> the >>>>>> following: >>>>>> >>>>>> *JDK 8:* >>>>>> >>>>>> Report creation timed out. >>>>>> >>>>>> >>>>>> *JDK 9:* >>>>>> >>>>>> New Failures >>>>>> Chart >>>>>> See children >>>>>> Build Number ⇒ >>>>>> Package-Class-Testmethod names ⇓ >>>>>> 351 >>>>>> 350 >>>>>> 349 >>>>>> 348 >>>>>> 347 >>>>>> 346 >>>>>> 345 >>>>>> 344 >>>>>> 343 >>>>>> 342 >>>>>> 341 >>>>>> 340 >>>>>> 339 >>>>>> 338 >>>>>> 337 >>>>>> 336 >>>>>> 335 >>>>>> 334 >>>>>> testManyChildWatchersAutoReset >>>>>> 45.604 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java9/351/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 600.337 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java9/350/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 21.904 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java9/349/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 583.063 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java9/348/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 600.325 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java9/347/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 600.383 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java9/346/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 600.362 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java9/345/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 21.139 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java9/344/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 24.031 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java9/343/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 584.200 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java9/342/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 600.327 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java9/341/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 600.323 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java9/340/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 23.737 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java9/339/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 600.406 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java9/338/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 547.004 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java9/337/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 600.393 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java9/336/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> N/A >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java9/test_results_analyzer/> >>>>>> 373.955 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java9/334/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> >>>>>> >>>>>> *JDK 10:* >>>>>> >>>>>> >>>>>> New Failures >>>>>> Chart >>>>>> See children >>>>>> Build Number ⇒ >>>>>> Package-Class-Testmethod names ⇓ >>>>>> 110 >>>>>> 109 >>>>>> 108 >>>>>> 107 >>>>>> 106 >>>>>> 105 >>>>>> 104 >>>>>> 103 >>>>>> 102 >>>>>> 101 >>>>>> 100 >>>>>> 99 >>>>>> 98 >>>>>> 97 >>>>>> 96 >>>>>> 95 >>>>>> 94 >>>>>> 93 >>>>>> 92 >>>>>> testManyChildWatchersAutoReset >>>>>> 364.945 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java10/110/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 543.983 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java10/109/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 388.182 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java10/108/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 600.446 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java10/107/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 600.025 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java10/106/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 535.046 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java10/105/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 600.306 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java10/104/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 474.005 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java10/103/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 560.925 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java10/102/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 600.328 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java10/101/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 558.547 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java10/100/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 600.397 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java10/99/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 600.414 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java10/98/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 430.383 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java10/97/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 564.064 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java10/96/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 600.357 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java10/95/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 432.435 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java10/94/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 596.378 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java10/93/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> 39.242 >>>>>> <https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ >>>>>> ZooKeeper_branch35_java10/92/testReport/org.apache.zookeeper.test/ >>>>>> DisconnectedWatcherTest/testManyChildWatchersAutoReset> >>>>>> >>>>>> >>>>>> It takes ages to complete on Jenkins for some reason and it looks >> like >>>>> it >>>>>> ends quite frequently close to the limit, so I suspect it should >>> succeed >>>>>> eventually if we were to increase the timeout even more. But is that >>>>>> correct? >>>>>> Bug or infrastructure issue? >>>>>> >>>>>> *master / 3.6* >>>>>> >>>>>> Pretty much the same as 3.5. I haven't seen >>>>> testManyChildWatchersAutoReset >>>>>> failing on this branch with JDK8 which is a bit confusing, but other >>>>> then >>>>>> that I see the same pattern on JDK9 and JDK10. Unable to generate >> the >>>>> above >>>>>> reports here, because Test Result Analyzer keep timeouting for me, >> but >>>>> I'll >>>>>> follow-up when I have them. >>>>>> >>>>>> Btw. Flaky Test report has been broken for 10 days, I'm going to >>> raise a >>>>>> ticket on that if somebody willing to fix it. (I'm planning to do >> so.) >>>>>> It would be nice to see the report working again, because if my >>>>>> observations are correct, we don't have too many annoying tests >> apart >>>>> from >>>>>> the one mentioned. >>>>>> >>>>>> Thanks, >>>>>> Andor >>>>>>