On Fri, Jul 20, 2018 at 10:54 AM, Daniel Farrell <dfarr...@redhat.com>
wrote:

> On Fri, Jul 20, 2018 at 10:36 AM Thanh Ha <thanh...@linuxfoundation.org>
> wrote:
>
>> On Fri, Jul 20, 2018 at 10:01 AM Tom Pantelis <tompante...@gmail.com>
>> wrote:
>>
>>> On Fri, Jul 20, 2018 at 4:48 AM, Anil Belur <abe...@linuxfoundation.org>
>>> wrote:
>>>
>>>> On Fri, Jul 20, 2018 at 11:12 AM Jenkins <jenkins-dontreply@
>>>> opendaylight.org> wrote:
>>>>
>>>>> Attention controller-devs,
>>>>>
>>>>> Autorelease oxygen failed to build sal-cluster-admin-impl from
>>>>> controller in build
>>>>> 359. Attached is a snippet of the error message related to the
>>>>> failure that we were able to automatically parse as well as console
>>>>> logs.
>>>>>
>>>>> Console Logs:
>>>>> https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/
>>>>> autorelease-release-oxygen/359
>>>>>
>>>>> Jenkins Build:
>>>>> https://jenkins.opendaylight.org/releng/job/autorelease-
>>>>> release-oxygen/359/
>>>>>
>>>>> Please review and provide an ETA on when a fix will be available.
>>>>>
>>>>> Thanks,
>>>>> ODL releng/autorelease team
>>>>>
>>>>  Hello controller-dev:
>>>>
>>>> Please look into these failed tests.
>>>>
>>>> Failed tests:
>>>>   ClusterAdminRpcServiceTest.testFlipMemberVotingStates:976->lambda$
>>>> testFlipMemberVotingStates$8:978 Expected leader member-1. Actual:
>>>> member-1-shard-cars-oper_testFlipMemberVotingStates
>>>>
>>>> Tests run: 17, Failures: 1, Errors: 0, Skipped: 0
>>>>
>>>
>>>
>>> I ran it successfully 500 times locally. But looking at the code and the
>>> test output from jenkins, I can see why it failed - just the right
>>> timing sequence coupled with just enough of a random thread execution delay
>>> and a deadline timeout set by the test being just a tad too low for that
>>> delay.  I'll push a patch. Another case where occasionally it seems there's
>>> just enough of a slight delay or slowdown in the jenkins environment to
>>> throw off timing to cause a test failure.
>>>
>>
>> Hi Tom,
>>
>> I'm curious when you said you ran it successfully 500 times locally did
>> you perform a full build during that time or tested the single test case in
>> isolation?
>>
>> I found that while troubleshooting the bgpcep issue in the bgp-bmp-mock
>> thread [0] that I had to run a full bgpcep build in order to reproduce the
>> issue on my own laptop system. I have a script that I'm testing now and
>> making it more generic that I will share to this list later which will
>> allow us to continuously run builds whether it's autorelease or project
>> specifc over and over infinitely and capture the maven output + surefire
>> logs output which I hope will help folks reproduce intermittent issues
>> locally.
>>
>> I feel like blaming infrastructure being "slow" is too easy an excuse for
>> issues. If the software was run in a customer production environment I
>> suspect telling the customer that their hardware is too slow and is not the
>> same hardware as the developer's laptop it would not be a solution the
>> customer would be happy with.
>>
>
> +1000
>
> Our code and tests need to be robust enough to handle diverse
> infrastructure. Bugs like this might be highlighted by infra variability,
> but they are still bugs in code/tests.
>
> Not picking on TomP or Controller here, this is a general ODL culture
> problem of blaming the infra first and until Thanh/Jamo/et al prove
> otherwise.
>

I did run the test in isolation.

I'm really not trying to blame the  infrastructure.  In this case, it looks
like the test set up a deadline that was a bit too short for comfort. The
vast majority of times it succeeds. I've seen this before with tests - it
fails on jenkins then I run it locally in a loop - sometimes it will fail
after 20 or 50 times and sometimes it takes hundreds of runs for the stars
to align right and fail. In this case I didn't see it fail locally after
500 - maybe it would've after 1000. Certainly running a full build or all
the tests is different than just running one test class. I think arbitrary
delays can occur due to GC with multiple test classes run in the same JVM.
I was just noting that in this failure it looks like there was enough of a
slight delay for whatever reason in the jenkins run that threw things off
and that could very well have been an untimely GC.

I agree that ultimately there (most likely) is an issue in the test or
code. It may be that a timeout deadline is too short for comfort or the
test makes assumptions about the ordering of events when testing async code
that just happens to work most of the time.  For deadlines, I usually give
it plenty of cushion like 5 s even tho it should normally take 5 ms. In
those cases where a failure isn't reproduced locally even after 100's of
runs, hopefully analyzing the code and test output can at least yield a
theory and possibly adding manual sleeps to change the timing will repro
the failure.


>
>
>> I'm not sure what we can do to help give more confidence in the
>> infrastructure so that it's not the first thing that gets blamed every time
>> there's a build issue but we do run on build flavors in vexxhost that
>> provide dedicated CPUs and RAM to our builders. Once I have some more
>> validation on the infinite build script maybe I can run it for awhile on
>> every autorelease managed project and report to the projects with the
>> script output on my 2 laptops + a few vexxhost instances.
>>
>
> Thanks for this work Thanh, seems like it will be very helpful for ironing
> out intermittent failures.
>
> Daniel
>
>
>>
>> Regards,
>> Thanh
>>
>> [0] https://lists.opendaylight.org/pipermail/
>> release/2018-July/015594.html
>>
>> _______________________________________________
>> release mailing list
>> rele...@lists.opendaylight.org
>> https://lists.opendaylight.org/mailman/listinfo/release
>>
>
_______________________________________________
controller-dev mailing list
controller-dev@lists.opendaylight.org
https://lists.opendaylight.org/mailman/listinfo/controller-dev

Reply via email to