i think the unique port assignment (d) is more problematic than it appears. there is a race between finding a free port and actually grabbing it. i think that contributes to the flakiness.
ben On Fri, Oct 12, 2018 at 8:50 AM Andor Molnar <an...@apache.org> wrote: > > That is a completely valid point. I started to investigate flakies for > exactly the same reason, if you remember the thread that I started a while > ago. It was later abandoned unfortunately, because I’ve run into a few issues: > > - We nailed down that in order to release 3.5 stable, we have to make sure > it’s not worse than 3.4 by comparing the builds: but these builds are not > comparable, because 3.4 tests running single threaded while 3.5 multithreaded > showing problems which might also exist on 3.4, > > - Neither of them running C++ tests for some reason, but that’s not really an > issue here, > > - Looks like tests on 3.5 is just as solid as on 3.4, because running them on > a dedicated, single threaded environment show almost all tests succeeding, > > - I think the root cause of failing unit tests could be one (or more) of the > following: > a) Environmental: Jenkins slave gets overloaded with other builds and > multithreaded test running makes things even worse: starving JDK threads and > ZK instances (both clients and servers) are unable to operate > b) Conceptional: ZK unit tests were not designed to run on multiple > threads: I investigated the unique port assignment feature which is looking > good, but there could be other possible gaps which makes them unreliable when > running simultaneously. > c) Bad testing: testing ZK in the wrong way, making bad assumption > (e.g. not syncing clients), etc. > d) Bug in the server. > > I feel that finding case d) with these tests is super hard, because a test > report doesn’t give any information on what could go wrong with ZooKeeper. > More or less guessing is your only option. > > Finding c) is a little bit easier, I’m trying to submit patches on them and > hopefully making some progress. > > The huge pain in the arse though are a) and b): people desperately keep > commenting “please retest this” on github to get a green build while testing > is going in a direction to hide real problems: I mean people started not to > care about a failing build, because “it must be some flaky unrelated to my > patch”. Which is bad, but the shame is it’s true 90% percent of cases. > > I’m just trying to find some ways - besides fixing c) and d) flakies - to get > more reliable and more informative Jenkins builds. Don’t want to make a huge > turnaround, but I think if we can get a significantly more reliable build for > the price of slightly longer build time running on 4 threads instead of 8, I > say let’s do it. > > As always, any help from the community is more than welcome and appreciated. > > Thanks, > Andor > > > > > > On 2018. Oct 12., at 16:52, Patrick Hunt <ph...@apache.org> wrote: > > > > iirc the number of threads was increased to improve performance. Reducing > > is fine, but do we understand why it's failing? Perhaps it's finding real > > issues as a result of the artificial concurrency/load. > > > > Patrick > > > > On Fri, Oct 12, 2018 at 7:12 AM Andor Molnar <an...@cloudera.com.invalid> > > wrote: > > > >> Thanks for the feedback. > >> I'm running a few tests now: branch-3.5 on 2 threads and trunk on 4 threads > >> to see what's the impact on the build time. > >> > >> Github PR job is hard to configure, because its settings are hard coded > >> into a shell script in the codebase. I have to open PR for that. > >> > >> Andor > >> > >> > >> > >> On Fri, Oct 12, 2018 at 2:46 PM, Norbert Kalmar < > >> nkal...@cloudera.com.invalid> wrote: > >> > >>> +1, running the tests locally with 1 thread always passes (well, I run it > >>> about 5 times, but still) > >>> On the other hand, running it on 8 threads yields similarly flaky results > >>> as Apache runs. (Although it is much faster, but if we have to run 6-8-10 > >>> times sometimes to get a green run...) > >>> > >>> Norbert > >>> > >>> On Fri, Oct 12, 2018 at 2:05 PM Enrico Olivelli <eolive...@gmail.com> > >>> wrote: > >>> > >>>> +1 > >>>> > >>>> Enrico > >>>> > >>>> Il ven 12 ott 2018, 13:52 Andor Molnar <an...@apache.org> ha scritto: > >>>> > >>>>> Hi, > >>>>> > >>>>> What do you think of changing number of threads running unit tests in > >>>>> Jenkins from current 8 to 4 or even 2? > >>>>> > >>>>> Running unit tests inside Cloudera environment on a single thread > >> shows > >>>> the > >>>>> builds much more stable. That would be probably too slow, but maybe > >>>> running > >>>>> at least less threads would improve the situation. > >>>>> > >>>>> It's getting very annoying that I cannot get a green build on GitHub > >>> with > >>>>> only a few retests. > >>>>> > >>>>> Regards, > >>>>> Andor > >>>>> > >>>> -- > >>>> > >>>> > >>>> -- Enrico Olivelli > >>>> > >>> > >> >