Re: [DISCUSS] solve unstable build capacity problem on TravisCI

Bowen Li Wed, 26 Jun 2019 12:29:46 -0700

Hi,

@Dawid, I think the "long test running" as I mentioned in the first email,
also as you guys said, belongs to "a big effort which is much harder to
accomplish in a short period of time and may deserve its own separate
discussion". Thus I didn't include it in what we can do in a foreseeable
short term.


Besides, I don't think that's the ultimate reason for lack of build
resources. Even if the build is shortened to something like 2h, the
problems of no build machine works about 6 or more hours in PST daytime
that I described will still happen, because no machine from ASF INFRA's
pool is allocated to Flink. As I have paid close attention to the build
queue in the past few weekdays, it's a pretty clear pattern now.

**The ultimate root cause** for that is - we don't have any **dedicated**
build resources that we can stably rely on. I'm actually ok to wait for a
long time if there are build requests running, it means at least we are
making progress. But I'm not ok with no build resource. A better place I
think we should aim at in short term is to always have at least a central
pool (can be 3 or 5) of machines dedicated to build Flink at any time, or
maybe use users resources.

@Chesnay @Robert I synced with Jeff offline that Zeppelin community is
using a Jenkins job to automatically build on users' travis account and
link the result back to github PR. I guess the Jenkins job would fetch
latest upstream master and build the PR against it. Jeff has filed tickets
to learn and get access to the Jenkins infra. It'll better to fully
understand it first before judging this approach.

I also heard good things about CircleCI, and ASF INFRA seems to have a pool
of build capacity there too. Can be an alternative to consider.









On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz <[email protected]>
wrote:

> Sorry to jump in late, but I think Bowen missed the most important point
> from Chesnay's previous message in the summary. The ultimate reason for
> all the problems is that the tests take close to 2 hours to run already.
> I fully support this claim: "Unless people start caring about test times
> before adding them, this issue cannot be solved"
>
> This is also another reason why using user's Travis account won't help.
> Every few weeks we reach the user's time limit for a single profile.
> This makes the user's builds simply fail, until we either properly
> decrease the time the tests take (which I am not sure we ever did) or
> postpone the problem by splitting into more profiles. (Note that the ASF
> Travis account has higher time limits)
>
> Best,
>
> Dawid
>
> On 26/06/2019 09:36, Robert Metzger wrote:
> > Do we know if using "the best" available hardware would improve the build
> > times?
> > Imagine we would run the build on machines with plenty of main memory to
> > mount everything to ramdisk + the latest CPU architecture?
> >
> > Throwing hardware at the problem could help reduce the time of an
> > individual build, and using our own infrastructure would remove our
> > dependency on Apache's Travis account (with the obvious downside of
> having
> > to maintain the infrastructure)
> > We could use an open source travis alternative, to have a similar
> > experience and make the migration easy.
> >
> >
> > On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler <[email protected]>
> wrote:
> >
> >>  From what I gathered, there's no special sauce that the Zeppelin
> >> project uses which actually integrates a users Travis account into the
> PR.
> >>
> >> They just disabled Travis for PRs. And that's kind of it.
> >>
> >> Naturally we can do this (duh) and safe the ASF a fair amount of
> >> resources, but there are downsides:
> >>
> >> The discoverability of the Travis check takes a nose-dive. Either we
> >> require every contributor to always, an every commit, also post a Travis
> >> build, or we have the reviewer sift through the contributors account to
> >> find it.
> >>
> >> This is rather cumbersome. Additionally, it's also not equivalent to
> >> having a PR build.
> >>
> >> A normal branch build takes a branch as is and tests it. A PR build
> >> merges the branch into master, and then runs it. (Fun fact: This is why
> >> a PR without merge conflicts is not being run on Travis.)
> >>
> >> And ultimately, everyone can already make use of this approach anyway.
> >>
> >> On 25/06/2019 08:02, Jark Wu wrote:
> >>> Hi Jeff,
> >>>
> >>> Thanks for sharing the Zeppelin approach. I think it's a good idea to
> >>> leverage user's travis account.
> >>> In this way, we can have almost unlimited concurrent build jobs and
> >>> developers can restart build by themselves (currently only committers
> >>> can restart PR's build).
> >>>
> >>> But I'm still not very clear how to integrate user's travis build into
> >>> the Flink pull request's build automatically. Can you explain more in
> >>> detail?
> >>>
> >>> Another question: does travis only build branches for user account?
> >>> My concern is that builds for PRs will rebase user's commits against
> >>> current master branch.
> >>> This will help us to find problems before merge.  Builds for branches
> >>> will lose the impact of new commits in master.
> >>> How does Zeppelin solve this problem?
> >>>
> >>> Thanks again for sharing the idea.
> >>>
> >>> Regards,
> >>> Jark
> >>>
> >>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang <[email protected]
> >>> <mailto:[email protected]>> wrote:
> >>>
> >>>     Hi Folks,
> >>>
> >>>     Zeppelin meet this kind of issue before, we solve it by delegating
> >>>     each
> >>>     one's PR build to his travis account (Everyone can have 5 free
> >>>     slot for
> >>>     travis build).
> >>>     Apache account travis build is only triggered when PR is merged.
> >>>
> >>>
> >>>
> >>>     Kurt Young <[email protected] <mailto:[email protected]>>
> >>>     于2019年6月25日周二 上午10:16写道：
> >>>
> >>>     > (Forgot to cc George)
> >>>     >
> >>>     > Best,
> >>>     > Kurt
> >>>     >
> >>>     >
> >>>     > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young <[email protected]
> >>>     <mailto:[email protected]>> wrote:
> >>>     >
> >>>     > > Hi Bowen,
> >>>     > >
> >>>     > > Thanks for bringing this up. We actually have discussed about
> >>>     this, and I
> >>>     > > think Till and George have
> >>>     > > already spend sometime investigating it. I have cced both of
> >>>     them, and
> >>>     > > maybe they can share
> >>>     > > their findings.
> >>>     > >
> >>>     > > Best,
> >>>     > > Kurt
> >>>     > >
> >>>     > >
> >>>     > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu <[email protected]
> >>>     <mailto:[email protected]>> wrote:
> >>>     > >
> >>>     > >> Hi Bowen,
> >>>     > >>
> >>>     > >> Thanks for bringing this. We also suffered from the long
> >>>     build time.
> >>>     > >> I agree that we should focus on solving build capacity
> >>>     problem in the
> >>>     > >> thread.
> >>>     > >>
> >>>     > >> My observation is there is only one build is running, all the
> >>>     others
> >>>     > >> (other
> >>>     > >> PRs, master) are pending.
> >>>     > >> The pricing plan[1] of travis shows it can support concurrent
> >>>     build
> >>>     > jobs.
> >>>     > >> But I don't know which plan we are using, might be the free
> >>>     plan for
> >>>     > open
> >>>     > >> source.
> >>>     > >>
> >>>     > >> I cc-ed Chesnay who may have some experience on Travis.
> >>>     > >>
> >>>     > >> Regards,
> >>>     > >> Jark
> >>>     > >>
> >>>     > >> [1]: https://travis-ci.com/plans
> >>>     > >>
> >>>     > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li <[email protected]
> >>>     <mailto:[email protected]>> wrote:
> >>>     > >>
> >>>     > >> > Hi Steven,
> >>>     > >> >
> >>>     > >> > I think you may not read what I wrote. The discussion is
> about
> >>>     > "unstable
> >>>     > >> > build **capacity**", in another word "unstable / lack of
> build
> >>>     > >> resources",
> >>>     > >> > not "unstable build".
> >>>     > >> >
> >>>     > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu
> >>>     <[email protected] <mailto:[email protected]>>
> >>>     > wrote:
> >>>     > >> >
> >>>     > >> > > long and sometimes unstable build is definitely a pain
> >> point.
> >>>     > >> > >
> >>>     > >> > > I suspect the build failure here in flink-connector-kafka
> >>>     is not
> >>>     > >> related
> >>>     > >> > to
> >>>     > >> > > my change. but there is no easy re-run the build on
> >>>     travis UI.
> >>>     > Google
> >>>     > >> > > search showed a trick of close-and-open the PR will
> >>>     trigger rebuild.
> >>>     > >> but
> >>>     > >> > > that could add noises to the PR activities.
> >>>     > >> > > https://travis-ci.org/apache/flink/jobs/545555519
> >>>     > >> > >
> >>>     > >> > > travis-ci for my personal repo often failed with
> >>>     exceeding time
> >>>     > limit
> >>>     > >> > after
> >>>     > >> > > 4+ hours.
> >>>     > >> > > The job exceeded the maximum time limit for jobs, and has
> >>>     been
> >>>     > >> > terminated.
> >>>     > >> > >
> >>>     > >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li
> >>>     <[email protected] <mailto:[email protected]>>
> >>>     > wrote:
> >>>     > >> > >
> >>>     > >> > > > https://travis-ci.org/apache/flink/builds/549681530
> >>>     This build
> >>>     > >> > request
> >>>     > >> > > > has
> >>>     > >> > > > been sitting at **HEAD of the queue** since I first saw
> >>>     it at PST
> >>>     > >> > 10:30am
> >>>     > >> > > > (not sure how long it's been there before 10:30am).
> >>>     It's PST
> >>>     > 4:12pm
> >>>     > >> now
> >>>     > >> > > and
> >>>     > >> > > > it hasn't started yet.
> >>>     > >> > > >
> >>>     > >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li
> >>>     <[email protected] <mailto:[email protected]>>
> >>>     > >> wrote:
> >>>     > >> > > >
> >>>     > >> > > > > Hi devs,
> >>>     > >> > > > >
> >>>     > >> > > > > I've been experiencing the pain resulting from lack
> >>>     of stable
> >>>     > >> build
> >>>     > >> > > > > capacity on Travis for Flink PRs [1]. Specifically, I
> >>>     noticed
> >>>     > >> often
> >>>     > >> > > that
> >>>     > >> > > > no
> >>>     > >> > > > > build in the queue is making any progress for hours,
> and
> >>>     > suddenly
> >>>     > >> 5
> >>>     > >> > or
> >>>     > >> > > 6
> >>>     > >> > > > > builds kick off all together after the long pause.
> >>>     I'm at PST
> >>>     > >> > (UTC-08)
> >>>     > >> > > > time
> >>>     > >> > > > > zone, and I've seen pause can be as long as 6 hours
> >>>     from PST 9am
> >>>     > >> to
> >>>     > >> > 3pm
> >>>     > >> > > > > (let alone the time needed to drain the queue
> >>>     afterwards).
> >>>     > >> > > > >
> >>>     > >> > > > > I think this has greatly impacted our productivity.
> I've
> >>>     > >> experienced
> >>>     > >> > > that
> >>>     > >> > > > > PRs submitted in the early morning of PST time zone
> >>>     won't finish
> >>>     > >> > their
> >>>     > >> > > > > build until late night of the same day.
> >>>     > >> > > > >
> >>>     > >> > > > > So my questions are:
> >>>     > >> > > > >
> >>>     > >> > > > > - Has anyone else experienced the same problem or
> >>>     have similar
> >>>     > >> > > > observation
> >>>     > >> > > > > on TravisCI? (I suspect it has things to do with time
> >>>     zone)
> >>>     > >> > > > >
> >>>     > >> > > > > - What pricing plan of TravisCI is Flink currently
> >>>     using? Is it
> >>>     > >> the
> >>>     > >> > > free
> >>>     > >> > > > > plan for open source projects? What are the
> >>>     guaranteed build
> >>>     > >> capacity
> >>>     > >> > > of
> >>>     > >> > > > > the current plan?
> >>>     > >> > > > >
> >>>     > >> > > > > - If the current pricing plan (either free or paid)
> >> can't
> >>>     > provide
> >>>     > >> > > stable
> >>>     > >> > > > > build capacity, can we upgrade to a higher priced
> >>>     plan with
> >>>     > larger
> >>>     > >> > and
> >>>     > >> > > > more
> >>>     > >> > > > > stable build capacity?
> >>>     > >> > > > >
> >>>     > >> > > > > BTW, another factor that contribute to the
> >>>     productivity problem
> >>>     > is
> >>>     > >> > that
> >>>     > >> > > > > our build is slow - we run full build for every PR
> and a
> >>>     > >> successful
> >>>     > >> > > full
> >>>     > >> > > > > build takes ~5h. We definitely have more options to
> >>>     solve it,
> >>>     > for
> >>>     > >> > > > instance,
> >>>     > >> > > > > modularize the build graphs and reuse artifacts from
> the
> >>>     > previous
> >>>     > >> > > build.
> >>>     > >> > > > > But I think that can be a big effort which is much
> >>>     harder to
> >>>     > >> > accomplish
> >>>     > >> > > > in
> >>>     > >> > > > > a short period of time and may deserve its own
> separate
> >>>     > >> discussion.
> >>>     > >> > > > >
> >>>     > >> > > > > [1] https://travis-ci.org/apache/flink/pull_requests
> >>>     > >> > > > >
> >>>     > >> > > > >
> >>>     > >> > > >
> >>>     > >> > >
> >>>     > >> >
> >>>     > >>
> >>>     > >
> >>>     >
> >>>
> >>>
> >>>     --
> >>>     Best Regards
> >>>
> >>>     Jeff Zhang
> >>>
> >>
>
>

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

Reply via email to