Re: [DISCUSS] solve unstable build capacity problem on TravisCI

Jeff Zhang Sat, 29 Jun 2019 05:57:21 -0700

Here's what zeppelin community did, we make a python script to check the
build status of pull request.
Here's script:
https://github.com/apache/zeppelin/blob/master/travis_check.py


And this is the script we used in Jenkins build job.

if [ -f "travis_check.py" ]; then
  git log -n 1
  STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull request.*from.*" | sed
's/.*GitHub pull request <a
href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 \2/g')
  AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g')
  PR=$(echo $STATUS | awk '{print $1}' | sed 's/.*[/]\(.*\)$/\1/g')
  #COMMIT=$(git log -n 1 | grep "^Merge:" | awk '{print $3}')
  #if [ -z $COMMIT ]; then
  #  COMMIT=$(curl -s https://api.github.com/repos/apache/zeppelin/pulls/$PR
| grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' ' | sed
's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v "apache:" |
sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g')
  #fi

  # get commit hash from PR
  COMMIT=$(curl -s https://api.github.com/repos/apache/zeppelin/pulls/$PR |
grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' ' | sed
's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v "apache:" |
sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g')
  sleep 30 # sleep few moment to wait travis starts the build
  RET_CODE=0
  python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$?
  if [ $RET_CODE -eq 2 ]; then # try with repository name when travis-ci is
not available in the account
    RET_CODE=0
    AUTHOR=$(curl -s https://api.github.com/repos/apache/zeppelin/pulls/$PR
| grep '"full_name":' | grep -v "apache/zeppelin" | sed
's/.*[:][^"]*["]\([^/]*\).*/\1/g')
  python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$?
  fi

  if [ $RET_CODE -eq 2 ]; then # fail with can't find build information in
the travis
    set +x
    echo "-----------------------------------------------------"
    echo "Looks like travis-ci is not configured for your fork."
    echo "Please setup by swich on 'zeppelin' repository at
https://travis-ci.org/profile and travis-ci."
    echo "And then make sure 'Build branch updates' option is enabled in
the settings https://travis-ci.org/${AUTHOR}/zeppelin/settings.";
    echo ""
    echo "To trigger CI after setup, you will need ammend your last commit
with"
    echo "git commit --amend"
    echo "git push your-remote HEAD --force"
    echo ""
    echo "See
http://zeppelin.apache.org/contribution/contributions.html#continuous-integration
."
  fi

  exit $RET_CODE
else
  set +x
  echo "travis_check.py does not exists"
  exit 1
fi

Chesnay Schepler <[email protected]> 于2019年6月29日周六 下午3:17写道：

> Does this imply that a Jenkins job is active as long as the Travis build
> runs?
>
> On 26/06/2019 21:28, Bowen Li wrote:
> > Hi,
> >
> > @Dawid, I think the "long test running" as I mentioned in the first
> email,
> > also as you guys said, belongs to "a big effort which is much harder to
> > accomplish in a short period of time and may deserve its own separate
> > discussion". Thus I didn't include it in what we can do in a foreseeable
> > short term.
> >
> > Besides, I don't think that's the ultimate reason for lack of build
> > resources. Even if the build is shortened to something like 2h, the
> > problems of no build machine works about 6 or more hours in PST daytime
> > that I described will still happen, because no machine from ASF INFRA's
> > pool is allocated to Flink. As I have paid close attention to the build
> > queue in the past few weekdays, it's a pretty clear pattern now.
> >
> > **The ultimate root cause** for that is - we don't have any **dedicated**
> > build resources that we can stably rely on. I'm actually ok to wait for a
> > long time if there are build requests running, it means at least we are
> > making progress. But I'm not ok with no build resource. A better place I
> > think we should aim at in short term is to always have at least a central
> > pool (can be 3 or 5) of machines dedicated to build Flink at any time, or
> > maybe use users resources.
> >
> > @Chesnay @Robert I synced with Jeff offline that Zeppelin community is
> > using a Jenkins job to automatically build on users' travis account and
> > link the result back to github PR. I guess the Jenkins job would fetch
> > latest upstream master and build the PR against it. Jeff has filed
> tickets
> > to learn and get access to the Jenkins infra. It'll better to fully
> > understand it first before judging this approach.
> >
> > I also heard good things about CircleCI, and ASF INFRA seems to have a
> pool
> > of build capacity there too. Can be an alternative to consider.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz <
> [email protected]>
> > wrote:
> >
> >> Sorry to jump in late, but I think Bowen missed the most important point
> >> from Chesnay's previous message in the summary. The ultimate reason for
> >> all the problems is that the tests take close to 2 hours to run already.
> >> I fully support this claim: "Unless people start caring about test times
> >> before adding them, this issue cannot be solved"
> >>
> >> This is also another reason why using user's Travis account won't help.
> >> Every few weeks we reach the user's time limit for a single profile.
> >> This makes the user's builds simply fail, until we either properly
> >> decrease the time the tests take (which I am not sure we ever did) or
> >> postpone the problem by splitting into more profiles. (Note that the ASF
> >> Travis account has higher time limits)
> >>
> >> Best,
> >>
> >> Dawid
> >>
> >> On 26/06/2019 09:36, Robert Metzger wrote:
> >>> Do we know if using "the best" available hardware would improve the
> build
> >>> times?
> >>> Imagine we would run the build on machines with plenty of main memory
> to
> >>> mount everything to ramdisk + the latest CPU architecture?
> >>>
> >>> Throwing hardware at the problem could help reduce the time of an
> >>> individual build, and using our own infrastructure would remove our
> >>> dependency on Apache's Travis account (with the obvious downside of
> >> having
> >>> to maintain the infrastructure)
> >>> We could use an open source travis alternative, to have a similar
> >>> experience and make the migration easy.
> >>>
> >>>
> >>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler <[email protected]>
> >> wrote:
> >>>>   From what I gathered, there's no special sauce that the Zeppelin
> >>>> project uses which actually integrates a users Travis account into the
> >> PR.
> >>>> They just disabled Travis for PRs. And that's kind of it.
> >>>>
> >>>> Naturally we can do this (duh) and safe the ASF a fair amount of
> >>>> resources, but there are downsides:
> >>>>
> >>>> The discoverability of the Travis check takes a nose-dive. Either we
> >>>> require every contributor to always, an every commit, also post a
> Travis
> >>>> build, or we have the reviewer sift through the contributors account
> to
> >>>> find it.
> >>>>
> >>>> This is rather cumbersome. Additionally, it's also not equivalent to
> >>>> having a PR build.
> >>>>
> >>>> A normal branch build takes a branch as is and tests it. A PR build
> >>>> merges the branch into master, and then runs it. (Fun fact: This is
> why
> >>>> a PR without merge conflicts is not being run on Travis.)
> >>>>
> >>>> And ultimately, everyone can already make use of this approach anyway.
> >>>>
> >>>> On 25/06/2019 08:02, Jark Wu wrote:
> >>>>> Hi Jeff,
> >>>>>
> >>>>> Thanks for sharing the Zeppelin approach. I think it's a good idea to
> >>>>> leverage user's travis account.
> >>>>> In this way, we can have almost unlimited concurrent build jobs and
> >>>>> developers can restart build by themselves (currently only committers
> >>>>> can restart PR's build).
> >>>>>
> >>>>> But I'm still not very clear how to integrate user's travis build
> into
> >>>>> the Flink pull request's build automatically. Can you explain more in
> >>>>> detail?
> >>>>>
> >>>>> Another question: does travis only build branches for user account?
> >>>>> My concern is that builds for PRs will rebase user's commits against
> >>>>> current master branch.
> >>>>> This will help us to find problems before merge.  Builds for branches
> >>>>> will lose the impact of new commits in master.
> >>>>> How does Zeppelin solve this problem?
> >>>>>
> >>>>> Thanks again for sharing the idea.
> >>>>>
> >>>>> Regards,
> >>>>> Jark
> >>>>>
> >>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang <[email protected]
> >>>>> <mailto:[email protected]>> wrote:
> >>>>>
> >>>>>      Hi Folks,
> >>>>>
> >>>>>      Zeppelin meet this kind of issue before, we solve it by
> delegating
> >>>>>      each
> >>>>>      one's PR build to his travis account (Everyone can have 5 free
> >>>>>      slot for
> >>>>>      travis build).
> >>>>>      Apache account travis build is only triggered when PR is merged.
> >>>>>
> >>>>>
> >>>>>
> >>>>>      Kurt Young <[email protected] <mailto:[email protected]>>
> >>>>>      于2019年6月25日周二 上午10:16写道：
> >>>>>
> >>>>>      > (Forgot to cc George)
> >>>>>      >
> >>>>>      > Best,
> >>>>>      > Kurt
> >>>>>      >
> >>>>>      >
> >>>>>      > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young <[email protected]
> >>>>>      <mailto:[email protected]>> wrote:
> >>>>>      >
> >>>>>      > > Hi Bowen,
> >>>>>      > >
> >>>>>      > > Thanks for bringing this up. We actually have discussed
> about
> >>>>>      this, and I
> >>>>>      > > think Till and George have
> >>>>>      > > already spend sometime investigating it. I have cced both of
> >>>>>      them, and
> >>>>>      > > maybe they can share
> >>>>>      > > their findings.
> >>>>>      > >
> >>>>>      > > Best,
> >>>>>      > > Kurt
> >>>>>      > >
> >>>>>      > >
> >>>>>      > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu <[email protected]
> >>>>>      <mailto:[email protected]>> wrote:
> >>>>>      > >
> >>>>>      > >> Hi Bowen,
> >>>>>      > >>
> >>>>>      > >> Thanks for bringing this. We also suffered from the long
> >>>>>      build time.
> >>>>>      > >> I agree that we should focus on solving build capacity
> >>>>>      problem in the
> >>>>>      > >> thread.
> >>>>>      > >>
> >>>>>      > >> My observation is there is only one build is running, all
> the
> >>>>>      others
> >>>>>      > >> (other
> >>>>>      > >> PRs, master) are pending.
> >>>>>      > >> The pricing plan[1] of travis shows it can support
> concurrent
> >>>>>      build
> >>>>>      > jobs.
> >>>>>      > >> But I don't know which plan we are using, might be the free
> >>>>>      plan for
> >>>>>      > open
> >>>>>      > >> source.
> >>>>>      > >>
> >>>>>      > >> I cc-ed Chesnay who may have some experience on Travis.
> >>>>>      > >>
> >>>>>      > >> Regards,
> >>>>>      > >> Jark
> >>>>>      > >>
> >>>>>      > >> [1]: https://travis-ci.com/plans
> >>>>>      > >>
> >>>>>      > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li <
> [email protected]
> >>>>>      <mailto:[email protected]>> wrote:
> >>>>>      > >>
> >>>>>      > >> > Hi Steven,
> >>>>>      > >> >
> >>>>>      > >> > I think you may not read what I wrote. The discussion is
> >> about
> >>>>>      > "unstable
> >>>>>      > >> > build **capacity**", in another word "unstable / lack of
> >> build
> >>>>>      > >> resources",
> >>>>>      > >> > not "unstable build".
> >>>>>      > >> >
> >>>>>      > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu
> >>>>>      <[email protected] <mailto:[email protected]>>
> >>>>>      > wrote:
> >>>>>      > >> >
> >>>>>      > >> > > long and sometimes unstable build is definitely a pain
> >>>> point.
> >>>>>      > >> > >
> >>>>>      > >> > > I suspect the build failure here in
> flink-connector-kafka
> >>>>>      is not
> >>>>>      > >> related
> >>>>>      > >> > to
> >>>>>      > >> > > my change. but there is no easy re-run the build on
> >>>>>      travis UI.
> >>>>>      > Google
> >>>>>      > >> > > search showed a trick of close-and-open the PR will
> >>>>>      trigger rebuild.
> >>>>>      > >> but
> >>>>>      > >> > > that could add noises to the PR activities.
> >>>>>      > >> > > https://travis-ci.org/apache/flink/jobs/545555519
> >>>>>      > >> > >
> >>>>>      > >> > > travis-ci for my personal repo often failed with
> >>>>>      exceeding time
> >>>>>      > limit
> >>>>>      > >> > after
> >>>>>      > >> > > 4+ hours.
> >>>>>      > >> > > The job exceeded the maximum time limit for jobs, and
> has
> >>>>>      been
> >>>>>      > >> > terminated.
> >>>>>      > >> > >
> >>>>>      > >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li
> >>>>>      <[email protected] <mailto:[email protected]>>
> >>>>>      > wrote:
> >>>>>      > >> > >
> >>>>>      > >> > > > https://travis-ci.org/apache/flink/builds/549681530
> >>>>>      This build
> >>>>>      > >> > request
> >>>>>      > >> > > > has
> >>>>>      > >> > > > been sitting at **HEAD of the queue** since I first
> saw
> >>>>>      it at PST
> >>>>>      > >> > 10:30am
> >>>>>      > >> > > > (not sure how long it's been there before 10:30am).
> >>>>>      It's PST
> >>>>>      > 4:12pm
> >>>>>      > >> now
> >>>>>      > >> > > and
> >>>>>      > >> > > > it hasn't started yet.
> >>>>>      > >> > > >
> >>>>>      > >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li
> >>>>>      <[email protected] <mailto:[email protected]>>
> >>>>>      > >> wrote:
> >>>>>      > >> > > >
> >>>>>      > >> > > > > Hi devs,
> >>>>>      > >> > > > >
> >>>>>      > >> > > > > I've been experiencing the pain resulting from lack
> >>>>>      of stable
> >>>>>      > >> build
> >>>>>      > >> > > > > capacity on Travis for Flink PRs [1].
> Specifically, I
> >>>>>      noticed
> >>>>>      > >> often
> >>>>>      > >> > > that
> >>>>>      > >> > > > no
> >>>>>      > >> > > > > build in the queue is making any progress for
> hours,
> >> and
> >>>>>      > suddenly
> >>>>>      > >> 5
> >>>>>      > >> > or
> >>>>>      > >> > > 6
> >>>>>      > >> > > > > builds kick off all together after the long pause.
> >>>>>      I'm at PST
> >>>>>      > >> > (UTC-08)
> >>>>>      > >> > > > time
> >>>>>      > >> > > > > zone, and I've seen pause can be as long as 6 hours
> >>>>>      from PST 9am
> >>>>>      > >> to
> >>>>>      > >> > 3pm
> >>>>>      > >> > > > > (let alone the time needed to drain the queue
> >>>>>      afterwards).
> >>>>>      > >> > > > >
> >>>>>      > >> > > > > I think this has greatly impacted our productivity.
> >> I've
> >>>>>      > >> experienced
> >>>>>      > >> > > that
> >>>>>      > >> > > > > PRs submitted in the early morning of PST time zone
> >>>>>      won't finish
> >>>>>      > >> > their
> >>>>>      > >> > > > > build until late night of the same day.
> >>>>>      > >> > > > >
> >>>>>      > >> > > > > So my questions are:
> >>>>>      > >> > > > >
> >>>>>      > >> > > > > - Has anyone else experienced the same problem or
> >>>>>      have similar
> >>>>>      > >> > > > observation
> >>>>>      > >> > > > > on TravisCI? (I suspect it has things to do with
> time
> >>>>>      zone)
> >>>>>      > >> > > > >
> >>>>>      > >> > > > > - What pricing plan of TravisCI is Flink currently
> >>>>>      using? Is it
> >>>>>      > >> the
> >>>>>      > >> > > free
> >>>>>      > >> > > > > plan for open source projects? What are the
> >>>>>      guaranteed build
> >>>>>      > >> capacity
> >>>>>      > >> > > of
> >>>>>      > >> > > > > the current plan?
> >>>>>      > >> > > > >
> >>>>>      > >> > > > > - If the current pricing plan (either free or paid)
> >>>> can't
> >>>>>      > provide
> >>>>>      > >> > > stable
> >>>>>      > >> > > > > build capacity, can we upgrade to a higher priced
> >>>>>      plan with
> >>>>>      > larger
> >>>>>      > >> > and
> >>>>>      > >> > > > more
> >>>>>      > >> > > > > stable build capacity?
> >>>>>      > >> > > > >
> >>>>>      > >> > > > > BTW, another factor that contribute to the
> >>>>>      productivity problem
> >>>>>      > is
> >>>>>      > >> > that
> >>>>>      > >> > > > > our build is slow - we run full build for every PR
> >> and a
> >>>>>      > >> successful
> >>>>>      > >> > > full
> >>>>>      > >> > > > > build takes ~5h. We definitely have more options to
> >>>>>      solve it,
> >>>>>      > for
> >>>>>      > >> > > > instance,
> >>>>>      > >> > > > > modularize the build graphs and reuse artifacts
> from
> >> the
> >>>>>      > previous
> >>>>>      > >> > > build.
> >>>>>      > >> > > > > But I think that can be a big effort which is much
> >>>>>      harder to
> >>>>>      > >> > accomplish
> >>>>>      > >> > > > in
> >>>>>      > >> > > > > a short period of time and may deserve its own
> >> separate
> >>>>>      > >> discussion.
> >>>>>      > >> > > > >
> >>>>>      > >> > > > > [1]
> https://travis-ci.org/apache/flink/pull_requests
> >>>>>      > >> > > > >
> >>>>>      > >> > > > >
> >>>>>      > >> > > >
> >>>>>      > >> > >
> >>>>>      > >> >
> >>>>>      > >>
> >>>>>      > >
> >>>>>      >
> >>>>>
> >>>>>
> >>>>>      --
> >>>>>      Best Regards
> >>>>>
> >>>>>      Jeff Zhang
> >>>>>
> >>
>
>

-- 
Best Regards

Jeff Zhang

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

Reply via email to