Re: [DISCUSS] solve unstable build capacity problem on TravisCI

Chesnay Schepler Wed, 03 Jul 2019 00:02:10 -0700

Are they using their own Travis CI pool, or did the switch to anentirely different CI service?

If we can just switch to our own Travis pool, just for our project, thenthis might be something we can do fairly quickly?


On 03/07/2019 05:55, Bowen Li wrote:

I responded in the INFRA ticket [1] that I believe they are using a wrong
metric against Flink and the total build time is a completely different
thing than guaranteed build capacity.

My response:

"As mentioned above, since I started to pay attention to Flink's build
queue a few tens of days ago, I'm in Seattle and I saw no build was kicking
off in PST daytime in weekdays for Flink. Our teammates in China and Europe
have also reported similar observations. So we need to evaluate how the
large total build time came from - if 1) your number and 2) our
observations from three locations that cover pretty much a full day, are
all true, I **guess** one reason can be that - highly likely the extra
build time came from weekends when other Apache projects may be idle and
Flink just drains hard its congested queue.

Please be aware of that we're not complaining about the lack of resources
in general, I'm complaining about the lack of **stable, dedicated**
resources. An example for the latter one is, currently even if no build is
in Flink's queue and I submit a request to be the queue head in PST
morning, my build won't even start in 6-8+h. That is an absurd amount of
waiting time.

That's saying, if ASF INFRA decides to adopt a quota system and grants
Flink five DEDICATED servers that runs all the time only for Flink, that'll
be PERFECT and can totally solve our problem now.

Please be aware of that we're not complaining about the lack of resources
in general, I'm complaining about the lack of **stable, dedicated**
resources. An example for the latter one is, currently even if no build is
in Flink's queue and I submit a request to be the queue head in PST
morning, my build won't even start in 6-8+h. That is an absurd amount of
waiting time.


That's saying, if ASF INFRA decides to adopt a quota system and grants
Flink five DEDICATED servers that runs all the time only for Flink, that'll
be PERFECT and can totally solve our problem now.

I feel what's missing in the ASF INFRA's Travis resource pool is some level
of build capacity SLAs and certainty"


Again, I believe there are differences in nature of these two problems,
long build time v.s. lack of dedicated build resource. That's saying,
shortening build time may relieve the situation, and may not. I'm sightly
negative on disabling IT cases for PRs, due to the downside is that we are
at risk of any potential bugs in PR that UTs doesn't catch, and may cost a
lot more to fix and if it slows others down or even block others, but am
open to others opinions on it.

AFAICT from INFRA ticket[1], donating to ASF INFRA won't be feasible to
solve our problem since INFRA's pool is fully shared and they have no
control and finer insights over resource allocation to a specific Apache
project. As mentioned in [1], Apache Arrow is moving away from ASF INFRA
Travis pool (they are actually surprised Flink hasn't plan to do so). I
know that Spark is on its own build infra. If we all agree that funding our
own build infra, I'd be glad to help investigate any potential options
after releasing 1.9 since I'm super busy with 1.9 now.

[1] https://issues.apache.org/jira/browse/INFRA-18533



On Tue, Jul 2, 2019 at 4:46 AM Chesnay Schepler <ches...@apache.org> wrote:

As a short-term stopgap, since we can assume this issue to become much
worse in the following days/weeks, we could disable IT cases in PRs and
only run them on master.

On 02/07/2019 12:03, Chesnay Schepler wrote:

People really have to stop thinking that just because something works
for us it is also a good solution.
Also, please remember that our builds run for 2h from start to finish,
and not the 14 _minutes_ it takes for zeppelin.
We are dealing with an entirely different scale here, both in terms of
build times and number of builds.

In this very thread people have been complaining about long queue
times for their builds. Surprise, other Apache projects have been
suffering the very same thing due to us not controlling our build
times. While switching services (be it Jenkins, CircleCI or whatever)
will possibly work for us (and these options are actually attractive,
like CircleCI's proper support for build artifacts), it will also
result in us likely negatively affecting other projects in significant
ways.

Sure, the Jenkins setup has a good user experience for us, at the cost
of blocking Jenkins workers for a _lot_ of time. Right now we have 25
PR's in our queue; that's possibly 50h we'd consume of Jenkins
resources, and the European contributors haven't even really started yet.

FYI, the latest INFRA response from INFRA-18533:

"Our rough metrics shows that Flink used over 5800 hours of build time
last month. That is equal to EIGHT servers running 24/7 for the ENTIRE
MONTH. EIGHT. nonstop.
When we discovered this last night, we discussed it some and are going
to tune down Flink to allow only five executors maximum. We cannot
allow Flink to consume so much of a Foundation shared resource."

So yes, we either
a) have to heavily reduce our CI usage or
b) fund our own, either maintaining it ourselves or donating to Apache.

On 02/07/2019 05:11, Bowen Li wrote:

By looking at the git history of the Jenkins script, its core part
was finished in March 2017 (and only two minor update in 2017/2018),
so it's been running for over two years now and feels like Zepplin
community has been quite happy with it. @Jeff Zhang
<mailto:zjf...@gmail.com> can you share your insights and user
experience with the Jenkins+Travis approach?

Things like:

- has the approach completely solved the resource capacity problem
for Zepplin community? is Zepplin community happy with the result?
- is the whole configuration chain stable (e.g. uptime) enough?
- how often do you need to maintain the Jenkins infra? how many
people are usually involved in maintenance and bug-fixes?

The downside of this approach seems mostly to be on the maintenance
to me - maintain the script and Jenkins infra.

** Having Our Own Travis-CI.com Account **

Another alternative I've been thinking of is to have our own
travis-ci.com <http://travis-ci.com> account with paid dedicated
resources. Note travis-ci.org <http://travis-ci.org> is the free
version and travis-ci.com <http://travis-ci.com> is the commercial
version. We currently use a shared resource pool managed by ASK INFRA
team on travis-ci.org <http://travis-ci.org>, but we have no control
over it - we can't see how it's configured, how much resources are
available, how resources are allocated among Apache projects, etc.
The nice thing about having an account on travis-ci.com
<http://travis-ci.com> are:

- relatively low cost with much better resource guarantee than what
we currently have [1]: $249/month with 5 dedicated concurrency,
$489/month with 10 concurrency
- low maintenance work compared to using Jenkins
- (potentially) no migration cost according to Travis's doc [2]
(pending verification)
- full control over the build capacity/configuration compared to
using ASF INFRA's pool

I'd be surprised if we as such a vibrant community cannot find and
fund $249*12=$2988 a year in exchange for a much better developer
experience and much higher productivity.

[1] https://travis-ci.com/plans
[2]

https://docs.travis-ci.com/user/migrate/open-source-repository-migration

On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler <ches...@apache.org
<mailto:ches...@apache.org>> wrote:

     So yes, the Jenkins job keeps pulling the state from Travis until it
     finishes.

     Note sure I'm comfortable with the idea of using Jenkins workers
     just to
     idle for a several hours.

     On 29/06/2019 14:56, Jeff Zhang wrote:
     > Here's what zeppelin community did, we make a python script to
     check the
     > build status of pull request.
     > Here's script:
     > https://github.com/apache/zeppelin/blob/master/travis_check.py
     >
     > And this is the script we used in Jenkins build job.
     >
     > if [ -f "travis_check.py" ]; then
     >    git log -n 1
     >    STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull
     request.*from.*" | sed
     > 's/.*GitHub pull request <a
     > href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 \2/g')
     >    AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g')
     >    PR=$(echo $STATUS | awk '{print $1}' | sed
's/.*[/]\(.*\)$/\1/g')
     >    #COMMIT=$(git log -n 1 | grep "^Merge:" | awk '{print $3}')
     >    #if [ -z $COMMIT ]; then
     >    #  COMMIT=$(curl -s
     https://api.github.com/repos/apache/zeppelin/pulls/$PR
     > | grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' '
     | sed
     > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v
     "apache:" |
     > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g')
     >    #fi
     >
     >    # get commit hash from PR
     >    COMMIT=$(curl -s
     https://api.github.com/repos/apache/zeppelin/pulls/$PR |
     > grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' '
| sed
     > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v
     "apache:" |
     > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g')
     >    sleep 30 # sleep few moment to wait travis starts the build
     >    RET_CODE=0
     >    python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$?
     >    if [ $RET_CODE -eq 2 ]; then # try with repository name when
     travis-ci is
     > not available in the account
     >      RET_CODE=0
     >      AUTHOR=$(curl -s
     https://api.github.com/repos/apache/zeppelin/pulls/$PR
     > | grep '"full_name":' | grep -v "apache/zeppelin" | sed
     > 's/.*[:][^"]*["]\([^/]*\).*/\1/g')
     >    python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$?
     >    fi
     >
     >    if [ $RET_CODE -eq 2 ]; then # fail with can't find build
     information in
     > the travis
     >      set +x
     >      echo "-----------------------------------------------------"
     >      echo "Looks like travis-ci is not configured for your fork."
     >      echo "Please setup by swich on 'zeppelin' repository at
     > https://travis-ci.org/profile and travis-ci."
     >      echo "And then make sure 'Build branch updates' option is
     enabled in
     > the settings https://travis-ci.org/${AUTHOR}/zeppelin/settings
<https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings>."
     >      echo ""
     >      echo "To trigger CI after setup, you will need ammend your
     last commit
     > with"
     >      echo "git commit --amend"
     >      echo "git push your-remote HEAD --force"
     >      echo ""
     >      echo "See
     >

http://zeppelin.apache.org/contribution/contributions.html#continuous-integration

     > ."
     >    fi
     >
     >    exit $RET_CODE
     > else
     >    set +x
     >    echo "travis_check.py does not exists"
     >    exit 1
     > fi
     >
     > Chesnay Schepler <ches...@apache.org
     <mailto:ches...@apache.org>> 于2019年6月29日周六 下午3:17写道：
     >
     >> Does this imply that a Jenkins job is active as long as the
     Travis build
     >> runs?
     >>
     >> On 26/06/2019 21:28, Bowen Li wrote:
     >>> Hi,
     >>>
     >>> @Dawid, I think the "long test running" as I mentioned in the
     first
     >> email,
     >>> also as you guys said, belongs to "a big effort which is much
     harder to
     >>> accomplish in a short period of time and may deserve its own
     separate
     >>> discussion". Thus I didn't include it in what we can do in a
     foreseeable
     >>> short term.
     >>>
     >>> Besides, I don't think that's the ultimate reason for lack of
     build
     >>> resources. Even if the build is shortened to something like
     2h, the
     >>> problems of no build machine works about 6 or more hours in
     PST daytime
     >>> that I described will still happen, because no machine from
     ASF INFRA's
     >>> pool is allocated to Flink. As I have paid close attention to
     the build
     >>> queue in the past few weekdays, it's a pretty clear pattern now.
     >>>
     >>> **The ultimate root cause** for that is - we don't have any
     **dedicated**
     >>> build resources that we can stably rely on. I'm actually ok to
     wait for a
     >>> long time if there are build requests running, it means at
     least we are
     >>> making progress. But I'm not ok with no build resource. A
     better place I
     >>> think we should aim at in short term is to always have at
     least a central
     >>> pool (can be 3 or 5) of machines dedicated to build Flink at
     any time, or
     >>> maybe use users resources.
     >>>
     >>> @Chesnay @Robert I synced with Jeff offline that Zeppelin
     community is
     >>> using a Jenkins job to automatically build on users' travis
     account and
     >>> link the result back to github PR. I guess the Jenkins job
     would fetch
     >>> latest upstream master and build the PR against it. Jeff has
filed
     >> tickets
     >>> to learn and get access to the Jenkins infra. It'll better to
     fully
     >>> understand it first before judging this approach.
     >>>
     >>> I also heard good things about CircleCI, and ASF INFRA seems
     to have a
     >> pool
     >>> of build capacity there too. Can be an alternative to consider.
     >>>
     >>>
     >>>
     >>>
     >>>
     >>>
     >>>
     >>>
     >>>
     >>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz <
     >> dwysakow...@apache.org <mailto:dwysakow...@apache.org>>
     >>> wrote:
     >>>
     >>>> Sorry to jump in late, but I think Bowen missed the most
     important point
     >>>> from Chesnay's previous message in the summary. The ultimate
     reason for
     >>>> all the problems is that the tests take close to 2 hours to
     run already.
     >>>> I fully support this claim: "Unless people start caring about
     test times
     >>>> before adding them, this issue cannot be solved"
     >>>>
     >>>> This is also another reason why using user's Travis account
     won't help.
     >>>> Every few weeks we reach the user's time limit for a single
     profile.
     >>>> This makes the user's builds simply fail, until we either
     properly
     >>>> decrease the time the tests take (which I am not sure we ever
     did) or
     >>>> postpone the problem by splitting into more profiles. (Note
     that the ASF
     >>>> Travis account has higher time limits)
     >>>>
     >>>> Best,
     >>>>
     >>>> Dawid
     >>>>
     >>>> On 26/06/2019 09:36, Robert Metzger wrote:
     >>>>> Do we know if using "the best" available hardware would
     improve the
     >> build
     >>>>> times?
     >>>>> Imagine we would run the build on machines with plenty of
     main memory
     >> to
     >>>>> mount everything to ramdisk + the latest CPU architecture?
     >>>>>
     >>>>> Throwing hardware at the problem could help reduce the time
     of an
     >>>>> individual build, and using our own infrastructure would
     remove our
     >>>>> dependency on Apache's Travis account (with the obvious
     downside of
     >>>> having
     >>>>> to maintain the infrastructure)
     >>>>> We could use an open source travis alternative, to have a
     similar
     >>>>> experience and make the migration easy.
     >>>>>
     >>>>>
     >>>>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler
     <ches...@apache.org <mailto:ches...@apache.org>>
     >>>> wrote:
     >>>>>>    From what I gathered, there's no special sauce that the
     Zeppelin
     >>>>>> project uses which actually integrates a users Travis
     account into the
     >>>> PR.
     >>>>>> They just disabled Travis for PRs. And that's kind of it.
     >>>>>>
     >>>>>> Naturally we can do this (duh) and safe the ASF a fair
     amount of
     >>>>>> resources, but there are downsides:
     >>>>>>
     >>>>>> The discoverability of the Travis check takes a nose-dive.
     Either we
     >>>>>> require every contributor to always, an every commit, also
     post a
     >> Travis
     >>>>>> build, or we have the reviewer sift through the
     contributors account
     >> to
     >>>>>> find it.
     >>>>>>
     >>>>>> This is rather cumbersome. Additionally, it's also not
     equivalent to
     >>>>>> having a PR build.
     >>>>>>
     >>>>>> A normal branch build takes a branch as is and tests it. A
     PR build
     >>>>>> merges the branch into master, and then runs it. (Fun fact:
     This is
     >> why
     >>>>>> a PR without merge conflicts is not being run on Travis.)
     >>>>>>
     >>>>>> And ultimately, everyone can already make use of this
     approach anyway.
     >>>>>>
     >>>>>> On 25/06/2019 08:02, Jark Wu wrote:
     >>>>>>> Hi Jeff,
     >>>>>>>
     >>>>>>> Thanks for sharing the Zeppelin approach. I think it's a
     good idea to
     >>>>>>> leverage user's travis account.
     >>>>>>> In this way, we can have almost unlimited concurrent build
     jobs and
     >>>>>>> developers can restart build by themselves (currently only
     committers
     >>>>>>> can restart PR's build).
     >>>>>>>
     >>>>>>> But I'm still not very clear how to integrate user's
     travis build
     >> into
     >>>>>>> the Flink pull request's build automatically. Can you
     explain more in
     >>>>>>> detail?
     >>>>>>>
     >>>>>>> Another question: does travis only build branches for user
     account?
     >>>>>>> My concern is that builds for PRs will rebase user's
     commits against
     >>>>>>> current master branch.
     >>>>>>> This will help us to find problems before merge.  Builds
     for branches
     >>>>>>> will lose the impact of new commits in master.
     >>>>>>> How does Zeppelin solve this problem?
     >>>>>>>
     >>>>>>> Thanks again for sharing the idea.
     >>>>>>>
     >>>>>>> Regards,
     >>>>>>> Jark
     >>>>>>>
     >>>>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang <zjf...@gmail.com
     <mailto:zjf...@gmail.com>
     >>>>>>> <mailto:zjf...@gmail.com <mailto:zjf...@gmail.com>>> wrote:
     >>>>>>>
     >>>>>>>       Hi Folks,
     >>>>>>>
     >>>>>>>       Zeppelin meet this kind of issue before, we solve
it by
     >> delegating
     >>>>>>>       each
     >>>>>>>       one's PR build to his travis account (Everyone can
     have 5 free
     >>>>>>>       slot for
     >>>>>>>       travis build).
     >>>>>>>       Apache account travis build is only triggered when
     PR is merged.
     >>>>>>>
     >>>>>>>
     >>>>>>>
     >>>>>>>       Kurt Young <ykt...@gmail.com
     <mailto:ykt...@gmail.com> <mailto:ykt...@gmail.com
     <mailto:ykt...@gmail.com>>>
     >>>>>>>       于2019年6月25日周二 上午10:16写道：
     >>>>>>>
     >>>>>>>       > (Forgot to cc George)
     >>>>>>>       >
     >>>>>>>       > Best,
     >>>>>>>       > Kurt
     >>>>>>>       >
     >>>>>>>       >
     >>>>>>>       > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young
     <ykt...@gmail.com <mailto:ykt...@gmail.com>
     >>>>>>> <mailto:ykt...@gmail.com <mailto:ykt...@gmail.com>>>
     wrote:
     >>>>>>>       >
     >>>>>>>       > > Hi Bowen,
     >>>>>>>       > >
     >>>>>>>       > > Thanks for bringing this up. We actually have
     discussed
     >> about
     >>>>>>>       this, and I
     >>>>>>>       > > think Till and George have
     >>>>>>>       > > already spend sometime investigating it. I have
     cced both of
     >>>>>>>       them, and
     >>>>>>>       > > maybe they can share
     >>>>>>>       > > their findings.
     >>>>>>>       > >
     >>>>>>>       > > Best,
     >>>>>>>       > > Kurt
     >>>>>>>       > >
     >>>>>>>       > >
     >>>>>>>       > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu
     <imj...@gmail.com <mailto:imj...@gmail.com>
     >>>>>>> <mailto:imj...@gmail.com <mailto:imj...@gmail.com>>>
     wrote:
     >>>>>>>       > >
     >>>>>>>       > >> Hi Bowen,
     >>>>>>>       > >>
     >>>>>>>       > >> Thanks for bringing this. We also suffered from
     the long
     >>>>>>>       build time.
     >>>>>>>       > >> I agree that we should focus on solving build
     capacity
     >>>>>>>       problem in the
     >>>>>>>       > >> thread.
     >>>>>>>       > >>
     >>>>>>>       > >> My observation is there is only one build is
     running, all
     >> the
     >>>>>>>       others
     >>>>>>>       > >> (other
     >>>>>>>       > >> PRs, master) are pending.
     >>>>>>>       > >> The pricing plan[1] of travis shows it can
support
     >> concurrent
     >>>>>>>       build
     >>>>>>>       > jobs.
     >>>>>>>       > >> But I don't know which plan we are using, might
     be the free
     >>>>>>>       plan for
     >>>>>>>       > open
     >>>>>>>       > >> source.
     >>>>>>>       > >>
     >>>>>>>       > >> I cc-ed Chesnay who may have some experience on
     Travis.
     >>>>>>>       > >>
     >>>>>>>       > >> Regards,
     >>>>>>>       > >> Jark
     >>>>>>>       > >>
     >>>>>>>       > >> [1]: https://travis-ci.com/plans
     >>>>>>>       > >>
     >>>>>>>       > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li <
     >> bowenl...@gmail.com <mailto:bowenl...@gmail.com>
     >>>>>>> <mailto:bowenl...@gmail.com
     <mailto:bowenl...@gmail.com>>> wrote:
     >>>>>>>       > >>
     >>>>>>>       > >> > Hi Steven,
     >>>>>>>       > >> >
     >>>>>>>       > >> > I think you may not read what I wrote. The
     discussion is
     >>>> about
     >>>>>>>       > "unstable
     >>>>>>>       > >> > build **capacity**", in another word
     "unstable / lack of
     >>>> build
     >>>>>>>       > >> resources",
     >>>>>>>       > >> > not "unstable build".
     >>>>>>>       > >> >
     >>>>>>>       > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu
     >>>>>>>       <stevenz...@gmail.com <mailto:stevenz...@gmail.com>
     <mailto:stevenz...@gmail.com <mailto:stevenz...@gmail.com>>>
     >>>>>>>       > wrote:
     >>>>>>>       > >> >
     >>>>>>>       > >> > > long and sometimes unstable build is
     definitely a pain
     >>>>>> point.
     >>>>>>>       > >> > >
     >>>>>>>       > >> > > I suspect the build failure here in
     >> flink-connector-kafka
     >>>>>>>       is not
     >>>>>>>       > >> related
     >>>>>>>       > >> > to
     >>>>>>>       > >> > > my change. but there is no easy re-run the
     build on
     >>>>>>>       travis UI.
     >>>>>>>       > Google
     >>>>>>>       > >> > > search showed a trick of close-and-open the
     PR will
     >>>>>>>       trigger rebuild.
     >>>>>>>       > >> but
     >>>>>>>       > >> > > that could add noises to the PR activities.
     >>>>>>>       > >> > >
     https://travis-ci.org/apache/flink/jobs/545555519
     >>>>>>>       > >> > >
     >>>>>>>       > >> > > travis-ci for my personal repo often failed
     with
     >>>>>>>       exceeding time
     >>>>>>>       > limit
     >>>>>>>       > >> > after
     >>>>>>>       > >> > > 4+ hours.
     >>>>>>>       > >> > > The job exceeded the maximum time limit for
     jobs, and
     >> has
     >>>>>>>       been
     >>>>>>>       > >> > terminated.
     >>>>>>>       > >> > >
     >>>>>>>       > >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li
     >>>>>>>       <bowenl...@gmail.com <mailto:bowenl...@gmail.com>
     <mailto:bowenl...@gmail.com <mailto:bowenl...@gmail.com>>>
     >>>>>>>       > wrote:
     >>>>>>>       > >> > >
     >>>>>>>       > >> > > >
     https://travis-ci.org/apache/flink/builds/549681530
     >>>>>>>       This build
     >>>>>>>       > >> > request
     >>>>>>>       > >> > > > has
     >>>>>>>       > >> > > > been sitting at **HEAD of the queue**
     since I first
     >> saw
     >>>>>>>       it at PST
     >>>>>>>       > >> > 10:30am
     >>>>>>>       > >> > > > (not sure how long it's been there before
     10:30am).
     >>>>>>>       It's PST
     >>>>>>>       > 4:12pm
     >>>>>>>       > >> now
     >>>>>>>       > >> > > and
     >>>>>>>       > >> > > > it hasn't started yet.
     >>>>>>>       > >> > > >
     >>>>>>>       > >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li
     >>>>>>>       <bowenl...@gmail.com <mailto:bowenl...@gmail.com>
     <mailto:bowenl...@gmail.com <mailto:bowenl...@gmail.com>>>
     >>>>>>>       > >> wrote:
     >>>>>>>       > >> > > >
     >>>>>>>       > >> > > > > Hi devs,
     >>>>>>>       > >> > > > >
     >>>>>>>       > >> > > > > I've been experiencing the pain
     resulting from lack
     >>>>>>>       of stable
     >>>>>>>       > >> build
     >>>>>>>       > >> > > > > capacity on Travis for Flink PRs [1].
     >> Specifically, I
     >>>>>>>       noticed
     >>>>>>>       > >> often
     >>>>>>>       > >> > > that
     >>>>>>>       > >> > > > no
     >>>>>>>       > >> > > > > build in the queue is making any
     progress for
     >> hours,
     >>>> and
     >>>>>>>       > suddenly
     >>>>>>>       > >> 5
     >>>>>>>       > >> > or
     >>>>>>>       > >> > > 6
     >>>>>>>       > >> > > > > builds kick off all together after the
     long pause.
     >>>>>>>       I'm at PST
     >>>>>>>       > >> > (UTC-08)
     >>>>>>>       > >> > > > time
     >>>>>>>       > >> > > > > zone, and I've seen pause can be as
     long as 6 hours
     >>>>>>>       from PST 9am
     >>>>>>>       > >> to
     >>>>>>>       > >> > 3pm
     >>>>>>>       > >> > > > > (let alone the time needed to drain the
     queue
     >>>>>>>       afterwards).
     >>>>>>>       > >> > > > >
     >>>>>>>       > >> > > > > I think this has greatly impacted our
     productivity.
     >>>> I've
     >>>>>>>       > >> experienced
     >>>>>>>       > >> > > that
     >>>>>>>       > >> > > > > PRs submitted in the early morning of
     PST time zone
     >>>>>>>       won't finish
     >>>>>>>       > >> > their
     >>>>>>>       > >> > > > > build until late night of the same day.
     >>>>>>>       > >> > > > >
     >>>>>>>       > >> > > > > So my questions are:
     >>>>>>>       > >> > > > >
     >>>>>>>       > >> > > > > - Has anyone else experienced the same
     problem or
     >>>>>>>       have similar
     >>>>>>>       > >> > > > observation
     >>>>>>>       > >> > > > > on TravisCI? (I suspect it has things
     to do with
     >> time
     >>>>>>>       zone)
     >>>>>>>       > >> > > > >
     >>>>>>>       > >> > > > > - What pricing plan of TravisCI is
     Flink currently
     >>>>>>>       using? Is it
     >>>>>>>       > >> the
     >>>>>>>       > >> > > free
     >>>>>>>       > >> > > > > plan for open source projects? What
are the
     >>>>>>>       guaranteed build
     >>>>>>>       > >> capacity
     >>>>>>>       > >> > > of
     >>>>>>>       > >> > > > > the current plan?
     >>>>>>>       > >> > > > >
     >>>>>>>       > >> > > > > - If the current pricing plan (either
     free or paid)
     >>>>>> can't
     >>>>>>>       > provide
     >>>>>>>       > >> > > stable
     >>>>>>>       > >> > > > > build capacity, can we upgrade to a
     higher priced
     >>>>>>>       plan with
     >>>>>>>       > larger
     >>>>>>>       > >> > and
     >>>>>>>       > >> > > > more
     >>>>>>>       > >> > > > > stable build capacity?
     >>>>>>>       > >> > > > >
     >>>>>>>       > >> > > > > BTW, another factor that contribute to
the
     >>>>>>>       productivity problem
     >>>>>>>       > is
     >>>>>>>       > >> > that
     >>>>>>>       > >> > > > > our build is slow - we run full build
     for every PR
     >>>> and a
     >>>>>>>       > >> successful
     >>>>>>>       > >> > > full
     >>>>>>>       > >> > > > > build takes ~5h. We definitely have
     more options to
     >>>>>>>       solve it,
     >>>>>>>       > for
     >>>>>>>       > >> > > > instance,
     >>>>>>>       > >> > > > > modularize the build graphs and reuse
     artifacts
     >> from
     >>>> the
     >>>>>>>       > previous
     >>>>>>>       > >> > > build.
     >>>>>>>       > >> > > > > But I think that can be a big effort
     which is much
     >>>>>>>       harder to
     >>>>>>>       > >> > accomplish
     >>>>>>>       > >> > > > in
     >>>>>>>       > >> > > > > a short period of time and may deserve
     its own
     >>>> separate
     >>>>>>>       > >> discussion.
     >>>>>>>       > >> > > > >
     >>>>>>>       > >> > > > > [1]
     >> https://travis-ci.org/apache/flink/pull_requests
     >>>>>>>       > >> > > > >
     >>>>>>>       > >> > > > >
     >>>>>>>       > >> > > >
     >>>>>>>       > >> > >
     >>>>>>>       > >> >
     >>>>>>>       > >>
     >>>>>>>       > >
     >>>>>>>       >
     >>>>>>>
     >>>>>>>
     >>>>>>>       --
     >>>>>>>       Best Regards
     >>>>>>>
     >>>>>>>       Jeff Zhang
     >>>>>>>
     >>

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

Reply via email to