Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-07-03 Thread Bowen Li
Re: > Are they using their own Travis CI pool, or did the switch to an
entirely different CI service?

I reached out to Wes and Krisztián from Apache Arrow PMC. They are
currently moving away from ASF's Travis to their own in-house metal
machines at [1] with custom CI application at [2]. They've seen significant
improvement w.r.t both much higher performance and basically no resource
waiting time, "night-and-day" difference quoting Wes.

Re: > If we can just switch to our own Travis pool, just for our project,
then this might be something we can do fairly quickly?

I believe so, according to [3] and [4]


[1] https://ci.ursalabs.org/ 
[2] https://github.com/ursa-labs/ursabot
[3] https://docs.travis-ci.com/user/migrate/open-source-repository-migration
[4] https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com



On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler  wrote:

> Are they using their own Travis CI pool, or did the switch to an
> entirely different CI service?
>
> If we can just switch to our own Travis pool, just for our project, then
> this might be something we can do fairly quickly?
>
> On 03/07/2019 05:55, Bowen Li wrote:
> > I responded in the INFRA ticket [1] that I believe they are using a wrong
> > metric against Flink and the total build time is a completely different
> > thing than guaranteed build capacity.
> >
> > My response:
> >
> > "As mentioned above, since I started to pay attention to Flink's build
> > queue a few tens of days ago, I'm in Seattle and I saw no build was
> kicking
> > off in PST daytime in weekdays for Flink. Our teammates in China and
> Europe
> > have also reported similar observations. So we need to evaluate how the
> > large total build time came from - if 1) your number and 2) our
> > observations from three locations that cover pretty much a full day, are
> > all true, I **guess** one reason can be that - highly likely the extra
> > build time came from weekends when other Apache projects may be idle and
> > Flink just drains hard its congested queue.
> >
> > Please be aware of that we're not complaining about the lack of resources
> > in general, I'm complaining about the lack of **stable, dedicated**
> > resources. An example for the latter one is, currently even if no build
> is
> > in Flink's queue and I submit a request to be the queue head in PST
> > morning, my build won't even start in 6-8+h. That is an absurd amount of
> > waiting time.
> >
> > That's saying, if ASF INFRA decides to adopt a quota system and grants
> > Flink five DEDICATED servers that runs all the time only for Flink,
> that'll
> > be PERFECT and can totally solve our problem now.
> >
> > Please be aware of that we're not complaining about the lack of resources
> > in general, I'm complaining about the lack of **stable, dedicated**
> > resources. An example for the latter one is, currently even if no build
> is
> > in Flink's queue and I submit a request to be the queue head in PST
> > morning, my build won't even start in 6-8+h. That is an absurd amount of
> > waiting time.
> >
> >
> > That's saying, if ASF INFRA decides to adopt a quota system and grants
> > Flink five DEDICATED servers that runs all the time only for Flink,
> that'll
> > be PERFECT and can totally solve our problem now.
> >
> > I feel what's missing in the ASF INFRA's Travis resource pool is some
> level
> > of build capacity SLAs and certainty"
> >
> >
> > Again, I believe there are differences in nature of these two problems,
> > long build time v.s. lack of dedicated build resource. That's saying,
> > shortening build time may relieve the situation, and may not. I'm sightly
> > negative on disabling IT cases for PRs, due to the downside is that we
> are
> > at risk of any potential bugs in PR that UTs doesn't catch, and may cost
> a
> > lot more to fix and if it slows others down or even block others, but am
> > open to others opinions on it.
> >
> > AFAICT from INFRA ticket[1], donating to ASF INFRA won't be feasible to
> > solve our problem since INFRA's pool is fully shared and they have no
> > control and finer insights over resource allocation to a specific Apache
> > project. As mentioned in [1], Apache Arrow is moving away from ASF INFRA
> > Travis pool (they are actually surprised Flink hasn't plan to do so). I
> > know that Spark is on its own build infra. If we all agree that funding
> our
> > own build infra, I'd be glad to help investigate any potential options
> > after releasing 1.9 since I'm super busy with 1.9 now.
> >
> > [1] https://issues.apache.org/jira/browse/INFRA-18533
> >
> >
> >
> > On Tue, Jul 2, 2019 at 4:46 AM Chesnay Schepler 
> wrote:
> >
> >> As a short-term stopgap, since we can assume this issue to become much
> >> worse in the following days/weeks, we could disable IT cases in PRs and
> >> only run them on master.
> >>
> >> On 02/07/2019 12:03, Chesnay Schepler wrote:
> >>> People really have to stop thinking that just because something 

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-07-03 Thread Chesnay Schepler
Are they using their own Travis CI pool, or did the switch to an 
entirely different CI service?


If we can just switch to our own Travis pool, just for our project, then 
this might be something we can do fairly quickly?


On 03/07/2019 05:55, Bowen Li wrote:

I responded in the INFRA ticket [1] that I believe they are using a wrong
metric against Flink and the total build time is a completely different
thing than guaranteed build capacity.

My response:

"As mentioned above, since I started to pay attention to Flink's build
queue a few tens of days ago, I'm in Seattle and I saw no build was kicking
off in PST daytime in weekdays for Flink. Our teammates in China and Europe
have also reported similar observations. So we need to evaluate how the
large total build time came from - if 1) your number and 2) our
observations from three locations that cover pretty much a full day, are
all true, I **guess** one reason can be that - highly likely the extra
build time came from weekends when other Apache projects may be idle and
Flink just drains hard its congested queue.

Please be aware of that we're not complaining about the lack of resources
in general, I'm complaining about the lack of **stable, dedicated**
resources. An example for the latter one is, currently even if no build is
in Flink's queue and I submit a request to be the queue head in PST
morning, my build won't even start in 6-8+h. That is an absurd amount of
waiting time.

That's saying, if ASF INFRA decides to adopt a quota system and grants
Flink five DEDICATED servers that runs all the time only for Flink, that'll
be PERFECT and can totally solve our problem now.

Please be aware of that we're not complaining about the lack of resources
in general, I'm complaining about the lack of **stable, dedicated**
resources. An example for the latter one is, currently even if no build is
in Flink's queue and I submit a request to be the queue head in PST
morning, my build won't even start in 6-8+h. That is an absurd amount of
waiting time.


That's saying, if ASF INFRA decides to adopt a quota system and grants
Flink five DEDICATED servers that runs all the time only for Flink, that'll
be PERFECT and can totally solve our problem now.

I feel what's missing in the ASF INFRA's Travis resource pool is some level
of build capacity SLAs and certainty"


Again, I believe there are differences in nature of these two problems,
long build time v.s. lack of dedicated build resource. That's saying,
shortening build time may relieve the situation, and may not. I'm sightly
negative on disabling IT cases for PRs, due to the downside is that we are
at risk of any potential bugs in PR that UTs doesn't catch, and may cost a
lot more to fix and if it slows others down or even block others, but am
open to others opinions on it.

AFAICT from INFRA ticket[1], donating to ASF INFRA won't be feasible to
solve our problem since INFRA's pool is fully shared and they have no
control and finer insights over resource allocation to a specific Apache
project. As mentioned in [1], Apache Arrow is moving away from ASF INFRA
Travis pool (they are actually surprised Flink hasn't plan to do so). I
know that Spark is on its own build infra. If we all agree that funding our
own build infra, I'd be glad to help investigate any potential options
after releasing 1.9 since I'm super busy with 1.9 now.

[1] https://issues.apache.org/jira/browse/INFRA-18533



On Tue, Jul 2, 2019 at 4:46 AM Chesnay Schepler  wrote:


As a short-term stopgap, since we can assume this issue to become much
worse in the following days/weeks, we could disable IT cases in PRs and
only run them on master.

On 02/07/2019 12:03, Chesnay Schepler wrote:

People really have to stop thinking that just because something works
for us it is also a good solution.
Also, please remember that our builds run for 2h from start to finish,
and not the 14 _minutes_ it takes for zeppelin.
We are dealing with an entirely different scale here, both in terms of
build times and number of builds.

In this very thread people have been complaining about long queue
times for their builds. Surprise, other Apache projects have been
suffering the very same thing due to us not controlling our build
times. While switching services (be it Jenkins, CircleCI or whatever)
will possibly work for us (and these options are actually attractive,
like CircleCI's proper support for build artifacts), it will also
result in us likely negatively affecting other projects in significant
ways.

Sure, the Jenkins setup has a good user experience for us, at the cost
of blocking Jenkins workers for a _lot_ of time. Right now we have 25
PR's in our queue; that's possibly 50h we'd consume of Jenkins
resources, and the European contributors haven't even really started yet.

FYI, the latest INFRA response from INFRA-18533:

"Our rough metrics shows that Flink used over 5800 hours of build time
last month. That is equal to EIGHT servers running 24/7 for the ENTIRE
MONTH. 

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-07-02 Thread Bowen Li
I responded in the INFRA ticket [1] that I believe they are using a wrong
metric against Flink and the total build time is a completely different
thing than guaranteed build capacity.

My response:

"As mentioned above, since I started to pay attention to Flink's build
queue a few tens of days ago, I'm in Seattle and I saw no build was kicking
off in PST daytime in weekdays for Flink. Our teammates in China and Europe
have also reported similar observations. So we need to evaluate how the
large total build time came from - if 1) your number and 2) our
observations from three locations that cover pretty much a full day, are
all true, I **guess** one reason can be that - highly likely the extra
build time came from weekends when other Apache projects may be idle and
Flink just drains hard its congested queue.

Please be aware of that we're not complaining about the lack of resources
in general, I'm complaining about the lack of **stable, dedicated**
resources. An example for the latter one is, currently even if no build is
in Flink's queue and I submit a request to be the queue head in PST
morning, my build won't even start in 6-8+h. That is an absurd amount of
waiting time.

That's saying, if ASF INFRA decides to adopt a quota system and grants
Flink five DEDICATED servers that runs all the time only for Flink, that'll
be PERFECT and can totally solve our problem now.

Please be aware of that we're not complaining about the lack of resources
in general, I'm complaining about the lack of **stable, dedicated**
resources. An example for the latter one is, currently even if no build is
in Flink's queue and I submit a request to be the queue head in PST
morning, my build won't even start in 6-8+h. That is an absurd amount of
waiting time.


That's saying, if ASF INFRA decides to adopt a quota system and grants
Flink five DEDICATED servers that runs all the time only for Flink, that'll
be PERFECT and can totally solve our problem now.

I feel what's missing in the ASF INFRA's Travis resource pool is some level
of build capacity SLAs and certainty"


Again, I believe there are differences in nature of these two problems,
long build time v.s. lack of dedicated build resource. That's saying,
shortening build time may relieve the situation, and may not. I'm sightly
negative on disabling IT cases for PRs, due to the downside is that we are
at risk of any potential bugs in PR that UTs doesn't catch, and may cost a
lot more to fix and if it slows others down or even block others, but am
open to others opinions on it.

AFAICT from INFRA ticket[1], donating to ASF INFRA won't be feasible to
solve our problem since INFRA's pool is fully shared and they have no
control and finer insights over resource allocation to a specific Apache
project. As mentioned in [1], Apache Arrow is moving away from ASF INFRA
Travis pool (they are actually surprised Flink hasn't plan to do so). I
know that Spark is on its own build infra. If we all agree that funding our
own build infra, I'd be glad to help investigate any potential options
after releasing 1.9 since I'm super busy with 1.9 now.

[1] https://issues.apache.org/jira/browse/INFRA-18533



On Tue, Jul 2, 2019 at 4:46 AM Chesnay Schepler  wrote:

> As a short-term stopgap, since we can assume this issue to become much
> worse in the following days/weeks, we could disable IT cases in PRs and
> only run them on master.
>
> On 02/07/2019 12:03, Chesnay Schepler wrote:
> > People really have to stop thinking that just because something works
> > for us it is also a good solution.
> > Also, please remember that our builds run for 2h from start to finish,
> > and not the 14 _minutes_ it takes for zeppelin.
> > We are dealing with an entirely different scale here, both in terms of
> > build times and number of builds.
> >
> > In this very thread people have been complaining about long queue
> > times for their builds. Surprise, other Apache projects have been
> > suffering the very same thing due to us not controlling our build
> > times. While switching services (be it Jenkins, CircleCI or whatever)
> > will possibly work for us (and these options are actually attractive,
> > like CircleCI's proper support for build artifacts), it will also
> > result in us likely negatively affecting other projects in significant
> > ways.
> >
> > Sure, the Jenkins setup has a good user experience for us, at the cost
> > of blocking Jenkins workers for a _lot_ of time. Right now we have 25
> > PR's in our queue; that's possibly 50h we'd consume of Jenkins
> > resources, and the European contributors haven't even really started yet.
> >
> > FYI, the latest INFRA response from INFRA-18533:
> >
> > "Our rough metrics shows that Flink used over 5800 hours of build time
> > last month. That is equal to EIGHT servers running 24/7 for the ENTIRE
> > MONTH. EIGHT. nonstop.
> > When we discovered this last night, we discussed it some and are going
> > to tune down Flink to allow only five executors maximum. We 

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-07-02 Thread Chesnay Schepler
As a short-term stopgap, since we can assume this issue to become much 
worse in the following days/weeks, we could disable IT cases in PRs and 
only run them on master.


On 02/07/2019 12:03, Chesnay Schepler wrote:
People really have to stop thinking that just because something works 
for us it is also a good solution.
Also, please remember that our builds run for 2h from start to finish, 
and not the 14 _minutes_ it takes for zeppelin.
We are dealing with an entirely different scale here, both in terms of 
build times and number of builds.


In this very thread people have been complaining about long queue 
times for their builds. Surprise, other Apache projects have been 
suffering the very same thing due to us not controlling our build 
times. While switching services (be it Jenkins, CircleCI or whatever) 
will possibly work for us (and these options are actually attractive, 
like CircleCI's proper support for build artifacts), it will also 
result in us likely negatively affecting other projects in significant 
ways.


Sure, the Jenkins setup has a good user experience for us, at the cost 
of blocking Jenkins workers for a _lot_ of time. Right now we have 25 
PR's in our queue; that's possibly 50h we'd consume of Jenkins 
resources, and the European contributors haven't even really started yet.


FYI, the latest INFRA response from INFRA-18533:

"Our rough metrics shows that Flink used over 5800 hours of build time 
last month. That is equal to EIGHT servers running 24/7 for the ENTIRE 
MONTH. EIGHT. nonstop.
When we discovered this last night, we discussed it some and are going 
to tune down Flink to allow only five executors maximum. We cannot 
allow Flink to consume so much of a Foundation shared resource."


So yes, we either
a) have to heavily reduce our CI usage or
b) fund our own, either maintaining it ourselves or donating to Apache.

On 02/07/2019 05:11, Bowen Li wrote:
By looking at the git history of the Jenkins script, its core part 
was finished in March 2017 (and only two minor update in 2017/2018), 
so it's been running for over two years now and feels like Zepplin 
community has been quite happy with it. @Jeff Zhang 
 can you share your insights and user 
experience with the Jenkins+Travis approach?


Things like:

- has the approach completely solved the resource capacity problem 
for Zepplin community? is Zepplin community happy with the result?

- is the whole configuration chain stable (e.g. uptime) enough?
- how often do you need to maintain the Jenkins infra? how many 
people are usually involved in maintenance and bug-fixes?


The downside of this approach seems mostly to be on the maintenance 
to me - maintain the script and Jenkins infra.


** Having Our Own Travis-CI.com Account **

Another alternative I've been thinking of is to have our own 
travis-ci.com  account with paid dedicated 
resources. Note travis-ci.org  is the free 
version and travis-ci.com  is the commercial 
version. We currently use a shared resource pool managed by ASK INFRA 
team on travis-ci.org , but we have no control 
over it - we can't see how it's configured, how much resources are 
available, how resources are allocated among Apache projects, etc. 
The nice thing about having an account on travis-ci.com 
 are:


- relatively low cost with much better resource guarantee than what 
we currently have [1]: $249/month with 5 dedicated concurrency, 
$489/month with 10 concurrency

- low maintenance work compared to using Jenkins
- (potentially) no migration cost according to Travis's doc [2] 
(pending verification)
- full control over the build capacity/configuration compared to 
using ASF INFRA's pool


I'd be surprised if we as such a vibrant community cannot find and 
fund $249*12=$2988 a year in exchange for a much better developer 
experience and much higher productivity.


[1] https://travis-ci.com/plans
[2] 
https://docs.travis-ci.com/user/migrate/open-source-repository-migration


On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler > wrote:


So yes, the Jenkins job keeps pulling the state from Travis until it
finishes.

Note sure I'm comfortable with the idea of using Jenkins workers
just to
idle for a several hours.

On 29/06/2019 14:56, Jeff Zhang wrote:
> Here's what zeppelin community did, we make a python script to
check the
> build status of pull request.
> Here's script:
> https://github.com/apache/zeppelin/blob/master/travis_check.py
>
> And this is the script we used in Jenkins build job.
>
> if [ -f "travis_check.py" ]; then
>git log -n 1
>STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull
request.*from.*" | sed
> 's/.*GitHub pull request  href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 \2/g')
>AUTHOR=$(echo $STATUS | sed 

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-07-02 Thread Chesnay Schepler
People really have to stop thinking that just because something works 
for us it is also a good solution.
Also, please remember that our builds run for 2h from start to finish, 
and not the 14 _minutes_ it takes for zeppelin.
We are dealing with an entirely different scale here, both in terms of 
build times and number of builds.


In this very thread people have been complaining about long queue times 
for their builds. Surprise, other Apache projects have been suffering 
the very same thing due to us not controlling our build times. While 
switching services (be it Jenkins, CircleCI or whatever) will possibly 
work for us (and these options are actually attractive, like CircleCI's 
proper support for build artifacts), it will also result in us likely 
negatively affecting other projects in significant ways.


Sure, the Jenkins setup has a good user experience for us, at the cost 
of blocking Jenkins workers for a _lot_ of time. Right now we have 25 
PR's in our queue; that's possibly 50h we'd consume of Jenkins 
resources, and the European contributors haven't even really started yet.


FYI, the latest INFRA response from INFRA-18533:

"Our rough metrics shows that Flink used over 5800 hours of build time 
last month. That is equal to EIGHT servers running 24/7 for the ENTIRE 
MONTH. EIGHT. nonstop.
When we discovered this last night, we discussed it some and are going 
to tune down Flink to allow only five executors maximum. We cannot allow 
Flink to consume so much of a Foundation shared resource."


So yes, we either
a) have to heavily reduce our CI usage or
b) fund our own, either maintaining it ourselves or donating to Apache.

On 02/07/2019 05:11, Bowen Li wrote:
By looking at the git history of the Jenkins script, its core part was 
finished in March 2017 (and only two minor update in 2017/2018), so 
it's been running for over two years now and feels like Zepplin 
community has been quite happy with it. @Jeff Zhang 
 can you share your insights and user 
experience with the Jenkins+Travis approach?


Things like:

- has the approach completely solved the resource capacity problem for 
Zepplin community? is Zepplin community happy with the result?

- is the whole configuration chain stable (e.g. uptime) enough?
- how often do you need to maintain the Jenkins infra? how many people 
are usually involved in maintenance and bug-fixes?


The downside of this approach seems mostly to be on the maintenance to 
me - maintain the script and Jenkins infra.


** Having Our Own Travis-CI.com Account **

Another alternative I've been thinking of is to have our own 
travis-ci.com  account with paid dedicated 
resources. Note travis-ci.org  is the free 
version and travis-ci.com  is the commercial 
version. We currently use a shared resource pool managed by ASK INFRA 
team on travis-ci.org , but we have no control 
over it - we can't see how it's configured, how much resources are 
available, how resources are allocated among Apache projects, etc. The 
nice thing about having an account on travis-ci.com 
 are:


- relatively low cost with much better resource guarantee than what we 
currently have [1]: $249/month with 5 dedicated concurrency, 
$489/month with 10 concurrency

- low maintenance work compared to using Jenkins
- (potentially) no migration cost according to Travis's doc [2] 
(pending verification)
- full control over the build capacity/configuration compared to using 
ASF INFRA's pool


I'd be surprised if we as such a vibrant community cannot find and 
fund $249*12=$2988 a year in exchange for a much better developer 
experience and much higher productivity.


[1] https://travis-ci.com/plans
[2] 
https://docs.travis-ci.com/user/migrate/open-source-repository-migration


On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler > wrote:


So yes, the Jenkins job keeps pulling the state from Travis until it
finishes.

Note sure I'm comfortable with the idea of using Jenkins workers
just to
idle for a several hours.

On 29/06/2019 14:56, Jeff Zhang wrote:
> Here's what zeppelin community did, we make a python script to
check the
> build status of pull request.
> Here's script:
> https://github.com/apache/zeppelin/blob/master/travis_check.py
>
> And this is the script we used in Jenkins build job.
>
> if [ -f "travis_check.py" ]; then
>git log -n 1
>STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull
request.*from.*" | sed
> 's/.*GitHub pull request  href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 \2/g')
>AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g')
>PR=$(echo $STATUS | awk '{print $1}' | sed 's/.*[/]\(.*\)$/\1/g')
>#COMMIT=$(git log -n 1 | grep "^Merge:" | awk '{print $3}')
>#if [ -z $COMMIT ]; then
>#  COMMIT=$(curl -s

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-07-01 Thread Bowen Li
By looking at the git history of the Jenkins script, its core part was
finished in March 2017 (and only two minor update in 2017/2018), so it's
been running for over two years now and feels like Zepplin community has
been quite happy with it. @Jeff Zhang  can you share your
insights and user experience with the Jenkins+Travis approach?

Things like:

- has the approach completely solved the resource capacity problem for
Zepplin community? is Zepplin community happy with the result?
- is the whole configuration chain stable (e.g. uptime) enough?
- how often do you need to maintain the Jenkins infra? how many people are
usually involved in maintenance and bug-fixes?

The downside of this approach seems mostly to be on the maintenance to me -
maintain the script and Jenkins infra.

** Having Our Own Travis-CI.com Account **

Another alternative I've been thinking of is to have our own travis-ci.com
account with paid dedicated resources. Note travis-ci.org is the free
version and travis-ci.com is the commercial version. We currently use a
shared resource pool managed by ASK INFRA team on travis-ci.org, but we
have no control over it - we can't see how it's configured, how much
resources are available, how resources are allocated among Apache projects,
etc. The nice thing about having an account on travis-ci.com are:

- relatively low cost with much better resource guarantee than what we
currently have [1]: $249/month with 5 dedicated concurrency, $489/month
with 10 concurrency
- low maintenance work compared to using Jenkins
- (potentially) no migration cost according to Travis's doc [2] (pending
verification)
- full control over the build capacity/configuration compared to using ASF
INFRA's pool

I'd be surprised if we as such a vibrant community cannot find and fund
$249*12=$2988 a year in exchange for a much better developer experience and
much higher productivity.

[1] https://travis-ci.com/plans
[2] https://docs.travis-ci.com/user/migrate/open-source-repository-migration

On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler  wrote:

> So yes, the Jenkins job keeps pulling the state from Travis until it
> finishes.
>
> Note sure I'm comfortable with the idea of using Jenkins workers just to
> idle for a several hours.
>
> On 29/06/2019 14:56, Jeff Zhang wrote:
> > Here's what zeppelin community did, we make a python script to check the
> > build status of pull request.
> > Here's script:
> > https://github.com/apache/zeppelin/blob/master/travis_check.py
> >
> > And this is the script we used in Jenkins build job.
> >
> > if [ -f "travis_check.py" ]; then
> >git log -n 1
> >STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull request.*from.*" |
> sed
> > 's/.*GitHub pull request  > href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 \2/g')
> >AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g')
> >PR=$(echo $STATUS | awk '{print $1}' | sed 's/.*[/]\(.*\)$/\1/g')
> >#COMMIT=$(git log -n 1 | grep "^Merge:" | awk '{print $3}')
> >#if [ -z $COMMIT ]; then
> >#  COMMIT=$(curl -s
> https://api.github.com/repos/apache/zeppelin/pulls/$PR
> > | grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' ' | sed
> > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v "apache:"
> |
> > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g')
> >#fi
> >
> ># get commit hash from PR
> >COMMIT=$(curl -s
> https://api.github.com/repos/apache/zeppelin/pulls/$PR |
> > grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' ' | sed
> > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v "apache:"
> |
> > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g')
> >sleep 30 # sleep few moment to wait travis starts the build
> >RET_CODE=0
> >python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$?
> >if [ $RET_CODE -eq 2 ]; then # try with repository name when
> travis-ci is
> > not available in the account
> >  RET_CODE=0
> >  AUTHOR=$(curl -s
> https://api.github.com/repos/apache/zeppelin/pulls/$PR
> > | grep '"full_name":' | grep -v "apache/zeppelin" | sed
> > 's/.*[:][^"]*["]\([^/]*\).*/\1/g')
> >python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$?
> >fi
> >
> >if [ $RET_CODE -eq 2 ]; then # fail with can't find build information
> in
> > the travis
> >  set +x
> >  echo "-"
> >  echo "Looks like travis-ci is not configured for your fork."
> >  echo "Please setup by swich on 'zeppelin' repository at
> > https://travis-ci.org/profile and travis-ci."
> >  echo "And then make sure 'Build branch updates' option is enabled in
> > the settings https://travis-ci.org/${AUTHOR}/zeppelin/settings.;
> >  echo ""
> >  echo "To trigger CI after setup, you will need ammend your last
> commit
> > with"
> >  echo "git commit --amend"
> >  echo "git push your-remote HEAD --force"
> >  echo ""
> >  echo "See
> >
> 

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-29 Thread Chesnay Schepler
So yes, the Jenkins job keeps pulling the state from Travis until it 
finishes.


Note sure I'm comfortable with the idea of using Jenkins workers just to 
idle for a several hours.


On 29/06/2019 14:56, Jeff Zhang wrote:

Here's what zeppelin community did, we make a python script to check the
build status of pull request.
Here's script:
https://github.com/apache/zeppelin/blob/master/travis_check.py

And this is the script we used in Jenkins build job.

if [ -f "travis_check.py" ]; then
   git log -n 1
   STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull request.*from.*" | sed
's/.*GitHub pull request https://api.github.com/repos/apache/zeppelin/pulls/$PR
| grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' ' | sed
's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v "apache:" |
sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g')
   #fi

   # get commit hash from PR
   COMMIT=$(curl -s https://api.github.com/repos/apache/zeppelin/pulls/$PR |
grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' ' | sed
's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v "apache:" |
sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g')
   sleep 30 # sleep few moment to wait travis starts the build
   RET_CODE=0
   python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$?
   if [ $RET_CODE -eq 2 ]; then # try with repository name when travis-ci is
not available in the account
 RET_CODE=0
 AUTHOR=$(curl -s https://api.github.com/repos/apache/zeppelin/pulls/$PR
| grep '"full_name":' | grep -v "apache/zeppelin" | sed
's/.*[:][^"]*["]\([^/]*\).*/\1/g')
   python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$?
   fi

   if [ $RET_CODE -eq 2 ]; then # fail with can't find build information in
the travis
 set +x
 echo "-"
 echo "Looks like travis-ci is not configured for your fork."
 echo "Please setup by swich on 'zeppelin' repository at
https://travis-ci.org/profile and travis-ci."
 echo "And then make sure 'Build branch updates' option is enabled in
the settings https://travis-ci.org/${AUTHOR}/zeppelin/settings.;
 echo ""
 echo "To trigger CI after setup, you will need ammend your last commit
with"
 echo "git commit --amend"
 echo "git push your-remote HEAD --force"
 echo ""
 echo "See
http://zeppelin.apache.org/contribution/contributions.html#continuous-integration
."
   fi

   exit $RET_CODE
else
   set +x
   echo "travis_check.py does not exists"
   exit 1
fi

Chesnay Schepler  于2019年6月29日周六 下午3:17写道:


Does this imply that a Jenkins job is active as long as the Travis build
runs?

On 26/06/2019 21:28, Bowen Li wrote:

Hi,

@Dawid, I think the "long test running" as I mentioned in the first

email,

also as you guys said, belongs to "a big effort which is much harder to
accomplish in a short period of time and may deserve its own separate
discussion". Thus I didn't include it in what we can do in a foreseeable
short term.

Besides, I don't think that's the ultimate reason for lack of build
resources. Even if the build is shortened to something like 2h, the
problems of no build machine works about 6 or more hours in PST daytime
that I described will still happen, because no machine from ASF INFRA's
pool is allocated to Flink. As I have paid close attention to the build
queue in the past few weekdays, it's a pretty clear pattern now.

**The ultimate root cause** for that is - we don't have any **dedicated**
build resources that we can stably rely on. I'm actually ok to wait for a
long time if there are build requests running, it means at least we are
making progress. But I'm not ok with no build resource. A better place I
think we should aim at in short term is to always have at least a central
pool (can be 3 or 5) of machines dedicated to build Flink at any time, or
maybe use users resources.

@Chesnay @Robert I synced with Jeff offline that Zeppelin community is
using a Jenkins job to automatically build on users' travis account and
link the result back to github PR. I guess the Jenkins job would fetch
latest upstream master and build the PR against it. Jeff has filed

tickets

to learn and get access to the Jenkins infra. It'll better to fully
understand it first before judging this approach.

I also heard good things about CircleCI, and ASF INFRA seems to have a

pool

of build capacity there too. Can be an alternative to consider.









On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz <

dwysakow...@apache.org>

wrote:


Sorry to jump in late, but I think Bowen missed the most important point
from Chesnay's previous message in the summary. The ultimate reason for
all the problems is that the tests take close to 2 hours to run already.
I fully support this claim: "Unless people start caring about test times
before adding them, this issue cannot be solved"

This is also another reason why using user's Travis account won't help.
Every few weeks we reach the user's time limit for a single profile.

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-29 Thread Jeff Zhang
Here's what zeppelin community did, we make a python script to check the
build status of pull request.
Here's script:
https://github.com/apache/zeppelin/blob/master/travis_check.py

And this is the script we used in Jenkins build job.

if [ -f "travis_check.py" ]; then
  git log -n 1
  STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull request.*from.*" | sed
's/.*GitHub pull request https://api.github.com/repos/apache/zeppelin/pulls/$PR
| grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' ' | sed
's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v "apache:" |
sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g')
  #fi

  # get commit hash from PR
  COMMIT=$(curl -s https://api.github.com/repos/apache/zeppelin/pulls/$PR |
grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' ' | sed
's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v "apache:" |
sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g')
  sleep 30 # sleep few moment to wait travis starts the build
  RET_CODE=0
  python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$?
  if [ $RET_CODE -eq 2 ]; then # try with repository name when travis-ci is
not available in the account
RET_CODE=0
AUTHOR=$(curl -s https://api.github.com/repos/apache/zeppelin/pulls/$PR
| grep '"full_name":' | grep -v "apache/zeppelin" | sed
's/.*[:][^"]*["]\([^/]*\).*/\1/g')
  python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$?
  fi

  if [ $RET_CODE -eq 2 ]; then # fail with can't find build information in
the travis
set +x
echo "-"
echo "Looks like travis-ci is not configured for your fork."
echo "Please setup by swich on 'zeppelin' repository at
https://travis-ci.org/profile and travis-ci."
echo "And then make sure 'Build branch updates' option is enabled in
the settings https://travis-ci.org/${AUTHOR}/zeppelin/settings.;
echo ""
echo "To trigger CI after setup, you will need ammend your last commit
with"
echo "git commit --amend"
echo "git push your-remote HEAD --force"
echo ""
echo "See
http://zeppelin.apache.org/contribution/contributions.html#continuous-integration
."
  fi

  exit $RET_CODE
else
  set +x
  echo "travis_check.py does not exists"
  exit 1
fi

Chesnay Schepler  于2019年6月29日周六 下午3:17写道:

> Does this imply that a Jenkins job is active as long as the Travis build
> runs?
>
> On 26/06/2019 21:28, Bowen Li wrote:
> > Hi,
> >
> > @Dawid, I think the "long test running" as I mentioned in the first
> email,
> > also as you guys said, belongs to "a big effort which is much harder to
> > accomplish in a short period of time and may deserve its own separate
> > discussion". Thus I didn't include it in what we can do in a foreseeable
> > short term.
> >
> > Besides, I don't think that's the ultimate reason for lack of build
> > resources. Even if the build is shortened to something like 2h, the
> > problems of no build machine works about 6 or more hours in PST daytime
> > that I described will still happen, because no machine from ASF INFRA's
> > pool is allocated to Flink. As I have paid close attention to the build
> > queue in the past few weekdays, it's a pretty clear pattern now.
> >
> > **The ultimate root cause** for that is - we don't have any **dedicated**
> > build resources that we can stably rely on. I'm actually ok to wait for a
> > long time if there are build requests running, it means at least we are
> > making progress. But I'm not ok with no build resource. A better place I
> > think we should aim at in short term is to always have at least a central
> > pool (can be 3 or 5) of machines dedicated to build Flink at any time, or
> > maybe use users resources.
> >
> > @Chesnay @Robert I synced with Jeff offline that Zeppelin community is
> > using a Jenkins job to automatically build on users' travis account and
> > link the result back to github PR. I guess the Jenkins job would fetch
> > latest upstream master and build the PR against it. Jeff has filed
> tickets
> > to learn and get access to the Jenkins infra. It'll better to fully
> > understand it first before judging this approach.
> >
> > I also heard good things about CircleCI, and ASF INFRA seems to have a
> pool
> > of build capacity there too. Can be an alternative to consider.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz <
> dwysakow...@apache.org>
> > wrote:
> >
> >> Sorry to jump in late, but I think Bowen missed the most important point
> >> from Chesnay's previous message in the summary. The ultimate reason for
> >> all the problems is that the tests take close to 2 hours to run already.
> >> I fully support this claim: "Unless people start caring about test times
> >> before adding them, this issue cannot be solved"
> >>
> >> This is also another reason why using user's Travis account won't help.
> >> Every few weeks we reach the user's time limit for a single profile.
> >> This makes the user's builds simply fail, 

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-29 Thread Chesnay Schepler
Does this imply that a Jenkins job is active as long as the Travis build 
runs?


On 26/06/2019 21:28, Bowen Li wrote:

Hi,

@Dawid, I think the "long test running" as I mentioned in the first email,
also as you guys said, belongs to "a big effort which is much harder to
accomplish in a short period of time and may deserve its own separate
discussion". Thus I didn't include it in what we can do in a foreseeable
short term.

Besides, I don't think that's the ultimate reason for lack of build
resources. Even if the build is shortened to something like 2h, the
problems of no build machine works about 6 or more hours in PST daytime
that I described will still happen, because no machine from ASF INFRA's
pool is allocated to Flink. As I have paid close attention to the build
queue in the past few weekdays, it's a pretty clear pattern now.

**The ultimate root cause** for that is - we don't have any **dedicated**
build resources that we can stably rely on. I'm actually ok to wait for a
long time if there are build requests running, it means at least we are
making progress. But I'm not ok with no build resource. A better place I
think we should aim at in short term is to always have at least a central
pool (can be 3 or 5) of machines dedicated to build Flink at any time, or
maybe use users resources.

@Chesnay @Robert I synced with Jeff offline that Zeppelin community is
using a Jenkins job to automatically build on users' travis account and
link the result back to github PR. I guess the Jenkins job would fetch
latest upstream master and build the PR against it. Jeff has filed tickets
to learn and get access to the Jenkins infra. It'll better to fully
understand it first before judging this approach.

I also heard good things about CircleCI, and ASF INFRA seems to have a pool
of build capacity there too. Can be an alternative to consider.









On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz 
wrote:


Sorry to jump in late, but I think Bowen missed the most important point
from Chesnay's previous message in the summary. The ultimate reason for
all the problems is that the tests take close to 2 hours to run already.
I fully support this claim: "Unless people start caring about test times
before adding them, this issue cannot be solved"

This is also another reason why using user's Travis account won't help.
Every few weeks we reach the user's time limit for a single profile.
This makes the user's builds simply fail, until we either properly
decrease the time the tests take (which I am not sure we ever did) or
postpone the problem by splitting into more profiles. (Note that the ASF
Travis account has higher time limits)

Best,

Dawid

On 26/06/2019 09:36, Robert Metzger wrote:

Do we know if using "the best" available hardware would improve the build
times?
Imagine we would run the build on machines with plenty of main memory to
mount everything to ramdisk + the latest CPU architecture?

Throwing hardware at the problem could help reduce the time of an
individual build, and using our own infrastructure would remove our
dependency on Apache's Travis account (with the obvious downside of

having

to maintain the infrastructure)
We could use an open source travis alternative, to have a similar
experience and make the migration easy.


On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler 

wrote:

  From what I gathered, there's no special sauce that the Zeppelin
project uses which actually integrates a users Travis account into the

PR.

They just disabled Travis for PRs. And that's kind of it.

Naturally we can do this (duh) and safe the ASF a fair amount of
resources, but there are downsides:

The discoverability of the Travis check takes a nose-dive. Either we
require every contributor to always, an every commit, also post a Travis
build, or we have the reviewer sift through the contributors account to
find it.

This is rather cumbersome. Additionally, it's also not equivalent to
having a PR build.

A normal branch build takes a branch as is and tests it. A PR build
merges the branch into master, and then runs it. (Fun fact: This is why
a PR without merge conflicts is not being run on Travis.)

And ultimately, everyone can already make use of this approach anyway.

On 25/06/2019 08:02, Jark Wu wrote:

Hi Jeff,

Thanks for sharing the Zeppelin approach. I think it's a good idea to
leverage user's travis account.
In this way, we can have almost unlimited concurrent build jobs and
developers can restart build by themselves (currently only committers
can restart PR's build).

But I'm still not very clear how to integrate user's travis build into
the Flink pull request's build automatically. Can you explain more in
detail?

Another question: does travis only build branches for user account?
My concern is that builds for PRs will rebase user's commits against
current master branch.
This will help us to find problems before merge.  Builds for branches
will lose the impact of new commits in master.
How does Zeppelin solve 

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-27 Thread Chesnay Schepler
see https://issues.apache.org/jira/browse/INFRA-18533 for the overall 
degradation of Travis capacity.


On 26/06/2019 21:50, Bowen wrote:

just elaborate a bit more on why slow build is ok but no resource is not: Say I 
submit a build request at PST 9am, no other requests exist and mine is the 
queue head, currently it means it still cannot get built until 4 or 5pm.




On Jun 26, 2019, at 12:28, Bowen Li  wrote:

Hi,

@Dawid, I think the "long test running" as I mentioned in the first email, also as you 
guys said, belongs to "a big effort which is much harder to accomplish in a short period of 
time and may deserve its own separate discussion". Thus I didn't include it in what we can do 
in a foreseeable short term.

Besides, I don't think that's the ultimate reason for lack of build resources. 
Even if the build is shortened to something like 2h, the problems of no build 
machine works about 6 or more hours in PST daytime that I described will still 
happen, because no machine from ASF INFRA's pool is allocated to Flink. As I 
have paid close attention to the build queue in the past few weekdays, it's a 
pretty clear pattern now.

**The ultimate root cause** for that is - we don't have any **dedicated** build 
resources that we can stably rely on. I'm actually ok to wait for a long time 
if there are build requests running, it means at least we are making progress. 
But I'm not ok with no build resource. A better place I think we should aim at 
in short term is to always have at least a central pool (can be 3 or 5) of 
machines dedicated to build Flink at any time, or maybe use users resources.

@Chesnay @Robert I synced with Jeff offline that Zeppelin community is using a 
Jenkins job to automatically build on users' travis account and link the result 
back to github PR. I guess the Jenkins job would fetch latest upstream master 
and build the PR against it. Jeff has filed tickets to learn and get access to 
the Jenkins infra. It'll better to fully understand it first before judging 
this approach.

I also heard good things about CircleCI, and ASF INFRA seems to have a pool of 
build capacity there too. Can be an alternative to consider.










On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz  
wrote:
Sorry to jump in late, but I think Bowen missed the most important point
from Chesnay's previous message in the summary. The ultimate reason for
all the problems is that the tests take close to 2 hours to run already.
I fully support this claim: "Unless people start caring about test times
before adding them, this issue cannot be solved"

This is also another reason why using user's Travis account won't help.
Every few weeks we reach the user's time limit for a single profile.
This makes the user's builds simply fail, until we either properly
decrease the time the tests take (which I am not sure we ever did) or
postpone the problem by splitting into more profiles. (Note that the ASF
Travis account has higher time limits)

Best,

Dawid

On 26/06/2019 09:36, Robert Metzger wrote:

Do we know if using "the best" available hardware would improve the build
times?
Imagine we would run the build on machines with plenty of main memory to
mount everything to ramdisk + the latest CPU architecture?

Throwing hardware at the problem could help reduce the time of an
individual build, and using our own infrastructure would remove our
dependency on Apache's Travis account (with the obvious downside of having
to maintain the infrastructure)
We could use an open source travis alternative, to have a similar
experience and make the migration easy.


On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler  wrote:


  From what I gathered, there's no special sauce that the Zeppelin
project uses which actually integrates a users Travis account into the PR.

They just disabled Travis for PRs. And that's kind of it.

Naturally we can do this (duh) and safe the ASF a fair amount of
resources, but there are downsides:

The discoverability of the Travis check takes a nose-dive. Either we
require every contributor to always, an every commit, also post a Travis
build, or we have the reviewer sift through the contributors account to
find it.

This is rather cumbersome. Additionally, it's also not equivalent to
having a PR build.

A normal branch build takes a branch as is and tests it. A PR build
merges the branch into master, and then runs it. (Fun fact: This is why
a PR without merge conflicts is not being run on Travis.)

And ultimately, everyone can already make use of this approach anyway.

On 25/06/2019 08:02, Jark Wu wrote:

Hi Jeff,

Thanks for sharing the Zeppelin approach. I think it's a good idea to
leverage user's travis account.
In this way, we can have almost unlimited concurrent build jobs and
developers can restart build by themselves (currently only committers
can restart PR's build).

But I'm still not very clear how to integrate user's travis build into
the Flink pull request's build automatically. Can you explain 

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-26 Thread Bowen
just elaborate a bit more on why slow build is ok but no resource is not: Say I 
submit a build request at PST 9am, no other requests exist and mine is the 
queue head, currently it means it still cannot get built until 4 or 5pm.



> On Jun 26, 2019, at 12:28, Bowen Li  wrote:
> 
> Hi,
> 
> @Dawid, I think the "long test running" as I mentioned in the first email, 
> also as you guys said, belongs to "a big effort which is much harder to 
> accomplish in a short period of time and may deserve its own separate 
> discussion". Thus I didn't include it in what we can do in a foreseeable 
> short term.
> 
> Besides, I don't think that's the ultimate reason for lack of build 
> resources. Even if the build is shortened to something like 2h, the problems 
> of no build machine works about 6 or more hours in PST daytime that I 
> described will still happen, because no machine from ASF INFRA's pool is 
> allocated to Flink. As I have paid close attention to the build queue in the 
> past few weekdays, it's a pretty clear pattern now. 
> 
> **The ultimate root cause** for that is - we don't have any **dedicated** 
> build resources that we can stably rely on. I'm actually ok to wait for a 
> long time if there are build requests running, it means at least we are 
> making progress. But I'm not ok with no build resource. A better place I 
> think we should aim at in short term is to always have at least a central 
> pool (can be 3 or 5) of machines dedicated to build Flink at any time, or 
> maybe use users resources.
> 
> @Chesnay @Robert I synced with Jeff offline that Zeppelin community is using 
> a Jenkins job to automatically build on users' travis account and link the 
> result back to github PR. I guess the Jenkins job would fetch latest upstream 
> master and build the PR against it. Jeff has filed tickets to learn and get 
> access to the Jenkins infra. It'll better to fully understand it first before 
> judging this approach.
> 
> I also heard good things about CircleCI, and ASF INFRA seems to have a pool 
> of build capacity there too. Can be an alternative to consider.
> 
> 
> 
> 
> 
> 
> 
> 
> 
>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz  
>> wrote:
>> Sorry to jump in late, but I think Bowen missed the most important point
>> from Chesnay's previous message in the summary. The ultimate reason for
>> all the problems is that the tests take close to 2 hours to run already.
>> I fully support this claim: "Unless people start caring about test times
>> before adding them, this issue cannot be solved"
>> 
>> This is also another reason why using user's Travis account won't help.
>> Every few weeks we reach the user's time limit for a single profile.
>> This makes the user's builds simply fail, until we either properly
>> decrease the time the tests take (which I am not sure we ever did) or
>> postpone the problem by splitting into more profiles. (Note that the ASF
>> Travis account has higher time limits)
>> 
>> Best,
>> 
>> Dawid
>> 
>> On 26/06/2019 09:36, Robert Metzger wrote:
>> > Do we know if using "the best" available hardware would improve the build
>> > times?
>> > Imagine we would run the build on machines with plenty of main memory to
>> > mount everything to ramdisk + the latest CPU architecture?
>> >
>> > Throwing hardware at the problem could help reduce the time of an
>> > individual build, and using our own infrastructure would remove our
>> > dependency on Apache's Travis account (with the obvious downside of having
>> > to maintain the infrastructure)
>> > We could use an open source travis alternative, to have a similar
>> > experience and make the migration easy.
>> >
>> >
>> > On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler  
>> > wrote:
>> >
>> >>  From what I gathered, there's no special sauce that the Zeppelin
>> >> project uses which actually integrates a users Travis account into the PR.
>> >>
>> >> They just disabled Travis for PRs. And that's kind of it.
>> >>
>> >> Naturally we can do this (duh) and safe the ASF a fair amount of
>> >> resources, but there are downsides:
>> >>
>> >> The discoverability of the Travis check takes a nose-dive. Either we
>> >> require every contributor to always, an every commit, also post a Travis
>> >> build, or we have the reviewer sift through the contributors account to
>> >> find it.
>> >>
>> >> This is rather cumbersome. Additionally, it's also not equivalent to
>> >> having a PR build.
>> >>
>> >> A normal branch build takes a branch as is and tests it. A PR build
>> >> merges the branch into master, and then runs it. (Fun fact: This is why
>> >> a PR without merge conflicts is not being run on Travis.)
>> >>
>> >> And ultimately, everyone can already make use of this approach anyway.
>> >>
>> >> On 25/06/2019 08:02, Jark Wu wrote:
>> >>> Hi Jeff,
>> >>>
>> >>> Thanks for sharing the Zeppelin approach. I think it's a good idea to
>> >>> leverage user's travis account.
>> >>> In this way, we can have almost unlimited concurrent 

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-26 Thread Bowen Li
Hi,

@Dawid, I think the "long test running" as I mentioned in the first email,
also as you guys said, belongs to "a big effort which is much harder to
accomplish in a short period of time and may deserve its own separate
discussion". Thus I didn't include it in what we can do in a foreseeable
short term.

Besides, I don't think that's the ultimate reason for lack of build
resources. Even if the build is shortened to something like 2h, the
problems of no build machine works about 6 or more hours in PST daytime
that I described will still happen, because no machine from ASF INFRA's
pool is allocated to Flink. As I have paid close attention to the build
queue in the past few weekdays, it's a pretty clear pattern now.

**The ultimate root cause** for that is - we don't have any **dedicated**
build resources that we can stably rely on. I'm actually ok to wait for a
long time if there are build requests running, it means at least we are
making progress. But I'm not ok with no build resource. A better place I
think we should aim at in short term is to always have at least a central
pool (can be 3 or 5) of machines dedicated to build Flink at any time, or
maybe use users resources.

@Chesnay @Robert I synced with Jeff offline that Zeppelin community is
using a Jenkins job to automatically build on users' travis account and
link the result back to github PR. I guess the Jenkins job would fetch
latest upstream master and build the PR against it. Jeff has filed tickets
to learn and get access to the Jenkins infra. It'll better to fully
understand it first before judging this approach.

I also heard good things about CircleCI, and ASF INFRA seems to have a pool
of build capacity there too. Can be an alternative to consider.









On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz 
wrote:

> Sorry to jump in late, but I think Bowen missed the most important point
> from Chesnay's previous message in the summary. The ultimate reason for
> all the problems is that the tests take close to 2 hours to run already.
> I fully support this claim: "Unless people start caring about test times
> before adding them, this issue cannot be solved"
>
> This is also another reason why using user's Travis account won't help.
> Every few weeks we reach the user's time limit for a single profile.
> This makes the user's builds simply fail, until we either properly
> decrease the time the tests take (which I am not sure we ever did) or
> postpone the problem by splitting into more profiles. (Note that the ASF
> Travis account has higher time limits)
>
> Best,
>
> Dawid
>
> On 26/06/2019 09:36, Robert Metzger wrote:
> > Do we know if using "the best" available hardware would improve the build
> > times?
> > Imagine we would run the build on machines with plenty of main memory to
> > mount everything to ramdisk + the latest CPU architecture?
> >
> > Throwing hardware at the problem could help reduce the time of an
> > individual build, and using our own infrastructure would remove our
> > dependency on Apache's Travis account (with the obvious downside of
> having
> > to maintain the infrastructure)
> > We could use an open source travis alternative, to have a similar
> > experience and make the migration easy.
> >
> >
> > On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler 
> wrote:
> >
> >>  From what I gathered, there's no special sauce that the Zeppelin
> >> project uses which actually integrates a users Travis account into the
> PR.
> >>
> >> They just disabled Travis for PRs. And that's kind of it.
> >>
> >> Naturally we can do this (duh) and safe the ASF a fair amount of
> >> resources, but there are downsides:
> >>
> >> The discoverability of the Travis check takes a nose-dive. Either we
> >> require every contributor to always, an every commit, also post a Travis
> >> build, or we have the reviewer sift through the contributors account to
> >> find it.
> >>
> >> This is rather cumbersome. Additionally, it's also not equivalent to
> >> having a PR build.
> >>
> >> A normal branch build takes a branch as is and tests it. A PR build
> >> merges the branch into master, and then runs it. (Fun fact: This is why
> >> a PR without merge conflicts is not being run on Travis.)
> >>
> >> And ultimately, everyone can already make use of this approach anyway.
> >>
> >> On 25/06/2019 08:02, Jark Wu wrote:
> >>> Hi Jeff,
> >>>
> >>> Thanks for sharing the Zeppelin approach. I think it's a good idea to
> >>> leverage user's travis account.
> >>> In this way, we can have almost unlimited concurrent build jobs and
> >>> developers can restart build by themselves (currently only committers
> >>> can restart PR's build).
> >>>
> >>> But I'm still not very clear how to integrate user's travis build into
> >>> the Flink pull request's build automatically. Can you explain more in
> >>> detail?
> >>>
> >>> Another question: does travis only build branches for user account?
> >>> My concern is that builds for PRs will rebase user's commits against
> >>> 

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-26 Thread Dawid Wysakowicz
Sorry to jump in late, but I think Bowen missed the most important point
from Chesnay's previous message in the summary. The ultimate reason for
all the problems is that the tests take close to 2 hours to run already.
I fully support this claim: "Unless people start caring about test times
before adding them, this issue cannot be solved"

This is also another reason why using user's Travis account won't help.
Every few weeks we reach the user's time limit for a single profile.
This makes the user's builds simply fail, until we either properly
decrease the time the tests take (which I am not sure we ever did) or
postpone the problem by splitting into more profiles. (Note that the ASF
Travis account has higher time limits)

Best,

Dawid

On 26/06/2019 09:36, Robert Metzger wrote:
> Do we know if using "the best" available hardware would improve the build
> times?
> Imagine we would run the build on machines with plenty of main memory to
> mount everything to ramdisk + the latest CPU architecture?
>
> Throwing hardware at the problem could help reduce the time of an
> individual build, and using our own infrastructure would remove our
> dependency on Apache's Travis account (with the obvious downside of having
> to maintain the infrastructure)
> We could use an open source travis alternative, to have a similar
> experience and make the migration easy.
>
>
> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler  wrote:
>
>>  From what I gathered, there's no special sauce that the Zeppelin
>> project uses which actually integrates a users Travis account into the PR.
>>
>> They just disabled Travis for PRs. And that's kind of it.
>>
>> Naturally we can do this (duh) and safe the ASF a fair amount of
>> resources, but there are downsides:
>>
>> The discoverability of the Travis check takes a nose-dive. Either we
>> require every contributor to always, an every commit, also post a Travis
>> build, or we have the reviewer sift through the contributors account to
>> find it.
>>
>> This is rather cumbersome. Additionally, it's also not equivalent to
>> having a PR build.
>>
>> A normal branch build takes a branch as is and tests it. A PR build
>> merges the branch into master, and then runs it. (Fun fact: This is why
>> a PR without merge conflicts is not being run on Travis.)
>>
>> And ultimately, everyone can already make use of this approach anyway.
>>
>> On 25/06/2019 08:02, Jark Wu wrote:
>>> Hi Jeff,
>>>
>>> Thanks for sharing the Zeppelin approach. I think it's a good idea to
>>> leverage user's travis account.
>>> In this way, we can have almost unlimited concurrent build jobs and
>>> developers can restart build by themselves (currently only committers
>>> can restart PR's build).
>>>
>>> But I'm still not very clear how to integrate user's travis build into
>>> the Flink pull request's build automatically. Can you explain more in
>>> detail?
>>>
>>> Another question: does travis only build branches for user account?
>>> My concern is that builds for PRs will rebase user's commits against
>>> current master branch.
>>> This will help us to find problems before merge.  Builds for branches
>>> will lose the impact of new commits in master.
>>> How does Zeppelin solve this problem?
>>>
>>> Thanks again for sharing the idea.
>>>
>>> Regards,
>>> Jark
>>>
>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang >> > wrote:
>>>
>>> Hi Folks,
>>>
>>> Zeppelin meet this kind of issue before, we solve it by delegating
>>> each
>>> one's PR build to his travis account (Everyone can have 5 free
>>> slot for
>>> travis build).
>>> Apache account travis build is only triggered when PR is merged.
>>>
>>>
>>>
>>> Kurt Young mailto:ykt...@gmail.com>>
>>> 于2019年6月25日周二 上午10:16写道:
>>>
>>> > (Forgot to cc George)
>>> >
>>> > Best,
>>> > Kurt
>>> >
>>> >
>>> > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young >> > wrote:
>>> >
>>> > > Hi Bowen,
>>> > >
>>> > > Thanks for bringing this up. We actually have discussed about
>>> this, and I
>>> > > think Till and George have
>>> > > already spend sometime investigating it. I have cced both of
>>> them, and
>>> > > maybe they can share
>>> > > their findings.
>>> > >
>>> > > Best,
>>> > > Kurt
>>> > >
>>> > >
>>> > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu >> > wrote:
>>> > >
>>> > >> Hi Bowen,
>>> > >>
>>> > >> Thanks for bringing this. We also suffered from the long
>>> build time.
>>> > >> I agree that we should focus on solving build capacity
>>> problem in the
>>> > >> thread.
>>> > >>
>>> > >> My observation is there is only one build is running, all the
>>> others
>>> > >> (other
>>> > >> PRs, master) are pending.
>>> > >> The pricing plan[1] of travis shows it can support concurrent
>>> build
>>> > jobs.
>>> > >> But I 

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-26 Thread Robert Metzger
Do we know if using "the best" available hardware would improve the build
times?
Imagine we would run the build on machines with plenty of main memory to
mount everything to ramdisk + the latest CPU architecture?

Throwing hardware at the problem could help reduce the time of an
individual build, and using our own infrastructure would remove our
dependency on Apache's Travis account (with the obvious downside of having
to maintain the infrastructure)
We could use an open source travis alternative, to have a similar
experience and make the migration easy.


On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler  wrote:

>  From what I gathered, there's no special sauce that the Zeppelin
> project uses which actually integrates a users Travis account into the PR.
>
> They just disabled Travis for PRs. And that's kind of it.
>
> Naturally we can do this (duh) and safe the ASF a fair amount of
> resources, but there are downsides:
>
> The discoverability of the Travis check takes a nose-dive. Either we
> require every contributor to always, an every commit, also post a Travis
> build, or we have the reviewer sift through the contributors account to
> find it.
>
> This is rather cumbersome. Additionally, it's also not equivalent to
> having a PR build.
>
> A normal branch build takes a branch as is and tests it. A PR build
> merges the branch into master, and then runs it. (Fun fact: This is why
> a PR without merge conflicts is not being run on Travis.)
>
> And ultimately, everyone can already make use of this approach anyway.
>
> On 25/06/2019 08:02, Jark Wu wrote:
> > Hi Jeff,
> >
> > Thanks for sharing the Zeppelin approach. I think it's a good idea to
> > leverage user's travis account.
> > In this way, we can have almost unlimited concurrent build jobs and
> > developers can restart build by themselves (currently only committers
> > can restart PR's build).
> >
> > But I'm still not very clear how to integrate user's travis build into
> > the Flink pull request's build automatically. Can you explain more in
> > detail?
> >
> > Another question: does travis only build branches for user account?
> > My concern is that builds for PRs will rebase user's commits against
> > current master branch.
> > This will help us to find problems before merge.  Builds for branches
> > will lose the impact of new commits in master.
> > How does Zeppelin solve this problem?
> >
> > Thanks again for sharing the idea.
> >
> > Regards,
> > Jark
> >
> > On Tue, 25 Jun 2019 at 11:01, Jeff Zhang  > > wrote:
> >
> > Hi Folks,
> >
> > Zeppelin meet this kind of issue before, we solve it by delegating
> > each
> > one's PR build to his travis account (Everyone can have 5 free
> > slot for
> > travis build).
> > Apache account travis build is only triggered when PR is merged.
> >
> >
> >
> > Kurt Young mailto:ykt...@gmail.com>>
> > 于2019年6月25日周二 上午10:16写道:
> >
> > > (Forgot to cc George)
> > >
> > > Best,
> > > Kurt
> > >
> > >
> > > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young  > > wrote:
> > >
> > > > Hi Bowen,
> > > >
> > > > Thanks for bringing this up. We actually have discussed about
> > this, and I
> > > > think Till and George have
> > > > already spend sometime investigating it. I have cced both of
> > them, and
> > > > maybe they can share
> > > > their findings.
> > > >
> > > > Best,
> > > > Kurt
> > > >
> > > >
> > > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu  > > wrote:
> > > >
> > > >> Hi Bowen,
> > > >>
> > > >> Thanks for bringing this. We also suffered from the long
> > build time.
> > > >> I agree that we should focus on solving build capacity
> > problem in the
> > > >> thread.
> > > >>
> > > >> My observation is there is only one build is running, all the
> > others
> > > >> (other
> > > >> PRs, master) are pending.
> > > >> The pricing plan[1] of travis shows it can support concurrent
> > build
> > > jobs.
> > > >> But I don't know which plan we are using, might be the free
> > plan for
> > > open
> > > >> source.
> > > >>
> > > >> I cc-ed Chesnay who may have some experience on Travis.
> > > >>
> > > >> Regards,
> > > >> Jark
> > > >>
> > > >> [1]: https://travis-ci.com/plans
> > > >>
> > > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li  > > wrote:
> > > >>
> > > >> > Hi Steven,
> > > >> >
> > > >> > I think you may not read what I wrote. The discussion is about
> > > "unstable
> > > >> > build **capacity**", in another word "unstable / lack of build
> > > >> resources",
> > > >> > not "unstable build".
> > > >> >
> > > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu
> > mailto:stevenz...@gmail.com>>
> > > wrote:
> > > 

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-26 Thread Chesnay Schepler
From what I gathered, there's no special sauce that the Zeppelin 
project uses which actually integrates a users Travis account into the PR.


They just disabled Travis for PRs. And that's kind of it.

Naturally we can do this (duh) and safe the ASF a fair amount of 
resources, but there are downsides:


The discoverability of the Travis check takes a nose-dive. Either we 
require every contributor to always, an every commit, also post a Travis 
build, or we have the reviewer sift through the contributors account to 
find it.


This is rather cumbersome. Additionally, it's also not equivalent to 
having a PR build.


A normal branch build takes a branch as is and tests it. A PR build 
merges the branch into master, and then runs it. (Fun fact: This is why 
a PR without merge conflicts is not being run on Travis.)


And ultimately, everyone can already make use of this approach anyway.

On 25/06/2019 08:02, Jark Wu wrote:

Hi Jeff,

Thanks for sharing the Zeppelin approach. I think it's a good idea to 
leverage user's travis account.
In this way, we can have almost unlimited concurrent build jobs and 
developers can restart build by themselves (currently only committers 
can restart PR's build).


But I'm still not very clear how to integrate user's travis build into 
the Flink pull request's build automatically. Can you explain more in 
detail?


Another question: does travis only build branches for user account?
My concern is that builds for PRs will rebase user's commits against 
current master branch.
This will help us to find problems before merge.  Builds for branches 
will lose the impact of new commits in master.

How does Zeppelin solve this problem?

Thanks again for sharing the idea.

Regards,
Jark

On Tue, 25 Jun 2019 at 11:01, Jeff Zhang > wrote:


Hi Folks,

Zeppelin meet this kind of issue before, we solve it by delegating
each
one's PR build to his travis account (Everyone can have 5 free
slot for
travis build).
Apache account travis build is only triggered when PR is merged.



Kurt Young mailto:ykt...@gmail.com>>
于2019年6月25日周二 上午10:16写道:

> (Forgot to cc George)
>
> Best,
> Kurt
>
>
> On Tue, Jun 25, 2019 at 10:16 AM Kurt Young mailto:ykt...@gmail.com>> wrote:
>
> > Hi Bowen,
> >
> > Thanks for bringing this up. We actually have discussed about
this, and I
> > think Till and George have
> > already spend sometime investigating it. I have cced both of
them, and
> > maybe they can share
> > their findings.
> >
> > Best,
> > Kurt
> >
> >
> > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu mailto:imj...@gmail.com>> wrote:
> >
> >> Hi Bowen,
> >>
> >> Thanks for bringing this. We also suffered from the long
build time.
> >> I agree that we should focus on solving build capacity
problem in the
> >> thread.
> >>
> >> My observation is there is only one build is running, all the
others
> >> (other
> >> PRs, master) are pending.
> >> The pricing plan[1] of travis shows it can support concurrent
build
> jobs.
> >> But I don't know which plan we are using, might be the free
plan for
> open
> >> source.
> >>
> >> I cc-ed Chesnay who may have some experience on Travis.
> >>
> >> Regards,
> >> Jark
> >>
> >> [1]: https://travis-ci.com/plans
> >>
> >> On Tue, 25 Jun 2019 at 08:11, Bowen Li mailto:bowenl...@gmail.com>> wrote:
> >>
> >> > Hi Steven,
> >> >
> >> > I think you may not read what I wrote. The discussion is about
> "unstable
> >> > build **capacity**", in another word "unstable / lack of build
> >> resources",
> >> > not "unstable build".
> >> >
> >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu
mailto:stevenz...@gmail.com>>
> wrote:
> >> >
> >> > > long and sometimes unstable build is definitely a pain point.
> >> > >
> >> > > I suspect the build failure here in flink-connector-kafka
is not
> >> related
> >> > to
> >> > > my change. but there is no easy re-run the build on
travis UI.
> Google
> >> > > search showed a trick of close-and-open the PR will
trigger rebuild.
> >> but
> >> > > that could add noises to the PR activities.
> >> > > https://travis-ci.org/apache/flink/jobs/54519
> >> > >
> >> > > travis-ci for my personal repo often failed with
exceeding time
> limit
> >> > after
> >> > > 4+ hours.
> >> > > The job exceeded the maximum time limit for jobs, and has
been
> >> > terminated.
> >> > >
> >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li
mailto:bowenl...@gmail.com>>
> wrote:
> >> > >
> >> > > > https://travis-ci.org/apache/flink/builds/549681530
This build
> >> > request
> >> > > > has
> >> > > > been sitting at **HEAD of the queue** since 

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-25 Thread Bowen Li
Want to summarize Chesnay's points for everyone reading this thread: 1) the
build resources Flink is currently using belong to ASF INFRA, and 2) we are
waiting on ASF INFRA's response on whether we can donate/sponsor extra
build resources for Flink.

I think it'll be super helpful to pay and secure dedicated build resources
for Flink. If that doesn't work, I agree with Jark that the Zeppelin's
approach Jeff described sounds promising.

Jeff, can you answer Jark's questions above and share how Zeppelin
community's practices look like?

Cheers,
Bowen

On Tue, Jun 25, 2019 at 12:50 AM Chesnay Schepler 
wrote:

>
> On 24/06/2019 23:48, Bowen Li wrote:
> > - Has anyone else experienced the same problem or have similar
> observation
> > on TravisCI? (I suspect it has things to do with time zone)
> In Europe we have the same problem.
> >
> > - What pricing plan of TravisCI is Flink currently using? Is it the free
> > plan for open source projects? What are the guaranteed build capacity of
> > the current plan?
> Flink is using the Travis resources provided by the ASF, which afaik the
> ASF is paying for.
>
> Note that in the past the Flink project was already approached  by INFRA
> since we were using too many Travis resources,
> so this is _not_ as simple as asking for more.
> >
> > - If the current pricing plan (either free or paid) can't provide stable
> > build capacity, can we upgrade to a higher priced plan with larger and
> more
> > stable build capacity?
> We are currently investigating whether companies could donate/sponsor
> Travis CI resources to the ASF for increasing the build capacity;
> currently waiting for an answer from INFRA.
> >
> > BTW, another factor that contribute to the productivity problem is that
> our
> > build is slow - we run full build for every PR and a successful full
> build
> > takes ~5h. We definitely have more options to solve it, for instance,
> > modularize the build graphs and reuse artifacts from the previous build.
> > But I think that can be a big effort which is much harder to accomplish
> in
> > a short period of time and may deserve its own separate discussion.
> We already are doing that. The vast majority of the build times is
> simply due to tests taking way too long, not compilation.
> The tests for the kafka connector alone exceed a single profile, as does
> the table API.
> Unless people start caring about test times before adding them, this
> issue cannot be solved.
>
> Of course, this discussion isn't new, I already raised it the last 2
> times we approach the Travis limits, with little to no effect to be seen.
>
> At this point I'm sure someone out there is thinking "well, just don't
> run kafka tests for every PR. Like, check the diff or something",
> and yes, sure, that _might_ work. But to this day, despite numerous
> people suggesting it, I still haven't seen a single person actually try
> implementing it.
>
> The problem with these kind of approaches is that they tend to be
> brittle as hell, result in subtle behaviors if they have bugs, and
> overall make the CI significantly more complicated by introducing
> various edge cases.
>
> Our current CI is, relatively speaking, straightforward and consistent.
> And as it stands we can't afford elaborate schemes because I just don't
> have the time capacity for maintaining that.
>


Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-25 Thread Chesnay Schepler



On 24/06/2019 23:48, Bowen Li wrote:

- Has anyone else experienced the same problem or have similar observation
on TravisCI? (I suspect it has things to do with time zone)

In Europe we have the same problem.


- What pricing plan of TravisCI is Flink currently using? Is it the free
plan for open source projects? What are the guaranteed build capacity of
the current plan?
Flink is using the Travis resources provided by the ASF, which afaik the 
ASF is paying for.


Note that in the past the Flink project was already approached  by INFRA 
since we were using too many Travis resources,

so this is _not_ as simple as asking for more.


- If the current pricing plan (either free or paid) can't provide stable
build capacity, can we upgrade to a higher priced plan with larger and more
stable build capacity?
We are currently investigating whether companies could donate/sponsor 
Travis CI resources to the ASF for increasing the build capacity; 
currently waiting for an answer from INFRA.


BTW, another factor that contribute to the productivity problem is that our
build is slow - we run full build for every PR and a successful full build
takes ~5h. We definitely have more options to solve it, for instance,
modularize the build graphs and reuse artifacts from the previous build.
But I think that can be a big effort which is much harder to accomplish in
a short period of time and may deserve its own separate discussion.
We already are doing that. The vast majority of the build times is 
simply due to tests taking way too long, not compilation.
The tests for the kafka connector alone exceed a single profile, as does 
the table API.
Unless people start caring about test times before adding them, this 
issue cannot be solved.


Of course, this discussion isn't new, I already raised it the last 2 
times we approach the Travis limits, with little to no effect to be seen.


At this point I'm sure someone out there is thinking "well, just don't 
run kafka tests for every PR. Like, check the diff or something",
and yes, sure, that _might_ work. But to this day, despite numerous 
people suggesting it, I still haven't seen a single person actually try 
implementing it.


The problem with these kind of approaches is that they tend to be 
brittle as hell, result in subtle behaviors if they have bugs, and 
overall make the CI significantly more complicated by introducing 
various edge cases.


Our current CI is, relatively speaking, straightforward and consistent. 
And as it stands we can't afford elaborate schemes because I just don't 
have the time capacity for maintaining that.


Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-25 Thread Jark Wu
Hi Jeff,

Thanks for sharing the Zeppelin approach. I think it's a good idea to
leverage user's travis account.
In this way, we can have almost unlimited concurrent build jobs and
developers can restart build by themselves (currently only committers can
restart PR's build).

But I'm still not very clear how to integrate user's travis build into the
Flink pull request's build automatically. Can you explain more in detail?

Another question: does travis only build branches for user account?
My concern is that builds for PRs will rebase user's commits against
current master branch.
This will help us to find problems before merge.  Builds for branches will
lose the impact of new commits in master.
How does Zeppelin solve this problem?

Thanks again for sharing the idea.

Regards,
Jark

On Tue, 25 Jun 2019 at 11:01, Jeff Zhang  wrote:

> Hi Folks,
>
> Zeppelin meet this kind of issue before, we solve it by delegating each
> one's PR build to his travis account (Everyone can have 5 free slot for
> travis build).
> Apache account travis build is only triggered when PR is merged.
>
>
>
> Kurt Young  于2019年6月25日周二 上午10:16写道:
>
> > (Forgot to cc George)
> >
> > Best,
> > Kurt
> >
> >
> > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young  wrote:
> >
> > > Hi Bowen,
> > >
> > > Thanks for bringing this up. We actually have discussed about this,
> and I
> > > think Till and George have
> > > already spend sometime investigating it. I have cced both of them, and
> > > maybe they can share
> > > their findings.
> > >
> > > Best,
> > > Kurt
> > >
> > >
> > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu  wrote:
> > >
> > >> Hi Bowen,
> > >>
> > >> Thanks for bringing this. We also suffered from the long build time.
> > >> I agree that we should focus on solving build capacity problem in the
> > >> thread.
> > >>
> > >> My observation is there is only one build is running, all the others
> > >> (other
> > >> PRs, master) are pending.
> > >> The pricing plan[1] of travis shows it can support concurrent build
> > jobs.
> > >> But I don't know which plan we are using, might be the free plan for
> > open
> > >> source.
> > >>
> > >> I cc-ed Chesnay who may have some experience on Travis.
> > >>
> > >> Regards,
> > >> Jark
> > >>
> > >> [1]: https://travis-ci.com/plans
> > >>
> > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li  wrote:
> > >>
> > >> > Hi Steven,
> > >> >
> > >> > I think you may not read what I wrote. The discussion is about
> > "unstable
> > >> > build **capacity**", in another word "unstable / lack of build
> > >> resources",
> > >> > not "unstable build".
> > >> >
> > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu 
> > wrote:
> > >> >
> > >> > > long and sometimes unstable build is definitely a pain point.
> > >> > >
> > >> > > I suspect the build failure here in flink-connector-kafka is not
> > >> related
> > >> > to
> > >> > > my change. but there is no easy re-run the build on travis UI.
> > Google
> > >> > > search showed a trick of close-and-open the PR will trigger
> rebuild.
> > >> but
> > >> > > that could add noises to the PR activities.
> > >> > > https://travis-ci.org/apache/flink/jobs/54519
> > >> > >
> > >> > > travis-ci for my personal repo often failed with exceeding time
> > limit
> > >> > after
> > >> > > 4+ hours.
> > >> > > The job exceeded the maximum time limit for jobs, and has been
> > >> > terminated.
> > >> > >
> > >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li 
> > wrote:
> > >> > >
> > >> > > > https://travis-ci.org/apache/flink/builds/549681530  This build
> > >> > request
> > >> > > > has
> > >> > > > been sitting at **HEAD of the queue** since I first saw it at
> PST
> > >> > 10:30am
> > >> > > > (not sure how long it's been there before 10:30am). It's PST
> > 4:12pm
> > >> now
> > >> > > and
> > >> > > > it hasn't started yet.
> > >> > > >
> > >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li 
> > >> wrote:
> > >> > > >
> > >> > > > > Hi devs,
> > >> > > > >
> > >> > > > > I've been experiencing the pain resulting from lack of stable
> > >> build
> > >> > > > > capacity on Travis for Flink PRs [1]. Specifically, I noticed
> > >> often
> > >> > > that
> > >> > > > no
> > >> > > > > build in the queue is making any progress for hours, and
> > suddenly
> > >> 5
> > >> > or
> > >> > > 6
> > >> > > > > builds kick off all together after the long pause. I'm at PST
> > >> > (UTC-08)
> > >> > > > time
> > >> > > > > zone, and I've seen pause can be as long as 6 hours from PST
> 9am
> > >> to
> > >> > 3pm
> > >> > > > > (let alone the time needed to drain the queue afterwards).
> > >> > > > >
> > >> > > > > I think this has greatly impacted our productivity. I've
> > >> experienced
> > >> > > that
> > >> > > > > PRs submitted in the early morning of PST time zone won't
> finish
> > >> > their
> > >> > > > > build until late night of the same day.
> > >> > > > >
> > >> > > > > So my questions are:
> > >> > > > >
> > >> > > > > - Has anyone else experienced the same problem or have similar
> > >> > > 

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-24 Thread Jeff Zhang
Hi Folks,

Zeppelin meet this kind of issue before, we solve it by delegating each
one's PR build to his travis account (Everyone can have 5 free slot for
travis build).
Apache account travis build is only triggered when PR is merged.



Kurt Young  于2019年6月25日周二 上午10:16写道:

> (Forgot to cc George)
>
> Best,
> Kurt
>
>
> On Tue, Jun 25, 2019 at 10:16 AM Kurt Young  wrote:
>
> > Hi Bowen,
> >
> > Thanks for bringing this up. We actually have discussed about this, and I
> > think Till and George have
> > already spend sometime investigating it. I have cced both of them, and
> > maybe they can share
> > their findings.
> >
> > Best,
> > Kurt
> >
> >
> > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu  wrote:
> >
> >> Hi Bowen,
> >>
> >> Thanks for bringing this. We also suffered from the long build time.
> >> I agree that we should focus on solving build capacity problem in the
> >> thread.
> >>
> >> My observation is there is only one build is running, all the others
> >> (other
> >> PRs, master) are pending.
> >> The pricing plan[1] of travis shows it can support concurrent build
> jobs.
> >> But I don't know which plan we are using, might be the free plan for
> open
> >> source.
> >>
> >> I cc-ed Chesnay who may have some experience on Travis.
> >>
> >> Regards,
> >> Jark
> >>
> >> [1]: https://travis-ci.com/plans
> >>
> >> On Tue, 25 Jun 2019 at 08:11, Bowen Li  wrote:
> >>
> >> > Hi Steven,
> >> >
> >> > I think you may not read what I wrote. The discussion is about
> "unstable
> >> > build **capacity**", in another word "unstable / lack of build
> >> resources",
> >> > not "unstable build".
> >> >
> >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu 
> wrote:
> >> >
> >> > > long and sometimes unstable build is definitely a pain point.
> >> > >
> >> > > I suspect the build failure here in flink-connector-kafka is not
> >> related
> >> > to
> >> > > my change. but there is no easy re-run the build on travis UI.
> Google
> >> > > search showed a trick of close-and-open the PR will trigger rebuild.
> >> but
> >> > > that could add noises to the PR activities.
> >> > > https://travis-ci.org/apache/flink/jobs/54519
> >> > >
> >> > > travis-ci for my personal repo often failed with exceeding time
> limit
> >> > after
> >> > > 4+ hours.
> >> > > The job exceeded the maximum time limit for jobs, and has been
> >> > terminated.
> >> > >
> >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li 
> wrote:
> >> > >
> >> > > > https://travis-ci.org/apache/flink/builds/549681530  This build
> >> > request
> >> > > > has
> >> > > > been sitting at **HEAD of the queue** since I first saw it at PST
> >> > 10:30am
> >> > > > (not sure how long it's been there before 10:30am). It's PST
> 4:12pm
> >> now
> >> > > and
> >> > > > it hasn't started yet.
> >> > > >
> >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li 
> >> wrote:
> >> > > >
> >> > > > > Hi devs,
> >> > > > >
> >> > > > > I've been experiencing the pain resulting from lack of stable
> >> build
> >> > > > > capacity on Travis for Flink PRs [1]. Specifically, I noticed
> >> often
> >> > > that
> >> > > > no
> >> > > > > build in the queue is making any progress for hours, and
> suddenly
> >> 5
> >> > or
> >> > > 6
> >> > > > > builds kick off all together after the long pause. I'm at PST
> >> > (UTC-08)
> >> > > > time
> >> > > > > zone, and I've seen pause can be as long as 6 hours from PST 9am
> >> to
> >> > 3pm
> >> > > > > (let alone the time needed to drain the queue afterwards).
> >> > > > >
> >> > > > > I think this has greatly impacted our productivity. I've
> >> experienced
> >> > > that
> >> > > > > PRs submitted in the early morning of PST time zone won't finish
> >> > their
> >> > > > > build until late night of the same day.
> >> > > > >
> >> > > > > So my questions are:
> >> > > > >
> >> > > > > - Has anyone else experienced the same problem or have similar
> >> > > > observation
> >> > > > > on TravisCI? (I suspect it has things to do with time zone)
> >> > > > >
> >> > > > > - What pricing plan of TravisCI is Flink currently using? Is it
> >> the
> >> > > free
> >> > > > > plan for open source projects? What are the guaranteed build
> >> capacity
> >> > > of
> >> > > > > the current plan?
> >> > > > >
> >> > > > > - If the current pricing plan (either free or paid) can't
> provide
> >> > > stable
> >> > > > > build capacity, can we upgrade to a higher priced plan with
> larger
> >> > and
> >> > > > more
> >> > > > > stable build capacity?
> >> > > > >
> >> > > > > BTW, another factor that contribute to the productivity problem
> is
> >> > that
> >> > > > > our build is slow - we run full build for every PR and a
> >> successful
> >> > > full
> >> > > > > build takes ~5h. We definitely have more options to solve it,
> for
> >> > > > instance,
> >> > > > > modularize the build graphs and reuse artifacts from the
> previous
> >> > > build.
> >> > > > > But I think that can be a big effort which is much harder to
> >> > accomplish
> >> > > > in
> >> > > > > a short 

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-24 Thread Kurt Young
(Forgot to cc George)

Best,
Kurt


On Tue, Jun 25, 2019 at 10:16 AM Kurt Young  wrote:

> Hi Bowen,
>
> Thanks for bringing this up. We actually have discussed about this, and I
> think Till and George have
> already spend sometime investigating it. I have cced both of them, and
> maybe they can share
> their findings.
>
> Best,
> Kurt
>
>
> On Tue, Jun 25, 2019 at 10:08 AM Jark Wu  wrote:
>
>> Hi Bowen,
>>
>> Thanks for bringing this. We also suffered from the long build time.
>> I agree that we should focus on solving build capacity problem in the
>> thread.
>>
>> My observation is there is only one build is running, all the others
>> (other
>> PRs, master) are pending.
>> The pricing plan[1] of travis shows it can support concurrent build jobs.
>> But I don't know which plan we are using, might be the free plan for open
>> source.
>>
>> I cc-ed Chesnay who may have some experience on Travis.
>>
>> Regards,
>> Jark
>>
>> [1]: https://travis-ci.com/plans
>>
>> On Tue, 25 Jun 2019 at 08:11, Bowen Li  wrote:
>>
>> > Hi Steven,
>> >
>> > I think you may not read what I wrote. The discussion is about "unstable
>> > build **capacity**", in another word "unstable / lack of build
>> resources",
>> > not "unstable build".
>> >
>> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu  wrote:
>> >
>> > > long and sometimes unstable build is definitely a pain point.
>> > >
>> > > I suspect the build failure here in flink-connector-kafka is not
>> related
>> > to
>> > > my change. but there is no easy re-run the build on travis UI. Google
>> > > search showed a trick of close-and-open the PR will trigger rebuild.
>> but
>> > > that could add noises to the PR activities.
>> > > https://travis-ci.org/apache/flink/jobs/54519
>> > >
>> > > travis-ci for my personal repo often failed with exceeding time limit
>> > after
>> > > 4+ hours.
>> > > The job exceeded the maximum time limit for jobs, and has been
>> > terminated.
>> > >
>> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li  wrote:
>> > >
>> > > > https://travis-ci.org/apache/flink/builds/549681530  This build
>> > request
>> > > > has
>> > > > been sitting at **HEAD of the queue** since I first saw it at PST
>> > 10:30am
>> > > > (not sure how long it's been there before 10:30am). It's PST 4:12pm
>> now
>> > > and
>> > > > it hasn't started yet.
>> > > >
>> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li 
>> wrote:
>> > > >
>> > > > > Hi devs,
>> > > > >
>> > > > > I've been experiencing the pain resulting from lack of stable
>> build
>> > > > > capacity on Travis for Flink PRs [1]. Specifically, I noticed
>> often
>> > > that
>> > > > no
>> > > > > build in the queue is making any progress for hours, and suddenly
>> 5
>> > or
>> > > 6
>> > > > > builds kick off all together after the long pause. I'm at PST
>> > (UTC-08)
>> > > > time
>> > > > > zone, and I've seen pause can be as long as 6 hours from PST 9am
>> to
>> > 3pm
>> > > > > (let alone the time needed to drain the queue afterwards).
>> > > > >
>> > > > > I think this has greatly impacted our productivity. I've
>> experienced
>> > > that
>> > > > > PRs submitted in the early morning of PST time zone won't finish
>> > their
>> > > > > build until late night of the same day.
>> > > > >
>> > > > > So my questions are:
>> > > > >
>> > > > > - Has anyone else experienced the same problem or have similar
>> > > > observation
>> > > > > on TravisCI? (I suspect it has things to do with time zone)
>> > > > >
>> > > > > - What pricing plan of TravisCI is Flink currently using? Is it
>> the
>> > > free
>> > > > > plan for open source projects? What are the guaranteed build
>> capacity
>> > > of
>> > > > > the current plan?
>> > > > >
>> > > > > - If the current pricing plan (either free or paid) can't provide
>> > > stable
>> > > > > build capacity, can we upgrade to a higher priced plan with larger
>> > and
>> > > > more
>> > > > > stable build capacity?
>> > > > >
>> > > > > BTW, another factor that contribute to the productivity problem is
>> > that
>> > > > > our build is slow - we run full build for every PR and a
>> successful
>> > > full
>> > > > > build takes ~5h. We definitely have more options to solve it, for
>> > > > instance,
>> > > > > modularize the build graphs and reuse artifacts from the previous
>> > > build.
>> > > > > But I think that can be a big effort which is much harder to
>> > accomplish
>> > > > in
>> > > > > a short period of time and may deserve its own separate
>> discussion.
>> > > > >
>> > > > > [1] https://travis-ci.org/apache/flink/pull_requests
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>


Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-24 Thread Kurt Young
Hi Bowen,

Thanks for bringing this up. We actually have discussed about this, and I
think Till and George have
already spend sometime investigating it. I have cced both of them, and
maybe they can share
their findings.

Best,
Kurt


On Tue, Jun 25, 2019 at 10:08 AM Jark Wu  wrote:

> Hi Bowen,
>
> Thanks for bringing this. We also suffered from the long build time.
> I agree that we should focus on solving build capacity problem in the
> thread.
>
> My observation is there is only one build is running, all the others (other
> PRs, master) are pending.
> The pricing plan[1] of travis shows it can support concurrent build jobs.
> But I don't know which plan we are using, might be the free plan for open
> source.
>
> I cc-ed Chesnay who may have some experience on Travis.
>
> Regards,
> Jark
>
> [1]: https://travis-ci.com/plans
>
> On Tue, 25 Jun 2019 at 08:11, Bowen Li  wrote:
>
> > Hi Steven,
> >
> > I think you may not read what I wrote. The discussion is about "unstable
> > build **capacity**", in another word "unstable / lack of build
> resources",
> > not "unstable build".
> >
> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu  wrote:
> >
> > > long and sometimes unstable build is definitely a pain point.
> > >
> > > I suspect the build failure here in flink-connector-kafka is not
> related
> > to
> > > my change. but there is no easy re-run the build on travis UI. Google
> > > search showed a trick of close-and-open the PR will trigger rebuild.
> but
> > > that could add noises to the PR activities.
> > > https://travis-ci.org/apache/flink/jobs/54519
> > >
> > > travis-ci for my personal repo often failed with exceeding time limit
> > after
> > > 4+ hours.
> > > The job exceeded the maximum time limit for jobs, and has been
> > terminated.
> > >
> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li  wrote:
> > >
> > > > https://travis-ci.org/apache/flink/builds/549681530  This build
> > request
> > > > has
> > > > been sitting at **HEAD of the queue** since I first saw it at PST
> > 10:30am
> > > > (not sure how long it's been there before 10:30am). It's PST 4:12pm
> now
> > > and
> > > > it hasn't started yet.
> > > >
> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li 
> wrote:
> > > >
> > > > > Hi devs,
> > > > >
> > > > > I've been experiencing the pain resulting from lack of stable build
> > > > > capacity on Travis for Flink PRs [1]. Specifically, I noticed often
> > > that
> > > > no
> > > > > build in the queue is making any progress for hours, and suddenly 5
> > or
> > > 6
> > > > > builds kick off all together after the long pause. I'm at PST
> > (UTC-08)
> > > > time
> > > > > zone, and I've seen pause can be as long as 6 hours from PST 9am to
> > 3pm
> > > > > (let alone the time needed to drain the queue afterwards).
> > > > >
> > > > > I think this has greatly impacted our productivity. I've
> experienced
> > > that
> > > > > PRs submitted in the early morning of PST time zone won't finish
> > their
> > > > > build until late night of the same day.
> > > > >
> > > > > So my questions are:
> > > > >
> > > > > - Has anyone else experienced the same problem or have similar
> > > > observation
> > > > > on TravisCI? (I suspect it has things to do with time zone)
> > > > >
> > > > > - What pricing plan of TravisCI is Flink currently using? Is it the
> > > free
> > > > > plan for open source projects? What are the guaranteed build
> capacity
> > > of
> > > > > the current plan?
> > > > >
> > > > > - If the current pricing plan (either free or paid) can't provide
> > > stable
> > > > > build capacity, can we upgrade to a higher priced plan with larger
> > and
> > > > more
> > > > > stable build capacity?
> > > > >
> > > > > BTW, another factor that contribute to the productivity problem is
> > that
> > > > > our build is slow - we run full build for every PR and a successful
> > > full
> > > > > build takes ~5h. We definitely have more options to solve it, for
> > > > instance,
> > > > > modularize the build graphs and reuse artifacts from the previous
> > > build.
> > > > > But I think that can be a big effort which is much harder to
> > accomplish
> > > > in
> > > > > a short period of time and may deserve its own separate discussion.
> > > > >
> > > > > [1] https://travis-ci.org/apache/flink/pull_requests
> > > > >
> > > > >
> > > >
> > >
> >
>


Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-24 Thread Jark Wu
Hi Bowen,

Thanks for bringing this. We also suffered from the long build time.
I agree that we should focus on solving build capacity problem in the
thread.

My observation is there is only one build is running, all the others (other
PRs, master) are pending.
The pricing plan[1] of travis shows it can support concurrent build jobs.
But I don't know which plan we are using, might be the free plan for open
source.

I cc-ed Chesnay who may have some experience on Travis.

Regards,
Jark

[1]: https://travis-ci.com/plans

On Tue, 25 Jun 2019 at 08:11, Bowen Li  wrote:

> Hi Steven,
>
> I think you may not read what I wrote. The discussion is about "unstable
> build **capacity**", in another word "unstable / lack of build resources",
> not "unstable build".
>
> On Mon, Jun 24, 2019 at 4:40 PM Steven Wu  wrote:
>
> > long and sometimes unstable build is definitely a pain point.
> >
> > I suspect the build failure here in flink-connector-kafka is not related
> to
> > my change. but there is no easy re-run the build on travis UI. Google
> > search showed a trick of close-and-open the PR will trigger rebuild. but
> > that could add noises to the PR activities.
> > https://travis-ci.org/apache/flink/jobs/54519
> >
> > travis-ci for my personal repo often failed with exceeding time limit
> after
> > 4+ hours.
> > The job exceeded the maximum time limit for jobs, and has been
> terminated.
> >
> > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li  wrote:
> >
> > > https://travis-ci.org/apache/flink/builds/549681530  This build
> request
> > > has
> > > been sitting at **HEAD of the queue** since I first saw it at PST
> 10:30am
> > > (not sure how long it's been there before 10:30am). It's PST 4:12pm now
> > and
> > > it hasn't started yet.
> > >
> > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li  wrote:
> > >
> > > > Hi devs,
> > > >
> > > > I've been experiencing the pain resulting from lack of stable build
> > > > capacity on Travis for Flink PRs [1]. Specifically, I noticed often
> > that
> > > no
> > > > build in the queue is making any progress for hours, and suddenly 5
> or
> > 6
> > > > builds kick off all together after the long pause. I'm at PST
> (UTC-08)
> > > time
> > > > zone, and I've seen pause can be as long as 6 hours from PST 9am to
> 3pm
> > > > (let alone the time needed to drain the queue afterwards).
> > > >
> > > > I think this has greatly impacted our productivity. I've experienced
> > that
> > > > PRs submitted in the early morning of PST time zone won't finish
> their
> > > > build until late night of the same day.
> > > >
> > > > So my questions are:
> > > >
> > > > - Has anyone else experienced the same problem or have similar
> > > observation
> > > > on TravisCI? (I suspect it has things to do with time zone)
> > > >
> > > > - What pricing plan of TravisCI is Flink currently using? Is it the
> > free
> > > > plan for open source projects? What are the guaranteed build capacity
> > of
> > > > the current plan?
> > > >
> > > > - If the current pricing plan (either free or paid) can't provide
> > stable
> > > > build capacity, can we upgrade to a higher priced plan with larger
> and
> > > more
> > > > stable build capacity?
> > > >
> > > > BTW, another factor that contribute to the productivity problem is
> that
> > > > our build is slow - we run full build for every PR and a successful
> > full
> > > > build takes ~5h. We definitely have more options to solve it, for
> > > instance,
> > > > modularize the build graphs and reuse artifacts from the previous
> > build.
> > > > But I think that can be a big effort which is much harder to
> accomplish
> > > in
> > > > a short period of time and may deserve its own separate discussion.
> > > >
> > > > [1] https://travis-ci.org/apache/flink/pull_requests
> > > >
> > > >
> > >
> >
>


Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-24 Thread Steven Wu
long and sometimes unstable build is definitely a pain point.

I suspect the build failure here in flink-connector-kafka is not related to
my change. but there is no easy re-run the build on travis UI. Google
search showed a trick of close-and-open the PR will trigger rebuild. but
that could add noises to the PR activities.
https://travis-ci.org/apache/flink/jobs/54519

travis-ci for my personal repo often failed with exceeding time limit after
4+ hours.
The job exceeded the maximum time limit for jobs, and has been terminated.

On Mon, Jun 24, 2019 at 4:15 PM Bowen Li  wrote:

> https://travis-ci.org/apache/flink/builds/549681530  This build request
> has
> been sitting at **HEAD of the queue** since I first saw it at PST 10:30am
> (not sure how long it's been there before 10:30am). It's PST 4:12pm now and
> it hasn't started yet.
>
> On Mon, Jun 24, 2019 at 2:48 PM Bowen Li  wrote:
>
> > Hi devs,
> >
> > I've been experiencing the pain resulting from lack of stable build
> > capacity on Travis for Flink PRs [1]. Specifically, I noticed often that
> no
> > build in the queue is making any progress for hours, and suddenly 5 or 6
> > builds kick off all together after the long pause. I'm at PST (UTC-08)
> time
> > zone, and I've seen pause can be as long as 6 hours from PST 9am to 3pm
> > (let alone the time needed to drain the queue afterwards).
> >
> > I think this has greatly impacted our productivity. I've experienced that
> > PRs submitted in the early morning of PST time zone won't finish their
> > build until late night of the same day.
> >
> > So my questions are:
> >
> > - Has anyone else experienced the same problem or have similar
> observation
> > on TravisCI? (I suspect it has things to do with time zone)
> >
> > - What pricing plan of TravisCI is Flink currently using? Is it the free
> > plan for open source projects? What are the guaranteed build capacity of
> > the current plan?
> >
> > - If the current pricing plan (either free or paid) can't provide stable
> > build capacity, can we upgrade to a higher priced plan with larger and
> more
> > stable build capacity?
> >
> > BTW, another factor that contribute to the productivity problem is that
> > our build is slow - we run full build for every PR and a successful full
> > build takes ~5h. We definitely have more options to solve it, for
> instance,
> > modularize the build graphs and reuse artifacts from the previous build.
> > But I think that can be a big effort which is much harder to accomplish
> in
> > a short period of time and may deserve its own separate discussion.
> >
> > [1] https://travis-ci.org/apache/flink/pull_requests
> >
> >
>


Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-24 Thread Bowen Li
https://travis-ci.org/apache/flink/builds/549681530  This build request has
been sitting at **HEAD of the queue** since I first saw it at PST 10:30am
(not sure how long it's been there before 10:30am). It's PST 4:12pm now and
it hasn't started yet.

On Mon, Jun 24, 2019 at 2:48 PM Bowen Li  wrote:

> Hi devs,
>
> I've been experiencing the pain resulting from lack of stable build
> capacity on Travis for Flink PRs [1]. Specifically, I noticed often that no
> build in the queue is making any progress for hours, and suddenly 5 or 6
> builds kick off all together after the long pause. I'm at PST (UTC-08) time
> zone, and I've seen pause can be as long as 6 hours from PST 9am to 3pm
> (let alone the time needed to drain the queue afterwards).
>
> I think this has greatly impacted our productivity. I've experienced that
> PRs submitted in the early morning of PST time zone won't finish their
> build until late night of the same day.
>
> So my questions are:
>
> - Has anyone else experienced the same problem or have similar observation
> on TravisCI? (I suspect it has things to do with time zone)
>
> - What pricing plan of TravisCI is Flink currently using? Is it the free
> plan for open source projects? What are the guaranteed build capacity of
> the current plan?
>
> - If the current pricing plan (either free or paid) can't provide stable
> build capacity, can we upgrade to a higher priced plan with larger and more
> stable build capacity?
>
> BTW, another factor that contribute to the productivity problem is that
> our build is slow - we run full build for every PR and a successful full
> build takes ~5h. We definitely have more options to solve it, for instance,
> modularize the build graphs and reuse artifacts from the previous build.
> But I think that can be a big effort which is much harder to accomplish in
> a short period of time and may deserve its own separate discussion.
>
> [1] https://travis-ci.org/apache/flink/pull_requests
>
>


[DISCUSS] solve unstable build capacity problem on TravisCI

2019-06-24 Thread Bowen Li
Hi devs,

I've been experiencing the pain resulting from lack of stable build
capacity on Travis for Flink PRs [1]. Specifically, I noticed often that no
build in the queue is making any progress for hours, and suddenly 5 or 6
builds kick off all together after the long pause. I'm at PST (UTC-08) time
zone, and I've seen pause can be as long as 6 hours from PST 9am to 3pm
(let alone the time needed to drain the queue afterwards).

I think this has greatly impacted our productivity. I've experienced that
PRs submitted in the early morning of PST time zone won't finish their
build until late night of the same day.

So my questions are:

- Has anyone else experienced the same problem or have similar observation
on TravisCI? (I suspect it has things to do with time zone)

- What pricing plan of TravisCI is Flink currently using? Is it the free
plan for open source projects? What are the guaranteed build capacity of
the current plan?

- If the current pricing plan (either free or paid) can't provide stable
build capacity, can we upgrade to a higher priced plan with larger and more
stable build capacity?

BTW, another factor that contribute to the productivity problem is that our
build is slow - we run full build for every PR and a successful full build
takes ~5h. We definitely have more options to solve it, for instance,
modularize the build graphs and reuse artifacts from the previous build.
But I think that can be a big effort which is much harder to accomplish in
a short period of time and may deserve its own separate discussion.

[1] https://travis-ci.org/apache/flink/pull_requests