Hi all,

as a follow up from our discussion on reducing the build time [1], I would
like to propose migrating our build infrastructure to Azure Pipelines (away
from Travis).

I believe that we have reached the limits of what Travis can provide the
Flink community, and I don't want the build system to limit or influence
the project's growth.

*Benefits:*
1. The free Travis account are limited to 5 parallel builds, with a timeout
of 50 minutes. Azure offers *10 parallel builds with 300 minute timeouts *for
free for open source projects.
2. Azure Pipelines allows us to *add custom build machines* to the pool of
10 free parallel builders.
This will allow the Flink community to scale the available build capacity
as the project grows. We are dependent on donations from supporting
companies, but I believe that it is easier for companies to donate machines
than money.
Alibaba is willing to provide 10 machines, with 32 cores each to the Flink
project for this purpose.
In addition, Xiyuan, who's working on adding ARM support for Flink provided
me with 2 ARM machines (16 cores each).
I want to use the custom, more efficient build machines for building
Flink's pull requests and master-pushes.
3. *Azure Pipelines is a more feature-rich tool*, allowing for example to
transfer intermediate build artifacts between pipeline stages. This will
allow us to make the build more reliable (we are currently abusing the
caching mechanism in Travis for this).
It also has some basic analytics on test results / flaky tests etc.

*Known problems:*
- Initially, we might see different build instabilities than before
- There's a higher maintenance overhead for the custom build machines
(keeping them up to date etc.)
- We can not use the build status integration of AZP, because they require
write access to the repository's source. The foundation does not allow that
[2].
I propose to extend flinkbot / the flink-ci repository.

*Current Status:*
- I'm able [3] to execute [4] the current custom build scripts on Azure
Pipelines: This means that we will have one compile stage, and N testing
jobs in the 2nd stage. Currently, we have N=10 testing jobs.
The time from the start of a build till all tests have completed is 1h22
minutes.
- I'm working on getting the nightly end to end tests to run on the new
infrastructure.
- I'm working on getting the build to work on our pool of custom machines
as well
- I'm working on setting up the full matrix of builds (different scala,
hadoop etc. versions) for the nightlies

*Next Steps:*
- I propose to document the entire build system in the Flink Wiki
- Once Azure can cover the same pull request tests as Travis, I would set
it up to run in parallel (including Flinkbot posting links to Azure). I
hope that this phase lasts for 1-2 weeks only, so that we do not have to
maintain things concurrently. I will monitor the build stability closely,
but would expect some support with debugging potential issues from the
contributors.
- Once there are no problems with the new setup, we remove the Travis setup.
- Independently, I will work on triggering builds from master / release -
branch pushes, as well as cron builds from the master branch ... all this
will be described in the Wiki.


*Timeline:*- Once I have the feeling that people are supportive of the
idea, I will start documenting in the Wiki. The first pull requests should
show up after a few more days.
I will do a one month parental leave starting some time later in December,
which will probably delay things a bit. I hope to have everything finished
by end of January.

I'm happy to hear your thoughts on this work.
If nobody objects, I will start documenting the system and prepare
everything for the migration.

Best,
Robert



[1]
https://lists.apache.org/thread.html/b90aa518fcabce94f8e1de4132f46120fae613db6e95a2705f1bd1ea@%3Cdev.flink.apache.org%3E
[2] https://issues.apache.org/jira/browse/INFRA-17030
[3] https://github.com/rmetzger/flink/tree/azure_playground
[4] https://dev.azure.com/rmetzger/Flink/_build?definitionId=4&_a=summary

Reply via email to