Thanks Robert for the updates! And thanks a lot for all the efforts to
investigate, experiment and tune Azure Pipelines for Flink building.
Big +1 for it.
It would be great that the community building can be extended with
custom
machines so that the tests would not be queued for long with daily
growing
PRs.
The increased timeout would be also very helpful.
The 50min timeout for free travis accounts is a pain currently,
especially
when we'd like to run e2e tests in our own travis. And I had to
manually
split the jobs to make it possible to pass.
Thanks,
Zhu Zhu
Robert Metzger <rmetz...@apache.org> 于2019年12月4日周三 下午6:36写道:
Hi all,
as a follow up from our discussion on reducing the build time [1], I
would
like to propose migrating our build infrastructure to Azure Pipelines
(away
from Travis).
I believe that we have reached the limits of what Travis can
provide the
Flink community, and I don't want the build system to limit or
influence
the project's growth.
*Benefits:*
1. The free Travis account are limited to 5 parallel builds, with a
timeout
of 50 minutes. Azure offers *10 parallel builds with 300 minute
timeouts
*for
free for open source projects.
2. Azure Pipelines allows us to *add custom build machines* to the
pool
of
10 free parallel builders.
This will allow the Flink community to scale the available build
capacity
as the project grows. We are dependent on donations from supporting
companies, but I believe that it is easier for companies to donate
machines
than money.
Alibaba is willing to provide 10 machines, with 32 cores each to the
Flink
project for this purpose.
In addition, Xiyuan, who's working on adding ARM support for Flink
provided
me with 2 ARM machines (16 cores each).
I want to use the custom, more efficient build machines for building
Flink's pull requests and master-pushes.
3. *Azure Pipelines is a more feature-rich tool*, allowing for
example to
transfer intermediate build artifacts between pipeline stages. This
will
allow us to make the build more reliable (we are currently abusing the
caching mechanism in Travis for this).
It also has some basic analytics on test results / flaky tests etc.
*Known problems:*
- Initially, we might see different build instabilities than before
- There's a higher maintenance overhead for the custom build machines
(keeping them up to date etc.)
- We can not use the build status integration of AZP, because they
require
write access to the repository's source. The foundation does not allow
that
[2].
I propose to extend flinkbot / the flink-ci repository.
*Current Status:*
- I'm able [3] to execute [4] the current custom build scripts on
Azure
Pipelines: This means that we will have one compile stage, and N
testing
jobs in the 2nd stage. Currently, we have N=10 testing jobs.
The time from the start of a build till all tests have completed is
1h22
minutes.
- I'm working on getting the nightly end to end tests to run on the
new
infrastructure.
- I'm working on getting the build to work on our pool of custom
machines
as well
- I'm working on setting up the full matrix of builds (different
scala,
hadoop etc. versions) for the nightlies
*Next Steps:*
- I propose to document the entire build system in the Flink Wiki
- Once Azure can cover the same pull request tests as Travis, I
would set
it up to run in parallel (including Flinkbot posting links to
Azure). I
hope that this phase lasts for 1-2 weeks only, so that we do not
have to
maintain things concurrently. I will monitor the build stability
closely,
but would expect some support with debugging potential issues from the
contributors.
- Once there are no problems with the new setup, we remove the Travis
setup.
- Independently, I will work on triggering builds from master /
release -
branch pushes, as well as cron builds from the master branch ...
all this
will be described in the Wiki.
*Timeline:*- Once I have the feeling that people are supportive of the
idea, I will start documenting in the Wiki. The first pull requests
should
show up after a few more days.
I will do a one month parental leave starting some time later in
December,
which will probably delay things a bit. I hope to have everything
finished
by end of January.
I'm happy to hear your thoughts on this work.
If nobody objects, I will start documenting the system and prepare
everything for the migration.
Best,
Robert
[1]
https://lists.apache.org/thread.html/b90aa518fcabce94f8e1de4132f46120fae613db6e95a2705f1bd1ea@%3Cdev.flink.apache.org%3E
[2] https://issues.apache.org/jira/browse/INFRA-17030
[3] https://github.com/rmetzger/flink/tree/azure_playground
[4]
https://dev.azure.com/rmetzger/Flink/_build?definitionId=4&_a=summary