Hi all, as a follow up from our discussion on reducing the build time [1], I would like to propose migrating our build infrastructure to Azure Pipelines (away from Travis).
I believe that we have reached the limits of what Travis can provide the Flink community, and I don't want the build system to limit or influence the project's growth. *Benefits:* 1. The free Travis account are limited to 5 parallel builds, with a timeout of 50 minutes. Azure offers *10 parallel builds with 300 minute timeouts *for free for open source projects. 2. Azure Pipelines allows us to *add custom build machines* to the pool of 10 free parallel builders. This will allow the Flink community to scale the available build capacity as the project grows. We are dependent on donations from supporting companies, but I believe that it is easier for companies to donate machines than money. Alibaba is willing to provide 10 machines, with 32 cores each to the Flink project for this purpose. In addition, Xiyuan, who's working on adding ARM support for Flink provided me with 2 ARM machines (16 cores each). I want to use the custom, more efficient build machines for building Flink's pull requests and master-pushes. 3. *Azure Pipelines is a more feature-rich tool*, allowing for example to transfer intermediate build artifacts between pipeline stages. This will allow us to make the build more reliable (we are currently abusing the caching mechanism in Travis for this). It also has some basic analytics on test results / flaky tests etc. *Known problems:* - Initially, we might see different build instabilities than before - There's a higher maintenance overhead for the custom build machines (keeping them up to date etc.) - We can not use the build status integration of AZP, because they require write access to the repository's source. The foundation does not allow that [2]. I propose to extend flinkbot / the flink-ci repository. *Current Status:* - I'm able [3] to execute [4] the current custom build scripts on Azure Pipelines: This means that we will have one compile stage, and N testing jobs in the 2nd stage. Currently, we have N=10 testing jobs. The time from the start of a build till all tests have completed is 1h22 minutes. - I'm working on getting the nightly end to end tests to run on the new infrastructure. - I'm working on getting the build to work on our pool of custom machines as well - I'm working on setting up the full matrix of builds (different scala, hadoop etc. versions) for the nightlies *Next Steps:* - I propose to document the entire build system in the Flink Wiki - Once Azure can cover the same pull request tests as Travis, I would set it up to run in parallel (including Flinkbot posting links to Azure). I hope that this phase lasts for 1-2 weeks only, so that we do not have to maintain things concurrently. I will monitor the build stability closely, but would expect some support with debugging potential issues from the contributors. - Once there are no problems with the new setup, we remove the Travis setup. - Independently, I will work on triggering builds from master / release - branch pushes, as well as cron builds from the master branch ... all this will be described in the Wiki. *Timeline:*- Once I have the feeling that people are supportive of the idea, I will start documenting in the Wiki. The first pull requests should show up after a few more days. I will do a one month parental leave starting some time later in December, which will probably delay things a bit. I hope to have everything finished by end of January. I'm happy to hear your thoughts on this work. If nobody objects, I will start documenting the system and prepare everything for the migration. Best, Robert [1] https://lists.apache.org/thread.html/b90aa518fcabce94f8e1de4132f46120fae613db6e95a2705f1bd1ea@%3Cdev.flink.apache.org%3E [2] https://issues.apache.org/jira/browse/INFRA-17030 [3] https://github.com/rmetzger/flink/tree/azure_playground [4] https://dev.azure.com/rmetzger/Flink/_build?definitionId=4&_a=summary