@robert Can you expand how the azure setup interacts with CiBot? Do we have to continue mirroring builds into flink-ci? How will the cronjob configuration work? We should have a general idea on how to implement this before proceeding. Additionally, moving /all /jobs into flink-ci requires setting up the environment variables we have; can we set these up via files or will we have to give all committers permissions for flink-ci/flink?

On 04/12/2019 12:55, Chesnay Schepler wrote:
From what I've seen so far Azure will provide us a better experience, so I'd say +1 for the transition as a whole.

I'd delay merge at least until the feature branch is cut.
Given the parental leave it may even make sense to only start merging in January afterwards, to reduce the total time taken for the transition.

Reviews could maybe be made earlier, but I'm wondering whether anyone would even have the time at the moment to do so.

On 04/12/2019 12:35, Kurt Young wrote:
Thanks Robert for driving this. There is another big pain point of current
travis,
which is its cache mechanism will fail from time to time. Almost around 50%
of
the build fails are caused by cache problem. I opened this issue to travis
but
got no response yet. So big +1 from my side.

Just one comment, it's close to 1.10 feature freeze and we will spend some
time
to make tests stable before release. I wish this replacement can happen
after
1.10 release, otherwise it will be a unstable factor during release
testing.

Best,
Kurt


On Wed, Dec 4, 2019 at 7:16 PM Zhu Zhu <reed...@gmail.com> wrote:

Thanks Robert for the updates! And thanks a lot for all the efforts to
investigate, experiment and tune Azure Pipelines for Flink building.
Big +1 for it.

It would be great that the community building can be extended with custom machines so that the tests would not be queued for long with daily growing
PRs.

The increased timeout would be also very helpful.
The 50min timeout for free travis accounts is a pain currently, especially when we'd like to run e2e tests in our own travis. And I had to manually
split the jobs to make it possible to pass.

Thanks,
Zhu Zhu

Robert Metzger <rmetz...@apache.org> 于2019年12月4日周三 下午6:36写道:

Hi all,

as a follow up from our discussion on reducing the build time [1], I
would
like to propose migrating our build infrastructure to Azure Pipelines
(away
from Travis).

I believe that we have reached the limits of what Travis can provide the Flink community, and I don't want the build system to limit or influence
the project's growth.

*Benefits:*
1. The free Travis account are limited to 5 parallel builds, with a
timeout
of 50 minutes. Azure offers *10 parallel builds with 300 minute timeouts
*for
free for open source projects.
2. Azure Pipelines allows us to *add custom build machines* to the pool
of
10 free parallel builders.
This will allow the Flink community to scale the available build capacity
as the project grows. We are dependent on donations from supporting
companies, but I believe that it is easier for companies to donate
machines
than money.
Alibaba is willing to provide 10 machines, with 32 cores each to the
Flink
project for this purpose.
In addition, Xiyuan, who's working on adding ARM support for Flink
provided
me with 2 ARM machines (16 cores each).
I want to use the custom, more efficient build machines for building
Flink's pull requests and master-pushes.
3. *Azure Pipelines is a more feature-rich tool*, allowing for example to transfer intermediate build artifacts between pipeline stages. This will
allow us to make the build more reliable (we are currently abusing the
caching mechanism in Travis for this).
It also has some basic analytics on test results / flaky tests etc.

*Known problems:*
- Initially, we might see different build instabilities than before
- There's a higher maintenance overhead for the custom build machines
(keeping them up to date etc.)
- We can not use the build status integration of AZP, because they
require
write access to the repository's source. The foundation does not allow
that
[2].
I propose to extend flinkbot / the flink-ci repository.

*Current Status:*
- I'm able [3] to execute [4] the current custom build scripts on Azure Pipelines: This means that we will have one compile stage, and N testing
jobs in the 2nd stage. Currently, we have N=10 testing jobs.
The time from the start of a build till all tests have completed is 1h22
minutes.
- I'm working on getting the nightly end to end tests to run on the new
infrastructure.
- I'm working on getting the build to work on our pool of custom machines
as well
- I'm working on setting up the full matrix of builds (different scala,
hadoop etc. versions) for the nightlies

*Next Steps:*
- I propose to document the entire build system in the Flink Wiki
- Once Azure can cover the same pull request tests as Travis, I would set it up to run in parallel (including Flinkbot posting links to Azure). I hope that this phase lasts for 1-2 weeks only, so that we do not have to maintain things concurrently. I will monitor the build stability closely,
but would expect some support with debugging potential issues from the
contributors.
- Once there are no problems with the new setup, we remove the Travis
setup.
- Independently, I will work on triggering builds from master / release - branch pushes, as well as cron builds from the master branch ... all this
will be described in the Wiki.


*Timeline:*- Once I have the feeling that people are supportive of the
idea, I will start documenting in the Wiki. The first pull requests
should
show up after a few more days.
I will do a one month parental leave starting some time later in
December,
which will probably delay things a bit. I hope to have everything
finished
by end of January.

I'm happy to hear your thoughts on this work.
If nobody objects, I will start documenting the system and prepare
everything for the migration.

Best,
Robert



[1]


https://lists.apache.org/thread.html/b90aa518fcabce94f8e1de4132f46120fae613db6e95a2705f1bd1ea@%3Cdev.flink.apache.org%3E
[2] https://issues.apache.org/jira/browse/INFRA-17030
[3] https://github.com/rmetzger/flink/tree/azure_playground
[4]
https://dev.azure.com/rmetzger/Flink/_build?definitionId=4&_a=summary




Reply via email to