I've created a first draft of my plans in the wiki:
https://cwiki.apache.org/confluence/display/FLINK/%5Bpreview%5D+Azure+Pipelines.
I'm looking forward to your comments.

On Thu, Dec 5, 2019 at 12:37 PM Robert Metzger <rmetz...@apache.org> wrote:

> Thank you all for the positive feedback. I will start putting together a
> page in the wiki.
>
> @Jark: Azure Pipelines provides a free services, that is even better than
> what Travis provides for free: 10 parallel builds with 6 hours timeouts.
>
> @Chesnay: I will answer your questions in the yet-to-be-written
> documentation in the wiki.
>
>
> On Thu, Dec 5, 2019 at 11:58 AM Arvid Heise <ar...@ververica.com> wrote:
>
>> +1 I had good experiences with Azure pipelines in the past.
>>
>> On Thu, Dec 5, 2019 at 11:35 AM Aljoscha Krettek <aljos...@apache.org>
>> wrote:
>>
>> > +1
>> >
>> > Thanks for the effort! The tooling seems to be quite a bit nicer and I
>> > like that we can grow by adding more machines.
>> >
>> > Best,
>> > Aljoscha
>> >
>> > > On 5. Dec 2019, at 03:18, Jark Wu <imj...@gmail.com> wrote:
>> > >
>> > > +1 for Azure pipeline because it promises better performance.
>> > >
>> > > However, I have 2 concerns:
>> > >
>> > > 1) Travis provides personal free service for testing personal
>> branches.
>> > > Usually, contributors use this feature to test PoC or run CRON jobs
>> for
>> > > pull requests.
>> > >    Using local machine will cost a lot of time. Does AZP provides the
>> > same
>> > > free service?
>> > > 2) Currently, we deployed a webhook [1] to receive Travis CI build
>> > > notifications [2] and send to bui...@flink.apache.org mailing list.
>> > >    We need to figure out a way how to send Azure build results to the
>> > > mailing list. And this [3] might be the way to go.
>> > >
>> > > builds@f.a.o mailing list
>> > >
>> > > Best,
>> > > Jark
>> > >
>> > > [1]: https://github.com/wuchong/flink-notification-bot
>> > > [2]:
>> > >
>> >
>> https://docs.travis-ci.com/user/notifications/#configuring-webhook-notifications
>> > > [3]:
>> > >
>> >
>> https://docs.microsoft.com/en-us/azure/devops/service-hooks/overview?view=azure-devops
>> > >
>> > >
>> > >
>> > > On Wed, 4 Dec 2019 at 22:48, Jeff Zhang <zjf...@gmail.com> wrote:
>> > >
>> > >> +1
>> > >>
>> > >> Till Rohrmann <trohrm...@apache.org> 于2019年12月4日周三 下午10:43写道:
>> > >>
>> > >>> +1 for moving to Azure pipelines as it promises better scalability
>> and
>> > >>> tooling. Looking forward to having faster builds and hence shorter
>> > >> feedback
>> > >>> cycles :-)
>> > >>>
>> > >>> Cheers,
>> > >>> Till
>> > >>>
>> > >>> On Wed, Dec 4, 2019 at 1:24 PM Chesnay Schepler <ches...@apache.org
>> >
>> > >>> wrote:
>> > >>>
>> > >>>> @robert Can you expand how the azure setup interacts with CiBot?
>> Do we
>> > >>>> have to continue mirroring builds into flink-ci? How will the
>> cronjob
>> > >>>> configuration work? We should have a general idea on how to
>> implement
>> > >>>> this before proceeding.
>> > >>>> Additionally, moving /all /jobs into flink-ci requires setting up
>> the
>> > >>>> environment variables we have; can we set these up via files or
>> will
>> > we
>> > >>>> have to give all committers permissions for flink-ci/flink?
>> > >>>>
>> > >>>> On 04/12/2019 12:55, Chesnay Schepler wrote:
>> > >>>>> From what I've seen so far Azure will provide us a better
>> experience,
>> > >>>>> so I'd say +1 for the transition as a whole.
>> > >>>>>
>> > >>>>> I'd delay merge at least until the feature branch is cut.
>> > >>>>> Given the parental leave it may even make sense to only start
>> merging
>> > >>>>> in January afterwards, to reduce the total time taken for the
>> > >>> transition.
>> > >>>>>
>> > >>>>> Reviews could maybe be made earlier, but I'm wondering whether
>> anyone
>> > >>>>> would even have the time at the moment to do so.
>> > >>>>>
>> > >>>>> On 04/12/2019 12:35, Kurt Young wrote:
>> > >>>>>> Thanks Robert for driving this. There is another big pain point
>> of
>> > >>>>>> current
>> > >>>>>> travis,
>> > >>>>>> which is its cache mechanism will fail from time to time. Almost
>> > >>>>>> around 50%
>> > >>>>>> of
>> > >>>>>> the build fails are caused by cache problem. I opened this issue
>> to
>> > >>>>>> travis
>> > >>>>>> but
>> > >>>>>> got no response yet. So big +1 from my side.
>> > >>>>>>
>> > >>>>>> Just one comment, it's close to 1.10 feature freeze and we will
>> > >> spend
>> > >>>>>> some
>> > >>>>>> time
>> > >>>>>> to make tests stable before release. I wish this replacement can
>> > >>> happen
>> > >>>>>> after
>> > >>>>>> 1.10 release, otherwise it will be a unstable factor during
>> release
>> > >>>>>> testing.
>> > >>>>>>
>> > >>>>>> Best,
>> > >>>>>> Kurt
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> On Wed, Dec 4, 2019 at 7:16 PM Zhu Zhu <reed...@gmail.com>
>> wrote:
>> > >>>>>>
>> > >>>>>>> Thanks Robert for the updates! And thanks a lot for all the
>> efforts
>> > >>> to
>> > >>>>>>> investigate, experiment and tune Azure Pipelines for Flink
>> > >> building.
>> > >>>>>>> Big +1 for it.
>> > >>>>>>>
>> > >>>>>>> It would be great that the community building can be extended
>> with
>> > >>>>>>> custom
>> > >>>>>>> machines so that the tests would not be queued for long with
>> daily
>> > >>>>>>> growing
>> > >>>>>>> PRs.
>> > >>>>>>>
>> > >>>>>>> The increased timeout would be also very helpful.
>> > >>>>>>> The 50min timeout for free travis accounts is a pain currently,
>> > >>>>>>> especially
>> > >>>>>>> when we'd like to run e2e tests in our own travis. And I had to
>> > >>>>>>> manually
>> > >>>>>>> split the jobs to make it possible to pass.
>> > >>>>>>>
>> > >>>>>>> Thanks,
>> > >>>>>>> Zhu Zhu
>> > >>>>>>>
>> > >>>>>>> Robert Metzger <rmetz...@apache.org> 于2019年12月4日周三 下午6:36写道:
>> > >>>>>>>
>> > >>>>>>>> Hi all,
>> > >>>>>>>>
>> > >>>>>>>> as a follow up from our discussion on reducing the build time
>> > >> [1], I
>> > >>>>>>> would
>> > >>>>>>>> like to propose migrating our build infrastructure to Azure
>> > >>> Pipelines
>> > >>>>>>> (away
>> > >>>>>>>> from Travis).
>> > >>>>>>>>
>> > >>>>>>>> I believe that we have reached the limits of what Travis can
>> > >>>>>>>> provide the
>> > >>>>>>>> Flink community, and I don't want the build system to limit or
>> > >>>>>>>> influence
>> > >>>>>>>> the project's growth.
>> > >>>>>>>>
>> > >>>>>>>> *Benefits:*
>> > >>>>>>>> 1. The free Travis account are limited to 5 parallel builds,
>> with
>> > >> a
>> > >>>>>>> timeout
>> > >>>>>>>> of 50 minutes. Azure offers *10 parallel builds with 300 minute
>> > >>>>>>>> timeouts
>> > >>>>>>>> *for
>> > >>>>>>>> free for open source projects.
>> > >>>>>>>> 2. Azure Pipelines allows us to *add custom build machines* to
>> the
>> > >>>>>>>> pool
>> > >>>>>>> of
>> > >>>>>>>> 10 free parallel builders.
>> > >>>>>>>> This will allow the Flink community to scale the available
>> build
>> > >>>>>>>> capacity
>> > >>>>>>>> as the project grows. We are dependent on donations from
>> > >> supporting
>> > >>>>>>>> companies, but I believe that it is easier for companies to
>> donate
>> > >>>>>>> machines
>> > >>>>>>>> than money.
>> > >>>>>>>> Alibaba is willing to provide 10 machines, with 32 cores each
>> to
>> > >> the
>> > >>>>>>> Flink
>> > >>>>>>>> project for this purpose.
>> > >>>>>>>> In addition, Xiyuan, who's working on adding ARM support for
>> Flink
>> > >>>>>>> provided
>> > >>>>>>>> me with 2 ARM machines (16 cores each).
>> > >>>>>>>> I want to use the custom, more efficient build machines for
>> > >> building
>> > >>>>>>>> Flink's pull requests and master-pushes.
>> > >>>>>>>> 3. *Azure Pipelines is a more feature-rich tool*, allowing for
>> > >>>>>>>> example to
>> > >>>>>>>> transfer intermediate build artifacts between pipeline stages.
>> > >> This
>> > >>>>>>>> will
>> > >>>>>>>> allow us to make the build more reliable (we are currently
>> abusing
>> > >>> the
>> > >>>>>>>> caching mechanism in Travis for this).
>> > >>>>>>>> It also has some basic analytics on test results / flaky tests
>> > >> etc.
>> > >>>>>>>>
>> > >>>>>>>> *Known problems:*
>> > >>>>>>>> - Initially, we might see different build instabilities than
>> > >> before
>> > >>>>>>>> - There's a higher maintenance overhead for the custom build
>> > >>> machines
>> > >>>>>>>> (keeping them up to date etc.)
>> > >>>>>>>> - We can not use the build status integration of AZP, because
>> they
>> > >>>>>>> require
>> > >>>>>>>> write access to the repository's source. The foundation does
>> not
>> > >>> allow
>> > >>>>>>> that
>> > >>>>>>>> [2].
>> > >>>>>>>> I propose to extend flinkbot / the flink-ci repository.
>> > >>>>>>>>
>> > >>>>>>>> *Current Status:*
>> > >>>>>>>> - I'm able [3] to execute [4] the current custom build scripts
>> on
>> > >>>>>>>> Azure
>> > >>>>>>>> Pipelines: This means that we will have one compile stage, and
>> N
>> > >>>>>>>> testing
>> > >>>>>>>> jobs in the 2nd stage. Currently, we have N=10 testing jobs.
>> > >>>>>>>> The time from the start of a build till all tests have
>> completed
>> > >> is
>> > >>>>>>>> 1h22
>> > >>>>>>>> minutes.
>> > >>>>>>>> - I'm working on getting the nightly end to end tests to run on
>> > >> the
>> > >>>>>>>> new
>> > >>>>>>>> infrastructure.
>> > >>>>>>>> - I'm working on getting the build to work on our pool of
>> custom
>> > >>>>>>>> machines
>> > >>>>>>>> as well
>> > >>>>>>>> - I'm working on setting up the full matrix of builds
>> (different
>> > >>>>>>>> scala,
>> > >>>>>>>> hadoop etc. versions) for the nightlies
>> > >>>>>>>>
>> > >>>>>>>> *Next Steps:*
>> > >>>>>>>> - I propose to document the entire build system in the Flink
>> Wiki
>> > >>>>>>>> - Once Azure can cover the same pull request tests as Travis, I
>> > >>>>>>>> would set
>> > >>>>>>>> it up to run in parallel (including Flinkbot posting links to
>> > >>>>>>>> Azure). I
>> > >>>>>>>> hope that this phase lasts for 1-2 weeks only, so that we do
>> not
>> > >>>>>>>> have to
>> > >>>>>>>> maintain things concurrently. I will monitor the build
>> stability
>> > >>>>>>>> closely,
>> > >>>>>>>> but would expect some support with debugging potential issues
>> from
>> > >>> the
>> > >>>>>>>> contributors.
>> > >>>>>>>> - Once there are no problems with the new setup, we remove the
>> > >>> Travis
>> > >>>>>>>> setup.
>> > >>>>>>>> - Independently, I will work on triggering builds from master /
>> > >>>>>>>> release -
>> > >>>>>>>> branch pushes, as well as cron builds from the master branch
>> ...
>> > >>>>>>>> all this
>> > >>>>>>>> will be described in the Wiki.
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> *Timeline:*- Once I have the feeling that people are
>> supportive of
>> > >>> the
>> > >>>>>>>> idea, I will start documenting in the Wiki. The first pull
>> > >> requests
>> > >>>>>>> should
>> > >>>>>>>> show up after a few more days.
>> > >>>>>>>> I will do a one month parental leave starting some time later
>> in
>> > >>>>>>> December,
>> > >>>>>>>> which will probably delay things a bit. I hope to have
>> everything
>> > >>>>>>> finished
>> > >>>>>>>> by end of January.
>> > >>>>>>>>
>> > >>>>>>>> I'm happy to hear your thoughts on this work.
>> > >>>>>>>> If nobody objects, I will start documenting the system and
>> prepare
>> > >>>>>>>> everything for the migration.
>> > >>>>>>>>
>> > >>>>>>>> Best,
>> > >>>>>>>> Robert
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> [1]
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>
>> > >>>>
>> > >>>
>> > >>
>> >
>> https://lists.apache.org/thread.html/b90aa518fcabce94f8e1de4132f46120fae613db6e95a2705f1bd1ea@%3Cdev.flink.apache.org%3E
>> > >>>>>>>
>> > >>>>>>>> [2] https://issues.apache.org/jira/browse/INFRA-17030
>> > >>>>>>>> [3] https://github.com/rmetzger/flink/tree/azure_playground
>> > >>>>>>>> [4]
>> > >>>>>>>
>> > >>>
>> https://dev.azure.com/rmetzger/Flink/_build?definitionId=4&_a=summary
>> > >>>>>
>> > >>>>>
>> > >>>>>
>> > >>>>
>> > >>>>
>> > >>>
>> > >>
>> > >>
>> > >> --
>> > >> Best Regards
>> > >>
>> > >> Jeff Zhang
>> > >>
>> >
>> >
>>
>

Reply via email to