Thanks for your feedback Alex. I responded to your comments below:

This is mentioned in the "Limitations of GitHub Actions in the past"
> section of the FLIP. Does this also apply to the Apache INFRA setup or can
> we expect contributors' runs executed there too?


Workflow runs on Flink forks (independent of PRs that would merge to Apache
Flink's core repo) will be executed with runners provided by GitHub with
their own limitations. Secrets are not set in these runs (similar to what
we have right now with PR runs).

If we allow the PR CI to run on Apache INFRA-hosted ephemeral runners we
might have the same freedom because of their ephemeral nature (the VMs are
discarded leaving).

We only have to start thinking about self-hosted customized runners if we
decide/need to have dedicated VMs for Flink's CI (similar to what we have
right now with Azure CI and Alibaba's VMs). This might happen if the
waiting times for acquiring a runner are too long. In that case, we might
give a certain group of people (e.g. committers) or certain types of events
(for PRs,  nightly builds, PR merges) the ability to use the self-hosted
runners.

As you mentioned in the FLIP, there are some timeout-related test
> discrepancies between different setups. Similar discrepancies could
> manifest themselves between the Github runners and the Apache INFRA
> runners. It would be great if we should have a uniform setup, where if
> tests pass in the individual CI, they also pass in the main runner and vice
> versa.


I agree. So far, what we've seen is that the timeout instability is coming
from too optimistic timeout configurations in some tests (they eventually
also fail in Azure CI; but the GitHub-provided runners seem to be more
sensitive in this regard). Fixing the tests if such a flakiness is observed
should bring us to a stage where the test behavior is matching between
different runners.

We had a similar issue in the Azure CI setup: Certain tests were more
stable on the Alibaba machines than on Azure VMs. That is why we introduced
a dedicated stage for Azure CI VMs as part of the nightly runs (see
FLINK-18370 [1]). We could do the same for GitHub Actions if necessary.

Currently we have such memory limits-related issues in individual vs main
> Azure CI pipelines.


I'm not sure I understand what you mean by memory limit-related issues. The
GitHub-provided runners do not seem to run into memory-related issues. We
have to see whether this also applies to Apache INFRA-provided runners. My
hope is that they have even better hardware than what GitHub offers. But
GitHub-provided runners seem to be a good fallback to rely on (see the
workflows I shared in my previous response to Xintong's message).

[1] https://issues.apache.org/jira/browse/FLINK-18370

On Wed, Nov 29, 2023 at 3:17 PM Matthias Pohl <matthias.p...@aiven.io>
wrote:

> Thanks for your comments, Xintong. See my answers below.
>
>
>> I think it would be helpful if we can at the end migrate the CI to an
>> ASF-managed Github Action, as long as it provides us a similar
>> computation capacity and stability.
>
>
> The current test runs in my Flink fork (using the GitHub-provided runners)
> suggest that even with using generic GitHub runners we get decent
> performance and stability. In this way I'm confident that we wouldn't lose
> much.
>
> Here's a comparison of the pipelines once more:
> * Nightly workflow: GitHub Actions [1] vs Azure CI [2]
> * PR workflow: GitHub Actions [3] vs Azure CI [4]
>
> [1] https://github.com/XComp/flink/actions/workflows/flink-ci-extended.yml
> [2]
> https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=1&_a=summary
> [3] https://github.com/XComp/flink/actions/workflows/flink-ci-basic.yml
> [4] https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=2
>
> Regarding the migration plan, I wonder if we should not disable the CIbot
>> until we fully decide to migrate to Github Actions? In case the nightly
>> runs don't really work well, it might be debatable whether we should
>> maintain the CI in two places (i.e. PRs on Github Actions and cron builds
>> on Azure).
>
>
> The CIbot handles the PR CI. Disabling it would mean that users would
> fully rely on the GitHub Actions workflow right away. I like the fact that
> for PRs we actually have both. That makes it more obvious if CI is not on
> par.
> For the nightly builds, I'm not too worried because they are not exposed
> to the contributors that much. That's more a question for the release
> managers who are monitoring the nightly runs how they want to handle it.
> But even there I see benefits of having both CIs running for some time to
> see how much they differ from each other in terms of stability
>
> - What exactly are the changes that would affect contributors during the
>> trial period? Is it only an additional CI report that you can potentially
>> just ignore? Or would there be some larger impacts, e.g. you cannot merge a
>> PR if the Github Action CI is not passed (I don't know, I just made this
>> up)?
>
>
> My plan would be to enable the PR CI workflow for PRs as well to have the
> comparison. For contributors this would mean that they have an additional
> CI point (essentially two CI runs for a PR). If that's not what we want, we
> could disable it for PRs and only allow the basic CI run for pushes to
> master.
>
> On Wed, Nov 29, 2023 at 2:31 PM Alexander Fedulov <
> alexander.fedu...@gmail.com> wrote:
>
>> Thanks for driving this Mathhias! +1 for joining the INFRA trial.
>>
>>
>> > Apache Infra did some experimenting on self-hosted runners in
>> collaboration
>> > with Apache Airflow (see ashb/runner with releases/pr-security-options
>> branch)
>> > where they only allow certain groups of users (e.g. committers) to run
>> their
>> > workflows on self-hosted machines. Any other group would have to rely on
>> > GitHub’s runners.
>>
>> This is mentioned in the "Limitations of GitHub Actions in the past"
>> section of
>> the FLIP. Does this also apply to the Apache INFRA setup or can we expect
>> contributors' runs executed there too? As you mentioned in the FLIP, there
>> are
>> some timeout-related test discrepancies between different setups. Similar
>> discrepancies could manifest themselves between the Github runners and the
>> Apache INFRA runners. It would be great if we should have a uniform setup,
>> where if tests pass in the individual CI, they also pass in the main
>> runner
>> and
>> vice versa.  Currently we have such memory limits-related issues in
>> individual
>> vs main Azure CI pipelines.
>>
>> >2. Disable Flink’s CI bot for PRs if step #1 is considered successful
>> >3. Join trial program for ephemeral GHA runners
>>
>> Due to potential new kinds of instabilities manifesting themselves in the
>> new setup,
>> can we keep both CIs running in parallel and keep relying on the existing
>> one until
>> we are confident in the tests stability on the new ephemeral GHA infra
>> (skip 2.)?
>>
>> Best,
>> Alex
>>
>> On Wed, 29 Nov 2023 at 13:42, Xintong Song <tonysong...@gmail.com> wrote:
>>
>> > Thanks for the efforts, Matthias.
>> >
>> >
>> > I think it would be helpful if we can at the end migrate the CI to an
>> > ASF-managed Github Action, as long as it provides us a similar
>> computation
>> > capacity and stability. Given that the proposal is only to start a trial
>> > and investigate whether the migration is feasible, I don't see much
>> concern
>> > in this.
>> >
>> >
>> > I have only one suggestion and one question.
>> >
>> > - Regarding the migration plan, I wonder if we should not disable the CI
>> > bot until we fully decide to migrate to Github Actions? In case the
>> nightly
>> > runs don't really work well, it might be debatable whether we should
>> > maintain the CI in two places (i.e. PRs on Github Actions and cron
>> builds
>> > on Azure).
>> >
>> > - What exactly are the changes that would affect contributors during the
>> > trial period? Is it only an additional CI report that you can
>> potentially
>> > just ignore? Or would there be some larger impacts, e.g. you cannot
>> merge a
>> > PR if the Github Action CI is not passed (I don't know, I just made this
>> > up)?
>> >
>> >
>> > Best,
>> >
>> > Xintong
>> >
>> >
>> >
>> > On Wed, Nov 29, 2023 at 8:07 PM Yuxin Tan <tanyuxinw...@gmail.com>
>> wrote:
>> >
>> > > Ok, Thanks for the update and the explanations.
>> > >
>> > > Best,
>> > > Yuxin
>> > >
>> > >
>> > > Matthias Pohl <matthias.p...@aiven.io.invalid> 于2023年11月29日周三
>> 15:43写道:
>> > >
>> > > > >
>> > > > > According to the Flip, the new tests will support arm env.
>> > > > > I believe that's good news for arm users. I have a minor
>> > > > > question here. Will it be a blocker before migrating the new
>> > > > > tests? If not,  If not, when can we expect arm environment
>> > > > > support to be implemented? Thanks.
>> > > >
>> > > >
>> > > > Thanks for your feedback, everyone.
>> > > >
>> > > > About the ARM support. I want to underline that this FLIP is not
>> about
>> > > > migrating to GitHub Actions but to start a trial run in the Apache
>> > Flink
>> > > > repository. That would allow us to come up with a proper decision
>> > whether
>> > > > GitHub Actions is what we want. I admit that the title is a bit
>> > > > "clickbaity". I updated the FLIP's title and its Motivation to make
>> > > things
>> > > > clear.
>> > > >
>> > > > The FLIP suggests starting a trial period until 1.19 is released to
>> try
>> > > > things out. A proper decision on whether we want to migrate would be
>> > made
>> > > > at the end of the 1.19 release cycle.
>> > > >
>> > > > About the ARM support: This related content of the FLIP is entirely
>> > based
>> > > > on documentation from Apache INFRAs side. INFRA seems to offer this
>> ARM
>> > > > support for their ephemeral runners. The ephemeral runners are in
>> the
>> > > > testing stage, i.e. these runners are still experimental. Apache
>> INFRA
>> > > asks
>> > > > Apache projects to join this test.
>> > > >
>> > > > Whether the ARM support is actually possible to achieve within
>> Flink is
>> > > > something we have to figure out as part of the trial run. One
>> > conclusion
>> > > of
>> > > > the trial run could be that we still move ahead with GHA but don't
>> use
>> > > arm
>> > > > machines due to some blocking issues.
>> > > >
>> > > > Matthias
>> > > >
>> > > >
>> > > >
>> > > > On Wed, Nov 29, 2023 at 4:46 AM Yuxin Tan <tanyuxinw...@gmail.com>
>> > > wrote:
>> > > >
>> > > > > Hi, Matthias,
>> > > > >
>> > > > > Thanks for driving this.
>> > > > > +1 from my side.
>> > > > >
>> > > > > According to the Flip, the new tests will support arm env.
>> > > > > I believe that's good news for arm users. I have a minor
>> > > > > question here. Will it be a blocker before migrating the new
>> > > > > tests? If not,  If not, when can we expect arm environment
>> > > > > support to be implemented? Thanks.
>> > > > >
>> > > > > Best,
>> > > > > Yuxin
>> > > > >
>> > > > >
>> > > > > Márton Balassi <balassi.mar...@gmail.com> 于2023年11月29日周三 03:09写道:
>> > > > >
>> > > > > > Thanks, Matthias. Big +1 from me.
>> > > > > >
>> > > > > > On Tue, Nov 28, 2023 at 5:30 PM Matthias Pohl
>> > > > > > <matthias.p...@aiven.io.invalid> wrote:
>> > > > > >
>> > > > > > > Thanks for the pointer. I'm planning to join that meeting.
>> > > > > > >
>> > > > > > > On Tue, Nov 28, 2023 at 4:16 PM Etienne Chauchot <
>> > > > echauc...@apache.org
>> > > > > >
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hi all,
>> > > > > > > >
>> > > > > > > > FYI there is the ASF infra roundtable soon. One of the
>> subjects
>> > > for
>> > > > > > this
>> > > > > > > > session is GitHub Actions. It could be worth passing by:
>> > > > > > > >
>> > > > > > > > December 6th, 2023 at 1700 UTC on the #Roundtablechannel on
>> > > Slack.
>> > > > > > > >
>> > > > > > > > For information about theroundtables, and about how to join,
>> > > > > > > > see:https://infra.apache.org/roundtable.html
>> > > > > > > > <https://infra.apache.org/roundtable.html>
>> > > > > > > >
>> > > > > > > > Best
>> > > > > > > >
>> > > > > > > > Etienne
>> > > > > > > >
>> > > > > > > > Le 24/11/2023 à 14:16, Maximilian Michels a écrit :
>> > > > > > > > > Thanks for reviving the efforts here Matthias! +1 for the
>> > > > > transition
>> > > > > > > > > to GitHub Actions.
>> > > > > > > > >
>> > > > > > > > > As for ASF Infra Jenkins, it works fine. Jenkins is
>> extremely
>> > > > > > > > > feature-rich. Not sure about the spare capacity though. I
>> > know
>> > > > that
>> > > > > > > > > for Apache Beam, Google donated a bunch of servers to get
>> > > > > additional
>> > > > > > > > > build capacity.
>> > > > > > > > >
>> > > > > > > > > -Max
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > On Thu, Nov 23, 2023 at 10:30 AM Matthias Pohl
>> > > > > > > > > <matthias.p...@aiven.io.invalid>  wrote:
>> > > > > > > > >> Btw. even though we've been focusing on GitHub Actions
>> with
>> > > this
>> > > > > > FLIP,
>> > > > > > > > I'm
>> > > > > > > > >> curious whether somebody has experience with Apache
>> Infra's
>> > > > > Jenkins
>> > > > > > > > >> deployment. The discussion I found about Jenkins [1] is
>> > quite
>> > > > > > > out-dated
>> > > > > > > > >> (2014). I haven't worked with it myself but could imagine
>> > that
>> > > > > there
>> > > > > > > are
>> > > > > > > > >> some features provided through plugins which are missing
>> in
>> > > > GitHub
>> > > > > > > > Actions.
>> > > > > > > > >>
>> > > > > > > > >> [1]
>> > > > > https://lists.apache.org/thread/vs81xdhn3q777r7x9k7wd4dyl9kvoqn4
>> > > > > > > > >>
>> > > > > > > > >> On Tue, Nov 21, 2023 at 4:19 PM Matthias Pohl<
>> > > > > > matthias.p...@aiven.io>
>> > > > > > > > >> wrote:
>> > > > > > > > >>
>> > > > > > > > >>> That's a valid point. I updated the FLIP accordingly:
>> > > > > > > > >>>
>> > > > > > > > >>>> Currently, the secrets (e.g. for S3 access tokens) are
>> > > > > maintained
>> > > > > > by
>> > > > > > > > >>>> certain PMC members with access to the corresponding
>> > > > > configuration
>> > > > > > > in
>> > > > > > > > the
>> > > > > > > > >>>> Azure CI project. This responsibility will be moved to
>> > > Apache
>> > > > > > Infra.
>> > > > > > > > They
>> > > > > > > > >>>> are in charge of handling secrets in the Apache
>> > > organization.
>> > > > > As a
>> > > > > > > > >>>> consequence, updating secrets is becoming a bit more
>> > > > > complicated.
>> > > > > > > > This can
>> > > > > > > > >>>> be still considered an improvement from a legal
>> standpoint
>> > > > > because
>> > > > > > > the
>> > > > > > > > >>>> responsibility is transferred from an individual
>> company
>> > > (i.e.
>> > > > > > > > Ververica
>> > > > > > > > >>>> who's the maintainer of the Azure CI project) to the
>> > Apache
>> > > > > > > > Foundation.
>> > > > > > > > >>>
>> > > > > > > > >>> On Tue, Nov 21, 2023 at 3:37 PM Martijn Visser<
>> > > > > > > > martijnvis...@apache.org>
>> > > > > > > > >>> wrote:
>> > > > > > > > >>>
>> > > > > > > > >>>> Hi Matthias,
>> > > > > > > > >>>>
>> > > > > > > > >>>> Thanks for the write-up and for the efforts on this. I
>> > > really
>> > > > > hope
>> > > > > > > > >>>> that we can move away from Azure towards GHA for a
>> better
>> > > > > > > integration
>> > > > > > > > >>>> as well (directly seeing if a PR can be merged due to
>> CI
>> > > > passing
>> > > > > > for
>> > > > > > > > >>>> example).
>> > > > > > > > >>>>
>> > > > > > > > >>>> The one thing I'm missing in the FLIP is how we would
>> > setup
>> > > > the
>> > > > > > > > >>>> secrets for the nightly runs (for the S3 tests,
>> potential
>> > > > tests
>> > > > > > with
>> > > > > > > > >>>> external services etc). My guess is we need to provide
>> the
>> > > > > secret
>> > > > > > to
>> > > > > > > > >>>> ASF Infra and then we would be able to refer to them
>> in a
>> > > > > > pipeline?
>> > > > > > > > >>>>
>> > > > > > > > >>>> Best regards,
>> > > > > > > > >>>>
>> > > > > > > > >>>> Martijn
>> > > > > > > > >>>>
>> > > > > > > > >>>> On Tue, Nov 21, 2023 at 3:05 PM Matthias Pohl
>> > > > > > > > >>>> <matthias.p...@aiven.io.invalid>  wrote:
>> > > > > > > > >>>>> I realized that I mixed up FLIP IDs. FLIP-395 is
>> already
>> > > > > reserved
>> > > > > > > > [1]. I
>> > > > > > > > >>>>> switched to FLIP-396 [2] for the sake of consistency.
>> 8)
>> > > > > > > > >>>>>
>> > > > > > > > >>>>> [1]
>> > > > > > >
>> https://lists.apache.org/thread/wjd3nbvg6nt93lb0sd52f0lzls6559tv
>> > > > > > > > >>>>> [2]
>> > > > > > > > >>>>>
>> > > > > > > > >>>>
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-396%3A+Migration+to+GitHub+Actions
>> > > > > > > > >>>>> On Tue, Nov 21, 2023 at 2:58 PM Matthias Pohl<
>> > > > > > > matthias.p...@aiven.io
>> > > > > > > > >
>> > > > > > > > >>>>> wrote:
>> > > > > > > > >>>>>
>> > > > > > > > >>>>>> Hi everyone,
>> > > > > > > > >>>>>>
>> > > > > > > > >>>>>> The Flink community discussed migrating from Azure
>> CI to
>> > > > > GitHub
>> > > > > > > > >>>> Actions
>> > > > > > > > >>>>>> quite some time ago [1]. The efforts around that
>> stalled
>> > > due
>> > > > > to
>> > > > > > > > >>>> limitations
>> > > > > > > > >>>>>> around self-hosted runner support from Apache Infra’s
>> > > side.
>> > > > > > There
>> > > > > > > > >>>> were some
>> > > > > > > > >>>>>> recent developments on that topic. Apache Infra is
>> > > > > experimenting
>> > > > > > > > with
>> > > > > > > > >>>>>> ephemeral runners now which might enable us to move
>> > ahead
>> > > > with
>> > > > > > > > GitHub
>> > > > > > > > >>>>>> Actions.
>> > > > > > > > >>>>>>
>> > > > > > > > >>>>>> The goal is to join the trial phase for ephemeral
>> > runners
>> > > > and
>> > > > > > > > >>>> experiment
>> > > > > > > > >>>>>> with our CI workflows in terms of stability and
>> > > performance.
>> > > > > At
>> > > > > > > the
>> > > > > > > > >>>> end we
>> > > > > > > > >>>>>> can decide whether we want to abandon Azure CI and
>> move
>> > to
>> > > > > > GitHub
>> > > > > > > > >>>> Actions
>> > > > > > > > >>>>>> or stick to the former one.
>> > > > > > > > >>>>>>
>> > > > > > > > >>>>>> Nico Weidner and Chesnay laid the groundwork on this
>> > topic
>> > > > in
>> > > > > > the
>> > > > > > > > >>>> past. I
>> > > > > > > > >>>>>> picked up the work they did and continued
>> experimenting
>> > > with
>> > > > > it
>> > > > > > in
>> > > > > > > > my
>> > > > > > > > >>>> own
>> > > > > > > > >>>>>> fork XComp/flink [2] the past few weeks. The
>> workflows
>> > are
>> > > > in
>> > > > > a
>> > > > > > > > state
>> > > > > > > > >>>> where
>> > > > > > > > >>>>>> I think that we start moving the relevant code into
>> > > Flink’s
>> > > > > > > > >>>> repository.
>> > > > > > > > >>>>>> Example runs for the basic workflow [3] and the
>> extended
>> > > > > > (nightly)
>> > > > > > > > >>>> workflow
>> > > > > > > > >>>>>> [4] are provided.
>> > > > > > > > >>>>>>
>> > > > > > > > >>>>>> This will bring a few more changes to the Flink
>> > > > contributors.
>> > > > > > That
>> > > > > > > > is
>> > > > > > > > >>>> why
>> > > > > > > > >>>>>> I wanted to bring this discussion to the mailing list
>> > > > first. I
>> > > > > > > did a
>> > > > > > > > >>>> write
>> > > > > > > > >>>>>> up on (hopefully) all related topics in FLIP-395 [5].
>> > > > > > > > >>>>>>
>> > > > > > > > >>>>>> I’m looking forward to your feedback.
>> > > > > > > > >>>>>>
>> > > > > > > > >>>>>> Matthias
>> > > > > > > > >>>>>>
>> > > > > > > > >>>>>> [1]
>> > > > > > >
>> https://lists.apache.org/thread/vcyx2nx0mhklqwm827vgykv8pc54gg3k
>> > > > > > > > >>>>>>
>> > > > > > > > >>>>>> [2]https://github.com/XComp/flink/actions
>> > > > > > > > >>>>>>
>> > > > > > > > >>>>>> [3]
>> > https://github.com/XComp/flink/actions/runs/6926309782
>> > > > > > > > >>>>>>
>> > > > > > > > >>>>>> [4]
>> > https://github.com/XComp/flink/actions/runs/6927443941
>> > > > > > > > >>>>>>
>> > > > > > > > >>>>>> [5]
>> > > > > > > > >>>>>>
>> > > > > > > > >>>>
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-395%3A+Migration+to+GitHub+Actions
>> > > > > > > > >>>>>>
>> > > > > > > > >>>>>> --
>> > > > > > > > >>>>>>
>> > > > > > > > >>>>>> [image: Aiven]<https://www.aiven.io>
>> > > > > > > > >>>>>>
>> > > > > > > > >>>>>> *Matthias Pohl*
>> > > > > > > > >>>>>> Opensource Software Engineer, *Aiven*
>> > > > > > > > >>>>>> matthias.p...@aiven.io  <i...@aiven.io>    |  +49
>> 170
>> > > > 9869525
>> > > > > > > > >>>>>> aiven.io<https://www.aiven.io>    |
>> > > > > > > > >>>>>> <https://www.facebook.com/aivencloud>
>> > > > > > > > >>>>>> <https://www.linkedin.com/company/aiven/>    <
>> > > > > > > > >>>> https://twitter.com/aiven_io>
>> > > > > > > > >>>>>> *Aiven Deutschland GmbH*
>> > > > > > > > >>>>>> Alexanderufer 3-7, 10117 Berlin
>> > > > > > > > >>>>>> Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
>> > > > > > > > >>>>>> Amtsgericht Charlottenburg, HRB 209739 B
>> > > > > > > > >>>>>>
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Reply via email to