Re: [DISCUSS] Consider disabling self-hosted runners for commiter PRs

Hussein Awala Fri, 05 Apr 2024 04:17:07 -0700

Although 900 runners seem like a lot, they are shared among the Apache
organization's 2.2k repositories, of course only a few of them are active
(let's say 50), and some of them use an external CI tool for big jobs (eg:
Kafka uses Jenkins, Hudi uses Azure pipelines), but we have other very
active repositories based entirely on GHA, for example, Iceberg, Spark,
Superset, ...


I haven't found the AFS runners metrics dashboard to check the max
concurrency and the max queued time during peak hours, but I'm sure that
moving Airflow committers' CI jobs to public runners will put some pressure
on these runners, especially since these committers are the most active
contributors to Airflow, and the 35 self-hosted runners (with 8 CPUs and 64
GB RAM) are used almost all the time, so we can say that we will need
around 70 AFS runners to run the same jobs.

There is no harm in testing and deciding after 2-3 weeks.

We also need to find a way to let the infra team help us solve the
connectivity problem with the ARC runners
<https://issues.apache.org/jira/projects/INFRA/issues/INFRA-25117?filter=reportedbyme>
.

+1 for testing what you propose.

On Fri, Apr 5, 2024 at 12:07 PM Amogh Desai <[email protected]>
wrote:

> +1 I like the idea.
> Looking forward to seeing the difference.
>
> Thanks & Regards,
> Amogh Desai
>
>
> On Fri, Apr 5, 2024 at 3:54 AM Ferruzzi, Dennis
> <[email protected]>
> wrote:
>
> > Interested in seeing the difference, +1
> >
> >
> >  - ferruzzi
> >
> >
> > ________________________________
> > From: Oliveira, Niko <[email protected]>
> > Sent: Thursday, April 4, 2024 2:00 PM
> > To: [email protected]
> > Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Consider disabling
> > self-hosted runners for commiter PRs
> >
> > CAUTION: This email originated from outside of the organization. Do not
> > click links or open attachments unless you can confirm the sender and
> know
> > the content is safe.
> >
> >
> >
> > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe.
> > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
> pouvez
> > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain
> que
> > le contenu ne présente aucun risque.
> >
> >
> >
> > +1I'd love to see this as well.
> >
> > In the past, stability and long queue times of PR builds have been very
> > frustrating. I'm not 100% sure this is due to using self hosted runners,
> > since 35 queue depth (to my mind) should be plenty. But something about
> > that setup has never seemed quite right to me with queuing. Switching to
> > public runners for a while to experiment would be great to see if it
> > improves.
> >
> > ________________________________
> > From: Pankaj Koti <[email protected]>
> > Sent: Thursday, April 4, 2024 12:41:02 PM
> > To: [email protected]
> > Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Consider disabling
> > self-hosted runners for commiter PRs
> >
> > CAUTION: This email originated from outside of the organization. Do not
> > click links or open attachments unless you can confirm the sender and
> know
> > the content is safe.
> >
> >
> >
> > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe.
> > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
> pouvez
> > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain
> que
> > le contenu ne présente aucun risque.
> >
> >
> >
> > +1 from me to this idea.
> >
> > Sounds very reasonable to me.
> > At times, my experience has been better with public runners instead of
> > self-hosted runners :)
> >
> > And like already mentioned in the discussion, I think having the ability
> of
> > a applying the label "use-self-hosted-runners" to be used for critical
> > times would be nice to have too.
> >
> >
> > On Fri, 5 Apr 2024, 00:50 Jarek Potiuk, <[email protected]> wrote:
> >
> > > Hello everyone,
> > >
> > > TL;DR With some recent changes in GitHub Actions and the fact that ASF
> > has
> > > a lot of runners available donated for all the builds, I think we could
> > > experiment with disabling "self-hosted" runners for committer builds.
> > >
> > > The self-hosted runners of ours have been extremely helpful (and we
> > should
> > > again thank Amazon and Astronomer for donating credits / money for
> > those) -
> > > when the Github Public runners have been far less powerful - and we had
> > > less number of those available for ASF projects. This saved us a LOT of
> > > troubles where there was a contention between ASF projects.
> > >
> > > But as of recently both limitations have been largely removed:
> > >
> > > * ASF has 900 public runners donated by GitHub to all projects
> > > * Those public runners have (as of January) for open-source projects
> now
> > > have 4 CPUS and 16GB of memory -
> > >
> > >
> >
> https://github.blog/2024-01-17-github-hosted-runners-double-the-power-for-open-source/
> > >
> > >
> > > While they are not as powerful as our self-hosted runners, the
> > parallelism
> > > we utilise for those brings those builds in not-that bad shape compared
> > to
> > > self-hosted runners. Typical differences between the public and
> > self-hosted
> > > runners now for the complete set of tests are ~ 20m for public runners
> > and
> > > ~14 m for self-hosted ones.
> > >
> > > But this is not the only factor - I think committers experience the
> "Job
> > > failed" for self-hosted runners generally much more often than
> > > non-committers (stability of our solution is not best, also we are
> using
> > > cheaper spot instances). Plus - we limit the total number of
> self-hosted
> > > runners (35) - so if several committers submit a few PRs and we have
> > canary
> > > build running, the jobs will wait until runners are available.
> > >
> > > And of course it costs the credits/money of sponsors which we could use
> > for
> > > other things.
> > >
> > > I have - as of recently - access to Github Actions metrics - and while
> > ASF
> > > is keeping an eye and stared limiting the number of parallel jobs
> > workflows
> > > in projects are run, it looks like even if all committer runs are added
> > to
> > > the public runners, we will still cause far lower usage that the limits
> > are
> > > and far lower than some other projects (which I will not name here).  I
> > > have access to the metrics so I can monitor our usage and react.
> > >
> > > I think possibly - if we switch committers to "public" runners by
> default
> > > -the experience will not be much worse for them (and sometimes even
> > better
> > > - because of stability/limited queue).
> > >
> > > I was planning this carefully - I made a number of refactors/changes to
> > our
> > > workflows recently that makes it way easier to manipulate the
> > configuration
> > > and get various conditions applied to various jobs - so
> > > changing/experimenting with those settings should be - well - a breeze
> > :).
> > > Few recent changes had proven that this change and workflow refactor
> were
> > > definitely worth the effort, I feel like I finally got a control over
> it
> > > where previously it was a bit like herding a pack of cats (which I
> > > brought to live by myself, but that's another story).
> > >
> > > I would like to propose to run an experiment and see how it works if we
> > > switch committer PRs back to the public runners - leaving the
> self-hosted
> > > runners only for canary builds (which makes perfect sense because those
> > > builds run a full set of tests and we need as much speed and power
> there
> > as
> > > we can.
> > >
> > > This is pretty safe, We should be able to switch back very easily if we
> > see
> > > problems. I will also monitor it and see if our usage is within the
> > limits
> > > of the ASF. I can also add the feature that committers should be able
> to
> > > use self-hosted runners by applying the "use self-hosted runners" label
> > to
> > > a PR.
> > >
> > > Running it for 2-3 weeks should be enough to gather experience from
> > > committers - whether things will seem better or worse for them - or
> maybe
> > > they won't really notice a big difference.
> > >
> > > Later we could consider some next steps - disabling the self-hosted
> > runners
> > > for canary builds if we see that our usage is low and build are fast
> > > enough, eventually possibly removing current self-hosted runners and
> > > switching to a better k8s based infrastructure (which we are close to
> do
> > > but it makes it a bit difficult while current self-hosted solution is
> so
> > > critical to keep it running (like rebuilding the plane while it is
> > flying).
> > > I'd love to do it gradually in the "change slowly and observe" mode -
> > > especially now that I have access to "proper" metrics.
> > >
> > > WDYT?
> > >
> > > J.
> > >
> >
>

Re: [DISCUSS] Consider disabling self-hosted runners for commiter PRs

Reply via email to