+1I'd love to see this as well.

In the past, stability and long queue times of PR builds have been very 
frustrating. I'm not 100% sure this is due to using self hosted runners, since 
35 queue depth (to my mind) should be plenty. But something about that setup 
has never seemed quite right to me with queuing. Switching to public runners 
for a while to experiment would be great to see if it improves.

________________________________
From: Pankaj Koti <pankaj.k...@astronomer.io.INVALID>
Sent: Thursday, April 4, 2024 12:41:02 PM
To: dev@airflow.apache.org
Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Consider disabling 
self-hosted runners for commiter PRs

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.



AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe. Ne 
cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez pas 
confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que le 
contenu ne présente aucun risque.



+1 from me to this idea.

Sounds very reasonable to me.
At times, my experience has been better with public runners instead of
self-hosted runners :)

And like already mentioned in the discussion, I think having the ability of
a applying the label "use-self-hosted-runners" to be used for critical
times would be nice to have too.


On Fri, 5 Apr 2024, 00:50 Jarek Potiuk, <ja...@potiuk.com> wrote:

> Hello everyone,
>
> TL;DR With some recent changes in GitHub Actions and the fact that ASF has
> a lot of runners available donated for all the builds, I think we could
> experiment with disabling "self-hosted" runners for committer builds.
>
> The self-hosted runners of ours have been extremely helpful (and we should
> again thank Amazon and Astronomer for donating credits / money for those) -
> when the Github Public runners have been far less powerful - and we had
> less number of those available for ASF projects. This saved us a LOT of
> troubles where there was a contention between ASF projects.
>
> But as of recently both limitations have been largely removed:
>
> * ASF has 900 public runners donated by GitHub to all projects
> * Those public runners have (as of January) for open-source projects now
> have 4 CPUS and 16GB of memory -
>
> https://github.blog/2024-01-17-github-hosted-runners-double-the-power-for-open-source/
>
>
> While they are not as powerful as our self-hosted runners, the parallelism
> we utilise for those brings those builds in not-that bad shape compared to
> self-hosted runners. Typical differences between the public and self-hosted
> runners now for the complete set of tests are ~ 20m for public runners and
> ~14 m for self-hosted ones.
>
> But this is not the only factor - I think committers experience the "Job
> failed" for self-hosted runners generally much more often than
> non-committers (stability of our solution is not best, also we are using
> cheaper spot instances). Plus - we limit the total number of self-hosted
> runners (35) - so if several committers submit a few PRs and we have canary
> build running, the jobs will wait until runners are available.
>
> And of course it costs the credits/money of sponsors which we could use for
> other things.
>
> I have - as of recently - access to Github Actions metrics - and while ASF
> is keeping an eye and stared limiting the number of parallel jobs workflows
> in projects are run, it looks like even if all committer runs are added to
> the public runners, we will still cause far lower usage that the limits are
> and far lower than some other projects (which I will not name here).  I
> have access to the metrics so I can monitor our usage and react.
>
> I think possibly - if we switch committers to "public" runners by default
> -the experience will not be much worse for them (and sometimes even better
> - because of stability/limited queue).
>
> I was planning this carefully - I made a number of refactors/changes to our
> workflows recently that makes it way easier to manipulate the configuration
> and get various conditions applied to various jobs - so
> changing/experimenting with those settings should be - well - a breeze :).
> Few recent changes had proven that this change and workflow refactor were
> definitely worth the effort, I feel like I finally got a control over it
> where previously it was a bit like herding a pack of cats (which I
> brought to live by myself, but that's another story).
>
> I would like to propose to run an experiment and see how it works if we
> switch committer PRs back to the public runners - leaving the self-hosted
> runners only for canary builds (which makes perfect sense because those
> builds run a full set of tests and we need as much speed and power there as
> we can.
>
> This is pretty safe, We should be able to switch back very easily if we see
> problems. I will also monitor it and see if our usage is within the limits
> of the ASF. I can also add the feature that committers should be able to
> use self-hosted runners by applying the "use self-hosted runners" label to
> a PR.
>
> Running it for 2-3 weeks should be enough to gather experience from
> committers - whether things will seem better or worse for them - or maybe
> they won't really notice a big difference.
>
> Later we could consider some next steps - disabling the self-hosted runners
> for canary builds if we see that our usage is low and build are fast
> enough, eventually possibly removing current self-hosted runners and
> switching to a better k8s based infrastructure (which we are close to do
> but it makes it a bit difficult while current self-hosted solution is so
> critical to keep it running (like rebuilding the plane while it is flying).
> I'd love to do it gradually in the "change slowly and observe" mode -
> especially now that I have access to "proper" metrics.
>
> WDYT?
>
> J.
>

Reply via email to