+1 from me to this idea.

Sounds very reasonable to me.
At times, my experience has been better with public runners instead of
self-hosted runners :)

And like already mentioned in the discussion, I think having the ability of
a applying the label "use-self-hosted-runners" to be used for critical
times would be nice to have too.


On Fri, 5 Apr 2024, 00:50 Jarek Potiuk, <[email protected]> wrote:

> Hello everyone,
>
> TL;DR With some recent changes in GitHub Actions and the fact that ASF has
> a lot of runners available donated for all the builds, I think we could
> experiment with disabling "self-hosted" runners for committer builds.
>
> The self-hosted runners of ours have been extremely helpful (and we should
> again thank Amazon and Astronomer for donating credits / money for those) -
> when the Github Public runners have been far less powerful - and we had
> less number of those available for ASF projects. This saved us a LOT of
> troubles where there was a contention between ASF projects.
>
> But as of recently both limitations have been largely removed:
>
> * ASF has 900 public runners donated by GitHub to all projects
> * Those public runners have (as of January) for open-source projects now
> have 4 CPUS and 16GB of memory -
>
> https://github.blog/2024-01-17-github-hosted-runners-double-the-power-for-open-source/
>
>
> While they are not as powerful as our self-hosted runners, the parallelism
> we utilise for those brings those builds in not-that bad shape compared to
> self-hosted runners. Typical differences between the public and self-hosted
> runners now for the complete set of tests are ~ 20m for public runners and
> ~14 m for self-hosted ones.
>
> But this is not the only factor - I think committers experience the "Job
> failed" for self-hosted runners generally much more often than
> non-committers (stability of our solution is not best, also we are using
> cheaper spot instances). Plus - we limit the total number of self-hosted
> runners (35) - so if several committers submit a few PRs and we have canary
> build running, the jobs will wait until runners are available.
>
> And of course it costs the credits/money of sponsors which we could use for
> other things.
>
> I have - as of recently - access to Github Actions metrics - and while ASF
> is keeping an eye and stared limiting the number of parallel jobs workflows
> in projects are run, it looks like even if all committer runs are added to
> the public runners, we will still cause far lower usage that the limits are
> and far lower than some other projects (which I will not name here).  I
> have access to the metrics so I can monitor our usage and react.
>
> I think possibly - if we switch committers to "public" runners by default
> -the experience will not be much worse for them (and sometimes even better
> - because of stability/limited queue).
>
> I was planning this carefully - I made a number of refactors/changes to our
> workflows recently that makes it way easier to manipulate the configuration
> and get various conditions applied to various jobs - so
> changing/experimenting with those settings should be - well - a breeze :).
> Few recent changes had proven that this change and workflow refactor were
> definitely worth the effort, I feel like I finally got a control over it
> where previously it was a bit like herding a pack of cats (which I
> brought to live by myself, but that's another story).
>
> I would like to propose to run an experiment and see how it works if we
> switch committer PRs back to the public runners - leaving the self-hosted
> runners only for canary builds (which makes perfect sense because those
> builds run a full set of tests and we need as much speed and power there as
> we can.
>
> This is pretty safe, We should be able to switch back very easily if we see
> problems. I will also monitor it and see if our usage is within the limits
> of the ASF. I can also add the feature that committers should be able to
> use self-hosted runners by applying the "use self-hosted runners" label to
> a PR.
>
> Running it for 2-3 weeks should be enough to gather experience from
> committers - whether things will seem better or worse for them - or maybe
> they won't really notice a big difference.
>
> Later we could consider some next steps - disabling the self-hosted runners
> for canary builds if we see that our usage is low and build are fast
> enough, eventually possibly removing current self-hosted runners and
> switching to a better k8s based infrastructure (which we are close to do
> but it makes it a bit difficult while current self-hosted solution is so
> critical to keep it running (like rebuilding the plane while it is flying).
> I'd love to do it gradually in the "change slowly and observe" mode -
> especially now that I have access to "proper" metrics.
>
> WDYT?
>
> J.
>

Reply via email to