Re: Github hard limit on job execution time

Jarek Potiuk Tue, 18 Apr 2023 01:26:10 -0700

Subscribe to builds@a.o and users@infra.a.o, There is an #asfinfra slack
channel. to follow. You can also take part in the monthly infra
roundtables: https://infra.apache.org/roundtable.html


J.

On Tue, Apr 18, 2023 at 5:22 AM Sai Boorlagadda <sai.boorlaga...@gmail.com>
wrote:

> Thanks, Gavin,
>
> How can I be informed or follow the status of this initiative?
>
> Sai
>
> On Fri, 14 Apr 2023 at 08:02, Gavin McDonald <gmcdon...@apache.org> wrote:
>
> > Hi All,
> >
> > Infra is working on self-hosted Github Runners provided by Infra to
> > projects, hosted in Azure and
> > hope to provide a few varieties of arch/cpu/mem.
> >
> > Will keep this list updated as it progresses
> >
> > Gav...
> >
> >
> > On Fri, Apr 14, 2023 at 9:23 AM Sai Boorlagadda <
> sai.boorlaga...@gmail.com
> > >
> > wrote:
> >
> > > thanks for all the feedback. At this time there are no sponsors for
> Geode
> > > so cannot have self-hosted runners. I have already split the job
> running
> > > tests into multiple jobs by gradle module, and even then one particular
> > > gradle module takes more than 6 hours. Will try to parallelize and see
> if
> > > that works.
> > >
> > > Sai
> > >
> > > On Thu, 13 Apr 2023 at 23:54, Jarek Potiuk <ja...@potiuk.com> wrote:
> > >
> > > > In many cases it can be done with choosing a bigger machine with more
> > > CPUS
> > > > and parallelising as others mentioned. This is cool if your tests are
> > > pure
> > > > unit tests and you can add just `--xdist` flag or similar (this is a
> > > pytest
> > > > extension to run your tests in parallel with as many CPUs as you
> can).
> > > > However there are cases where the limitation is I/O or your tests
> > simply
> > > > cannot run in parallel because a lot of them rely on shared resources
> > > (say
> > > > database). But even then you can attempt to do something about it.
> > > >
> > > > In Airflow we solved those problems by custom-parallelising our jobs,
> > > > choosing huge self-hosted runners and running everything in-memory.
> > > >
> > > > Even though our tests could not be parallelized "per tests" (mostly
> for
> > > > historical reasons a lot of our tests are not pure unit tests and
> > depend
> > > on
> > > > database), we split the tests into "test types" (8 of them but soon
> > more)
> > > > and run them in parallel - with as many parallel types running as we
> > > have.
> > > > Each test uses its own database instance  - this is all orchestrated
> > with
> > > > docker-compose.
> > > > In order to avoid inevitable I/O contention with this setup, this is
> > all
> > > > running on a huge tmpfs storage (50 GB or so) - including a docker
> > > > instance  that runs the databases that has tmpfs backing storage, so
> > > those
> > > > databases are backed by in-memory filesystem and thus are
> super-stable
> > > and
> > > > super-fast.  Thanks to that, our thousands of tests can run really
> fast
> > > > even if some of them are not pure unit tests. We run it all on a
> large
> > > > self-hosted runner with 8 CPUS and 64 GB RAM and thanks to that our
> > > > complete test suite runs in 15 minutes instead of 1.5 hour.
> > > >
> > > > Such setup achieves two optimisation goals: cheap and fast. Yes we
> need
> > > > much more costly, bigger machines but we need them for a shorter time
> > and
> > > > we use them with  80%-90% utilisation which is pretty high for such
> > cases
> > > > (we keep optimising it regularly and I try to continue to push it
> > closer
> > > to
> > > > 100% continuously). As the result - if your hosted runners in the
> cloud
> > > are
> > > > on-demand/ephemeral (usually 80%-90% cost reduction) and you have a
> > fast
> > > > setup, you can bring them up for 10 minutes and shutdown when
> finished,
> > > > thus they cause a fraction of small machines that run all the time,
> > > > especially if in the project you have times where no PRs are run.
> Also
> > > > optimising speed of tests is even more important than optimising the
> > cost
> > > > of them, because getting feedback faster is good for your
> contributors
> > -
> > > > but with this setup we can eat cake and have it too - the cost is low
> > and
> > > > the tests are fast.
> > > >
> > > > J.
> > > >
> > > >
> > > >
> > > > On Fri, Apr 14, 2023 at 1:37 AM Hyukjin Kwon <gurwls...@gmail.com>
> > > wrote:
> > > >
> > > > > Just dropping a comment. Apache Spark solved it by splitting the
> job.
> > > > >
> > > > > As of the number of parallel jobs, Apache Spark made, in PR
> builder,
> > a
> > > > > custom logic to link the GitHub workflow run in forked
> repositories -
> > > so
> > > > we
> > > > > reuse the GitHub resources in PR authors forked repository instead
> of
> > > the
> > > > > one allocated to ASF itself.
> > > > >
> > > > > On Fri, Apr 14, 2023 at 8:00 AM sebb <seb...@gmail.com> wrote:
> > > > >
> > > > > > On Thu, 13 Apr 2023 at 20:58, Martin Grigorov <
> > mgrigo...@apache.org>
> > > > > > wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > On Thu, Apr 13, 2023 at 7:17 PM Sai Boorlagadda <
> > > > > > sai_boorlaga...@apache.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hey All! I am part of Apache Geode project and we have been
> > > > migrating
> > > > > > our
> > > > > > > > pipelines to Github actions and hit a roadblock that the max.
> > job
> > > > > > execution
> > > > > > > > time on non-self-hosted GitHub workers is set a hard limit
> > > > > > > > <
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration
> > > > > > > > >
> > > > > > > > of
> > > > > > > > 6 hours and one of our job
> > > > > > > > <https://github.com/apache/geode/actions/runs/4639012912> is
> > > > taking
> > > > > > more
> > > > > > > > than 6 hours. Are there any pointers on how someone solved
> > this?
> > > or
> > > > > > does
> > > > > > >
> > > > > > > Github provides any increases for Apache Foundation projects?
> > > > > > > >
> > > > > > >
> > > > > > > The only way to "increase the resources" is to use a
> self-hosted
> > > > > runner.
> > > > > > > But instead of looking how to use more of the free pool you
> > should
> > > > try
> > > > > to
> > > > > > > optimize your build to need less!
> > > > > > > These free resources are shared with all other Apache projects,
> > so
> > > > when
> > > > > > > your project uses more another project will have to wait.
> > > > > > >
> > > > > > > You can start by using parallel build -
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/geode/blob/102e24691eacd2d1d6652a070f14af9f5b42dc0d/.github/workflows/gradle.yml#L254
> > > > > > > Also tune the maxWorkers -
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/geode/blob/102e24691eacd2d1d6652a070f14af9f5b42dc0d/.github/workflows/gradle.yml#L256
> > > > > > .
> > > > > > > The Linux VMs have 2 vCPUs. You can try with the macos-latest
> > VM,it
> > > > > has 3
> > > > > > > vCPUs.
> > > > > > > Another option is to split this job into few smaller ones. Each
> > job
> > > > has
> > > > > > its
> > > > > > > own 6 hours.
> > > > > >
> > > > > > Also maybe run some of the jobs manually, rather than on every
> > > commit.
> > > > > > At present there are two instances running at the same time from
> > > > > > subsequent commits.
> > > > > > At least one of these is a waste of resources.
> > > > > >
> > > > > > > Good luck!
> > > > > > >
> > > > > > > Martin
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> >
> > *Gavin McDonald*
> > Systems Administrator
> > ASF Infrastructure Team
> >
>

Re: Github hard limit on job execution time

Reply via email to