Hi Jarek,

Thanks for the detailed context and looking forward to the new solution and
infra.

Thanks,

Ping


On Sat, May 14, 2022 at 7:50 AM Jarek Potiuk <[email protected]> wrote:

> Yeah. Would be great to figure it out. I also noticed quite a number of
> those and they are related to our GitHub Runner infrastructure. For some
> reason our runners are more often killed and evicted than it was before so
> likely we will need to take a closer look at it. Until it becomes REALLY
> annoying, this is a bit  time consuming to analyse and look at that - and
> usually Ash and myself looked at it when we had a bit of spare time, but
> maybe someone from the committers team would like to take a look at it from
> the "devopsy" point of view?
>
> I think it would be great if someone looks at it with a fresh eye, as
> having just me and Ash looking at it when we have time to spare is not
> nearly good enough and we are two  - but still "just two" Points of Failure
> . The current solution is kinda complex-ish using a combination of Github
> Runner modified by Ash.  AWS-specific infrastructure, Dynamo DB to keep
> shared authentication information, Auto-Scaling groups, webhook from GitHub
> Actions triggering the scaling in/out/, starting Spot Instances as needed
> (which can get evicted any time but are 8x cheaper to run), so it might be
> some fine tuning (preceded with analysis of what are the root causes for
> the failures might be needed). So it requires quite an open-mind on the
> tools and technologies used as well as some cloud
> management/monitoring/infrastructure devopsing experience.
>
> Eventually we might want to migrate to a K8S-managed infrastructure as the
> Apache Beam team  together with ASF Infra (with some of our help and
> guidance) works on building a solution that is supposed to be more portable
> and easier. So similarly to Python the Breeze and CI Actions rewrite (which
> we are finishing) - one of the goals for the infra should be that we have
> more people who are involved, know how to fix and run things and make it
> more "standard".
>
> Any volunteers to take a look at the current setup are most welcome. I
> think we need a committer, due to sensitivity of the infrastructure access.
>
> Anyone? Who would like to help here ?
>
> J.
>
>
> On Sat, May 14, 2022 at 2:09 AM Ping Zhang <[email protected]> wrote:
>
>> Hi friends,
>>
>> Recently, I noticed my PRs got lots of this kind of errors:
>>
>> Some checks were not successful58 successful, 4 skipped, and 1 cancelled
>> checks
>>
>> Tests / Helm Chart Executor Upgrade (pull_request) Cancelled after 104m
>> — Helm Chart Executor Upgrade
>>
>> For example https://github.com/apache/airflow/pull/23655 and
>> https://github.com/apache/airflow/pull/23684, and I had to force push
>> many times.
>>
>> I am wondering what causes this and how I can avoid this error.
>>
>> Thanks,
>>
>> Ping
>>
>

Reply via email to