Re: Add /ready-to-test bot to airflow CI

Jarek Potiuk Sat, 24 Oct 2020 12:14:28 -0700

Some short update - seems like we can get 50% 60% saving in job usage by
the "unapproved PRs". We are progressing with implementation :D.


On Sat, Oct 24, 2020 at 10:55 AM Jarek Potiuk <jarek.pot...@polidea.com>
wrote:

> FYI. We found out with Tobiasz, that it will be a bit better and if we add
> "*Approved*" label in the PR that the workflow will automatically set
> when the issue gets approved.
>
> This way we have state of the PR approval (we know when it changes) and
> know that we should re-run last "small matrix" successful run when the
> label changes to "Approved".
>
> This will also be an additional indication to committers in case of
> queues and delays we se. It might be that the "small" matrix run is already
> successful, the PR gets approved but the "full matrix build" is delayed due
> to queuing. Such PR will have green "merge" button and might get merged by
> mistake - but it will not have the "Approved" label yet. Setting the label
> and re-running the build will happen at the same time.
>
> But I start thinking this label should be named differently  - how about 
> "*Ready
> to merge*" maybe? Or maybe other ideas?
>
> J.
>
>
>
>
> On Sat, Oct 24, 2020 at 1:10 AM Daniel Imberman <daniel.imber...@gmail.com>
> wrote:
>
>> +1000!
>>
>> via Newton Mail
>> <https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.51&pv=10.15.6&source=email_footer_2>
>>
>> On Fri, Oct 23, 2020 at 8:22 AM, Jarek Potiuk <jarek.pot...@polidea.com>
>> wrote:
>>
>> Thanks Tobiasz :). fantastic.
>>
>> I prepared a very short and simple design doc
>> https://docs.google.com/document/d/16rwyCfyDpKWN-DrLYbhjU0B1D58T1RFYan5ltmw4DQg/edit#
>> where we can collaborate.
>>
>> I also added you as collaborator to
>> https://github.com/potiuk/get-workflow-origin that we already use, and I
>> think you can update the "get workflow origin" plugin to include status of
>> the PR in the output of the action (ore maybe we find out that we already
>> have what we need in GitHub context).
>>
>> I will take a look at finding out how/if we can trigger the "full build"
>> automatically when approval status changes from "Not approved" to
>> "Approved".
>>
>> J.
>>
>>
>>
>>
>> On Thu, Oct 22, 2020 at 7:20 PM Tobiasz Kędzierski <
>> tobiasz.kedzier...@polidea.com> wrote:
>>
>>> Hi Jarek.
>>>
>>> sounds good to me. I am happy to help you as much as I can with it.
>>>
>>> BR
>>> Tobiasz
>>>
>>> On Thu, Oct 22, 2020 at 9:06 AM Jarek Potiuk <jarek.pot...@polidea.com>
>>> wrote:
>>>
>>>> *TLDR; I thought about it a bit and I have a proposal on how to solve
>>>> it even better - one that can be implemented over the weekend (I volunteer
>>>> :) ) and that can be very easily shared and adopted by the other ASF
>>>> projects so that we all collectively decrease the strain on Github Actions.
>>>> *
>>>>
>>>> This is in parallel to our efforts on having self-hosted workers of
>>>> course, but I think it will be needed anyway. Let me put it in a bit of
>>>> context
>>>>
>>>> *Problem statement: *
>>>>
>>>> * the root cause of the problem is that we are competing with many
>>>> other projects of ASF for the 180 jobs. I have started the discussion in
>>>> bui...@apache.org about this:
>>>> https://lists.apache.org/thread.html/r1708881f52adbdae722afb8fea16b23325b739b254b60890e72375e1%40%3Cbuilds.apache.org%3E
>>>> and it's clear all ASF projects using GA have the same problem and compete
>>>> against each other for the jobs.
>>>> * if we decrease the strain on our side, this is not solving the
>>>> problem long term. We keep on doing it already, and we already decrease a
>>>> lot of strain, but other projects from ASF increase their strain in the
>>>> meantime (Apache Beam, Skywalking, and few other projects are becoming
>>>> heavy GitHub Actions users).
>>>> * in all the projects that I looked at, we have the same root cause.
>>>> Matrix strategy of builds causes enormous strain on Github Actions if the
>>>> whole matrix is run for all PRs. We are going to make it works sustainably
>>>> only if we come up with an easy solution, that can be applied to all those
>>>> projects.
>>>> * I think the comment-based PR triggering process is complex and
>>>> cumbersome to follow. It puts a LOT more effort on the committers because
>>>> they not only have to review and comment on the PRs but also make decisions
>>>> that those PRs are ready for "full build". This is a lot of unnecessary
>>>> effort and complicated process that many of the ASF projects will not like
>>>> to adopt
>>>>
>>>> *Proposed Solution:*
>>>>
>>>> *Add an easy way to limit the matrix strategy to one "default" combo
>>>> for PRs THAT HAVE NOT BEEN APPROVED BY COMMITTERS YET*
>>>>
>>>> This can be easily done with Github Actions workflows - no need to
>>>> write a bot for this.
>>>>
>>>> *Some details:*
>>>>
>>>> * Custom GitHub Action (generic) that checks if the PRs are approved by
>>>> Committers (and no "disapprovals"). The action would produce an output ->
>>>> "Approved", "Not approved". The output could be used to determine the
>>>> matrix strategy scope (in our case we already have support for dynamic
>>>> matrix strategy that I added a few weeks ago - so it's just a matter of
>>>> wiring the output in).
>>>> * Very small workflow with the same GithubAction run on
>>>> "pull_request_target" event. That workflow would effective "observe" the
>>>> PR, and when the status changes from "not approved" to "approved", it
>>>> triggers a PR build (with the "full matrix strategy" this time because the
>>>> PR will be already approved). This seems to be entirely possible. This
>>>> "pull_request_target" workflow - similarly to "workflow_run" runs with a
>>>> "write" access token and uses a "main branch" workflow version and it could
>>>> easily trigger a rerun of the last PR build in such case.
>>>>
>>>> *Benefits:*
>>>>
>>>> - I think I could write such an action over the coming weekend (happy
>>>> to collaborate with anyone on that). I will first search if someone has
>>>> done something similar of course because maybe it can be done faster this
>>>> way, but I am quite confident after writing my
>>>> https://github.com/potiuk/cancel-workflow-runs which we already use to
>>>> limit the strain that it is doable in a day/two
>>>> - no need to change the process we have - we continue working as we did
>>>> and simply "approved" PRs will be the full matrix strategy ones but the
>>>> "not-approved-ones" will run a limited version of the checks.
>>>> - no way to accidentally submit a breaking PR - when the committer
>>>> approves the PR that has not been approved before, the PR build will be
>>>> re-run with the "full matrix strategy" and not mergeable until it finishes
>>>> - last-but-not-least: we can propose (and help) other ASF projects to
>>>> use the action in their own GitHub Actions. It will not be changing
>>>> anyone's process - which makes it super-easy to adopt and I can even turn
>>>> it into a "recommended solution" by Apache Infra - similarly as Airflow's
>>>> CI architecture is a recommended solution already for the integration of GA
>>>> with DockerHub
>>>> https://cwiki.apache.org/confluence/display/INFRA/Github+Actions+to+DockerHub
>>>>
>>>> WDYT?
>>>>
>>>> J
>>>>
>>>>
>>>>
>>>> On Fri, Oct 2, 2020 at 7:59 PM Jarek Potiuk <jarek.pot...@polidea.com>
>>>> wrote:
>>>>
>>>>> I think it's a good idea, but I'd augment it a bit. A better option
>>>>> will be to run all test types but
>>>>> for only one chosen combination of "Python + DB type + DB version".
>>>>>
>>>>> I often don't even look at the PR until the tests pass and this would
>>>>> be much better this way.
>>>>> And often people have slower/small machines so they submit the PR to
>>>>> see if they have not
>>>>> broken any other tests. This is much, much easier than doing it
>>>>> locally - because then in one
>>>>> "fire&forget" you can run static, doc, unit tests, integration, and
>>>>> Kubernetes ones,. And it's a valid
>>>>> thing.
>>>>>
>>>>> Also, we have to make sure that such PR does not become "Green" before
>>>>> all the tests are run.
>>>>> This might be rather problematic as Github does not yet have a "
>>>>> manual" Approval step in
>>>>> Github Actions (it's coming in Q4:
>>>>> https://github.com/github/roadmap/issues/99).
>>>>>
>>>>> We have many tests and already we hit a bug a few times, where not all
>>>>> tests have yet started
>>>>> and we've merged such PR. I can imagine it will happen more and more
>>>>> often if all PRs will
>>>>> only run a subset of tests. It will be very easy to make that mistake
>>>>> because even if we run a subset of
>>>>> those tests, we have so many jobs that you cannot see them all in the
>>>>> GitHub UI.
>>>>>
>>>>> So we will have to have a check that fails the PR but marks it somehow
>>>>> as "Ready for review" for example adding
>>>>> a label "Ready for review" when the subset of tests succeeds.
>>>>>
>>>>> Also, this might not be needed (or less important) if we implement:
>>>>> https://github.com/apache/airflow/issues/10507 "Selective Tests"
>>>>> for which I have an open PR. They will give much bigger improvements -
>>>>> because, in the vast majority of cases, the tests will take
>>>>> very little time - giving feedback
>>>>> about relevant tests in a few minutes rather than half-an-hour. We can
>>>>> also combine those two.
>>>>>
>>>>> It seems that I managed to finish some of the stuff that I thought
>>>>> will take more time, so I might come back to it next week
>>>>> if it goes as well as I planned.
>>>>>
>>>>> J.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Oct 2, 2020 at 3:57 AM Daniel Imberman <
>>>>> daniel.imber...@gmail.com> wrote:
>>>>>
>>>>>> I’m not too worried about that. I think people would learn pretty
>>>>>> quickly. It hasn’t been an issue for the kubernetes community so I can’t
>>>>>> imagine it being an issue for us. End-of-day, we only have a limited 
>>>>>> amount
>>>>>> of compute power and this will increase the speed we merge the PR’s that
>>>>>> have passed basic code quality checks.
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 1, 2020 at 2:19 PM, Tomasz Urbaszek <turbas...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>> I think I can agree. Especially with flaky tests, some contributors
>>>>>> may be confused that some of the tests don't work on CI but work 
>>>>>> locally...
>>>>>>
>>>>>> Checking the code quality is good first step. Once there's a review
>>>>>> we can start tests on CI.
>>>>>>
>>>>>> On the other hand, I can see people asking for starting the tests or
>>>>>> being even more confused why some PRs have more CI builds than others...
>>>>>>
>>>>>> Cheers,
>>>>>> Tomek
>>>>>>
>>>>>> On Thu, Oct 1, 2020 at 10:29 PM Daniel Imberman <
>>>>>> daniel.imber...@gmail.com> wrote:
>>>>>>
>>>>>>> Hello all,
>>>>>>>
>>>>>>> With the recent uptick in airflow contribution and pull requests, I
>>>>>>> have a proposal that I hope will ensure that we do not find ourselves 
>>>>>>> in a
>>>>>>> CI backlog hell. I noticed that on the Kubernetes project, pull 
>>>>>>> requests do
>>>>>>> not run integration test until a committer submits a "ready to test"
>>>>>>> command to the CI bot. This step can prevent draft PRs or un-reviewed 
>>>>>>> PRs
>>>>>>> from taking github CI resources. It is worth noting that with breeze's
>>>>>>> docker based testing system, users have the exact same testing 
>>>>>>> capabilities
>>>>>>> locally as they would on our CI.
>>>>>>>
>>>>>>> I propose that we allow unverified PR's to run basic and static
>>>>>>> tests, but not perform the full test suite or integration test without
>>>>>>> first being reviewed.
>>>>>>>
>>>>>>> What does everyone think?
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Jarek Potiuk
>>>>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>>>>
>>>>> M: +48 660 796 129 <+48660796129>
>>>>> [image: Polidea] <https://www.polidea.com/>
>>>>>
>>>>>
>>>>
>>>> --
>>>>
>>>> Jarek Potiuk
>>>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>>>
>>>> M: +48 660 796 129 <+48660796129>
>>>> [image: Polidea] <https://www.polidea.com/>
>>>>
>>>>
>>
>> --
>>
>> Jarek Potiuk
>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>
>> M: +48 660 796 129 <+48660796129>
>> [image: Polidea] <https://www.polidea.com/>
>>
>>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: Add /ready-to-test bot to airflow CI

Reply via email to