Some short update - seems like we can get 50% 60% saving in job usage by the "unapproved PRs". We are progressing with implementation :D.
On Sat, Oct 24, 2020 at 10:55 AM Jarek Potiuk <jarek.pot...@polidea.com> wrote: > FYI. We found out with Tobiasz, that it will be a bit better and if we add > "*Approved*" label in the PR that the workflow will automatically set > when the issue gets approved. > > This way we have state of the PR approval (we know when it changes) and > know that we should re-run last "small matrix" successful run when the > label changes to "Approved". > > This will also be an additional indication to committers in case of > queues and delays we se. It might be that the "small" matrix run is already > successful, the PR gets approved but the "full matrix build" is delayed due > to queuing. Such PR will have green "merge" button and might get merged by > mistake - but it will not have the "Approved" label yet. Setting the label > and re-running the build will happen at the same time. > > But I start thinking this label should be named differently - how about > "*Ready > to merge*" maybe? Or maybe other ideas? > > J. > > > > > On Sat, Oct 24, 2020 at 1:10 AM Daniel Imberman <daniel.imber...@gmail.com> > wrote: > >> +1000! >> >> via Newton Mail >> <https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.51&pv=10.15.6&source=email_footer_2> >> >> On Fri, Oct 23, 2020 at 8:22 AM, Jarek Potiuk <jarek.pot...@polidea.com> >> wrote: >> >> Thanks Tobiasz :). fantastic. >> >> I prepared a very short and simple design doc >> https://docs.google.com/document/d/16rwyCfyDpKWN-DrLYbhjU0B1D58T1RFYan5ltmw4DQg/edit# >> where we can collaborate. >> >> I also added you as collaborator to >> https://github.com/potiuk/get-workflow-origin that we already use, and I >> think you can update the "get workflow origin" plugin to include status of >> the PR in the output of the action (ore maybe we find out that we already >> have what we need in GitHub context). >> >> I will take a look at finding out how/if we can trigger the "full build" >> automatically when approval status changes from "Not approved" to >> "Approved". >> >> J. >> >> >> >> >> On Thu, Oct 22, 2020 at 7:20 PM Tobiasz Kędzierski < >> tobiasz.kedzier...@polidea.com> wrote: >> >>> Hi Jarek. >>> >>> sounds good to me. I am happy to help you as much as I can with it. >>> >>> BR >>> Tobiasz >>> >>> On Thu, Oct 22, 2020 at 9:06 AM Jarek Potiuk <jarek.pot...@polidea.com> >>> wrote: >>> >>>> *TLDR; I thought about it a bit and I have a proposal on how to solve >>>> it even better - one that can be implemented over the weekend (I volunteer >>>> :) ) and that can be very easily shared and adopted by the other ASF >>>> projects so that we all collectively decrease the strain on Github Actions. >>>> * >>>> >>>> This is in parallel to our efforts on having self-hosted workers of >>>> course, but I think it will be needed anyway. Let me put it in a bit of >>>> context >>>> >>>> *Problem statement: * >>>> >>>> * the root cause of the problem is that we are competing with many >>>> other projects of ASF for the 180 jobs. I have started the discussion in >>>> bui...@apache.org about this: >>>> https://lists.apache.org/thread.html/r1708881f52adbdae722afb8fea16b23325b739b254b60890e72375e1%40%3Cbuilds.apache.org%3E >>>> and it's clear all ASF projects using GA have the same problem and compete >>>> against each other for the jobs. >>>> * if we decrease the strain on our side, this is not solving the >>>> problem long term. We keep on doing it already, and we already decrease a >>>> lot of strain, but other projects from ASF increase their strain in the >>>> meantime (Apache Beam, Skywalking, and few other projects are becoming >>>> heavy GitHub Actions users). >>>> * in all the projects that I looked at, we have the same root cause. >>>> Matrix strategy of builds causes enormous strain on Github Actions if the >>>> whole matrix is run for all PRs. We are going to make it works sustainably >>>> only if we come up with an easy solution, that can be applied to all those >>>> projects. >>>> * I think the comment-based PR triggering process is complex and >>>> cumbersome to follow. It puts a LOT more effort on the committers because >>>> they not only have to review and comment on the PRs but also make decisions >>>> that those PRs are ready for "full build". This is a lot of unnecessary >>>> effort and complicated process that many of the ASF projects will not like >>>> to adopt >>>> >>>> *Proposed Solution:* >>>> >>>> *Add an easy way to limit the matrix strategy to one "default" combo >>>> for PRs THAT HAVE NOT BEEN APPROVED BY COMMITTERS YET* >>>> >>>> This can be easily done with Github Actions workflows - no need to >>>> write a bot for this. >>>> >>>> *Some details:* >>>> >>>> * Custom GitHub Action (generic) that checks if the PRs are approved by >>>> Committers (and no "disapprovals"). The action would produce an output -> >>>> "Approved", "Not approved". The output could be used to determine the >>>> matrix strategy scope (in our case we already have support for dynamic >>>> matrix strategy that I added a few weeks ago - so it's just a matter of >>>> wiring the output in). >>>> * Very small workflow with the same GithubAction run on >>>> "pull_request_target" event. That workflow would effective "observe" the >>>> PR, and when the status changes from "not approved" to "approved", it >>>> triggers a PR build (with the "full matrix strategy" this time because the >>>> PR will be already approved). This seems to be entirely possible. This >>>> "pull_request_target" workflow - similarly to "workflow_run" runs with a >>>> "write" access token and uses a "main branch" workflow version and it could >>>> easily trigger a rerun of the last PR build in such case. >>>> >>>> *Benefits:* >>>> >>>> - I think I could write such an action over the coming weekend (happy >>>> to collaborate with anyone on that). I will first search if someone has >>>> done something similar of course because maybe it can be done faster this >>>> way, but I am quite confident after writing my >>>> https://github.com/potiuk/cancel-workflow-runs which we already use to >>>> limit the strain that it is doable in a day/two >>>> - no need to change the process we have - we continue working as we did >>>> and simply "approved" PRs will be the full matrix strategy ones but the >>>> "not-approved-ones" will run a limited version of the checks. >>>> - no way to accidentally submit a breaking PR - when the committer >>>> approves the PR that has not been approved before, the PR build will be >>>> re-run with the "full matrix strategy" and not mergeable until it finishes >>>> - last-but-not-least: we can propose (and help) other ASF projects to >>>> use the action in their own GitHub Actions. It will not be changing >>>> anyone's process - which makes it super-easy to adopt and I can even turn >>>> it into a "recommended solution" by Apache Infra - similarly as Airflow's >>>> CI architecture is a recommended solution already for the integration of GA >>>> with DockerHub >>>> https://cwiki.apache.org/confluence/display/INFRA/Github+Actions+to+DockerHub >>>> >>>> WDYT? >>>> >>>> J >>>> >>>> >>>> >>>> On Fri, Oct 2, 2020 at 7:59 PM Jarek Potiuk <jarek.pot...@polidea.com> >>>> wrote: >>>> >>>>> I think it's a good idea, but I'd augment it a bit. A better option >>>>> will be to run all test types but >>>>> for only one chosen combination of "Python + DB type + DB version". >>>>> >>>>> I often don't even look at the PR until the tests pass and this would >>>>> be much better this way. >>>>> And often people have slower/small machines so they submit the PR to >>>>> see if they have not >>>>> broken any other tests. This is much, much easier than doing it >>>>> locally - because then in one >>>>> "fire&forget" you can run static, doc, unit tests, integration, and >>>>> Kubernetes ones,. And it's a valid >>>>> thing. >>>>> >>>>> Also, we have to make sure that such PR does not become "Green" before >>>>> all the tests are run. >>>>> This might be rather problematic as Github does not yet have a " >>>>> manual" Approval step in >>>>> Github Actions (it's coming in Q4: >>>>> https://github.com/github/roadmap/issues/99). >>>>> >>>>> We have many tests and already we hit a bug a few times, where not all >>>>> tests have yet started >>>>> and we've merged such PR. I can imagine it will happen more and more >>>>> often if all PRs will >>>>> only run a subset of tests. It will be very easy to make that mistake >>>>> because even if we run a subset of >>>>> those tests, we have so many jobs that you cannot see them all in the >>>>> GitHub UI. >>>>> >>>>> So we will have to have a check that fails the PR but marks it somehow >>>>> as "Ready for review" for example adding >>>>> a label "Ready for review" when the subset of tests succeeds. >>>>> >>>>> Also, this might not be needed (or less important) if we implement: >>>>> https://github.com/apache/airflow/issues/10507 "Selective Tests" >>>>> for which I have an open PR. They will give much bigger improvements - >>>>> because, in the vast majority of cases, the tests will take >>>>> very little time - giving feedback >>>>> about relevant tests in a few minutes rather than half-an-hour. We can >>>>> also combine those two. >>>>> >>>>> It seems that I managed to finish some of the stuff that I thought >>>>> will take more time, so I might come back to it next week >>>>> if it goes as well as I planned. >>>>> >>>>> J. >>>>> >>>>> >>>>> >>>>> On Fri, Oct 2, 2020 at 3:57 AM Daniel Imberman < >>>>> daniel.imber...@gmail.com> wrote: >>>>> >>>>>> I’m not too worried about that. I think people would learn pretty >>>>>> quickly. It hasn’t been an issue for the kubernetes community so I can’t >>>>>> imagine it being an issue for us. End-of-day, we only have a limited >>>>>> amount >>>>>> of compute power and this will increase the speed we merge the PR’s that >>>>>> have passed basic code quality checks. >>>>>> >>>>>> >>>>>> On Thu, Oct 1, 2020 at 2:19 PM, Tomasz Urbaszek <turbas...@apache.org> >>>>>> wrote: >>>>>> >>>>>> I think I can agree. Especially with flaky tests, some contributors >>>>>> may be confused that some of the tests don't work on CI but work >>>>>> locally... >>>>>> >>>>>> Checking the code quality is good first step. Once there's a review >>>>>> we can start tests on CI. >>>>>> >>>>>> On the other hand, I can see people asking for starting the tests or >>>>>> being even more confused why some PRs have more CI builds than others... >>>>>> >>>>>> Cheers, >>>>>> Tomek >>>>>> >>>>>> On Thu, Oct 1, 2020 at 10:29 PM Daniel Imberman < >>>>>> daniel.imber...@gmail.com> wrote: >>>>>> >>>>>>> Hello all, >>>>>>> >>>>>>> With the recent uptick in airflow contribution and pull requests, I >>>>>>> have a proposal that I hope will ensure that we do not find ourselves >>>>>>> in a >>>>>>> CI backlog hell. I noticed that on the Kubernetes project, pull >>>>>>> requests do >>>>>>> not run integration test until a committer submits a "ready to test" >>>>>>> command to the CI bot. This step can prevent draft PRs or un-reviewed >>>>>>> PRs >>>>>>> from taking github CI resources. It is worth noting that with breeze's >>>>>>> docker based testing system, users have the exact same testing >>>>>>> capabilities >>>>>>> locally as they would on our CI. >>>>>>> >>>>>>> I propose that we allow unverified PR's to run basic and static >>>>>>> tests, but not perform the full test suite or integration test without >>>>>>> first being reviewed. >>>>>>> >>>>>>> What does everyone think? >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> >>>>> Jarek Potiuk >>>>> Polidea <https://www.polidea.com/> | Principal Software Engineer >>>>> >>>>> M: +48 660 796 129 <+48660796129> >>>>> [image: Polidea] <https://www.polidea.com/> >>>>> >>>>> >>>> >>>> -- >>>> >>>> Jarek Potiuk >>>> Polidea <https://www.polidea.com/> | Principal Software Engineer >>>> >>>> M: +48 660 796 129 <+48660796129> >>>> [image: Polidea] <https://www.polidea.com/> >>>> >>>> >> >> -- >> >> Jarek Potiuk >> Polidea <https://www.polidea.com/> | Principal Software Engineer >> >> M: +48 660 796 129 <+48660796129> >> [image: Polidea] <https://www.polidea.com/> >> >> > > -- > > Jarek Potiuk > Polidea <https://www.polidea.com/> | Principal Software Engineer > > M: +48 660 796 129 <+48660796129> > [image: Polidea] <https://www.polidea.com/> > > -- Jarek Potiuk Polidea <https://www.polidea.com/> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/>