Re: [DISCUSS] Make our "ready for review" expectation more explicit and stricter

Jarek Potiuk Fri, 06 Mar 2026 11:58:09 -0800

Hello here,

I have seen only positive comments and a number of improvements and
ideas already added - I am starting a [LAZY CONSENSUS] on attempting
to gradually introduce the approach.


J.

On Wed, Mar 4, 2026 at 10:13 PM Jarek Potiuk <[email protected]> wrote:
>
> > I just fear that (soon) if AI costs are put to realistic price levels we
> > need to check if contributors still have and get free AI bot access,
> > else the idea is melting fast. (Low risk thoug, let's see if this
> > happens we need to just change the approach... or look for funding)
>
> If that happens, we will not have to deal with the problem in the
> first place, because it will also be costly for those who create the
> slop, not only for us.
>
> Also - I assess (I will know more when I start doing it and this is
> one of the things I am going to track also over time) that ~90% of
> filter for now is purely deterministic and FAST - I think the crux of
> the solution is not to employ the AI, but to assess as quickly as
> possible whether we should look at the PR at all.
>
> So this change is moslty a change to our process:
>
> a) maintainers won't look at drafts (firmly)
> b) clearly communicate to contributors that this will happen and
> specify what they need to do
> c) Relentlessly and without hesitation (but with oversight) convert
> PRs to drafts when we quickly assess they are bad  - and tell aiuthors
> how to fix them
>
> The LLM there is just one of the checks - and LLMs check is fired only
> when all other easily and deterministically verifiable criteria are
> met. And I do hope we reach the checks with LLM will mostly say "fine"
> - because it's very likely that those PRs are **actually** worth
> looking at. I think most of our future work as maintainers will be
> deciding what we want to accept (or work on) - rather than spending
> time assessing code quality nitpicks. For me this is a natural
> consequence of what we've always been doing with static code checks. I
> do remember times when (even in Airflow) our reviews included comments
> about bad formatting and missing licences. Yes, that was the case - up
> until we introduced pre-commit (one reason I introduced it, and one of
> the first rules was to add licence headers automatically). This grew
> to over 170 checks that we don't even have to think about. I see what
> we are doing here as the natural next step.
>
> I am of course exaggerating a bit. I still review AI generated code
> and check its quality, asking agents to correct it when it doesn't
> meet my standards. In fact, I review it in detail because I learn
> something new every time. But I am exaggerating only slightly when
> describing the focus I think we as maintainers will need to prioritize
> in the future.
>
> Another thing - ASF is already looking for a sponsor to cover AI usage
> for ASF maintainers. I also know at least one company considering
> giving free access (under certain conditions - not sponsoring, but
> related to what goal the tokens will be used for) to all OSS
> maintainers in general in case this will be needed in the future.
>
> J.
>
>
> On Wed, Mar 4, 2026 at 9:42 PM Jens Scheffler <[email protected]> wrote:
> >
> > I like the idea and also assume that we can adjust and improve rules and
> > expectations over time.
> >
> > I just fear that (soon) if AI costs are put to realistic price levels we
> > need to check if contributors still have and get free AI bot access,
> > else the idea is melting fast. (Low risk thoug, let's see if this
> > happens we need to just change the approach... or look for funding)
> >
> > On 04.03.26 08:13, Jarek Potiuk wrote:
> > >>   Another manual step (and bottleneck) in triaging PRs is that 
> > >> maintainers
> > > will still need to approve CI runs on GitHub.
> > >
> > > Great point ... and ... it's already handled :)  - look at my PR.
> > >
> > > When - during the triage - the triager will see that workflow approval is
> > > needed, my nice little tool will print the diff of the incoming PR on
> > > terminal and ask the triager to confirm that there is nothing suspicious
> > > and after saying "y" the workflow run will be approved.
> > >
> > > J.
> > >
> > >
> > > On Wed, Mar 4, 2026 at 3:35 AM Zhe-You Liu <[email protected]> wrote:
> > >
> > >> Hi all,
> > >>
> > >> Thanks Jarek for bringing up the auto-triage idea!
> > >> Big +1 from me on the “let’s try” decision.
> > >>
> > >> I really like this feature; it can help avoid copy‑pasting or repeatedly
> > >> writing similar instructions for contributors to fix baseline test
> > >> failures.
> > >>
> > >> I had the same thoughts as Wei regarding flaky tests. Having 
> > >> deterministic
> > >> checks or automated comments should be enough to handle flaky test 
> > >> issues,
> > >> and contributors can still reach out on Slack to get their PRs reviewed, 
> > >> so
> > >> this should not be a problem.
> > >>
> > >> Another manual step (and bottleneck) in triaging PRs is that maintainers
> > >> will still need to approve CI runs on GitHub. It doesn’t seem safe to 
> > >> fully
> > >> automate CI approval, as there could still be rare cases where an 
> > >> attacker
> > >> creates a vulnerable PR that logs environment variables during tests. 
> > >> Even
> > >> though we could use an LLM to check for these kinds of vulnerabilities
> > >> before approving a CI run, it is still not as safe as a manual review in
> > >> most cases (e.g. prompt injection attack). I’m not sure whether anyone 
> > >> has
> > >> a good idea for fully automated PR triaging -- for example, automatically
> > >> approving CI, periodically checking test baselines for quality (via the
> > >> `breeze pr auto-triage`), re‑approving CI as needed, and continuing this
> > >> loop until all CI checks are green.
> > >>
> > >> Best regards,
> > >> Jason
> > >>
> > >> On Tue, Mar 3, 2026 at 10:48 PM Vincent Beck <[email protected]> wrote:
> > >>
> > >>> I like the overall strategy, for sure the tool will need continuous
> > >>> iterations to handle all the different scenarios. But this is definitely
> > >>> needed, the number of open PRs just skyrocketed the last few months, it
> > >> is
> > >>> very hard/impossible to keep track of everything.
> > >>>
> > >>> On 2026/03/03 14:39:41 Jarek Potiuk wrote:
> > >>>>>
> > >>>>> Thanks for bringing this up! Overall, I like this idea, but it's
> > >> worth
> > >>>>> testing it for a bit before we enforce it, especially the LLM-verify
> > >>> part.
> > >>>> Oh absolutely. My plan to introduce it is (after the community
> > >> hopefully
> > >>>> makes an overall "let's try" decision):
> > >>>>
> > >>>> * The human triager is always in the loop, quickly reviewing comments
> > >>> just
> > >>>> before they are posted to the user (until we achieve high confidence)
> > >>>> * I plan to run it myself as the sole triager for some time to perfect
> > >> it
> > >>>> and to pay much more attention initially. I will start with smaller
> > >>>> groups/areas of code and expand as we go - possibly adding more
> > >>> maintainers
> > >>>> willing to participate in triaging and testing/improving the tool
> > >>>> * See how quickly we can do it on a regular basis - whether we need
> > >>> several
> > >>>> triagers or perhaps one rotational triager handling all PRs from all
> > >>> areas
> > >>>> at a time.
> > >>>> * Possibly further automate it. My assessment is that we will have 90%
> > >> of
> > >>>> deterministic "fails"—those we can easily automate without hesitation
> > >>> once
> > >>>> the process and expectations will be in place. The LLM part is a bit
> > >> more
> > >>>> nuanced and we can decide after we try.
> > >>>>
> > >>>>> * The author ensures the PR passes ALL the checks and tests (i.e.
> > >>> green).
> > >>>>>> It might sometimes mean we have to - even more quickly to `main`
> > >>>>> breakages,
> > >>>>>> and probably provide some "status" info and exceptions when we know
> > >>> main
> > >>>>> is
> > >>>>>> broken.
> > >>>>> Probably, we should exempt some checks that might be flaky?
> > >>>>>
> > >>>> Yeah - this part is a bit problematic - but we can likely add also an
> > >>> easy
> > >>>> automated, deterministic check if the failure is happening for others.
> > >>>> Sending an automated comment like, "Please rebase now, the issue is
> > >>> fixed,"
> > >>>> to the authors would be super useful when they see unrelated failures.
> > >>> This
> > >>>> is something we **should** figure out during testing. There will be
> > >>> plenty
> > >>>> of opportunities :D
> > >>>>
> > >>>>
> > >>>>>> * All PRs that do not meet this requirement will be converted to
> > >>> Drafts
> > >>>>>> with automated suggestions (reviewed quickly and efficiently by a
> > >>>>>> triager) provided to the author on the next steps.
> > >>>>> This will be super helpful! I also do it manually from time to time.
> > >>>>
> > >>>> Yes. I believe converting to Draft is an extremely strong (but fair)
> > >>> signal
> > >>>> to the author: "Hey, you have work to do.".
> > >>>>
> > >>>> Also when this is accompanied by an actionable comment like, "Here is
> > >>> what
> > >>>> you should do and here is the link describing it," it immediately
> > >> filters
> > >>>> out people who submit PRs without much work.
> > >>>>
> > >>>> Surely - they might feed the comment into their agent anyway (or it can
> > >>>> read it automatically and act). But if our tool is faster and cheaper
> > >> and
> > >>>> more accurate (because of smart human in the driver's seat) than their
> > >>>> tools, we gain an upper hand.
> > >>>> And it should be faster - because we only check the expectation rather
> > >>> than
> > >>>> figuring out what to do, which should be much faster.
> > >>>>
> > >>>> Then in the worst case we will have continuous ping-pong (Draft ->
> > >>> Undraft
> > >>>> -> Draft), but we will control how fast this loop runs. Generally, our
> > >>> goal
> > >>>> should be to slow it down rather than respond immediately; for example,
> > >>>> running it daily or every two days is a good idea.
> > >>>>
> > >>>> Effectively, if the PR is in the "ready for maintainer review" state,
> > >> the
> > >>>> maintainer should be quite certain, that the code quality, tests, etc.,
> > >>> are
> > >>>> all good. Only then should they take a look (and they can immediately
> > >>> say,
> > >>>> "No, this is not what we want")—and this is absolutely fine as well. We
> > >>>> should not optimize for contributors spending time on work we might not
> > >>>> accept. This is deliberately not a goal for me. This will automatically
> > >>>> mean that new contributors who want to contribute significant changes
> > >>> will
> > >>>> mostly waste a lot of time and their PRs will be rejected.
> > >>>>
> > >>>> This is largely what we are already doing, mostly because those PRs do
> > >>> not
> > >>>> follow our "tribal knowledge," which the agent cannot easily derive.
> > >>>> Naturally new contributors should start with small, easy-to-complete
> > >>> tasks.
> > >>>> that can be easily discarded if reviewers reject them. This is what we
> > >>>> always asked people to start with. So this approach with the triage
> > >> tool,
> > >>>> also largely supports this: someone new rewriting the proverbial
> > >>> scheduler
> > >>>> will have to spend significant time ensuring "auto-triage" passes, only
> > >>> to
> > >>>> have the idea completely rejected by the reviewer or be asked for a
> > >>>> complete rewrite. And this is perfectly fine. We always encouraged
> > >>>> newcomers to start with small tasks, learn the basics, and "grow" until
> > >>>> they were ready to propose bigger changes or split it into much smaller
> > >>>> chunks. With "auto-triage" this will be natural and expected, requiring
> > >>>> authors to invest more time and effort to reach the "ready for review"
> > >>>> status.
> > >>>>
> > >>>> And I think it's absolutely fair and restores the balance we so much
> > >> need
> > >>>> now.
> > >>>>
> > >>>>
> > >>>>>
> > >>>>> Best,
> > >>>>> Wei
> > >>>>>
> > >>>>>> On Mar 3, 2026, at 9:34 PM, Jarek Potiuk <[email protected]> wrote:
> > >>>>>>
> > >>>>>> *TL;DR; I propose a stricter (automation-assisted) approach for the
> > >>>>> "ready
> > >>>>>> for review" state and clearer expectations for contributors
> > >> regarding
> > >>>>> when
> > >>>>>> maintainers review PRs of non-collaborators.*
> > >>>>>>
> > >>>>>> Following the
> > >>>>>> https://lists.apache.org/thread/8tzwwwd7jmtmfo4j9pzg27704g10vpr4
> > >>> where I
> > >>>>>> showcased a tool that I claude-coded, I would like to have a
> > >>> (possibly
> > >>>>>> short) discussion on this subject and reach a stage where I can
> > >>> attempt
> > >>>>> to
> > >>>>>> try the tool out.
> > >>>>>>
> > >>>>>> *Why? *
> > >>>>>>
> > >>>>>> Because we maintainers are overwhelmed and burning out, we no
> > >> longer
> > >>> see
> > >>>>>> how our time invested in Airflow can bring significant returns to
> > >> us
> > >>>>>> (personally) and the community.
> > >>>>>>
> > >>>>>> While some of us spend a lot of time reviewing, commenting on, and
> > >>>>> merging
> > >>>>>> code, with the current rate of AI-generated PRs and other things we
> > >>> do,
> > >>>>>> this is not sustainable. Also there is a mismatch—or lack of
> > >>>>>> clarity—regarding the quality expectations for the PRs we want to
> > >>> review.
> > >>>>>> *Social Contract Issue*
> > >>>>>>
> > >>>>>> We are a good (I think) open source project with a thriving
> > >> community
> > >>>>> and a
> > >>>>>> great group of maintainers who are also friends and like to work
> > >> with
> > >>>>> each
> > >>>>>> other but also are very open to bringing new community members in.
> > >> As
> > >>>>>> maintainers, we are willing to help new contributors grow and
> > >>> generally
> > >>>>>> willing to spend some of our time doing so. This is the social
> > >>> contract
> > >>>>> we
> > >>>>>> signed up for as OSS maintainers and as committers for the Apache
> > >>>>> Software
> > >>>>>> Foundation PMC. Community Over Code.
> > >>>>>>
> > >>>>>> However, this social contract - this community-building aspect is
> > >>>>> currently
> > >>>>>> heavily imbalanced because AI-generated content takes away time,
> > >>> focus
> > >>>>> and
> > >>>>>> energy from the maintainers. Instead of having meaningful
> > >>> discussions in
> > >>>>>> PRs about whether changes are needed and communicating with people,
> > >>> we
> > >>>>>> start losing time talking to - effectively - AI agents about
> > >>> hundreds of
> > >>>>>> smaller and bigger things that should not be there in a first
> > >> place.
> > >>>>>> Currently - collaboration and community building suffer. Even if
> > >> real
> > >>>>>> people submit code generated by agents (which is becoming really
> > >>> good,
> > >>>>> fast
> > >>>>>> and cheap to produce), we simply lack the time as maintainers to
> > >> have
> > >>>>>> meaningful conversations with the people behind those agents.
> > >>>>>>
> > >>>>>> Sometimes we lose time talking to agents. Sometimes we lose time on
> > >>>>> talking
> > >>>>>> to people who have 0 understanding of what they are doing and
> > >> submitt
> > >>>>>> continuous crap, and we should not be having that conversation at
> > >>>>>> all. Sometimes, we just look at the number of PRs opened in a given
> > >>> day
> > >>>>> in
> > >>>>>> despair, dreading even trying to bring order to them.
> > >>>>>>
> > >>>>>> And many of us also have some "work" to do or a "feature" to work
> > >> on
> > >>> top
> > >>>>> of
> > >>>>>> that.
> > >>>>>>
> > >>>>>> I think we need to reclaim the maintainers' collective time to
> > >> focus
> > >>> on
> > >>>>>> what matters: delegating more responsibility to authors so they
> > >> meet
> > >>> our
> > >>>>>> expected quality bar (and efficiently verifying it with tools
> > >> without
> > >>>>>> losing time and focus).
> > >>>>>>
> > >>>>>> *What do we have now?*
> > >>>>>>
> > >>>>>> We have already done a lot to help with it - AGENTS.The PR
> > >>> guidelines,
> > >>>>>> overhauled by Kaxil and updated by others, will certainly help
> > >>> clarify
> > >>>>>> expectations for agents in the future. I know Kaxil is also
> > >>> exploring a
> > >>>>> way
> > >>>>>> to enable automated copilot code reviews in a manner that will not
> > >>> be too
> > >>>>>> "dehumanizing" and will work well. This is all good. The better the
> > >>>>> agents
> > >>>>>> people use and the more closely they follow those instructions, the
> > >>>>> higher
> > >>>>>> the quality of incoming PRs will be. But we also need to help
> > >>> maintainers
> > >>>>>> easily identify what to focus on—distinguishing work in progress
> > >> and
> > >>>>>> unfinished PRs that need work from those truly "Ready for (human)
> > >>>>> review."
> > >>>>>> *How?*
> > >>>>>>
> > >>>>>> My proposal has two parts:
> > >>>>>>
> > >>>>>> * Define and communicate expectations for PRs that maintainers can
> > >>>>> manage.
> > >>>>>> * Relentlessly automate it to ensure expectations are met and that
> > >>>>>> maintainers can easily focus on those PRs that "Ready for review."
> > >>>>>>
> > >>>>>> My tool (needs a bit more fine-tuning and refinement):
> > >>>>>> https://github.com/apache/airflow/pull/62682 `*breeze pr
> > >>> auto-triage*`
> > >>>>> is
> > >>>>>> designed to do exactly this: automate those expectations by
> > >>> auto-triaging
> > >>>>>> the PRs. It not only converts them to Draft when they are not yet
> > >>> "Ready
> > >>>>>> For Review," but also provides actionable, automated
> > >> (deterministic +
> > >>>>> LLM)
> > >>>>>> comments to the authors. A concrete maintainer (the current
> > >> triager)
> > >>> is
> > >>>>>> using the tool very efficiently.
> > >>>>>>
> > >>>>>> *Proposed expectations (for non-collaborators):*
> > >>>>>>
> > >>>>>> Those are not "new" expectations. Really, I'm proposing we
> > >> completely
> > >>>>>> delegate the responsibility for fulfilling those expectations to
> > >> the
> > >>>>> author
> > >>>>>> (with helpful, automated comments - reviewed and confirmed by a
> > >> human
> > >>>>>> triager for now). And simply be very clear that generally no
> > >>> maintainer
> > >>>>>> will look at a PR until:
> > >>>>>>
> > >>>>>> * The author ensures the PR passes ALL the checks and tests (i.e.
> > >>> green).
> > >>>>>> It might sometimes mean we have to - even more quickly to `main`
> > >>>>> breakages,
> > >>>>>> and probably provide some "status" info and exceptions when we know
> > >>> main
> > >>>>> is
> > >>>>>> broken.
> > >>>>>>
> > >>>>>> * The author follows all PR guidelines (LLM-verified) regarding
> > >>>>>> description, content, quality, and presence of tests.
> > >>>>>>
> > >>>>>> * All PRs that do not meet this requirement will be converted to
> > >>> Drafts
> > >>>>>> with automated suggestions (reviewed quickly and efficiently by a
> > >>>>>> triager) provided to the author on the next steps.
> > >>>>>>
> > >>>>>> * Drafts with no activity will be more aggressively pruned by our
> > >>>>> stalebot.
> > >>>>>> The triager is there mostly to quickly assess and generate
> > >>> comments—with
> > >>>>>> tool/AI assistance. The triager won't be the one who actually
> > >> reviews
> > >>>>> those
> > >>>>>> PRs when they are "ready for review."
> > >>>>>>
> > >>>>>> * Only after that do we mark the PR as "*ready for maintainer
> > >>> review*"
> > >>>>>> (label)
> > >>>>>>
> > >>>>>> * Only such PRs should be reviewed and it is entirely up to the
> > >>> author to
> > >>>>>> make them ready.
> > >>>>>>
> > >>>>>> Note: This approach is only for non-collaborators. For
> > >>> collaborators: we
> > >>>>>> might have just one expectation - mark your PR with "ready for
> > >>> maintainer
> > >>>>>> review" when you think it's ready.
> > >>>>>> We accept people as committers and collaborators because we already
> > >>> know
> > >>>>>> they generally know and follow the rules; automating this step
> > >> isn't
> > >>>>>> necessary.
> > >>>>>>
> > >>>>>> This is nothing new; we've already been doing this with humans
> > >>> handling
> > >>>>> all
> > >>>>>> the heavy lifting without much of strictness or organization, but
> > >>> this is
> > >>>>>> no longer sustainable.
> > >>>>>>
> > >>>>>> I propose we make the expectations explicit, communicate them
> > >>> clearly,
> > >>>>> and
> > >>>>>> relentlessly automate their execution.
> > >>>>>>
> > >>>>>> I would love to hear what y'all think.
> > >>>>>>
> > >>>>>> J.
> > >>>>>
> > >>>>> ---------------------------------------------------------------------
> > >>>>> To unsubscribe, e-mail: [email protected]
> > >>>>> For additional commands, e-mail: [email protected]
> > >>>>>
> > >>>>>
> > >>> ---------------------------------------------------------------------
> > >>> To unsubscribe, e-mail: [email protected]
> > >>> For additional commands, e-mail: [email protected]
> > >>>
> > >>>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [DISCUSS] Make our "ready for review" expectation more explicit and stricter

Reply via email to