Hi Jarek, I really like the idea. The three-bucket framing (accept-as-is / heuristics-need-work / genuine human judgment) is a clean way to make the automation roadmap data-driven instead of opinion-driven, and the skill-to-deterministic-CI path closes the loop nicely with PRINCIPLES.md §5 <https://github.com/apache/airflow-steward/pull/147> (probabilistic outputs, deterministic gates).
Two small things worth pinning down early, mostly to keep the loop honest as it grows: 1. *§10 (no default telemetry).* As long as the stats stay an audit-log artefact each adopter gathers from their own sessions, with no central aggregation on by default, we are fine. Worth saying out loud in the Mode-D design so it does not drift into a phone-home pipeline later. 2. *§7 (sentiment gates graduation).* Accept-rate is throughput, and §7 explicitly says throughput alone does not qualify. To promote a skill from "experimental" to "default" or eventually into CI, Mode-D probably needs a sentiment signal alongside the bucket stats, not in place of them. If both are baked in from the start, the self-improvement loop you described becomes one of the strongest evidence-generation paths we have for the eval and graduation story. Happy to help shape the stats schema if that would be useful. Thanks, André Ahlert Em qua., 27 de mai. de 2026 às 17:47, Jarek Potiuk <[email protected]> escreveu: > Hello here, > > I have run a few PR triage sessions with Airflow and started to gather > stats and analysis of how well triage sessions are "just accept what the > agent proposes" vs. "need improvement in heuristics" and "genuine human > judgment needed". > > You can see the first result here [1], which also some proposals for > heuristic improvements that I am applying now. > > Two PRs as result [2] - update the SKILL to self analyse sessions and to > propose heuristics improvement [3] the PR with the improvements > > > Once we gather more data, we might start proposing that some of those > triage skills be converted into fully automated triage actions - for > example by generating deterministic Python scripts reflecting the SKILLs > actions. > > This way those triage actions could simply be performed as part of CI. We > can keep them updated by having humans run the triage sessions and when any > of the SKILLs updates, the deterministic Python scripts might also be > updated. > > This might create a nice self-improvement loop where after every triage > session, you can identify improvements, and areas suitable for full > automation - making triage better with every loop. > > Almost every day I find new ways this whole process feeds itself, with > humans who use it to teach the SKILLs to be better. > > Jarek > > [1] Mode-D stats Gist > https://gist.github.com/potiuk/c419315f2ac318f74a3e63134757723a > [2] PR triage stats persistence > https://github.com/apache/airflow-steward/pull/343 > [3] Improvements to PR triage heuristics > https://github.com/apache/airflow-steward/pull/344 > > J. >
