Re: [DISCUSS] AIP-102: Reproducible Benchmarks and Conformance for AI Capabilities

Jarek Potiuk Mon, 23 Feb 2026 00:32:23 -0800

While I think it's a good idea, I also think it would be great to have
a third-party run such evaluations on your site. You could ask people
to contribute there - rather than having an official AIP and something
"in-airflow" - simply because we do not know what value it can bring
to the community, what kind of burden it will introduce - and the
questions of Dev-IL regarding how to run it, how to block CI and the
value and "official status" of having "certificates" officially by the
Airflow community are probably the most important questions that need
answering.


I think - since this is pretty orthogonal to Airflow's regular work
and features of Airflow, I would rather (in your case) focus on
running such evaluations and demonstrating their value outside of the
core Airflow efforts. For years, we have been preaching "what we can
remove from Airflow run by the community here rather than what we can
add." This immediately looks like you can run outside for a while -
and demonstrate how it adds value - both to Airflow and your
`ai-evals.io` site and schema.  There is no clear "standard" in
evaluation frameworks. Many exist, and perhaps more established
standards will arise eventually. That would be a great moment for us
to adopt one, but until then, I think it's more of a "let's wait and
see" from the community standpoint, and you could start working on
getting the standards - and demonstrate how it might be good for
Airflow, then your solution might be a good candidate to build the
standard around.

I think that can easily be part of our ecosystem
https://airflow.apache.org/ecosystem/  and you are absolutely welcome
to share any progress here and ask people to participate, but I think
it's quite a bit too early to "adopt" it. by the community.

J.

On Mon, Feb 23, 2026 at 9:09 AM Ephraim Anierobi
<[email protected]> wrote:
>
> Hi Alex,
>
> I share the same concerns as Dev-iL regarding running it in the CI(which is
> the best place to run this kind of test/exam) and the cost.
>
> For the AIP, the two other big risks I see:
>
> 1. You mentioned that pass/fail won’t be stable as the downsides to the AIP
> and that's true because LLM answers can change from run to run, and also
> change when the model or version changes. So a “cert/conformance” label
> could be misleading and likely cause arguments in AIPs. Debates could shift
> from implementation to exam semantics.
>
> 2. It seems like the system can be gamed. Scores can shift because the
> grader changes, and people may end up optimizing for the grader instead of
> quality. What do you think?
>
> Best,
> Ephraim
>
>
>
> On Mon, 23 Feb 2026 at 05:47, Dev-iL <[email protected]> wrote:
>
> > Hi Alex,
> > Thank you for the interesting suggestion!
> >
> > I have several questions about practical aspects of these evaluation:
> > 1. Who is supposed to run them and at what stage? It sounds to me that for
> > maximum benefit this should run as part of CI, at least when LLM-facing
> > features are modified. If so, where are we going to get API credits to run
> > this?
> > 2. As you mentioned, this is a rapidly developing space with new models
> > popping up on a regular basis. What is the benefit of knowing that evals
> > passed at a given point in time with a given model? Not all users have
> > access to all LLMs, and results obtained on one model don't tell us another
> > model would behave. Say an AIP was drafted and evaluated on ChatGPT 5.3,
> > and by the time it reaches the user, the latest version might be 6.1. Do we
> > expect users to use a older model just for interacting with Airflow? Do we
> > expect users to submit certs if they tried the code on new models?
> >
> > Would appreciate your clarifications!
> > Dev-iL
> >
> > P.S.
> > The spec link is broken.
> >
> > On Mon, 23 Feb 2026, 4:47 Alex, <[email protected]> wrote:
> >
> > > Hi all,
> > >
> > > I'd like to propose an AIP [1] to establish a shared benchmark and
> > > conformance standard for AI capabilities in Airflow. Sharing this to
> > gather
> > > feedback and rough consensus.
> > >
> > > The core idea is to give AI capabilities an exam. Define what "qualified"
> > > looks like for a given role - DAG Operator, DAG Code Generator, DAG
> > Fixer,
> > > Migration Agent - and let anyone reproduce that result. Once the exam
> > > exists, the conversation about whether a feature is ready becomes
> > concrete.
> > > Each AI-related AIP can declare which roles it targets and ship a
> > > corresponding exam suite, without depending on another AIP's roadmap.
> > The
> > > same goes for providers or anyone experimenting in this fast moving
> > space.
> > > It provides a common language for AI evaluation (often referred to as AI
> > > Evals).
> > >
> > > *A useful example is already running against a real Airflow localization
> > > skill,* with a viewable cert here [2]. A simpler non-Airflow example is
> > > also available [3]. The exam showcases the pattern that allows us to
> > > produce machine readable and human comparable outputs for easy
> > > collaboration, regardless of the aspects of the black box (Agent, Model,
> > > Skill, MCP).
> > >
> > > I gave a lightning talk on this topic at Airflow Summit 2025 [4] and have
> > > been building toward it since: evals-playground [5] holds working
> > examples
> > > against Airflow AI capabilities, ai-evals.io [6] explains the pattern to
> > > people from different backgrounds, eval-ception [7] demonstrates it
> > > hands-on, and the exam spec [8] is taking shape at ai-evals.io/spec.
> > >
> > > Let me know if you have any questions.
> > >
> > > - [1] Draft proposal -
> > > https://docs.google.com/document/d/1KvEX9zdq9-NMfSnl_qvET5SgSeujF-Zz/
> > > - [2] Cert viewer - Airflow localizer exam:
> > >
> > >
> > https://ai-evals.io/certify-your-agent/viewer.html?cert=https://raw.githubusercontent.com/Alexhans/eval-ception/main/certs/airflow-localizer-es/airflow-es-localizer-exam-pydantic-agent.cert.json
> > > - [3]
> > >
> > >
> > https://ai-evals.io/certify-your-agent/viewer.html?cert=https://raw.githubusercontent.com/Alexhans/eval-ception/main/certs/simple-exam/simple-exam.ollama-agent.cert.json
> > > - [4] Toward a Shared Vision for LLM Evaluation in the Airflow Ecosystem
> > -
> > > Airflow Summit 2025 - Lightning Talk (5 min) -
> > >
> > >
> > https://alexhans.github.io/posts/talk.toward-a-shared-vision-of-llm-evals-in-airflow-ecosystem.html
> > > - [5] https://github.com/Alexhans/evals-playground
> > > - [6] https://ai-evals.io/
> > > - [7] https://github.com/Alexhans/eval-ception
> > > - [8] https://ai-evals.io/spec/cert/v0.1.0/schema.json
> > >
> > > Alex Guglielmone Nemi
> > >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [DISCUSS] AIP-102: Reproducible Benchmarks and Conformance for AI Capabilities

Reply via email to