[DISCUSS] AIP-102: Reproducible Benchmarks and Conformance for AI Capabilities

Alex Sun, 22 Feb 2026 18:47:23 -0800

Hi all,

I'd like to propose an AIP [1] to establish a shared benchmark and
conformance standard for AI capabilities in Airflow. Sharing this to gather
feedback and rough consensus.


The core idea is to give AI capabilities an exam. Define what "qualified"
looks like for a given role - DAG Operator, DAG Code Generator, DAG Fixer,
Migration Agent - and let anyone reproduce that result. Once the exam
exists, the conversation about whether a feature is ready becomes concrete.
Each AI-related AIP can declare which roles it targets and ship a
corresponding exam suite, without depending on another AIP's roadmap.  The
same goes for providers or anyone experimenting in this fast moving space.
It provides a common language for AI evaluation (often referred to as AI
Evals).

*A useful example is already running against a real Airflow localization
skill,* with a viewable cert here [2]. A simpler non-Airflow example is
also available [3]. The exam showcases the pattern that allows us to
produce machine readable and human comparable outputs for easy
collaboration, regardless of the aspects of the black box (Agent, Model,
Skill, MCP).

I gave a lightning talk on this topic at Airflow Summit 2025 [4] and have
been building toward it since: evals-playground [5] holds working examples
against Airflow AI capabilities, ai-evals.io [6] explains the pattern to
people from different backgrounds, eval-ception [7] demonstrates it
hands-on, and the exam spec [8] is taking shape at ai-evals.io/spec.

Let me know if you have any questions.

- [1] Draft proposal -
https://docs.google.com/document/d/1KvEX9zdq9-NMfSnl_qvET5SgSeujF-Zz/
- [2] Cert viewer - Airflow localizer exam:
https://ai-evals.io/certify-your-agent/viewer.html?cert=https://raw.githubusercontent.com/Alexhans/eval-ception/main/certs/airflow-localizer-es/airflow-es-localizer-exam-pydantic-agent.cert.json
- [3]
https://ai-evals.io/certify-your-agent/viewer.html?cert=https://raw.githubusercontent.com/Alexhans/eval-ception/main/certs/simple-exam/simple-exam.ollama-agent.cert.json
- [4] Toward a Shared Vision for LLM Evaluation in the Airflow Ecosystem -
Airflow Summit 2025 - Lightning Talk (5 min) -
https://alexhans.github.io/posts/talk.toward-a-shared-vision-of-llm-evals-in-airflow-ecosystem.html
- [5] https://github.com/Alexhans/evals-playground
- [6] https://ai-evals.io/
- [7] https://github.com/Alexhans/eval-ception
- [8] https://ai-evals.io/spec/cert/v0.1.0/schema.json

Alex Guglielmone Nemi

[DISCUSS] AIP-102: Reproducible Benchmarks and Conformance for AI Capabilities

Reply via email to