[PR] An idea for testing fixture-based eval harness for skill steps [DRAFT] [airflow-steward]

via GitHub Thu, 14 May 2026 17:26:31 -0700


justinmclean opened a new pull request, #158:
URL: https://github.com/apache/airflow-steward/pull/158


   ## What this adds
   
   A small, model-agnostic eval harness under `tools/skill-evals/` for
   verifying that skill steps produce the expected structured output.
   
   The runner prints the exact system prompt + user prompt for each fixture
   case so you can paste it into any model and compare the response against
   `expected.json` so as to be vendor-neutral.
   
   This is an early idea; I'm sharing it for feedback before we go further.
   
   - Should fixture cases live next to their skill rather than under 
`tools/skill-evals/evals/`?
   - Is manual paste-and-compare enough, or do we want automated scoring via 
the API? If so, how do we make that vendor-neutral?
   - Is there a simpler approach entirely?
   
   Happy to rework or close in favour of a better idea.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] An idea for testing fixture-based eval harness for skill steps [DRAFT] [airflow-steward]

Reply via email to