justinmclean opened a new pull request, #252:
URL: https://github.com/apache/airflow-steward/pull/252
> **Generated by the spec-driven build loop.** This eval suite and the docs
> update were produced by an autonomous run of `tools/spec-loop`
(`./loop.sh` —
> one work item, one branch, one PR). Authored by Claude (see the
`Generated-by`
> commit trailer) and reviewed + tested by a human before submission.
## What
Adds the missing `intervention` eval suite (8 cases) to the existing
`pr-management-mentor` skill's eval tree, covering the intervention-selection
decision: the out-of-scope and maintainer-engaged checks, the four
intervention
templates, the multi-trigger (`ask`) path, the no-trigger (`silent`) path,
and
the hand-off triggers.
Also syncs `docs/modes.md`: the Mentoring row moves from `proposed / 0
skills`
to `experimental / 1 skill`, and the section points at the shipped skill
rather
than a forward reference.
## Why
`pr-management-mentor` shipped without a matching eval suite for its
intervention-selection step, and the framework treats a skill without evals
as
incomplete. This back-fills that coverage so the skill's decision logic is
pinned by fixtures.
## Changes
- `tools/skill-evals/evals/pr-management-mentor/intervention/` — 8-case eval
suite (system prompt, user-prompt template, case fixtures).
- `docs/modes.md` — Mentoring row + skill table.
## Testing — and an issue the loop did not detect
The suite assembles cleanly and, after the fix below, an independent agent
run
matches all 8 cases against their ground truth.
On the **first** test pass, case-4 (`why-pushback`) failed. Its
`expected.json`
is `handoff` — which is correct per the skill: the contributor argues *after*
the agent already answered the "why" once, firing the skill's **hand-off
trigger
2** ("answer the why once, don't argue", defined in `hand-off.md`). But the
eval's own `system-prompt.md` never encoded the hand-off triggers, so a model
following the prompt returned `draft / 4` instead. The build loop generated
an
eval whose ground truth assumed skill logic its own prompt left out — a
self-inconsistency **the loop did not catch**. Fixed here by adding the four
hand-off triggers (documented `4 → 3 → 1 → 2` order) to `system-prompt.md`;
re-running the independent check then passed 8/8.
## Notes
- Eval + docs only; no skill behaviour (`SKILL.md`) is changed.
- The loop-detection gap is called out deliberately: it's a concrete data
point
that the build loop needs a self-consistency check between an eval's
fixtures
and the prompt it ships.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]