justinmclean opened a new pull request, #512:
URL: https://github.com/apache/airflow-steward/pull/512
## Summary
### What
Adds the `release-announce-draft` skill (Steps 0–3: pre-flight check, draft
the `[ANNOUNCE]` email, propose the site-bump PR) plus a 9-case behavioral
eval suite, and extends the skill-eval runner so the prose-producing steps
grade automatically instead of falling to manual review.
### Why
Steps 2 and 3 emit free-form prose (the announce body, the PR body), so their
`expected.json` files assert *properties* (`has_*` / `mention_*`) rather than
exact text. The runner previously reported those four cases as `MANUAL`,
leaving the suite unable to fail on a regression. They're now graded.
### How
- **Skill + evals**: `skills/release-announce-draft/SKILL.md` and
`tools/skill-evals/evals/release-announce-draft/` (step-0 / step-2 / step-3
fixtures, output specs, READMEs).
- **Structural assertions** (`runner.py`): each `has_*` / `mention_*` key
maps
to a predicate declared in the fixtures dir's `assertions.json`.
Deterministic types (`regex`, `contains`, `contains_all`, `empty`,
`non_empty`, `field_true`) run locally; `judge` pipes a one-line yes/no
rubric to the grader CLI. Decision fields (subject, backend, `proposed`, …)
are still compared exactly. A structural fixtures dir with no
`assertions.json` still falls back to `MANUAL`, so existing suites are
unaffected.
- **Security**: for the prompt-injection case, the load-bearing checks are
deterministic (`proposed` is `true`, `scope_violations` is empty), so the
guarantee doesn't depend on a probabilistic judge. A judge error or
disagreement fails the case — it never silently passes.
- **Tests**: `test_runner.py` coverage for `load_assertions`, the
deterministic predicates, `batch_judge_assertions`, and
`compare_structural`,
with `_judge_yes.py` / `_judge_no.py` mock graders.
## Type of change
<!-- Tick all that apply. -->
- [X] Skill change (`.claude/skills/<name>/`) — eval fixtures updated below
- [ ] Tool / bridge contract (`tools/<system>/*.md`)
- [ ] Python package (`tools/*/` with `pyproject.toml`)
- [ ] Groovy reference impl
- [ ] Cross-cutting (RFC, AGENTS.md, sandbox, privacy-LLM)
- [ ] Documentation (`docs/`, `README.md`, `CONTRIBUTING.md`)
- [ ] Project template (`projects/_template/`)
- [ ] CI / dev loop (`prek`, workflows, validators)
- [ ] Other:
## Test plan
- [X] `prek run --all-files` passes
- [ ] For Python packages touched: `uv run pytest` / `ruff check` / `mypy`
passes
- [ ] For Groovy bridges touched: command-line invocation tested end-to-end
- [X] For skill changes: eval suite passes for the affected skill
(`PYTHONPATH=tools/skill-evals/src python3 -m skill_evals.runner
tools/skill-evals/evals/<skill>/`)
- [ ] For skill *behaviour* changes: a new or updated eval fixture is
included in this PR
(a regression test for the bug fixed / the behaviour added — see
CONTRIBUTING.md)
- [ ] Other:
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]