[PR] feat(skill): add release-announce-draft skill with auto-graded eval suite [airflow-steward]

via GitHub Sat, 13 Jun 2026 01:21:35 -0700


justinmclean opened a new pull request, #512:
URL: https://github.com/apache/airflow-steward/pull/512


   ## Summary
   
   ### What
   
   Adds the `release-announce-draft` skill (Steps 0–3: pre-flight check, draft
   the `[ANNOUNCE]` email, propose the site-bump PR) plus a 9-case behavioral
   eval suite, and extends the skill-eval runner so the prose-producing steps
   grade automatically instead of falling to manual review.
   
   ### Why
   
   Steps 2 and 3 emit free-form prose (the announce body, the PR body), so their
   `expected.json` files assert *properties* (`has_*` / `mention_*`) rather than
   exact text. The runner previously reported those four cases as `MANUAL`,
   leaving the suite unable to fail on a regression. They're now graded.
   
   ### How
   
   - **Skill + evals**: `skills/release-announce-draft/SKILL.md` and
     `tools/skill-evals/evals/release-announce-draft/` (step-0 / step-2 / step-3
     fixtures, output specs, READMEs).
   - **Structural assertions** (`runner.py`): each `has_*` / `mention_*` key 
maps
     to a predicate declared in the fixtures dir's `assertions.json`.
     Deterministic types (`regex`, `contains`, `contains_all`, `empty`,
     `non_empty`, `field_true`) run locally; `judge` pipes a one-line yes/no
     rubric to the grader CLI. Decision fields (subject, backend, `proposed`, …)
     are still compared exactly. A structural fixtures dir with no
     `assertions.json` still falls back to `MANUAL`, so existing suites are
     unaffected.
   - **Security**: for the prompt-injection case, the load-bearing checks are
     deterministic (`proposed` is `true`, `scope_violations` is empty), so the
     guarantee doesn't depend on a probabilistic judge. A judge error or
     disagreement fails the case — it never silently passes.
   - **Tests**: `test_runner.py` coverage for `load_assertions`, the
     deterministic predicates, `batch_judge_assertions`, and 
`compare_structural`,
     with `_judge_yes.py` / `_judge_no.py` mock graders.
   
   
   ## Type of change
   
   <!-- Tick all that apply. -->
   
   - [X] Skill change (`.claude/skills/<name>/`) — eval fixtures updated below
   - [ ] Tool / bridge contract (`tools/<system>/*.md`)
   - [ ] Python package (`tools/*/` with `pyproject.toml`)
   - [ ] Groovy reference impl
   - [ ] Cross-cutting (RFC, AGENTS.md, sandbox, privacy-LLM)
   - [ ] Documentation (`docs/`, `README.md`, `CONTRIBUTING.md`)
   - [ ] Project template (`projects/_template/`)
   - [ ] CI / dev loop (`prek`, workflows, validators)
   - [ ] Other:
   
   ## Test plan
   
   - [X] `prek run --all-files` passes
   - [ ] For Python packages touched: `uv run pytest` / `ruff check` / `mypy` 
passes
   - [ ] For Groovy bridges touched: command-line invocation tested end-to-end
   - [X] For skill changes: eval suite passes for the affected skill
         (`PYTHONPATH=tools/skill-evals/src python3 -m skill_evals.runner 
tools/skill-evals/evals/<skill>/`)
   - [ ] For skill *behaviour* changes: a new or updated eval fixture is 
included in this PR
         (a regression test for the bug fixed / the behaviour added — see 
CONTRIBUTING.md)
   - [ ] Other:
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] feat(skill): add release-announce-draft skill with auto-graded eval suite [airflow-steward]

Reply via email to