potiuk opened a new pull request, #281:
URL: https://github.com/apache/airflow-steward/pull/281
## Summary
Two paired changes that make eval and cost-table maintenance an
expected part of the skill / tool-adapter change workflow:
1. **`tools/skill-evals/` harness — new `--cli` automated mode.**
The runner can now pipe each case's prompt through a configured
shell command (`--cli "claude -p"`, `--cli "llm -m gpt-4o-mini"`,
any LLM CLI that reads stdin and writes stdout), extract the
JSON the model produced, and compare against `expected.json`
automatically. Reports `PASS` / `FAIL` / `MANUAL` / `ERROR` per
case with a summary line; exits non-zero on any `FAIL` / `ERROR`.
2. **`AGENTS.md` — new "Keeping evals and mode-economics in sync"
section** plus a cross-reference bullet on the *Before submitting*
checklist. Codifies the contributor expectation that a skill or
tool-adapter change pairs with (a) re-running the affected eval
suite, (b) updating [`docs/mode-economics.md`](docs/mode-economics.md)
when the per-invocation token shape moves materially.
### Why now
- The eval harness has grown to nineteen suites without a documented
expectation that contributors re-run them when modifying the
underlying skill.
- `docs/mode-economics.md` is new (#253) and not yet wired into the
contributor flow — a skill change that doubles its context load
would silently drift the page out of date.
- The existing harness only printed prompts for human paste-and-diff,
which raised the cost of running evals enough that "just re-run it
before commit" was not a realistic ask. `--cli` mode makes the
smoke-test pass cheap enough that contributors actually will.
### Harness changes — `tools/skill-evals/src/skill_evals/runner.py`
- New `--cli "<shell command>"` argument. Per case: builds
`<system_prompt>\n\n<user_prompt>`, pipes to the command on stdin,
captures stdout.
- New `extract_json_from_output()` tries three strategies in order
(whole-stdout JSON, first ```` ```json ```` fenced block, longest
balanced `{...}` / `[...]` substring) so models wrapping output in
prose or markdown fences still work.
- New `compare_outputs()` does JSON equality with a unified-diff
fallback for FAIL output.
- New `is_structural_expected()` detects composition-style fixtures
(top-level `has_*` flags or `mention_*` lists); those cases skip
the CLI call and report `MANUAL` for human review — they cannot
be auto-compared meaningfully.
- New `--verbose` prints prompts + model stdout alongside the
PASS/FAIL line for failure investigation.
- New `--timeout` (default 120s) for the per-case CLI call.
- Print mode (no `--cli`) preserved as the default — existing
invocations work unchanged.
### Tests — `tools/skill-evals/tests/test_runner.py`
18 new tests (45 → 63 total, all passing):
- `is_structural_expected` — plain k/v, `has_*`, `mention_*`,
non-dict, empty.
- `extract_json_from_output` — whole-output JSON, whitespace
tolerance, ```` ```json ```` and bare ```` ``` ```` fenced blocks,
brace block in prose, largest-block tie-break, top-level arrays,
empty/malformed/no-JSON error reporting.
- `compare_outputs` — equality, key-order indifference, diff
formatting for FAIL cases, added/removed key markers.
- `--cli` mode end-to-end (using `echo` / `printf` / `false` as
the configured CLI for hermetic, network-free tests) — PASS,
FAIL with wrong JSON, MANUAL for structural expected.json,
ERROR on non-JSON output, ERROR on non-zero exit,
fenced-JSON extraction from a multi-line response, summary
counts across mixed pass/fail.
### Docs
- `tools/skill-evals/README.md` — new "Automated mode (`--cli`)"
subsection covering invocation shape, the three JSON-extraction
strategies, the MANUAL fallback for structural cases, and the
self-eval caveat (same-model self-grading is a smoke test, not a
regression check).
- `AGENTS.md` — new top-level section
[Keeping evals and mode-economics in
sync](AGENTS.md#keeping-evals-and-mode-economics-in-sync)
with four subsections:
| Subsection | Covers |
|---|---|
| When the rule fires | Skill / tool-adapter / pure-prose rows |
| Running evals | Print mode + `--cli` mode invocations, budget guidance
per change size |
| Agent-run self-eval | Self-eval via `--cli` vs print mode, smoke-test
framing, cross-model recommendation for substantive changes |
| Updating mode economics | Where to find the row, how to re-estimate
using the anchors, high-end / low-end adjustment heuristic, brand-new-skill row
pattern |
Plus a single bullet on *Before submitting* pointing back at the
section.
### Scope
Scoped to **skill changes and tool-adapter changes**. Process docs
(`docs/<x>/`) and generator code
(`tools/vulnogram/generate-cve-json/` etc.) are out of scope — they
have no eval coverage and don't move per-skill token shape.
### Drive-by
A pre-existing ruff RUF059 (unused-var) in `test_runner.py` was
fixed since the file was already in the diff. The `skill-evals`
project is not currently in prek's per-project ruff matrix, which
is why the violation slipped — that wiring is a separate change.
## Test plan
- [x] `uv run pytest tests/` in `tools/skill-evals/` — 63 passed.
- [x] `uv run ruff check src tests` + `ruff format --check` — clean.
- [x] `uv run --with mypy mypy src tests` — clean.
- [x] `prek run --files AGENTS.md tools/skill-evals/README.md` —
markdownlint / doctoc / typos pass (doctoc auto-added the 5 new
AGENTS.md section anchors).
- [x] Manual smoke: `python3 -m skill_evals.runner --cli "echo '{...}'" …`
exercises PASS and FAIL paths end-to-end (covered by the test
suite using `echo`, `printf`, `false`).
- [ ] First real `--cli "claude -p"` run against a live skill
suite — to be exercised in a follow-up session and any rough
edges (rate limits, JSON-extraction misses, MANUAL false
positives) reported in a follow-up.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]