potiuk opened a new pull request, #281:
URL: https://github.com/apache/airflow-steward/pull/281

   ## Summary
   
   Two paired changes that make eval and cost-table maintenance an
   expected part of the skill / tool-adapter change workflow:
   
   1. **`tools/skill-evals/` harness — new `--cli` automated mode.**
      The runner can now pipe each case's prompt through a configured
      shell command (`--cli "claude -p"`, `--cli "llm -m gpt-4o-mini"`,
      any LLM CLI that reads stdin and writes stdout), extract the
      JSON the model produced, and compare against `expected.json`
      automatically. Reports `PASS` / `FAIL` / `MANUAL` / `ERROR` per
      case with a summary line; exits non-zero on any `FAIL` / `ERROR`.
   
   2. **`AGENTS.md` — new "Keeping evals and mode-economics in sync"
      section** plus a cross-reference bullet on the *Before submitting*
      checklist. Codifies the contributor expectation that a skill or
      tool-adapter change pairs with (a) re-running the affected eval
      suite, (b) updating [`docs/mode-economics.md`](docs/mode-economics.md)
      when the per-invocation token shape moves materially.
   
   ### Why now
   
   - The eval harness has grown to nineteen suites without a documented
     expectation that contributors re-run them when modifying the
     underlying skill.
   - `docs/mode-economics.md` is new (#253) and not yet wired into the
     contributor flow — a skill change that doubles its context load
     would silently drift the page out of date.
   - The existing harness only printed prompts for human paste-and-diff,
     which raised the cost of running evals enough that "just re-run it
     before commit" was not a realistic ask. `--cli` mode makes the
     smoke-test pass cheap enough that contributors actually will.
   
   ### Harness changes — `tools/skill-evals/src/skill_evals/runner.py`
   
   - New `--cli "<shell command>"` argument. Per case: builds
     `<system_prompt>\n\n<user_prompt>`, pipes to the command on stdin,
     captures stdout.
   - New `extract_json_from_output()` tries three strategies in order
     (whole-stdout JSON, first ```` ```json ```` fenced block, longest
     balanced `{...}` / `[...]` substring) so models wrapping output in
     prose or markdown fences still work.
   - New `compare_outputs()` does JSON equality with a unified-diff
     fallback for FAIL output.
   - New `is_structural_expected()` detects composition-style fixtures
     (top-level `has_*` flags or `mention_*` lists); those cases skip
     the CLI call and report `MANUAL` for human review — they cannot
     be auto-compared meaningfully.
   - New `--verbose` prints prompts + model stdout alongside the
     PASS/FAIL line for failure investigation.
   - New `--timeout` (default 120s) for the per-case CLI call.
   - Print mode (no `--cli`) preserved as the default — existing
     invocations work unchanged.
   
   ### Tests — `tools/skill-evals/tests/test_runner.py`
   
   18 new tests (45 → 63 total, all passing):
   
   - `is_structural_expected` — plain k/v, `has_*`, `mention_*`,
     non-dict, empty.
   - `extract_json_from_output` — whole-output JSON, whitespace
     tolerance, ```` ```json ```` and bare ```` ``` ```` fenced blocks,
     brace block in prose, largest-block tie-break, top-level arrays,
     empty/malformed/no-JSON error reporting.
   - `compare_outputs` — equality, key-order indifference, diff
     formatting for FAIL cases, added/removed key markers.
   - `--cli` mode end-to-end (using `echo` / `printf` / `false` as
     the configured CLI for hermetic, network-free tests) — PASS,
     FAIL with wrong JSON, MANUAL for structural expected.json,
     ERROR on non-JSON output, ERROR on non-zero exit,
     fenced-JSON extraction from a multi-line response, summary
     counts across mixed pass/fail.
   
   ### Docs
   
   - `tools/skill-evals/README.md` — new "Automated mode (`--cli`)"
     subsection covering invocation shape, the three JSON-extraction
     strategies, the MANUAL fallback for structural cases, and the
     self-eval caveat (same-model self-grading is a smoke test, not a
     regression check).
   - `AGENTS.md` — new top-level section
     [Keeping evals and mode-economics in 
sync](AGENTS.md#keeping-evals-and-mode-economics-in-sync)
     with four subsections:
   
     | Subsection | Covers |
     |---|---|
     | When the rule fires | Skill / tool-adapter / pure-prose rows |
     | Running evals | Print mode + `--cli` mode invocations, budget guidance 
per change size |
     | Agent-run self-eval | Self-eval via `--cli` vs print mode, smoke-test 
framing, cross-model recommendation for substantive changes |
     | Updating mode economics | Where to find the row, how to re-estimate 
using the anchors, high-end / low-end adjustment heuristic, brand-new-skill row 
pattern |
   
     Plus a single bullet on *Before submitting* pointing back at the
     section.
   
   ### Scope
   
   Scoped to **skill changes and tool-adapter changes**. Process docs
   (`docs/<x>/`) and generator code
   (`tools/vulnogram/generate-cve-json/` etc.) are out of scope — they
   have no eval coverage and don't move per-skill token shape.
   
   ### Drive-by
   
   A pre-existing ruff RUF059 (unused-var) in `test_runner.py` was
   fixed since the file was already in the diff. The `skill-evals`
   project is not currently in prek's per-project ruff matrix, which
   is why the violation slipped — that wiring is a separate change.
   
   ## Test plan
   
   - [x] `uv run pytest tests/` in `tools/skill-evals/` — 63 passed.
   - [x] `uv run ruff check src tests` + `ruff format --check` — clean.
   - [x] `uv run --with mypy mypy src tests` — clean.
   - [x] `prek run --files AGENTS.md tools/skill-evals/README.md` —
     markdownlint / doctoc / typos pass (doctoc auto-added the 5 new
     AGENTS.md section anchors).
   - [x] Manual smoke: `python3 -m skill_evals.runner --cli "echo '{...}'" …`
     exercises PASS and FAIL paths end-to-end (covered by the test
     suite using `echo`, `printf`, `false`).
   - [ ] First real `--cli "claude -p"` run against a live skill
     suite — to be exercised in a follow-up session and any rough
     edges (rate limits, JSON-extraction misses, MANUAL false
     positives) reported in a follow-up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to