This is an automated email from the ASF dual-hosted git repository.
choo121600 pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/airflow-steward.git
The following commit(s) were added to refs/heads/main by this push:
new 779db6a feat(skill-evals): --cli mode for auto-compare; docs(agents)
require evals + mode-economics on skill changes (#281)
779db6a is described below
commit 779db6a9d1aad4c9f95803d90431d5898f0c9795
Author: Jarek Potiuk <[email protected]>
AuthorDate: Mon May 25 20:51:49 2026 +0200
feat(skill-evals): --cli mode for auto-compare; docs(agents) require evals
+ mode-economics on skill changes (#281)
When a contributor modifies a skill or a tool adapter the skills
load, two follow-up actions should be part of the change rather
than optional polish: (1) run the affected skill's eval suite,
(2) update docs/mode-economics.md if the per-invocation token
shape moved.
AGENTS.md previously had no instruction to keep either signal in
sync with skill or adapter changes. The mode-economics page is
new (#253) and not yet referenced from the contributor flow;
the eval harness has had nineteen suites land without a
documented expectation to run them on related changes.
To make running the suites cheap enough that contributors
actually do it, this commit also extends the runner with an
automated comparison mode.
skill-evals harness:
- New --cli "<shell command>" arg pipes <system_prompt>\n\n
<user_prompt> to the configured command on stdin, captures
stdout, extracts the model's JSON, and compares against
expected.json automatically.
- JSON extraction tries three strategies (whole-stdout JSON,
```json fenced block, largest balanced brace block) so models
wrapping output in prose or markdown fences still work.
- Per-case status PASS / FAIL / MANUAL / ERROR plus a summary
count. Non-zero exit on any FAIL or ERROR.
- MANUAL is reserved for "structural" expected.json files (top-
level has_* flags or mention_* lists) where JSON-equality
comparison is meaningless; those cases skip the CLI call and
surface as MANUAL for human review.
- --verbose prints prompts + model stdout alongside the
PASS/FAIL line for debugging a failure.
- --timeout (default 120s) for the per-case CLI call.
- 18 new tests cover is_structural_expected,
extract_json_from_output, compare_outputs, run_cli, and
end-to-end --cli mode behaviour (PASS / FAIL / MANUAL /
ERROR / summary counts / fenced-JSON extraction) using
echo / printf / false as the configured CLI for hermetic
cases.
- tools/skill-evals/README.md gains an "Automated mode" section
walking through --cli usage, JSON extraction strategies, the
MANUAL fallback, and the self-eval caveat.
AGENTS.md:
- New "Keeping evals and mode-economics in sync" section between
Reusable skills and Before submitting. Four subsections plus
a when-the-rule-fires table:
- When the rule fires (skill / adapter / pure-prose rows).
- Running evals (print mode and --cli mode invocations,
budget guidance by change size).
- Agent-run self-eval (self-eval via --cli vs print mode, the
smoke-test-vs-regression-check framing, the cross-model
pass recommendation for substantive changes).
- Updating mode economics (where to find the row, how to
re-estimate using the anchors in docs/mode-economics.md, the
high-end/low-end adjustment heuristic, the brand-new-skill
row pattern).
- Before submitting gains a single cross-reference bullet
pointing at the new section, noting agent self-eval as the
default and cross-model as the substantive-change expectation.
Drive-by: ruff RUF059 fix on a pre-existing unused-var in
test_runner.py (the skill-evals project is not in prek's per-
project ruff matrix, so this slipped -- fixed since the file is
already in this diff).
Generated-by: Claude Code (Opus 4.7)
---
AGENTS.md | 169 ++++++++++++++++++
tools/skill-evals/README.md | 42 +++++
tools/skill-evals/src/skill_evals/runner.py | 263 ++++++++++++++++++++++++++--
tools/skill-evals/tests/test_runner.py | 245 +++++++++++++++++++++++++-
4 files changed, 704 insertions(+), 15 deletions(-)
diff --git a/AGENTS.md b/AGENTS.md
index d79efb1..8fede15 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -31,6 +31,11 @@
- [Mentioning project maintainers and security-team
members](#mentioning-project-maintainers-and-security-team-members)
- [Other editorial guidelines](#other-editorial-guidelines)
- [Reusable skills](#reusable-skills)
+ - [Keeping evals and mode-economics in
sync](#keeping-evals-and-mode-economics-in-sync)
+ - [When the rule fires](#when-the-rule-fires)
+ - [Running evals](#running-evals)
+ - [Agent-run self-eval](#agent-run-self-eval)
+ - [Updating mode economics](#updating-mode-economics)
- [Before submitting](#before-submitting)
- [References](#references)
@@ -1386,6 +1391,164 @@ When adding a new skill:
confirmation before it runs;
- avoid agent-specific syntax so the skill remains portable across tools.
+## Keeping evals and mode-economics in sync
+
+When you change a skill or a tool adapter the skills load, two
+follow-up actions are part of the change, not optional polish:
+
+1. **Run the affected skill's eval suite** to confirm the prompts the
+ harness extracts from `SKILL.md` still produce the expected
+ structured output.
+2. **Update [`docs/mode-economics.md`](docs/mode-economics.md)** if
+ the change materially shifts the per-invocation token shape — a
+ new step that loads substantial context, a removed read path, a
+ new skill entirely.
+
+Both signals catch the same class of regression: a skill that
+silently starts producing different output (eval failure) or a skill
+that silently became materially more expensive to run (cost-table
+drift). The eval suite is the unit test; the mode-economics table is
+the performance budget.
+
+### When the rule fires
+
+| You touched | Run evals for | Update mode-economics if |
+|---|---|---|
+| `.claude/skills/<skill>/SKILL.md`, an extracted step subdoc, or any prompt
material a step's `step-config.json` extracts | That skill's suite under
`tools/skill-evals/evals/<skill>/` | The change adds or removes a step, alters
a context-heavy read, or restructures the call catalogue |
+| `tools/<adapter>/` docs or operation catalogues that skills load (e.g.
`tools/github/operations.md`, `tools/gmail/operations.md`,
`tools/ponymail/operations.md`) | Every skill that names this adapter in its
prerequisites or step bodies — `grep -l <adapter-path>
.claude/skills/*/SKILL.md` to enumerate | A new operation enlarges a typical
skill's loaded context, or a removed one shrinks it |
+| Pure prose edits (typo / clarification / link fix) with no behavioural
impact on the model's output | No eval rerun required | No update required |
+
+If you are unsure whether your change is "behavioural" or
+"prose-only", re-run the affected eval suite anyway — it is cheap
+and protects against the false-negative case where a "clarification"
+actually changes how the model responds.
+
+### Running evals
+
+The harness lives in [`tools/skill-evals/`](tools/skill-evals/)
+(full README at
+[`tools/skill-evals/README.md`](tools/skill-evals/README.md)).
+It is pure-stdlib Python ≥ 3.10 — no third-party dependencies, no
+API key required, no build step. Two modes:
+
+- **Print mode (default)** — emits the system prompt, user prompt,
+ and expected JSON per case. The operator pastes into the model
+ under test and diffs the response against `expected.json`
+ manually.
+- **`--cli` mode** — pipes the constructed prompt through a shell
+ command (any LLM CLI that reads stdin and writes stdout works),
+ extracts the JSON the model produced, and reports
+ `PASS` / `FAIL` / `MANUAL` / `ERROR` per case. Exits non-zero on
+ any `FAIL` or `ERROR`. `MANUAL` is reserved for structural
+ expected.json files (top-level `has_*` flags / `mention_*`
+ lists); those still print prompts for manual review.
+
+```bash
+# Print mode — paste prompts into the model under test
+PYTHONPATH=tools/skill-evals/src python3 -m skill_evals.runner \
+ tools/skill-evals/evals/<skill-name>/
+
+# Single case (print mode)
+PYTHONPATH=tools/skill-evals/src python3 -m skill_evals.runner \
+ tools/skill-evals/evals/<skill-name>/<step-name>/fixtures/case-N-<name>
+
+# Automated mode — run against Claude Code (or any LLM CLI)
+PYTHONPATH=tools/skill-evals/src python3 -m skill_evals.runner --cli "claude
-p" \
+ tools/skill-evals/evals/<skill-name>/
+
+# Add --verbose to also print prompts + model stdout per case
+PYTHONPATH=tools/skill-evals/src python3 -m skill_evals.runner --cli "claude
-p" -v \
+ tools/skill-evals/evals/<skill-name>/<step-name>/fixtures/
+```
+
+Budget guidance:
+
+- **Single step rewrite** — run that step's cases plus one
+ representative case from any other step whose prompt the change
+ touched indirectly.
+- **Whole-skill restructure** — run the full suite.
+- **Tool-adapter doc edit** — for each affected skill, run the step
+ whose prompt body changed (per the `step-config.json` heading)
+ plus the step that triggers the adapter call.
+
+When the change adds behaviour worth covering — a new step, a new
+edge case the existing fixtures do not hit, a new prompt-injection
+shape the skill should defend against — add a fixture under the
+relevant step's `fixtures/` directory: `case-N-<name>/` containing
+`report.md` (mock tool outputs) + `expected.json` (ground-truth
+model output). The README's *Structure* and *Adversarial cases*
+sections walk through the layout.
+
+### Agent-run self-eval
+
+The agent making the change can run the suite in-session in either
+mode:
+
+- **Self-eval via `--cli`** — point the runner at a CLI that invokes
+ the same model class authoring the change (e.g.
+ `--cli "claude -p"`). The runner handles prompt construction,
+ invocation, JSON extraction, and comparison automatically; the
+ agent reads the per-case `PASS` / `FAIL` / `MANUAL` / `ERROR`
+ summary and the non-zero exit code.
+- **Self-eval via print mode** — run the runner without `--cli` to
+ print prompts, then act as the model under test using **only**
+ the printed system + user prompts as input. Do not re-read the
+ `SKILL.md`, source files, or any other context the runner did not
+ include — the eval's value is in catching prompt-vs-output
+ mismatches, which only works when the model under test sees only
+ the eval-constructed input. Diff the produced JSON against
+ `expected.json`: JSON equality for exact-match cases; per-flag
+ check for composition cases that use boolean flags
+ (`has_security_model_quote`, `has_bare_issue_numbers`) or
+ membership lists (`mention_handles`).
+
+**Self-eval is a smoke test, not a regression check.** The same
+model that just authored the change is now grading whether the
+change is correct, so subtle biases the change introduced may go
+undetected. Self-eval catches the cheap failure class — invalid
+JSON, missing fields, off-spec output shape, fixture / prompt
+drift — and is worth running on every skill or adapter change.
+
+For substantive changes (new steps, prompt restructures, behaviour
+changes that cross a class boundary like Triage classification
+thresholds or injection-defence wording), also run a **cross-model
+pass**: a human pastes the prompts into a different model class than
+the one that authored the change, and diffs that output against
+`expected.json` independently. This catches the self-confirmation
+bias that self-eval cannot.
+
+### Updating mode economics
+
+[`docs/mode-economics.md`](docs/mode-economics.md) is hand-maintained
+— it is the indicative cost budget per skill per invocation, not a
+generated artefact. After a change that moves the per-invocation
+envelope:
+
+1. Find the row for the modified skill under
+ [§ Per-mode token shape](docs/mode-economics.md#per-mode-token-shape).
+2. Re-estimate using the anchors at the top of the doc
+ ([§ What "tokens" means
here](docs/mode-economics.md#what-tokens-means-here)):
+ ~530 tokens per 400-word report body, ~5 000 per medium PR diff,
+ 3 000–8 000 per typical mail thread, plus the `SKILL.md`
+ overhead. For an exact `SKILL.md` token count, run
+ `cl100k_base` tokenisation against the file; or apply the
+ doc's existing small / typical / large bands as an
+ order-of-magnitude estimate.
+3. Adjust the **high end** of the existing range when the change
+ adds context; adjust the **low end** when the change removes one
+ (rare). Update the *Primary cost driver* / *Notes* column if the
+ new driver is qualitatively different from the old one.
+
+For a brand-new skill, add a row under the correct mode section
+(Triage / Mentoring / Drafting / Pairing) with a token range, a
+one-line invocation description, and the primary cost driver. Use
+a neighbouring skill of similar shape as the anchor for the range.
+
+For documentation-only changes that do not touch the skill's reads
+or prompt body — a typo fix, a link update, a clarifying paragraph
+that does not change what the model is asked to produce —
+`mode-economics.md` does not need an update.
+
## Before submitting
- Re-read the diff and check that every change is intentional.
@@ -1394,6 +1557,12 @@ When adding a new skill:
exists on the current stable version (adopting project's anchors:
[`<project-config>/security-model.md`](<project-config>/security-model.md)).
- Self-review the tone of any modified canned response against the "polite but
firm" guidance above.
+- If the change touched a skill or a tool adapter the skills load,
+ follow the
+ [Keeping evals and mode-economics in
sync](#keeping-evals-and-mode-economics-in-sync)
+ rules above — run the affected eval suite(s) (agent self-eval on
+ every change, cross-model on substantive changes) and update
+ `docs/mode-economics.md` if the per-invocation token shape moved.
## References
diff --git a/tools/skill-evals/README.md b/tools/skill-evals/README.md
index 0a06c84..119771e 100644
--- a/tools/skill-evals/README.md
+++ b/tools/skill-evals/README.md
@@ -45,6 +45,48 @@ PYTHONPATH=tools/skill-evals/src python3 -m
skill_evals.runner \
The runner prints the system prompt, user prompt, and expected output for each
case. Paste into any model and compare the response against the expected JSON.
The harness is intentionally model-agnostic — no API key or CLI dependency
required.
+### Automated mode (`--cli`)
+
+For unattended runs, pass `--cli "<shell command>"`. The runner pipes
+`<system_prompt>\n\n<user_prompt>` to the command on stdin, captures
+stdout, extracts the model's JSON, and compares against `expected.json`
+automatically. Per-case status is `PASS`, `FAIL`, `MANUAL` (structural
+expected.json — see below), or `ERROR` (CLI failure / non-JSON output);
+the runner exits non-zero on any `FAIL` or `ERROR`.
+
+```bash
+# Run every case for a skill against Claude Code's print mode
+PYTHONPATH=tools/skill-evals/src python3 -m skill_evals.runner --cli "claude
-p" \
+ tools/skill-evals/evals/issue-triage/
+
+# Any LLM CLI that reads a prompt on stdin and writes the response to
+# stdout works — `llm`, `gpt`, a thin curl wrapper, etc.
+PYTHONPATH=tools/skill-evals/src python3 -m skill_evals.runner --cli "llm -m
gpt-4o-mini" \
+ tools/skill-evals/evals/issue-triage/step-3-classify/fixtures/
+
+# Add --verbose to also print prompts and the model's raw stdout per case.
+PYTHONPATH=tools/skill-evals/src python3 -m skill_evals.runner --cli "claude
-p" --verbose \
+
tools/skill-evals/evals/issue-triage/step-3-classify/fixtures/case-1-clear-bug
+```
+
+**JSON extraction** tries three strategies in order: parse the whole
+stdout as JSON, look for the first ```` ```json ```` fenced block, then
+the largest balanced `{...}` (or `[...]`) substring. Models that wrap
+output in prose or markdown fences still work.
+
+**Structural cases (composition steps).** When `expected.json` describes
+prose properties via boolean flags (`has_security_model_quote`,
+`has_bare_issue_numbers`) or membership lists (`mention_handles`),
+automatic JSON-equality comparison is meaningless. Those cases report
+`MANUAL` and the runner skips the CLI call; review them by re-running
+without `--cli` (or with `--verbose`).
+
+**Self-eval caveat.** When the model invoked by `--cli` is the same
+model (or model class) that just authored the skill change, the
+comparison is a self-eval pass — useful as a smoke test for prompt /
+output-shape regressions, but weaker than a cross-model run. For
+substantive changes, also run against a different model class.
+
## Structure
```text
diff --git a/tools/skill-evals/src/skill_evals/runner.py
b/tools/skill-evals/src/skill_evals/runner.py
index 17259b9..286077a 100644
--- a/tools/skill-evals/src/skill_evals/runner.py
+++ b/tools/skill-evals/src/skill_evals/runner.py
@@ -18,9 +18,20 @@
"""
Skill eval runner.
-Loads fixture data from an eval case directory and prints the system prompt
-and user prompt for each case so you can paste them into any model.
-Compare the model's response against expected.json to verify correctness.
+Two modes:
+
+1. **Print mode (default)** — emit the system prompt, user prompt, and
+ expected JSON for each case. The operator (or the agent making the
+ change, in self-eval mode) pastes the prompts into the model under
+ test and diffs the response against expected.json manually.
+
+2. **`--cli` mode** — pipe ``<system_prompt>\\n\\n<user_prompt>`` on stdin
+ to the configured shell command, capture stdout, extract the JSON
+ the model produced, and compare against expected.json automatically.
+ Reports PASS / FAIL / MANUAL per case and exits non-zero on any FAIL.
+ MANUAL is reserved for "structural" expected.json files (top-level
+ ``has_*`` flags or ``mention_*`` lists) where automatic comparison
+ is not meaningful; those still print prompts for manual review.
Usage:
# Print prompts for all cases under a fixtures directory
@@ -30,6 +41,11 @@ Usage:
# Print prompt for a single case
uv run --project tools/skill-evals skill-eval \\
evals/security-issue-import/step-2a-semantic-sweep/fixtures/case-1-clear-duplicate
+
+ # Automated comparison against a CLI (Claude Code shown; any LLM CLI
+ # that reads a prompt on stdin and writes the response to stdout works)
+ uv run --project tools/skill-evals skill-eval --cli "claude -p" \\
+ evals/security-issue-import/step-2a-semantic-sweep/fixtures/
"""
from __future__ import annotations
@@ -37,10 +53,10 @@ from __future__ import annotations
import argparse
import json
import re
+import subprocess
import sys
from pathlib import Path
-
# ---------------------------------------------------------------------------
# Prompt construction
# ---------------------------------------------------------------------------
@@ -186,6 +202,124 @@ def load_case(case_dir: Path) -> tuple[list[dict], dict,
str, dict]:
return corpus, roster, report, expected
+# ---------------------------------------------------------------------------
+# Automated comparison (--cli mode)
+# ---------------------------------------------------------------------------
+
+
+def is_structural_expected(expected: dict) -> bool:
+ """Return True if expected.json describes prose properties, not exact
output.
+
+ Composition steps (e.g. compose-comment, compose-verdict) assert structural
+ properties of the model's prose via boolean ``has_*`` flags and membership
+ lists like ``mention_handles``. Those cannot be JSON-equality-compared
+ against a model's actual prose output, so --cli mode falls back to manual
+ review for them.
+ """
+ if not isinstance(expected, dict):
+ return False
+ return any(key.startswith(("has_", "mention_")) for key in expected)
+
+
+def extract_json_from_output(text: str) -> tuple[object | None, str | None]:
+ """Return (parsed_json, error) extracted from a model's stdout.
+
+ Tries three strategies in order:
+ 1. The whole output is valid JSON.
+ 2. The output contains a ```json ... ``` fenced block.
+ 3. The output contains a balanced ``{...}`` (or ``[...]``) block — the
+ longest such block is tried.
+
+ Returns ``(value, None)`` on success or ``(None, error_message)`` if no
+ JSON could be extracted.
+ """
+ stripped = text.strip()
+ if stripped:
+ try:
+ return json.loads(stripped), None
+ except json.JSONDecodeError:
+ pass
+
+ fence = re.search(r"```(?:json)?\s*\n(.*?)\n```", text, re.DOTALL)
+ if fence:
+ try:
+ return json.loads(fence.group(1).strip()), None
+ except json.JSONDecodeError:
+ pass
+
+ candidate = _find_largest_brace_block(text)
+ if candidate is not None:
+ try:
+ return json.loads(candidate), None
+ except json.JSONDecodeError:
+ pass
+
+ return None, "no JSON object or array found in model output"
+
+
+def _find_largest_brace_block(text: str) -> str | None:
+ """Return the largest balanced ``{...}`` or ``[...]`` substring, or
None."""
+ best: str | None = None
+ for opener, closer in (("{", "}"), ("[", "]")):
+ depth = 0
+ start = -1
+ for i, ch in enumerate(text):
+ if ch == opener:
+ if depth == 0:
+ start = i
+ depth += 1
+ elif ch == closer and depth > 0:
+ depth -= 1
+ if depth == 0 and start >= 0:
+ candidate = text[start : i + 1]
+ if best is None or len(candidate) > len(best):
+ best = candidate
+ start = -1
+ return best
+
+
+def compare_outputs(actual: object, expected: object) -> tuple[bool, str]:
+ """Return (passed, diff_text). Diff is empty when passed=True."""
+ if actual == expected:
+ return True, ""
+ return False, _format_diff(actual, expected)
+
+
+def _format_diff(actual: object, expected: object) -> str:
+ actual_text = json.dumps(actual, indent=2, sort_keys=True)
+ expected_text = json.dumps(expected, indent=2, sort_keys=True)
+ a_lines = actual_text.splitlines()
+ e_lines = expected_text.splitlines()
+ lines = ["--- expected", "+++ actual"]
+ for line in e_lines:
+ if line not in a_lines:
+ lines.append(f"- {line}")
+ for line in a_lines:
+ if line not in e_lines:
+ lines.append(f"+ {line}")
+ return "\n".join(lines)
+
+
+def run_cli(cli: str, prompt: str, timeout: int = 120) -> tuple[str, str, int]:
+ """Run ``cli`` (shell command) with ``prompt`` on stdin. Return (stdout,
stderr, rc).
+
+ The command is run with ``shell=True`` so quoting and arguments work as a
+ developer would type them at a shell prompt. The runner is a local
+ developer tool — the operator supplies the command, so shell semantics are
+ the ergonomic choice rather than a security concern.
+ """
+ proc = subprocess.run(
+ cli,
+ input=prompt,
+ capture_output=True,
+ text=True,
+ shell=True,
+ timeout=timeout,
+ check=False,
+ )
+ return proc.stdout, proc.stderr, proc.returncode
+
+
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
@@ -225,7 +359,11 @@ def find_cases(path: Path) -> list[tuple[Path, Path]]:
def main(argv: list[str] | None = None) -> int:
parser = argparse.ArgumentParser(
- description="Print eval prompts for skill cases. Paste into any model
and compare against expected.json."
+ description=(
+ "Run skill eval cases. Default mode prints prompts for manual
review. "
+ "--cli mode pipes prompts through a shell command and
auto-compares "
+ "against expected.json."
+ )
)
parser.add_argument(
"path",
@@ -237,6 +375,30 @@ def main(argv: list[str] | None = None) -> int:
action="store_true",
help="Suppress prompt content; print only case names and expected
JSON.",
)
+ parser.add_argument(
+ "--cli",
+ type=str,
+ default=None,
+ help=(
+ "Shell command that reads a prompt on stdin and writes the model "
+ "response to stdout (e.g. 'claude -p'). When set, the runner sends
"
+ "<system_prompt>\\n\\n<user_prompt> to the command for each case, "
+ "extracts JSON from stdout, and compares against expected.json. "
+ "Exits non-zero on any FAIL."
+ ),
+ )
+ parser.add_argument(
+ "--timeout",
+ type=int,
+ default=120,
+ help="Timeout in seconds for each --cli invocation (default: 120).",
+ )
+ parser.add_argument(
+ "--verbose",
+ "-v",
+ action="store_true",
+ help="In --cli mode, also print the prompts and the model's raw stdout
per case.",
+ )
args = parser.parse_args(argv)
cases = find_cases(args.path)
@@ -248,6 +410,8 @@ def main(argv: list[str] | None = None) -> int:
# the same fixtures dir (common when running a whole skill at once).
_step_config_cache: dict[Path, tuple[str, str]] = {}
+ passed = failed = manual = errored = 0
+
for case_dir, fixtures_dir in cases:
if fixtures_dir not in _step_config_cache:
_step_config_cache[fixtures_dir] = load_step_config(fixtures_dir)
@@ -267,20 +431,97 @@ def main(argv: list[str] | None = None) -> int:
"Literal braces that are not slots must be doubled ({{ and
}})."
) from exc
step_label = fixtures_dir.parent.name
- print(f"{'=' * 60}")
- print(f"CASE: {step_label}/{case_dir.name}")
- print(f"{'=' * 60}")
- if not args.quiet:
+ case_label = f"{step_label}/{case_dir.name}"
+
+ if args.cli is None:
+ # Print mode (existing behaviour)
+ print(f"{'=' * 60}")
+ print(f"CASE: {case_label}")
+ print(f"{'=' * 60}")
+ if not args.quiet:
+ print("--- SYSTEM PROMPT ---")
+ print(system_prompt)
+ print("--- USER PROMPT ---")
+ print(user_prompt)
+ print("--- EXPECTED ---")
+ print(json.dumps(expected, indent=2))
+ print()
+ continue
+
+ # --cli mode: run the configured command and auto-compare.
+ if isinstance(expected, dict) and is_structural_expected(expected):
+ print(f"MANUAL {case_label} (structural expected.json — review
actual output by hand)")
+ if args.verbose:
+ _print_prompts_and_run(args, system_prompt, user_prompt)
+ manual += 1
+ continue
+
+ full_prompt = f"{system_prompt}\n\n{user_prompt}"
+ try:
+ stdout, stderr, rc = run_cli(args.cli, full_prompt,
timeout=args.timeout)
+ except subprocess.TimeoutExpired:
+ print(f"ERROR {case_label} (CLI timed out after
{args.timeout}s)")
+ errored += 1
+ continue
+ except OSError as exc:
+ print(f"ERROR {case_label} (CLI invocation failed: {exc})")
+ errored += 1
+ continue
+
+ if rc != 0:
+ print(f"ERROR {case_label} (CLI exited {rc}; stderr:
{stderr.strip()[:200]})")
+ errored += 1
+ if args.verbose:
+ print("--- STDOUT ---")
+ print(stdout)
+ continue
+
+ actual, parse_err = extract_json_from_output(stdout)
+ if parse_err is not None:
+ print(f"ERROR {case_label} ({parse_err})")
+ errored += 1
+ if args.verbose:
+ print("--- STDOUT ---")
+ print(stdout)
+ continue
+
+ ok, diff = compare_outputs(actual, expected)
+ if ok:
+ print(f"PASS {case_label}")
+ passed += 1
+ else:
+ print(f"FAIL {case_label}")
+ print(diff)
+ failed += 1
+
+ if args.verbose:
print("--- SYSTEM PROMPT ---")
print(system_prompt)
print("--- USER PROMPT ---")
print(user_prompt)
- print("--- EXPECTED ---")
- print(json.dumps(expected, indent=2))
+ print("--- STDOUT ---")
+ print(stdout)
+ print()
+
+ if args.cli is not None:
+ total = passed + failed + manual + errored
print()
+ print(f"{'=' * 60}")
+ print(f"Ran {total} cases: {passed} passed, {failed} failed, {manual}
manual, {errored} errored")
+ print(f"{'=' * 60}")
+ if failed or errored:
+ return 1
return 0
+def _print_prompts_and_run(args: argparse.Namespace, system_prompt: str,
user_prompt: str) -> None:
+ """Verbose helper for MANUAL-mode cases: show what would have been sent."""
+ print("--- SYSTEM PROMPT ---")
+ print(system_prompt)
+ print("--- USER PROMPT ---")
+ print(user_prompt)
+
+
if __name__ == "__main__":
sys.exit(main())
diff --git a/tools/skill-evals/tests/test_runner.py
b/tools/skill-evals/tests/test_runner.py
index 956e4b1..3b9c65a 100644
--- a/tools/skill-evals/tests/test_runner.py
+++ b/tools/skill-evals/tests/test_runner.py
@@ -27,15 +27,17 @@ import pytest
from skill_evals.runner import (
build_corpus_text,
build_roster_text,
+ compare_outputs,
+ extract_json_from_output,
extract_skill_section,
find_cases,
find_repo_root,
+ is_structural_expected,
load_case,
load_step_config,
main,
)
-
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
@@ -68,7 +70,9 @@ def _make_fixtures_dir(
return fixtures_dir
-def _make_case(fixtures_dir: Path, name: str, *, report: str = "report text",
expected: dict | None = None) -> Path:
+def _make_case(
+ fixtures_dir: Path, name: str, *, report: str = "report text", expected:
dict | None = None
+) -> Path:
case_dir = fixtures_dir / name
case_dir.mkdir(parents=True, exist_ok=True)
(case_dir / "report.md").write_text(report)
@@ -110,7 +114,7 @@ def test_build_corpus_text_multiple_items():
def test_build_corpus_text_repr_escapes_quotes():
- result = build_corpus_text([{"number": 1, "title": "She said \"hi\"",
"body": "x"}])
+ result = build_corpus_text([{"number": 1, "title": 'She said "hi"',
"body": "x"}])
# title is repr'd so quotes are escaped
assert "'She said" in result or '"She said' in result
@@ -278,7 +282,7 @@ def test_load_step_config_uses_step_config_json(tmp_path:
Path):
repo_root / "step-dir",
step_config={"skill_md": "skills/my-skill/SKILL.md", "step_heading":
"## Target Step"},
)
- system_prompt, user_prompt_template = load_step_config(fixtures_dir)
+ system_prompt, _user_prompt_template = load_step_config(fixtures_dir)
assert "Prompt content here" in system_prompt
assert "Other Step" not in system_prompt
@@ -532,3 +536,236 @@ def test_main_bad_user_prompt_template_raises(tmp_path:
Path):
with pytest.raises(KeyError):
main([str(fixtures_dir)])
+
+
+# ---------------------------------------------------------------------------
+# is_structural_expected
+# ---------------------------------------------------------------------------
+
+
+def test_is_structural_expected_plain_kvs():
+ assert is_structural_expected({"class": "BUG", "confidence": "high"}) is
False
+
+
+def test_is_structural_expected_has_flag():
+ assert is_structural_expected({"has_security_model_quote": True}) is True
+
+
+def test_is_structural_expected_mention_list():
+ assert is_structural_expected({"mention_handles": ["@alice"]}) is True
+
+
+def test_is_structural_expected_non_dict():
+ assert is_structural_expected(["a", "b"]) is False # type:
ignore[arg-type]
+
+
+def test_is_structural_expected_empty_dict():
+ assert is_structural_expected({}) is False
+
+
+# ---------------------------------------------------------------------------
+# extract_json_from_output
+# ---------------------------------------------------------------------------
+
+
+def test_extract_json_whole_output():
+ value, err = extract_json_from_output('{"verdict": "ok"}')
+ assert err is None
+ assert value == {"verdict": "ok"}
+
+
+def test_extract_json_whole_output_with_whitespace():
+ value, err = extract_json_from_output('\n {"verdict": "ok"}\n\n')
+ assert err is None
+ assert value == {"verdict": "ok"}
+
+
+def test_extract_json_fenced_block():
+ text = 'Here is the output:\n\n```json\n{"verdict": "ok"}\n```\n\nThank
you.'
+ value, err = extract_json_from_output(text)
+ assert err is None
+ assert value == {"verdict": "ok"}
+
+
+def test_extract_json_bare_fenced_block():
+ text = 'Result:\n```\n{"verdict": "ok"}\n```'
+ value, err = extract_json_from_output(text)
+ assert err is None
+ assert value == {"verdict": "ok"}
+
+
+def test_extract_json_largest_brace_block():
+ text = 'I think the answer is {"verdict": "ok"} based on the data.'
+ value, err = extract_json_from_output(text)
+ assert err is None
+ assert value == {"verdict": "ok"}
+
+
+def test_extract_json_picks_largest_brace_block():
+ text = 'Small: {"a": 1}. Larger and correct: {"verdict": "ok",
"rationale": "longer text here"}.'
+ value, err = extract_json_from_output(text)
+ assert err is None
+ assert value == {"verdict": "ok", "rationale": "longer text here"}
+
+
+def test_extract_json_array_top_level():
+ value, err = extract_json_from_output("[1, 2, 3]")
+ assert err is None
+ assert value == [1, 2, 3]
+
+
+def test_extract_json_no_json_returns_error():
+ value, err = extract_json_from_output("Just prose, no JSON anywhere.")
+ assert value is None
+ assert err is not None and "no JSON" in err
+
+
+def test_extract_json_empty_output_returns_error():
+ value, err = extract_json_from_output("")
+ assert value is None
+ assert err is not None
+
+
+def test_extract_json_malformed_braces_returns_error():
+ value, err = extract_json_from_output('{"verdict": missing-quotes}')
+ assert value is None
+ assert err is not None
+
+
+# ---------------------------------------------------------------------------
+# compare_outputs
+# ---------------------------------------------------------------------------
+
+
+def test_compare_outputs_equal_dicts():
+ ok, diff = compare_outputs({"a": 1, "b": 2}, {"a": 1, "b": 2})
+ assert ok is True
+ assert diff == ""
+
+
+def test_compare_outputs_dict_key_order_irrelevant():
+ ok, _ = compare_outputs({"b": 2, "a": 1}, {"a": 1, "b": 2})
+ assert ok is True
+
+
+def test_compare_outputs_unequal_dicts_show_diff():
+ ok, diff = compare_outputs({"class": "BUG"}, {"class": "INVALID"})
+ assert ok is False
+ assert "BUG" in diff
+ assert "INVALID" in diff
+
+
+def test_compare_outputs_unequal_marks_added_and_removed():
+ ok, diff = compare_outputs({"a": 1}, {"a": 1, "b": 2})
+ assert ok is False
+ assert "+" in diff or "-" in diff
+
+
+# ---------------------------------------------------------------------------
+# --cli mode integration
+# ---------------------------------------------------------------------------
+
+
+def _make_cli_case(tmp_path: Path, *, expected: dict, report: str = "the
report") -> tuple[Path, Path]:
+ """Build a minimal fixtures dir + one case dir; return (fixtures_dir,
case_dir)."""
+ repo_root = _make_repo(tmp_path)
+ skill_md = repo_root / "SKILL.md"
+ skill_md.write_text("## Step\n\nSystem instructions.\n")
+ fixtures_dir = _make_fixtures_dir(
+ repo_root / "step-dir",
+ step_config={"skill_md": "SKILL.md", "step_heading": "## Step"},
+ user_prompt_template="Report: {report}\n",
+ )
+ case_dir = _make_case(fixtures_dir, "case-1", report=report,
expected=expected)
+ return fixtures_dir, case_dir
+
+
+def test_cli_mode_pass_with_echo(tmp_path: Path, capsys:
pytest.CaptureFixture[str]):
+ """A CLI that echoes the expected JSON should PASS."""
+ expected = {"verdict": "ok"}
+ fixtures_dir, _ = _make_cli_case(tmp_path, expected=expected)
+ rc, stdout, _ = _run_main(
+ capsys,
+ ["--cli", f"echo '{json.dumps(expected)}'", str(fixtures_dir)],
+ )
+ assert rc == 0
+ assert "PASS" in stdout
+ assert "1 passed" in stdout
+
+
+def test_cli_mode_fail_with_wrong_json(tmp_path: Path, capsys:
pytest.CaptureFixture[str]):
+ """A CLI that returns the wrong JSON should FAIL and exit non-zero."""
+ fixtures_dir, _ = _make_cli_case(tmp_path, expected={"verdict": "ok"})
+ rc, stdout, _ = _run_main(
+ capsys,
+ ["--cli", 'echo \'{"verdict": "wrong"}\'', str(fixtures_dir)],
+ )
+ assert rc == 1
+ assert "FAIL" in stdout
+ assert "1 failed" in stdout
+
+
+def test_cli_mode_manual_skips_structural(tmp_path: Path, capsys:
pytest.CaptureFixture[str]):
+ """Structural expected.json (has_* / mention_*) is reported MANUAL, not
auto-compared."""
+ fixtures_dir, _ = _make_cli_case(
+ tmp_path, expected={"has_security_model_quote": True,
"mention_handles": []}
+ )
+ rc, stdout, _ = _run_main(
+ capsys,
+ # CLI would return junk; runner should not even invoke it for MANUAL
cases.
+ ["--cli", "exit 1", str(fixtures_dir)],
+ )
+ assert rc == 0
+ assert "MANUAL" in stdout
+ assert "1 manual" in stdout
+
+
+def test_cli_mode_error_on_non_json_output(tmp_path: Path, capsys:
pytest.CaptureFixture[str]):
+ """A CLI that returns prose without any JSON should ERROR and exit
non-zero."""
+ fixtures_dir, _ = _make_cli_case(tmp_path, expected={"verdict": "ok"})
+ rc, stdout, _ = _run_main(
+ capsys,
+ ["--cli", "echo 'just prose, no JSON here'", str(fixtures_dir)],
+ )
+ assert rc == 1
+ assert "ERROR" in stdout
+ assert "1 errored" in stdout
+
+
+def test_cli_mode_error_on_non_zero_exit(tmp_path: Path, capsys:
pytest.CaptureFixture[str]):
+ """A CLI that exits non-zero should ERROR and exit non-zero."""
+ fixtures_dir, _ = _make_cli_case(tmp_path, expected={"verdict": "ok"})
+ rc, stdout, _ = _run_main(capsys, ["--cli", "false", str(fixtures_dir)])
+ assert rc == 1
+ assert "ERROR" in stdout
+
+
+def test_cli_mode_extracts_json_from_fenced_response(tmp_path: Path, capsys:
pytest.CaptureFixture[str]):
+ """The runner should find JSON inside a ```json fence in the CLI's
stdout."""
+ fixtures_dir, _ = _make_cli_case(tmp_path, expected={"verdict": "ok"})
+ # printf interprets escapes so we get a real fenced block on stdout.
+ rc, stdout, _ = _run_main(
+ capsys,
+ [
+ "--cli",
+ 'printf \'Sure!\\n\\n```json\\n{"verdict": "ok"}\\n```\\n\'',
+ str(fixtures_dir),
+ ],
+ )
+ assert rc == 0
+ assert "PASS" in stdout
+
+
+def test_cli_mode_summary_counts(tmp_path: Path, capsys:
pytest.CaptureFixture[str]):
+ """Multiple cases should be summarised correctly."""
+ expected = {"verdict": "ok"}
+ fixtures_dir, _ = _make_cli_case(tmp_path, expected=expected)
+ # Add a second case that will FAIL by reusing the same fixtures dir.
+ _make_case(fixtures_dir, "case-2-fail", report="x", expected={"verdict":
"different"})
+ rc, stdout, _ = _run_main(
+ capsys,
+ ["--cli", f"echo '{json.dumps(expected)}'", str(fixtures_dir)],
+ )
+ assert rc == 1 # because one case FAILs
+ assert "1 passed" in stdout
+ assert "1 failed" in stdout