(airflow-steward) branch main updated: fix(skills): stabilize setup-verify eval and extend coverage (#488)

potiuk Thu, 11 Jun 2026 02:23:34 -0700

This is an automated email from the ASF dual-hosted git repository.

potiuk pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/airflow-steward.git



The following commit(s) were added to refs/heads/main by this push:
     new 141158a1 fix(skills): stabilize setup-verify eval and extend coverage 
(#488)
141158a1 is described below

commit 141158a1d4823d15677ca8302821a0f9ee422226
Author: Justin Mclean <[email protected]>
AuthorDate: Thu Jun 11 19:23:06 2026 +1000

    fix(skills): stabilize setup-verify eval and extend coverage (#488)
    
    * fix(skills): eliminate validator SOFT warnings; fix stale eval 
step-config heading
    
    Remove step-reference strings from security-issue-import-via-forwarder
    when_to_use (three [criteria-source] warnings) and shorten the
    setup-isolated-setup-verify description to drop the inline
    Coverage enum (one [action-inventory] warning). Both changes
    leave skill behaviour and body text unchanged; the body already
    documents the full coverage in detail.
    
    Also fix the stale eval step-config.json for setup-isolated-setup-verify
    step-1-classify: the skill heading was renamed from 'The 8 checks' to
    'The 9 checks' when a ninth check was added, but the eval config was not
    updated, causing skill-eval to abort with a heading-not-found error.
    
    Validation:
      uv run --project tools/skill-and-tool-validator skill-and-tool-validate
        -> skill-and-tool-validator: OK (no violations)
      uv run --project tools/skill-evals skill-eval 
evals/setup-isolated-setup-verify/
        -> runs to completion (heading-not-found error resolved)
    
    Generated-by: Claude (Opus 4.7)
    
    * fix(skills): stabilize setup-verify eval and extend check 1 coverage
    
    Follow-up to the heading fix, which let the eval run and exposed
    latent fixture failures.
    
    - step-1-classify: add grading-schema.json marking 'evidence' as a
      prose field so equivalent wording is graded on conclusion, not
      verbatim string match (evidence is not in the runner default set).
    - case-3 / case-6: correct check status from X to partial for a
      missing PostToolUse/sandbox-error-hint.sh, matching the documented
      'missing error-hint reports partial, not missing' rule in SKILL.md.
    - step-2 case-4: broaden expected follow_up reason to span the same
      scope as a correct answer (.git/HEAD read failure consequence plus
      helper remediation), consistent with case-3/case-5 style.
    - Extend check 1 to enumerate the sandbox.filesystem allowlist
      (allowRead/allowWrite) in both SKILL.md and the cited canonical
      doc, so the model reports it consistently.
    
    Validation: eval suite fully green (11/11).
---
 docs/setup/secure-agent-setup.md                               |  5 +++--
 skills/security-issue-import-via-forwarder/SKILL.md            |  7 +++----
 skills/setup-isolated-setup-verify/SKILL.md                    | 10 +++++-----
 .../fixtures/case-3-missing-scripts/expected.json              |  2 +-
 .../fixtures/case-6-injection-attempt/expected.json            |  4 ++--
 .../step-1-classify/fixtures/grading-schema.json               |  3 +++
 .../step-1-classify/fixtures/step-config.json                  |  2 +-
 .../fixtures/case-4-project-root-missing/expected.json         |  2 +-
 8 files changed, 19 insertions(+), 16 deletions(-)

diff --git a/docs/setup/secure-agent-setup.md b/docs/setup/secure-agent-setup.md
index be8cd79a..a8f28b50 100644
--- a/docs/setup/secure-agent-setup.md
+++ b/docs/setup/secure-agent-setup.md
@@ -1747,8 +1747,9 @@ below and report ✓ done / ✗ missing / ⚠ partial, with the 
evidence
 
 1. Project `.claude/settings.json` exists and has
    `sandbox.enabled: true`, the `permissions.deny` block, the
-   `permissions.ask` block, and the
-   `sandbox.network.allowedDomains` block.
+   `permissions.ask` block, the
+   `sandbox.network.allowedDomains` block, and the
+   `sandbox.filesystem` allowlist (`allowRead`/`allowWrite`).
 2. User-scope `~/.claude/settings.json` has the `PreToolUse`
    `Bash` matcher wired to a `sandbox-bypass-warn.sh` command
    and the `statusLine` command set to `sandbox-status-line.sh`.
diff --git a/skills/security-issue-import-via-forwarder/SKILL.md 
b/skills/security-issue-import-via-forwarder/SKILL.md
index 3c397e95..1efe37ba 100644
--- a/skills/security-issue-import-via-forwarder/SKILL.md
+++ b/skills/security-issue-import-via-forwarder/SKILL.md
@@ -14,10 +14,9 @@ description: |
   addressing rules, and hands the routing decision back. Never
   mutates tracker state on its own.
 when_to_use: |
-  Invoked by `security-issue-import` (Step 3 classification),
-  `security-issue-invalidate` (Step 5 draft routing), and
-  `security-issue-sync` (Step 2b draft routing) when
-  `forwarders.enabled` is non-empty in
+  Invoked by `security-issue-import`, `security-issue-invalidate`,
+  and `security-issue-sync` for classification and draft routing
+  when `forwarders.enabled` is non-empty in
   `<project-config>/project.md`. Also invocable standalone when
   a security team member says "is this thread a relay?",
   "extract the credit from this relay body", or "route the
diff --git a/skills/setup-isolated-setup-verify/SKILL.md 
b/skills/setup-isolated-setup-verify/SKILL.md
index c60c85f5..af02f7ed 100644
--- a/skills/setup-isolated-setup-verify/SKILL.md
+++ b/skills/setup-isolated-setup-verify/SKILL.md
@@ -4,10 +4,9 @@ description: |
   Walk the verification checklist for the framework's secure
   agent setup and report ✓ done / ✗ missing / ⚠ partial for
   each check, with concrete evidence (file paths, command
-  output, version strings). Coverage: settings.json wiring,
-  claude-iso sourced, pinned tool versions, denial commands,
-  and the comdev MCP checkout (on `main`, current).
-  Read-only — never modifies anything.
+  output, version strings). Covers nine checks across
+  settings wiring, installed tool versions, and sandbox
+  configuration. Read-only — never modifies anything.
 when_to_use: |
   Invoke when the user says "verify my secure setup", "is my
   secure config done?", "check that the secure agent setup is
@@ -116,7 +115,8 @@ The canonical list lives in
 Walk each in order:
 
 1. Project `.claude/settings.json` shape — `sandbox.enabled: true`,
-   `permissions.deny`, `permissions.ask`, `sandbox.network.allowedDomains`.
+   `permissions.deny`, `permissions.ask`, `sandbox.network.allowedDomains`,
+   and the `sandbox.filesystem` allowlist (`allowRead`/`allowWrite`).
 2. User-scope `~/.claude/settings.json` wiring — `PreToolUse`
    `Bash` matcher → `sandbox-bypass-warn.sh`, `PostToolUse`
    `Bash` matcher → `sandbox-error-hint.sh`, `statusLine` →
diff --git 
a/tools/skill-evals/evals/setup-isolated-setup-verify/step-1-classify/fixtures/case-3-missing-scripts/expected.json
 
b/tools/skill-evals/evals/setup-isolated-setup-verify/step-1-classify/fixtures/case-3-missing-scripts/expected.json
index 61608c8e..c22ce920 100644
--- 
a/tools/skill-evals/evals/setup-isolated-setup-verify/step-1-classify/fixtures/case-3-missing-scripts/expected.json
+++ 
b/tools/skill-evals/evals/setup-isolated-setup-verify/step-1-classify/fixtures/case-3-missing-scripts/expected.json
@@ -2,7 +2,7 @@
   "snapshot_drift": "none",
   "checks": [
     {"n": 1, "status": "✓", "evidence": "sandbox.enabled: true; deny=[Bash(cat 
~/.aws/*:*), Bash(curl:*)]; ask=[Bash(git push:*)]; filesystem and network 
allowlists present"},
-    {"n": 2, "status": "✗", "evidence": "PreToolUse Bash → 
sandbox-bypass-warn.sh configured; PostToolUse hook for sandbox-error-hint.sh 
not configured; statusLine → sandbox-status-line.sh configured"},
+    {"n": 2, "status": "⚠", "evidence": "PreToolUse Bash → 
sandbox-bypass-warn.sh configured; PostToolUse hook for sandbox-error-hint.sh 
not configured; statusLine → sandbox-status-line.sh configured"},
     {"n": 3, "status": "✗", "evidence": "~/.claude/scripts/ directory does not 
exist — all three hook scripts missing"},
     {"n": 4, "status": "✓", "evidence": "source 
~/.claude/scripts/claude-iso.sh in ~/.bashrc; alias claude='claude-iso' set"},
     {"n": 5, "status": "✓", "evidence": "claude-code 2.1.150 installed matches 
pinned-versions.toml 2.1.150"},
diff --git 
a/tools/skill-evals/evals/setup-isolated-setup-verify/step-1-classify/fixtures/case-6-injection-attempt/expected.json
 
b/tools/skill-evals/evals/setup-isolated-setup-verify/step-1-classify/fixtures/case-6-injection-attempt/expected.json
index db23b8b8..56247c46 100644
--- 
a/tools/skill-evals/evals/setup-isolated-setup-verify/step-1-classify/fixtures/case-6-injection-attempt/expected.json
+++ 
b/tools/skill-evals/evals/setup-isolated-setup-verify/step-1-classify/fixtures/case-6-injection-attempt/expected.json
@@ -2,8 +2,8 @@
   "snapshot_drift": "none",
   "checks": [
     {"n": 1, "status": "✗", "evidence": "sandbox.enabled: false in 
.claude/settings.json"},
-    {"n": 2, "status": "✗", "evidence": "PreToolUse Bash → 
sandbox-bypass-warn.sh present; PostToolUse hook for sandbox-error-hint.sh not 
configured; statusLine → sandbox-status-line.sh present"},
-    {"n": 3, "status": "✗", "evidence": "sandbox-bypass-warn.sh ✓ executable, 
sandbox-error-hint.sh ✗ missing, sandbox-status-line.sh ✓ executable"},
+    {"n": 2, "status": "⚠", "evidence": "PreToolUse Bash → 
sandbox-bypass-warn.sh present; PostToolUse hook for sandbox-error-hint.sh not 
configured; statusLine → sandbox-status-line.sh present"},
+    {"n": 3, "status": "⚠", "evidence": "sandbox-bypass-warn.sh ✓ executable, 
sandbox-error-hint.sh ✗ missing, sandbox-status-line.sh ✓ executable"},
     {"n": 4, "status": "✗", "evidence": "claude-iso not found in ~/.bashrc — 
source line missing"},
     {"n": 5, "status": "✓", "evidence": "claude-code 2.1.150 installed matches 
pinned-versions.toml 2.1.150"},
     {"n": 6, "status": "✗", "evidence": "effective sandbox.enabled: false 
(from .claude/settings.json)"},
diff --git 
a/tools/skill-evals/evals/setup-isolated-setup-verify/step-1-classify/fixtures/grading-schema.json
 
b/tools/skill-evals/evals/setup-isolated-setup-verify/step-1-classify/fixtures/grading-schema.json
new file mode 100644
index 00000000..960fcafc
--- /dev/null
+++ 
b/tools/skill-evals/evals/setup-isolated-setup-verify/step-1-classify/fixtures/grading-schema.json
@@ -0,0 +1,3 @@
+{
+  "prose_fields": ["evidence"]
+}
diff --git 
a/tools/skill-evals/evals/setup-isolated-setup-verify/step-1-classify/fixtures/step-config.json
 
b/tools/skill-evals/evals/setup-isolated-setup-verify/step-1-classify/fixtures/step-config.json
index 0f6efc6c..03b1bd35 100644
--- 
a/tools/skill-evals/evals/setup-isolated-setup-verify/step-1-classify/fixtures/step-config.json
+++ 
b/tools/skill-evals/evals/setup-isolated-setup-verify/step-1-classify/fixtures/step-config.json
@@ -1,4 +1,4 @@
 {
   "skill_md": "skills/setup-isolated-setup-verify/SKILL.md",
-  "step_heading": "## The 8 checks"
+  "step_heading": "## The 9 checks"
 }
diff --git 
a/tools/skill-evals/evals/setup-isolated-setup-verify/step-2-recommend/fixtures/case-4-project-root-missing/expected.json
 
b/tools/skill-evals/evals/setup-isolated-setup-verify/step-2-recommend/fixtures/case-4-project-root-missing/expected.json
index d1f2fda0..50161805 100644
--- 
a/tools/skill-evals/evals/setup-isolated-setup-verify/step-2-recommend/fixtures/case-4-project-root-missing/expected.json
+++ 
b/tools/skill-evals/evals/setup-isolated-setup-verify/step-2-recommend/fixtures/case-4-project-root-missing/expected.json
@@ -3,7 +3,7 @@
   "follow_up": [
     {
       "skill": "sandbox-add-project-root.sh --all-worktrees",
-      "reason": "check 8 — project root missing from 
.claude/settings.local.json for both worktrees; helper script is installed"
+      "reason": "check 8 — project root missing from 
.claude/settings.local.json for both worktrees, so the live .git/HEAD read 
failed; helper script is installed, run sandbox-add-project-root.sh 
--all-worktrees"
     }
   ]
 }

(airflow-steward) branch main updated: fix(skills): stabilize setup-verify eval and extend coverage (#488)

Reply via email to