Hi, Two issues I’ve found when using the eval CLI rather than directly via Claude Desktop.
- Comparator is too strict. runner.compare_outputs uses exact ==, so free-text fields (rationale, reason, drop_reason, blockers) flip cases to FAIL on wording alone. - Invocation context changes answers. --cli "claude -p" pipes a clean <system>\n\n<user> and passes. In the same case, a general-purpose sub-agent can return something different because it adds its own system prompt. Kind Regards. Justin
