Eval CLI matching too strict / perhaps missing context

Justin Mclean Thu, 28 May 2026 20:32:30 -0700

Hi,

Two issues I’ve found when using the eval CLI rather than directly via Claude 
Desktop.


- Comparator is too strict. runner.compare_outputs uses exact ==, so free-text 
fields (rationale, reason, drop_reason, blockers) flip cases to FAIL on wording 
alone.
- Invocation context changes answers. --cli "claude -p" pipes a clean 
<system>\n\n<user> and passes. In the same case, a general-purpose sub-agent 
can return something different because it adds its own system prompt.

Kind Regards.
Justin

Eval CLI matching too strict / perhaps missing context

Reply via email to