Hi,

Two issues I’ve found when using the eval CLI rather than directly via Claude 
Desktop.

- Comparator is too strict. runner.compare_outputs uses exact ==, so free-text 
fields (rationale, reason, drop_reason, blockers) flip cases to FAIL on wording 
alone.
- Invocation context changes answers. --cli "claude -p" pipes a clean 
<system>\n\n<user> and passes. In the same case, a general-purpose sub-agent 
can return something different because it adds its own system prompt.

Kind Regards.
Justin

Reply via email to