Skill evals: added llama smoke-test tagging

Justin Mclean Wed, 27 May 2026 02:55:35 -0700

Hi all,

I’ve added lightweight tag support to the skill eval runner so we can mark a 
small subset of fixtures as runnable with local Ollama models, for example:


uv run --project tools/skill-evals skill-eval \
  --tag llama \
  --cli "ollama run llama3.1:8b --nowordwrap --format json" \
  tools/skill-evals/evals/
This adds optional per-case metadata via case-meta.json, currently using:

{
  "tags": ["llama", "smoke"]
}
The intent is not to treat llama3.1:8b as a replacement for the main eval 
model. It is not reliable on nuanced security judgment, exact prose, 
prompt-injection handling, or absence-of-findings review gates. After testing 
and pruning, the llama tag now covers a conservative smoke suite of cases that 
appear useful for quick local checks.

I did try a bigger GPU model, but it did very bad things to my screen. As they 
say, your mileage may vary. I believe open source models will improve over 
time, and in perhaps 6 months most of these tests will run with Llama.

I also added runner tests for tag parsing and filtering, and updated the skill 
eval README with usage docs.

Thanks,

Justin

Skill evals: added llama smoke-test tagging

Reply via email to