Hi all,
I’ve added lightweight tag support to the skill eval runner so we can mark a
small subset of fixtures as runnable with local Ollama models, for example:
uv run --project tools/skill-evals skill-eval \
--tag llama \
--cli "ollama run llama3.1:8b --nowordwrap --format json" \
tools/skill-evals/evals/
This adds optional per-case metadata via case-meta.json, currently using:
{
"tags": ["llama", "smoke"]
}
The intent is not to treat llama3.1:8b as a replacement for the main eval
model. It is not reliable on nuanced security judgment, exact prose,
prompt-injection handling, or absence-of-findings review gates. After testing
and pruning, the llama tag now covers a conservative smoke suite of cases that
appear useful for quick local checks.
I did try a bigger GPU model, but it did very bad things to my screen. As they
say, your mileage may vary. I believe open source models will improve over
time, and in perhaps 6 months most of these tests will run with Llama.
I also added runner tests for tag parsing and filtering, and updated the skill
eval README with usage docs.
Thanks,
Justin