Re: Skill evals: added llama smoke-test tagging

Jarek Potiuk Wed, 27 May 2026 03:31:03 -0700

Oh nice :)

On Wed, May 27, 2026 at 11:55 AM Justin Mclean <[email protected]>
wrote:


> Hi all,
>
> I’ve added lightweight tag support to the skill eval runner so we can mark
> a small subset of fixtures as runnable with local Ollama models, for
> example:
>
> uv run --project tools/skill-evals skill-eval \
>   --tag llama \
>   --cli "ollama run llama3.1:8b --nowordwrap --format json" \
>   tools/skill-evals/evals/
> This adds optional per-case metadata via case-meta.json, currently using:
>
> {
>   "tags": ["llama", "smoke"]
> }
> The intent is not to treat llama3.1:8b as a replacement for the main eval
> model. It is not reliable on nuanced security judgment, exact prose,
> prompt-injection handling, or absence-of-findings review gates. After
> testing and pruning, the llama tag now covers a conservative smoke suite of
> cases that appear useful for quick local checks.
>
> I did try a bigger GPU model, but it did very bad things to my screen. As
> they say, your mileage may vary. I believe open source models will improve
> over time, and in perhaps 6 months most of these tests will run with Llama.
>
> I also added runner tests for tag parsing and filtering, and updated the
> skill eval README with usage docs.
>
> Thanks,
>
> Justin

Re: Skill evals: added llama smoke-test tagging

Reply via email to