Re: [DISCUSS] Proposal - Agentic Eval (Meta-)Skill for Extensibility and Maintainability

Yufei Gu Wed, 27 May 2026 18:36:17 -0700

Hi Dennis

Thanks for raising it. It looks like a cool idea, and +1 on experimenting
in the Polaris repo to make Polaris more agent-friendly.


My main concern is benchmark coverage and long term maintenance cost.

For coverage, a small static task corpus may overfit to a few known
workflows or repository conventions. A model could appear to improve simply
because the benchmark captures patterns already encoded in AGENTS.md, while
missing broader extensibility or maintainability issues elsewhere in the
codebase. The task synthesizer direction may help, but generating
representative and non gameable tasks seems challenging on its own.

For maintenance cost, I suspect the benchmark corpus and verifiers could
gradually become another subsystem we need to maintain alongside the
codebase itself. As Polaris evolves, tasks, fixtures, assertions, and
expected outcomes will drift too. Keeping evals deterministic, stable, and
still representative over time could become expensive.

That said, I still think the direction is interesting, especially as a
lightweight signal for agent friendliness. I would probably start with a
very small and highly deterministic scope first.

Ideally the evaluation could run in CI, but getting an LLM sponsor may be
difficult. In practice, contributors may need to run the evals themselves.
With that, I suggest integrating with Gradle commands or creating commands
to make local execution easier.

Yufei


On Thu, May 21, 2026 at 2:13 AM Dennis Huo <[email protected]> wrote:

> Hi all,
>
> Now that agentic development is evolving to be a more fundamental and
> pervasive tool, I wanted to explore ways to address both a "need" and an
> "opportunity" under one umbrella - adding an agentic (meta-)skill to start
> codifying a way for us to bake in quantifiable metrics to the impact of
> "non-functional" changes on repository "health" (in terms of extensibility
> and maintainability).
>
> Basically, if we extrapolate from getting into the habit of formalizing our
> AGENTS.md files towards likely adding well-defined agent "skills" for
> repeatable agentic workflows, and those becoming more ingrained in the
> development process over time, the basic "need" is to standardize our evals
> against the addition of new skills and mdfile documentation, but also to
> recognize the opportunity of addressing three related types of
> nonfunctional changes:
>
> 1. Refactoring code - sometimes subjective, sometimes partially objective
> (consolidating duplicate code), but the *effects* are rarely quantifiable
> 2. Adding documentation/code comments - Generally regarded as being good,
> but sometimes verbosity can hurt, and certainly "incorrect" documentation
> can hurt
> 3. Addition of agent skills or rules - possibly manually tested to some
> extent when added, but usually not consistently and rarely with
> reproducible evals
>
> To that end I put together this proposal doc with some lightweight design
> elements for this agentic skill:
>
>
> https://docs.google.com/document/d/1RE5mGcrMLbmi8sglkHuJKxORVNiuiZ69da1weqwpGjE/edit?tab=t.0
>
> Would love to discuss folks' thoughts here or in comments in the doc.
> Recapping the core concept from the doc:
>
> *Treat any candidate change as an intervention in a measurable A/B. Take a
> baseline ref and a candidate ref, run a fixed set of agent-driven sample
> tasks against both refs, collect a small number of metrics (success vs. an
> oracle, wall-clock, tokens, agent rounds, crash count, etc), and emit a
> delta report a reviewer can actually interpret.*
>
> And the three component carveouts:
>
>    - Static task corpus - hand curated set of initial development tasks
>    (e.g. "Add a new Polaris privilege") that provides basic cross-cutting
>    signal
>    - Task synthesizer - More advanced meta-evolution step - the agentic
>    driver of the harness can intelligently synthesize tasks that exercise
>    newly identified segments of coding complexity
>    - Eval harness - the overall framework for isolating subagents, sets up
>    the task experiments, collects metrics, etc.
>
> I have an initial v1 available for review:
> https://github.com/apache/polaris/pull/4519
>
> This includes the end-to-end working v1 eval harness and prospective
> initial set of static tasks, no codified task synthesizer yet. I ran an
> initial meta-eval on it with a three models (Claude Haiku 4.5, Claude Opus
> 4.7, and Codex GPT 5.4) and just the "add new privilege" task; more
> detailed results posted in the PR, abridged here - we should iterate a bit
> more on the task corpus, but at least it's a proof-of-concept of the
> end-to-end flow.
>
> ## Task & fixture
>
> - **Task**: `tasks/seed/T-priv-add.yaml` — add the enum constant
> `LIST_NAMESPACE_TABLES_RECURSIVE` to `PolarisAuthorizableOperation`,
> ensure compile + `*PolarisAuthorizer*` tests pass without modifying
> any test file. The task is a *probe* of the authorizer SPI: a naive
> one-file edit (enum only) trips the static initializer in
> `RbacOperationSemantics.java` and breaks 4 tests; the correct two-file
> change (enum + register call) passes.
> - **BEFORE ref**: `568a8883` (Polaris main HEAD on 2026-05-16).
> - **AFTER ref**: `c9b37227` (TEMP local fixture: AGENTS.md +100 lines —
> "Recipes for Common Extension Tasks" section that explicitly tells
> agents to also edit `RbacOperationSemantics.register(...)`). The
> fixture only changes `AGENTS.md`; no source code differs between BASE
> and AFTER.
>
> The task's deterministic verifier runs out-of-band from the worker
> agent (separate `bash` subprocess after the worker's transcript is
> captured) so worker self-reports cannot fake a PASS.
>
> ## Headline results
>
> | Cell | Verdict | Wall (s) | Cost (USD) | Tokens out | Turns | Files in
> diff |
>
> |------|---------|---------:|-----------:|-----------:|------:|---------------|
> | haiku-base | PASS | 270 | $0.362 | 9374 | 59 | 2 (enum + Rbac) |
> | haiku-after | PASS | 157 | $0.226 | 5657 | 36 | 2 (enum + Rbac) |
> | opus-base | PASS | 204 | $1.481 | 10112 | 24 | 2 (enum + Rbac) |
> | opus-after | PASS | 124 | $0.854 | 5150 | 15 | 2 (enum + Rbac) |
> | codex-base | **FAIL** | 37 | n/a | n/a | n/a | **1 (enum only)** |
> | codex-after | PASS | 39 | n/a | n/a | n/a | 2 (enum + Rbac) |
>
> Per-arm deltas (BEFORE → AFTER, AFTER doc helps):
>
> | Model | Wall Δ | Cost Δ | Turns Δ | Verdict Δ |
> |--------|-------:|--------:|--------:|-----------|
> | haiku | -42% | -38% | -39% | PASS → PASS (soft-improvement) |
> | opus | -39% | -42% | -38% | PASS → PASS (soft-improvement) |
> | codex | +5% | n/a | n/a | **FAIL → PASS** (hard improvement) |
>
> Total: 6 cells, 13m 49s wall, $2.92 spend. One discriminating
> verdict-flip + two consistent ~40% cost reductions on the same
> task — clear, replicable signal that the AGENTS.md recipe addition is
> agent-load-bearing.
>

Re: [DISCUSS] Proposal - Agentic Eval (Meta-)Skill for Extensibility and Maintainability

Reply via email to