[I] docs/mode-economics.md: provenance, tokenizer disclosure, and measurement methodology [airflow-steward]

via GitHub Tue, 26 May 2026 14:30:13 -0700


andreahlert opened a new issue, #327:
URL: https://github.com/apache/airflow-steward/issues/327

# `docs/mode-economics.md`: provenance, tokenizer disclosure, and
measurement methodology

## Context

`docs/mode-economics.md` (added in #253) is positioned by [MISSION.md §
Affordability](../MISSION.md#affordability-and-vendor-neutrality--the-public-good-commitment)
as quantitative input for adoption decisions and for the long-term ASF
inference-endpoint capacity plan. #281 then wired the contributor expectation
to keep the page in sync with skill changes (`AGENTS.md § Keeping evals and
mode-economics in sync`).

That sequence resolves the **future** sync gap. It does not address the
**shipped** state of the page, which still presents bot-generated ranges as if
they were measured, with no provenance signal. #326 fixed three concrete bugs
in passing; this issue collects the structural items I deliberately kept out of
that PR so the project can decide direction before any further edit lands.

The framing I'd suggest: every concern below is a low-cost annotation or
table-shape change. None of them require running new evals; all of them become
easier to populate honestly once #281's `--cli` harness has run against a real
skill suite.

## Findings

### 1. No provenance / freshness signal on the page itself

The page reports per-skill / per-mode token counts but never states:

- Which tokenizer the counts are measured with. The corrected SKILL.md
anchor (lines 57–59) says `measured with cl100k_base`, but the per-mode /
per-skill tables further down never restate that assumption, and the
cross-class recommendations (Small ~7B–13B, Mid-tier ~70B, Large frontier) are
vendor-agnostic — `cl100k_base` is OpenAI-specific and diverges from Claude's
tokenizer / Llama BPE by ~10–40 % depending on content (code-heavy and
non-English content drift the most).
- When the counts were last measured.
- Which subset of skills is covered. The catalogue is currently ~26 skills;
the page names ~13.

Suggested fix: a single banner / front-matter block at the top of the page:

> _Measured with `cl100k_base` on `<commit>` / `<date>`. ±~30 % drift
expected across Claude / Llama / local quantised tokenizers. Coverage: N of M
skills; remainder use estimated ranges flagged in the tables._

### 2. Measured rows and hypothetical rows share one table

Two rows in the Pairing section give concrete numeric ranges for skills that
don't exist on `main`:

- `pairing-self-review` — labeled "Estimated; skill experimental". Per
#253's own description, this is _"a name in a table, not a link"_ — an unmerged
sibling branch.
- `Multi-agent review pipeline` — labeled "Estimated; future skill."

The labeling is visible in the Notes column, but the numeric `Token range`
column reads as commensurable with the measured Triage / Drafting rows above. A
reader scanning the tables for budget planning has to spot the "Estimated;
future" tag to discount the row.

Suggested fix: split the Pairing table into a "Measured" subsection and a
"Planned / hypothetical (estimates pending implementation)" subsection. Same
numbers, different headings. Or: replace concrete bounds on hypothetical rows
with `TBD (expected O(10k–100k))`.

### 3. Ranges have shape but no anchor

The Drafting ranges span 5–7× (e.g. `30 000–150 000` for
`security-issue-fix` code-fix; `30 000–200 000` for the multi-agent pipeline).
The Affordability framing implies a maintainer can use these to plan; in
practice a 5× spread is ceiling-budgeting only — it doesn't help rank modes or
estimate typical cost.

Suggested fix: add a `p50` / `typical` column alongside `Token range`, and
one worked example per mode (e.g. "1-file fix on the issue-fix-workflow corpus
≈ 35 000; multi-package fix ≈ 130 000"). Keep the range; add the anchor inside
it. #281's `--cli` runs across the eval suites are the natural data source for
both columns once they've been exercised.

### 4. Cross-class multiplier table needs a date and named comparators

`Small-class models are typically 10–50× cheaper per token than Large-class
models at hosted-API rates. Mid-tier sits at roughly 3–10× cheaper than Large.`
The structural claim (small cheaper than large by ~1 OoM) is durable; the
specific ratio drifts as vendors reprice. No date, no named comparators
(Haiku/Opus? Flash/Pro? GPT-mini/full?).

Suggested fix: one-line caption — `As of YYYY-MM. Indicative ratios across
hosted-API rates from <vendors named>. Check current pricing pages.`

### 5. Pairing payer ambiguity

The Pairing section opens with _"runs in the developer's own development
cycle, not on project infrastructure — cost is per-developer-session"_ but then
provides a token rule-of-thumb for adoption planning. The page targets _"a
maintainer evaluating adoption"_, who has to know whether this line hits the
project budget, the contributor's personal budget, or a hybrid reimbursement
model.

Suggested fix: one sentence after the "per-developer-session" framing —
`Whether the project reimburses contributors or treats Pairing as personal
tooling is a project-level policy decision, outside the scope of this page.`

### 6. Auto-merge subsection is empty scaffolding

The Auto-merge subsection contains exactly two sentences: `**Status: off.**
Auto-merge is not implemented; it has no token cost.` plus a cross-ref. A full
H3 with a "Status: off" bolded line mimics the visual weight of the substantive
Triage / Drafting / Mentoring / Pairing sections.

Suggested fix: collapse to a one-line entry under a `Modes not yet covered`
subsection at the end, preserving the `#auto-merge` anchor for inbound
deep-links from `modes.md`.

## What this issue is **not**

- Not a request to retract the page. The page provides a useful shape and an
Affordability-mandate scaffolding that should stay.
- Not a request to wait for perfect data before annotating. Items 1, 4, 5, 6
are all editorial; only items 2 and 3 benefit from #281's data.
- Not adversarial to #253 or #281. #281's harness is the right answer for
sync; this issue is about the in-page presentation that #281 doesn't touch.

## Suggested sequencing

1. Editorial pass (items 1, 4, 5, 6) — one PR, no new measurements.
2. Once `--cli` has run against the real skill suites (the open follow-up in
#281's test plan), backfill items 2 and 3 with measured `p50` + split
measured/planned tables.

Happy to take the editorial PR if useful; flagging here first so direction
can be discussed before I touch the file again.

cc @justinmclean @potiuk @choo121600

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] docs/mode-economics.md: provenance, tokenizer disclosure, and measurement methodology [airflow-steward]

Reply via email to