andreahlert opened a new issue, #327: URL: https://github.com/apache/airflow-steward/issues/327
# `docs/mode-economics.md`: provenance, tokenizer disclosure, and measurement methodology ## Context `docs/mode-economics.md` (added in #253) is positioned by [MISSION.md § Affordability](../MISSION.md#affordability-and-vendor-neutrality--the-public-good-commitment) as quantitative input for adoption decisions and for the long-term ASF inference-endpoint capacity plan. #281 then wired the contributor expectation to keep the page in sync with skill changes (`AGENTS.md § Keeping evals and mode-economics in sync`). That sequence resolves the **future** sync gap. It does not address the **shipped** state of the page, which still presents bot-generated ranges as if they were measured, with no provenance signal. #326 fixed three concrete bugs in passing; this issue collects the structural items I deliberately kept out of that PR so the project can decide direction before any further edit lands. The framing I'd suggest: every concern below is a low-cost annotation or table-shape change. None of them require running new evals; all of them become easier to populate honestly once #281's `--cli` harness has run against a real skill suite. ## Findings ### 1. No provenance / freshness signal on the page itself The page reports per-skill / per-mode token counts but never states: - Which tokenizer the counts are measured with. The corrected SKILL.md anchor (lines 57–59) says `measured with cl100k_base`, but the per-mode / per-skill tables further down never restate that assumption, and the cross-class recommendations (Small ~7B–13B, Mid-tier ~70B, Large frontier) are vendor-agnostic — `cl100k_base` is OpenAI-specific and diverges from Claude's tokenizer / Llama BPE by ~10–40 % depending on content (code-heavy and non-English content drift the most). - When the counts were last measured. - Which subset of skills is covered. The catalogue is currently ~26 skills; the page names ~13. Suggested fix: a single banner / front-matter block at the top of the page: > _Measured with `cl100k_base` on `<commit>` / `<date>`. ±~30 % drift expected across Claude / Llama / local quantised tokenizers. Coverage: N of M skills; remainder use estimated ranges flagged in the tables._ ### 2. Measured rows and hypothetical rows share one table Two rows in the Pairing section give concrete numeric ranges for skills that don't exist on `main`: - `pairing-self-review` — labeled "Estimated; skill experimental". Per #253's own description, this is _"a name in a table, not a link"_ — an unmerged sibling branch. - `Multi-agent review pipeline` — labeled "Estimated; future skill." The labeling is visible in the Notes column, but the numeric `Token range` column reads as commensurable with the measured Triage / Drafting rows above. A reader scanning the tables for budget planning has to spot the "Estimated; future" tag to discount the row. Suggested fix: split the Pairing table into a "Measured" subsection and a "Planned / hypothetical (estimates pending implementation)" subsection. Same numbers, different headings. Or: replace concrete bounds on hypothetical rows with `TBD (expected O(10k–100k))`. ### 3. Ranges have shape but no anchor The Drafting ranges span 5–7× (e.g. `30 000–150 000` for `security-issue-fix` code-fix; `30 000–200 000` for the multi-agent pipeline). The Affordability framing implies a maintainer can use these to plan; in practice a 5× spread is ceiling-budgeting only — it doesn't help rank modes or estimate typical cost. Suggested fix: add a `p50` / `typical` column alongside `Token range`, and one worked example per mode (e.g. "1-file fix on the issue-fix-workflow corpus ≈ 35 000; multi-package fix ≈ 130 000"). Keep the range; add the anchor inside it. #281's `--cli` runs across the eval suites are the natural data source for both columns once they've been exercised. ### 4. Cross-class multiplier table needs a date and named comparators `Small-class models are typically 10–50× cheaper per token than Large-class models at hosted-API rates. Mid-tier sits at roughly 3–10× cheaper than Large.` The structural claim (small cheaper than large by ~1 OoM) is durable; the specific ratio drifts as vendors reprice. No date, no named comparators (Haiku/Opus? Flash/Pro? GPT-mini/full?). Suggested fix: one-line caption — `As of YYYY-MM. Indicative ratios across hosted-API rates from <vendors named>. Check current pricing pages.` ### 5. Pairing payer ambiguity The Pairing section opens with _"runs in the developer's own development cycle, not on project infrastructure — cost is per-developer-session"_ but then provides a token rule-of-thumb for adoption planning. The page targets _"a maintainer evaluating adoption"_, who has to know whether this line hits the project budget, the contributor's personal budget, or a hybrid reimbursement model. Suggested fix: one sentence after the "per-developer-session" framing — `Whether the project reimburses contributors or treats Pairing as personal tooling is a project-level policy decision, outside the scope of this page.` ### 6. Auto-merge subsection is empty scaffolding The Auto-merge subsection contains exactly two sentences: `**Status: off.** Auto-merge is not implemented; it has no token cost.` plus a cross-ref. A full H3 with a "Status: off" bolded line mimics the visual weight of the substantive Triage / Drafting / Mentoring / Pairing sections. Suggested fix: collapse to a one-line entry under a `Modes not yet covered` subsection at the end, preserving the `#auto-merge` anchor for inbound deep-links from `modes.md`. ## What this issue is **not** - Not a request to retract the page. The page provides a useful shape and an Affordability-mandate scaffolding that should stay. - Not a request to wait for perfect data before annotating. Items 1, 4, 5, 6 are all editorial; only items 2 and 3 benefit from #281's data. - Not adversarial to #253 or #281. #281's harness is the right answer for sync; this issue is about the in-page presentation that #281 doesn't touch. ## Suggested sequencing 1. Editorial pass (items 1, 4, 5, 6) — one PR, no new measurements. 2. Once `--cli` has run against the real skill suites (the open follow-up in #281's test plan), backfill items 2 and 3 with measured `p50` + split measured/planned tables. Happy to take the editorial PR if useful; flagging here first so direction can be discussed before I touch the file again. cc @justinmclean @potiuk @choo121600 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
