andreahlert commented on PR #87: URL: https://github.com/apache/airflow-steward/pull/87#issuecomment-4398095052
> Hey Everyone here: WDYT ? (@andreahlert @johnslavik @choo121600 ? especially). Since the mission of the project is to be affordable, I think this makes a lot of sense even if prose gets a little bit more difficult to read by human. > > Let me know what you think -> if this will be acceptable, I plan to create an issue with tasks and we could attempt to convert all of the SKILLS/prose this way? > > Particularly when testing wiht @shahar1 - the tokens on smaller subscription got very quickly eaten. hey! Thanks for that. i took some time to fully test with Claude, and measured this end to end before commenting. scripts and raw numbers: [[gist](https://gist.github.com/andreahlert/f76cd049b7418aa39b34f73a4c5df8d7)](https://gist.github.com/andreahlert/f76cd049b7418aa39b34f73a4c5df8d7) i'm against merging this as a pilot for the 80 file umbrella. two reasons. the `-24%` in the body is a word count. on the actual claude tokenizer the file drops `-9.4%` in isolation. in a realistic mid-session turn (system prompt, tools, CLAUDE.md, prior conversation, skill body) the saving is `-2.10%`. on long sessions (100k+ history) it sits between `-0.4%` and `-1%`. the pitch and the runtime saving are off by 12x to 60x depending on session length. that's not a margin worth committing review budget on across 80 files. second, the body claims frontmatter stayed verbatim. it didn't. the diff rewrites `description` and `when_to_use`, which is the trigger and recall surface. that needs to either revert or land with a recall test, not slip in under a "body only" framing. i'm aligned with the affordability mission, that's not in dispute. but the sequencing here, imho, is wrong. A/B/C/D modes aren't all shipped. we don't have telemetry on which sections are actually hot. optimizing prose density on cold paths before the platform is at full potential locks in style overhead for marginal real-world saving. two cheap moves that dominate this on ROI right now and i'd rather see land first: haiku 4.5 for skill routing. 20 probe A/B across haiku/sonnet/opus, haiku at 90% accuracy, 15x cheaper than opus, 5x faster. one failure mode (hallucinated skill name), gateable with a name validator. that's a `-15x` cost cut on the routing layer with one PR. boilerplate dedup. `Adopter overrides`, `Snapshot drift`, `Prerequisites` repeat across nearly every SKILL.md. extracting them saves around 5,900 tokens one shot, no maintenance debt, no style lock in. one PR, three sections. my position: is to hold this PR and the umbrella. ship A/B/C/D, get telemetry, then apply caveman targeted at hot sections where it actually pays. caveman blanket before that could let us in a wrong order of operations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
