andreahlert commented on PR #87:
URL: https://github.com/apache/airflow-steward/pull/87#issuecomment-4398095052

   > Hey Everyone here: WDYT ? (@andreahlert @johnslavik @choo121600 ? 
especially). Since the mission of the project is to be affordable, I think this 
makes a lot of sense even if prose gets a little bit more difficult to read by 
human.
   > 
   > Let me know what you think -> if this will be acceptable, I plan to create 
an issue with tasks and we could attempt to convert all of the SKILLS/prose 
this way?
   > 
   > Particularly when testing wiht @shahar1 - the tokens on smaller 
subscription got very quickly eaten.
   
   hey! Thanks for that. i took some time to fully test with Claude, and 
measured this end to end before commenting. scripts and raw numbers: 
[[gist](https://gist.github.com/andreahlert/f76cd049b7418aa39b34f73a4c5df8d7)](https://gist.github.com/andreahlert/f76cd049b7418aa39b34f73a4c5df8d7)
   
   i'm against merging this as a pilot for the 80 file umbrella. two reasons.
   
   the `-24%` in the body is a word count. on the actual claude tokenizer the 
file drops `-9.4%` in isolation. in a realistic mid-session turn (system 
prompt, tools, CLAUDE.md, prior conversation, skill body) the saving is 
`-2.10%`. on long sessions (100k+ history) it sits between `-0.4%` and `-1%`. 
the pitch and the runtime saving are off by 12x to 60x depending on session 
length. that's not a margin worth committing review budget on across 80 files.
   
   second, the body claims frontmatter stayed verbatim. it didn't. the diff 
rewrites `description` and `when_to_use`, which is the trigger and recall 
surface. that needs to either revert or land with a recall test, not slip in 
under a "body only" framing.
   
   i'm aligned with the affordability mission, that's not in dispute. but the 
sequencing here, imho, is wrong. A/B/C/D modes aren't all shipped. we don't 
have telemetry on which sections are actually hot. optimizing prose density on 
cold paths before the platform is at full potential locks in style overhead for 
marginal real-world saving.
   
   two cheap moves that dominate this on ROI right now and i'd rather see land 
first:
   
   haiku 4.5 for skill routing. 20 probe A/B across haiku/sonnet/opus, haiku at 
90% accuracy, 15x cheaper than opus, 5x faster. one failure mode (hallucinated 
skill name), gateable with a name validator. that's a `-15x` cost cut on the 
routing layer with one PR.
   
   boilerplate dedup. `Adopter overrides`, `Snapshot drift`, `Prerequisites` 
repeat across nearly every SKILL.md. extracting them saves around 5,900 tokens 
one shot, no maintenance debt, no style lock in. one PR, three sections.
   
   my position: is to hold this PR and the umbrella. ship A/B/C/D, get 
telemetry, then apply caveman targeted at hot sections where it actually pays. 
caveman blanket before that could let us in a wrong order of operations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to