iprithv commented on issue #15884:
URL: https://github.com/apache/lucene/issues/15884#issuecomment-4498942331

   @romseygeek right now, writing doc values does multiple passes depending on 
the field type. numeric and sorted types do the most work since they need 
stats, skip index, and actual writing. binary is simpler. sortedset can be a 
mix depending on single vs multi valued.
   
   one thing i noticed is we are recomputing stats (like min, max, doc count) 
even though we already have similar info from the skipper when merging.
   
   skipper already has:
   - min and max
   - doc count
   - max values per doc
   
   so in theory, we could reuse this instead of recomputing everything again.
   
   i see a few possible directions:
   
   1) just reuse skipper stats inside the consumer  
   we already compute them in writeSkipIndex, so we could pass them to 
writeValues and skip recomputing min/max/docCount  
   but this only saves a bit of work, we still need a full pass for gcd, unique 
values, etc.
   
   2) expose skipper from DocValuesProducer  
   during merge, source segments already have this info, so we could just read 
it instead of iterating again this feels like the biggest win, especially when 
merging sorted indexes where iteration is expensive
   
   3) try to merge passes  
   like combining stats + writing, or disi + writing this could remove full 
passes, but is more complex
   
   from what i see, just caching min/max/docCount alone won’t help much since 
we still need to iterate for other stats anyway. the real cost is the iteration 
itself.
   
   so wanted to check:
   - is there a preferred direction here?
   - is exposing skipper via DocValuesProducer acceptable?
   - would it make sense to also track total value count in skipper so we can 
skip that part too?
   
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to