xiangfu0 opened a new pull request, #18772:
URL: https://github.com/apache/pinot/pull/18772

   ## Summary
   `VarByteChunkForwardIndexWriterV6` currently closes a chunk only when the 
next
   entry would overflow the `chunkSize`-byte buffer. This PR adds an optional
   `targetDocsPerChunk` parameter so a chunk can additionally be bounded by 
document
   count, letting callers control chunk granularity independently of the byte 
budget.
   
   - New 4-arg constructor `VarByteChunkForwardIndexWriterV6(file, 
compressionType,
     chunkSize, targetDocsPerChunk)`. The existing 3-arg constructor delegates 
with
     `targetDocsPerChunk = -1` (`DISABLE_DOCS_PER_CHUNK`).
   - When `targetDocsPerChunk > 0`, a chunk is flushed once it holds that many 
docs,
     even if the byte buffer isn't full; otherwise behavior is unchanged.
   - The buffer-overflow flush predicate is extracted into a protected
     `shouldStartNewChunk(int)` hook in `VarByteChunkForwardIndexWriterV4` 
(mirroring
     the existing `writeChunkHeader` hook), so V6 adds the cap without 
duplicating
     `putBytes()`.
   
   ## Motivation
   For raw string/bytes columns, the ZSTD compression ratio depends heavily on 
how
   many repeated values fall within a single chunk (the dedup window). Being 
able to
   bound a chunk by document count — not only by bytes — gives finer control 
over the
   size/granularity tradeoff for repetitive columns.
   
   ## Backward compatibility
   - The default `-1` reproduces the exact current behavior.
   - The on-disk format and writer version (`6`) are unchanged; the target 
chunk size
     remains self-describing in the file header, so existing and new indexes 
stay
     mutually readable.
   
   ## Testing
   - New `VarByteChunkV6Test#testTargetDocsPerChunkCapsChunk` asserts each 
capped chunk
     holds exactly `targetDocsPerChunk` docs and that values round-trip.
   - All inherited V4/V5/V6 read/write tests pass (the `-1` default path).
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to