[PR] Parquet: split RecordBatches when binary offsets would overflow [arrow-rs]

via GitHub Fri, 06 Feb 2026 02:17:35 -0800


vigneshsiva11 opened a new pull request, #9369:
URL: https://github.com/apache/arrow-rs/pull/9369


   # Which issue does this PR close?
   
   - Closes #NNN.
   
   # Rationale for this change
   
   The Parquet → Arrow reader currently attempts to decode up to batch_size 
rows into a single BinaryArray / StringArray. When the total variable-length 
data exceeds the maximum representable offset size, this causes offset overflow 
and a panic.
   
   The reader should instead emit smaller batches when necessary, without 
requiring schema changes or reduced batch sizes from users.
   
   # What changes are included in this PR?
   
   This PR updates the Parquet RecordBatch reader to stop decoding early when 
binary or string offsets would overflow, emit a partial RecordBatch, and 
continue reading remaining rows in subsequent batches. Existing behavior is 
unchanged for normal
   cases.
   
   # Are these changes tested?
   
   Yes. The behavior is covered by regression tests added earlier that 
reproduce the overflow scenario. All Parquet and Arrow reader tests pass.
   
   # Are there any user-facing changes?
   
   No API changes. In rare cases with very large binary/string values, the 
reader may return smaller RecordBatches than the requested batch_size to avoid 
overflow.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Parquet: split RecordBatches when binary offsets would overflow [arrow-rs]

Reply via email to