vigneshsiva11 opened a new pull request, #9369: URL: https://github.com/apache/arrow-rs/pull/9369
# Which issue does this PR close? - Closes #NNN. # Rationale for this change The Parquet → Arrow reader currently attempts to decode up to batch_size rows into a single BinaryArray / StringArray. When the total variable-length data exceeds the maximum representable offset size, this causes offset overflow and a panic. The reader should instead emit smaller batches when necessary, without requiring schema changes or reduced batch sizes from users. # What changes are included in this PR? This PR updates the Parquet RecordBatch reader to stop decoding early when binary or string offsets would overflow, emit a partial RecordBatch, and continue reading remaining rows in subsequent batches. Existing behavior is unchanged for normal cases. # Are these changes tested? Yes. The behavior is covered by regression tests added earlier that reproduce the overflow scenario. All Parquet and Arrow reader tests pass. # Are there any user-facing changes? No API changes. In rare cases with very large binary/string values, the reader may return smaller RecordBatches than the requested batch_size to avoid overflow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
