This is an automated email from the ASF dual-hosted git repository.
alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-rs.git
The following commit(s) were added to refs/heads/main by this push:
new f18dadd709 Document the `ParquetRecordBatchStream` buffering (#6947)
f18dadd709 is described below
commit f18dadd7093cbed66ee42738d6564950168d3fe3
Author: Andrew Lamb <[email protected]>
AuthorDate: Wed Jan 8 09:02:23 2025 -0500
Document the `ParquetRecordBatchStream` buffering (#6947)
* Document the ParquetRecordBatchStream buffering
* Update parquet/src/arrow/async_reader/mod.rs
Co-authored-by: Raphael Taylor-Davies
<[email protected]>
---------
Co-authored-by: Raphael Taylor-Davies
<[email protected]>
---
parquet/src/arrow/async_reader/mod.rs | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)
diff --git a/parquet/src/arrow/async_reader/mod.rs
b/parquet/src/arrow/async_reader/mod.rs
index 4f3befe426..5323251b07 100644
--- a/parquet/src/arrow/async_reader/mod.rs
+++ b/parquet/src/arrow/async_reader/mod.rs
@@ -611,11 +611,23 @@ impl<T> std::fmt::Debug for StreamState<T> {
}
}
-/// An asynchronous
[`Stream`](https://docs.rs/futures/latest/futures/stream/trait.Stream.html) of
[`RecordBatch`]
-/// for a parquet file that can be constructed using
[`ParquetRecordBatchStreamBuilder`].
+/// An asynchronous [`Stream`]of [`RecordBatch`] constructed using
[`ParquetRecordBatchStreamBuilder`] to read parquet files.
///
/// `ParquetRecordBatchStream` also provides
[`ParquetRecordBatchStream::next_row_group`] for fetching row groups,
/// allowing users to decode record batches separately from I/O.
+///
+/// # I/O Buffering
+///
+/// `ParquetRecordBatchStream` buffers *all* data pages selected after
predicates
+/// (projection + filtering, etc) and decodes the rows from those buffered
pages.
+///
+/// For example, if all rows and columns are selected, the entire row group is
+/// buffered in memory during decode. This minimizes the number of IO
operations
+/// required, which is especially important for object stores, where IO
operations
+/// have latencies in the hundreds of milliseconds
+///
+///
+/// [`Stream`]: https://docs.rs/futures/latest/futures/stream/trait.Stream.html
pub struct ParquetRecordBatchStream<T> {
metadata: Arc<ParquetMetaData>,