(arrow-rs) branch main updated: Document the `ParquetRecordBatchStream` buffering (#6947)

alamb Wed, 08 Jan 2025 06:02:33 -0800

This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-rs.git



The following commit(s) were added to refs/heads/main by this push:
     new f18dadd709 Document the `ParquetRecordBatchStream` buffering (#6947)
f18dadd709 is described below

commit f18dadd7093cbed66ee42738d6564950168d3fe3
Author: Andrew Lamb <[email protected]>
AuthorDate: Wed Jan 8 09:02:23 2025 -0500

    Document the `ParquetRecordBatchStream` buffering (#6947)
    
    * Document the ParquetRecordBatchStream buffering
    
    * Update parquet/src/arrow/async_reader/mod.rs
    
    Co-authored-by: Raphael Taylor-Davies 
<[email protected]>
    
    ---------
    
    Co-authored-by: Raphael Taylor-Davies 
<[email protected]>
---
 parquet/src/arrow/async_reader/mod.rs | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/parquet/src/arrow/async_reader/mod.rs 
b/parquet/src/arrow/async_reader/mod.rs
index 4f3befe426..5323251b07 100644
--- a/parquet/src/arrow/async_reader/mod.rs
+++ b/parquet/src/arrow/async_reader/mod.rs
@@ -611,11 +611,23 @@ impl<T> std::fmt::Debug for StreamState<T> {
     }
 }
 
-/// An asynchronous 
[`Stream`](https://docs.rs/futures/latest/futures/stream/trait.Stream.html) of 
[`RecordBatch`]
-/// for a parquet file that can be constructed using 
[`ParquetRecordBatchStreamBuilder`].
+/// An asynchronous [`Stream`]of [`RecordBatch`] constructed using 
[`ParquetRecordBatchStreamBuilder`] to read parquet files.
 ///
 /// `ParquetRecordBatchStream` also provides 
[`ParquetRecordBatchStream::next_row_group`] for fetching row groups,
 /// allowing users to decode record batches separately from I/O.
+///
+/// # I/O Buffering
+///
+/// `ParquetRecordBatchStream` buffers *all* data pages selected after 
predicates
+/// (projection + filtering, etc) and decodes the rows from those buffered 
pages.
+///
+/// For example, if all rows and columns are selected, the entire row group is
+/// buffered in memory during decode. This minimizes the number of IO 
operations
+/// required, which is especially important for object stores, where IO 
operations
+/// have latencies in the hundreds of milliseconds
+///
+///
+/// [`Stream`]: https://docs.rs/futures/latest/futures/stream/trait.Stream.html
 pub struct ParquetRecordBatchStream<T> {
     metadata: Arc<ParquetMetaData>,

(arrow-rs) branch main updated: Document the `ParquetRecordBatchStream` buffering (#6947)

Reply via email to