Re: [PR] Document the `ParquetRecordBatchStream` buffering [arrow-rs]

via GitHub Tue, 07 Jan 2025 01:50:54 -0800


tustvold commented on code in PR #6947:
URL: https://github.com/apache/arrow-rs/pull/6947#discussion_r1905188237



##########
parquet/src/arrow/async_reader/mod.rs:
##########
@@ -611,11 +611,22 @@ impl<T> std::fmt::Debug for StreamState<T> {
     }
 }
 
-/// An asynchronous 
[`Stream`](https://docs.rs/futures/latest/futures/stream/trait.Stream.html) of 
[`RecordBatch`]
-/// for a parquet file that can be constructed using 
[`ParquetRecordBatchStreamBuilder`].
+/// An asynchronous [`Stream`]of [`RecordBatch`] constructed using 
[`ParquetRecordBatchStreamBuilder`] to read parquet files.
 ///
 /// `ParquetRecordBatchStream` also provides 
[`ParquetRecordBatchStream::next_row_group`] for fetching row groups,
 /// allowing users to decode record batches separately from I/O.
+///
+/// # I/O Buffering
+///
+/// `ParquetRecordBatchStream` buffers *all* data pages selected after 
predicates
+/// (projection + filtering, etc) and decodes the rows from those buffered 
pages.
+///
+/// For example, if all rows and columns are selected, the entire row group is
+/// buffered in memory during decode. This minimized the number of IO 
operations
+/// required.

Review Comment:
   ```suggestion
   /// buffered in memory during decode. This minimizes the number of IO 
operations
   /// required, which is especially important for object stores, where IO 
operations
   /// have latencies in the hundreds of milliseconds
   ```
   
   I think it is important to provide the why as well



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Document the `ParquetRecordBatchStream` buffering [arrow-rs]

Reply via email to