tustvold commented on code in PR #6947:
URL: https://github.com/apache/arrow-rs/pull/6947#discussion_r1905188237
##########
parquet/src/arrow/async_reader/mod.rs:
##########
@@ -611,11 +611,22 @@ impl<T> std::fmt::Debug for StreamState<T> {
}
}
-/// An asynchronous
[`Stream`](https://docs.rs/futures/latest/futures/stream/trait.Stream.html) of
[`RecordBatch`]
-/// for a parquet file that can be constructed using
[`ParquetRecordBatchStreamBuilder`].
+/// An asynchronous [`Stream`]of [`RecordBatch`] constructed using
[`ParquetRecordBatchStreamBuilder`] to read parquet files.
///
/// `ParquetRecordBatchStream` also provides
[`ParquetRecordBatchStream::next_row_group`] for fetching row groups,
/// allowing users to decode record batches separately from I/O.
+///
+/// # I/O Buffering
+///
+/// `ParquetRecordBatchStream` buffers *all* data pages selected after
predicates
+/// (projection + filtering, etc) and decodes the rows from those buffered
pages.
+///
+/// For example, if all rows and columns are selected, the entire row group is
+/// buffered in memory during decode. This minimized the number of IO
operations
+/// required.
Review Comment:
```suggestion
/// buffered in memory during decode. This minimizes the number of IO
operations
/// required, which is especially important for object stores, where IO
operations
/// have latencies in the hundreds of milliseconds
```
I think it is important to provide the why as well
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]