westonpace commented on code in PR #13799:
URL: https://github.com/apache/arrow/pull/13799#discussion_r950598812
##########
cpp/src/arrow/dataset/scanner.h:
##########
@@ -384,6 +381,25 @@ class ARROW_DS_EXPORT ScannerBuilder {
/// This option provides a control limiting the memory owned by any
RecordBatch.
Status BatchSize(int64_t batch_size);
+ /// \brief Set the number of batches to read ahead within a fragment.
+ ///
+ /// \param[in] batch_readahead How many batches to read ahead within a
fragment,
+ /// might not work for all formats.
+ /// \returns An error if this number is less than 0.
+ ///
+ /// This option provides a control on RAM vs I/O tradeoff.
+ /// It might not be support by all file formats, in which case it will
+ /// simply be ignored.
Review Comment:
```suggestion
/// \param[in] batch_readahead How many batches to read ahead within a
fragment
/// \returns an error if this number is less than 0.
///
/// This option provides a control on the RAM vs I/O tradeoff.
/// It might not be supported by all file formats, in which case it will
/// simply be ignored.
```
##########
python/pyarrow/_dataset.pyx:
##########
@@ -2328,6 +2341,13 @@ cdef class Scanner(_Weakrefable):
The maximum row count for scanned record batches. If scanned
record batches are overflowing memory then this method can be
called to reduce their size.
+ batch_readahead : int, default 16
+ The number of batches to read ahead in a file. This might not work
+ for all file formats like CSV. Increasing this number will increase
+ RAM usage but also improve IO utilization.
Review Comment:
```suggestion
The number of batches to read ahead in a file. This might not
work
for all file formats. Increasing this number will increase
RAM usage but could also improve IO utilization.
```
We probably should tie batch readahead into CSV at some point. I think it's
fine to be vague for now.
##########
python/pyarrow/_dataset.pyx:
##########
@@ -2406,6 +2428,10 @@ cdef class Scanner(_Weakrefable):
The maximum row count for scanned record batches. If scanned
record batches are overflowing memory then this method can be
called to reduce their size.
+ batch_readahead : int, default 16
+ The number of batches to read ahead in a file. This might not work
+ for all file formats like CSV. Increasing this number will increase
+ RAM usage but also improve IO utilization.
Review Comment:
```suggestion
The number of batches to read ahead in a file. This might not
work
for all file formats. Increasing this number will increase
RAM usage but could also improve IO utilization.
```
##########
cpp/src/arrow/dataset/scanner.h:
##########
@@ -384,6 +381,25 @@ class ARROW_DS_EXPORT ScannerBuilder {
/// This option provides a control limiting the memory owned by any
RecordBatch.
Status BatchSize(int64_t batch_size);
+ /// \brief Set the number of batches to read ahead within a fragment.
+ ///
+ /// \param[in] batch_readahead How many batches to read ahead within a
fragment,
+ /// might not work for all formats.
+ /// \returns An error if this number is less than 0.
+ ///
+ /// This option provides a control on RAM vs I/O tradeoff.
+ /// It might not be support by all file formats, in which case it will
+ /// simply be ignored.
+ Status BatchReadahead(int32_t batch_readahead);
+
+ /// \brief Set the number of fragments to read ahead
+ ///
+ /// \param[in] fragment_readahead How many fragments to read ahead
+ /// \returns An error if this number is less than 0.
+ ///
+ /// This option provides a control on RAM vs IO tradeoff.
Review Comment:
```suggestion
/// This option provides a control on the RAM vs IO tradeoff.
```
##########
python/pyarrow/_dataset.pyx:
##########
@@ -2254,6 +2259,12 @@ cdef class Scanner(_Weakrefable):
The maximum row count for scanned record batches. If scanned
record batches are overflowing memory then this method can be
called to reduce their size.
+ batch_readahead : int, default 16
+ The number of batches to read ahead in a file. Increasing this number
+ will increase RAM usage but also improve IO utilization.
+ fragment_readahead : int, default 4
+ The number of files to read ahead. Increasing this number will increase
+ RAM usage but also improve IO utilization.
Review Comment:
```suggestion
The number of batches to read ahead in a file. Increasing this
number
will increase RAM usage but could also improve IO utilization.
fragment_readahead : int, default 4
The number of files to read ahead. Increasing this number will
increase
RAM usage but could also improve IO utilization.
```
##########
cpp/src/arrow/dataset/scanner.h:
##########
@@ -384,6 +381,25 @@ class ARROW_DS_EXPORT ScannerBuilder {
/// This option provides a control limiting the memory owned by any
RecordBatch.
Status BatchSize(int64_t batch_size);
+ /// \brief Set the number of batches to read ahead within a fragment.
+ ///
+ /// \param[in] batch_readahead How many batches to read ahead within a
fragment,
+ /// might not work for all formats.
+ /// \returns An error if this number is less than 0.
+ ///
+ /// This option provides a control on RAM vs I/O tradeoff.
+ /// It might not be support by all file formats, in which case it will
+ /// simply be ignored.
+ Status BatchReadahead(int32_t batch_readahead);
+
+ /// \brief Set the number of fragments to read ahead
+ ///
+ /// \param[in] fragment_readahead How many fragments to read ahead
+ /// \returns An error if this number is less than 0.
Review Comment:
```suggestion
/// \returns an error if this number is less than 0.
```
##########
python/pyarrow/_dataset.pyx:
##########
@@ -2328,6 +2341,13 @@ cdef class Scanner(_Weakrefable):
The maximum row count for scanned record batches. If scanned
record batches are overflowing memory then this method can be
called to reduce their size.
+ batch_readahead : int, default 16
+ The number of batches to read ahead in a file. This might not work
+ for all file formats like CSV. Increasing this number will increase
+ RAM usage but also improve IO utilization.
+ fragment_readahead : int, default 4
+ The number of files to read ahead. Increasing this number will
increase
+ RAM usage but also improve IO utilization.
Review Comment:
```suggestion
RAM usage but could also improve IO utilization.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]