[GitHub] [arrow] westonpace commented on a diff in pull request #13799: Arrow 17299: [C++][Python] Expose the Scanner kDefaultBatchReadahead and kDefaultFragmentReadahead parameters

GitBox Thu, 04 Aug 2022 15:04:35 -0700


westonpace commented on code in PR #13799:
URL: https://github.com/apache/arrow/pull/13799#discussion_r938275153



##########
cpp/src/arrow/dataset/scanner.h:
##########
@@ -384,6 +381,23 @@ class ARROW_DS_EXPORT ScannerBuilder {
   /// This option provides a control limiting the memory owned by any 
RecordBatch.
   Status BatchSize(int64_t batch_size);
 
+  /// \brief Set the number of batches to read aheadw ithin a file.
+  ///
+  /// \param[in] batch_readahead How many batches to read ahead within a file,
+  ///  might not work for all formats. refer to comments above.
+  /// \returns An error if this number is less than 0.
+  ///
+  /// This option provides a control on RAM vs IO tradeoff.

Review Comment:
   ```suggestion
     /// This option provides a control on RAM vs I/O tradeoff.
   ```
   
   Or maybe something like "higher readahead can improve I/O utilization but 
will consume more RAM"



##########
cpp/src/arrow/dataset/scanner.h:
##########
@@ -384,6 +381,23 @@ class ARROW_DS_EXPORT ScannerBuilder {
   /// This option provides a control limiting the memory owned by any 
RecordBatch.
   Status BatchSize(int64_t batch_size);
 
+  /// \brief Set the number of batches to read aheadw ithin a file.
+  ///
+  /// \param[in] batch_readahead How many batches to read ahead within a file,
+  ///  might not work for all formats. refer to comments above.
+  /// \returns An error if this number is less than 0.
+  ///
+  /// This option provides a control on RAM vs IO tradeoff.
+  Status BatchReadahead(int32_t batch_readahead);
+
+  /// \brief Set the number of How many files to read ahead
+  ///
+  /// \param[in] fragment_readahead How many files to read ahead
+  /// \returns An error if this number is less than 0.

Review Comment:
   ```suggestion
     /// \param[in] fragment_readahead How many fragments to read ahead
     /// \returns An error if this number is less than 0.
   ```



##########
cpp/src/arrow/dataset/scanner.h:
##########
@@ -384,6 +381,23 @@ class ARROW_DS_EXPORT ScannerBuilder {
   /// This option provides a control limiting the memory owned by any 
RecordBatch.
   Status BatchSize(int64_t batch_size);
 
+  /// \brief Set the number of batches to read aheadw ithin a file.
+  ///
+  /// \param[in] batch_readahead How many batches to read ahead within a file,
+  ///  might not work for all formats. refer to comments above.
+  /// \returns An error if this number is less than 0.
+  ///
+  /// This option provides a control on RAM vs IO tradeoff.
+  Status BatchReadahead(int32_t batch_readahead);
+
+  /// \brief Set the number of How many files to read ahead

Review Comment:
   ```suggestion
     /// \brief Set the number of fragments to read ahead
   ```



##########
cpp/src/arrow/dataset/scanner.h:
##########
@@ -384,6 +381,23 @@ class ARROW_DS_EXPORT ScannerBuilder {
   /// This option provides a control limiting the memory owned by any 
RecordBatch.
   Status BatchSize(int64_t batch_size);
 
+  /// \brief Set the number of batches to read aheadw ithin a file.

Review Comment:
   ```suggestion
     /// \brief Set the number of batches to read ahead within a file.
   ```



##########
cpp/src/arrow/dataset/scanner.h:
##########
@@ -384,6 +381,23 @@ class ARROW_DS_EXPORT ScannerBuilder {
   /// This option provides a control limiting the memory owned by any 
RecordBatch.
   Status BatchSize(int64_t batch_size);
 
+  /// \brief Set the number of batches to read aheadw ithin a file.
+  ///
+  /// \param[in] batch_readahead How many batches to read ahead within a file,
+  ///  might not work for all formats. refer to comments above.

Review Comment:
   ```suggestion
     ///  might not work for all formats.
   ```
   
   These comments are used to generate C++ API docs.  I'm not sure what 
comments you are hoping to refer to but this phrase "refer to comments above" 
will not make sense in the generated docs I think.



##########
cpp/src/arrow/dataset/scanner.cc:
##########
@@ -843,6 +834,24 @@ Status ScannerBuilder::BatchSize(int64_t batch_size) {
   return Status::OK();
 }
 
+Status ScannerBuilder::BatchReadahead(int32_t batch_readahead) {
+  if (batch_readahead < 0) {

Review Comment:
   What does it mean if `batch_readahead == 0`?



##########
python/pyarrow/_dataset.pyx:
##########
@@ -2254,6 +2259,13 @@ cdef class Scanner(_Weakrefable):
         The maximum row count for scanned record batches. If scanned
         record batches are overflowing memory then this method can be
         called to reduce their size.
+    batch_readahead : int, default 16
+        The number of batches to read ahead in a file. This might not work
+        for all file formats like CSV. Increasing this number will increase
+        RAM usage but also improve IO utilization.

Review Comment:
   The readahead does apply on CSV, it just uses blocks instead of batches.



##########
python/pyarrow/_dataset.pyx:
##########
@@ -2435,7 +2462,9 @@ cdef class Scanner(_Weakrefable):
         builder = make_shared[CScannerBuilder](pyarrow_unwrap_schema(schema),
                                                fragment.unwrap(), options)
         _populate_builder(builder, columns=columns, filter=filter,
-                          batch_size=batch_size, use_threads=use_threads,
+                          batch_size=batch_size, 
batch_readahead=batch_readahead,
+                          fragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,

Review Comment:
   ```suggestion
   ```
   I don't think we need to specify this kwarg if we're just going to specify 
the default.



##########
python/pyarrow/_dataset.pyx:
##########
@@ -2508,7 +2537,8 @@ cdef class Scanner(_Weakrefable):
                           FutureWarning)
 
         _populate_builder(builder, columns=columns, filter=filter,
-                          batch_size=batch_size, use_threads=use_threads,
+                          batch_size=batch_size, 
batch_readahead=_DEFAULT_BATCH_READAHEAD,
+                          fragment_readahead=_DEFAULT_FRAGMENT_READAHEAD, 
use_threads=use_threads,

Review Comment:
   Same as above, I'm not sure we need to specify these arguments since we're 
using the defaults.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on a diff in pull request #13799: Arrow 17299: [C++][Python] Expose the Scanner kDefaultBatchReadahead and kDefaultFragmentReadahead parameters

Reply via email to