[GitHub] [arrow] westonpace commented on a diff in pull request #13799: ARROW-17299: [C++][Python] Expose the Scanner kDefaultBatchReadahead and kDefaultFragmentReadahead parameters

GitBox Fri, 19 Aug 2022 16:01:42 -0700


westonpace commented on code in PR #13799:
URL: https://github.com/apache/arrow/pull/13799#discussion_r950598812



##########
cpp/src/arrow/dataset/scanner.h:
##########
@@ -384,6 +381,25 @@ class ARROW_DS_EXPORT ScannerBuilder {
   /// This option provides a control limiting the memory owned by any 
RecordBatch.
   Status BatchSize(int64_t batch_size);
 
+  /// \brief Set the number of batches to read ahead within a fragment.
+  ///
+  /// \param[in] batch_readahead How many batches to read ahead within a 
fragment,
+  ///  might not work for all formats.
+  /// \returns An error if this number is less than 0.
+  ///
+  /// This option provides a control on RAM vs I/O tradeoff.
+  /// It might not be support by all file formats, in which case it will
+  /// simply be ignored.

Review Comment:
   ```suggestion
     /// \param[in] batch_readahead How many batches to read ahead within a 
fragment
     /// \returns an error if this number is less than 0.
     ///
     /// This option provides a control on the RAM vs I/O tradeoff.
     /// It might not be supported by all file formats, in which case it will
     /// simply be ignored.
   ```



##########
python/pyarrow/_dataset.pyx:
##########
@@ -2328,6 +2341,13 @@ cdef class Scanner(_Weakrefable):
             The maximum row count for scanned record batches. If scanned
             record batches are overflowing memory then this method can be
             called to reduce their size.
+        batch_readahead : int, default 16
+            The number of batches to read ahead in a file. This might not work
+            for all file formats like CSV. Increasing this number will increase
+            RAM usage but also improve IO utilization.

Review Comment:
   ```suggestion
               The number of batches to read ahead in a file. This might not 
work
               for all file formats. Increasing this number will increase
               RAM usage but could also improve IO utilization.
   ```
   
   We probably should tie batch readahead into CSV at some point.  I think it's 
fine to be vague for now.



##########
python/pyarrow/_dataset.pyx:
##########
@@ -2406,6 +2428,10 @@ cdef class Scanner(_Weakrefable):
             The maximum row count for scanned record batches. If scanned
             record batches are overflowing memory then this method can be
             called to reduce their size.
+        batch_readahead : int, default 16
+            The number of batches to read ahead in a file. This might not work
+            for all file formats like CSV. Increasing this number will increase
+            RAM usage but also improve IO utilization.

Review Comment:
   ```suggestion
               The number of batches to read ahead in a file. This might not 
work
               for all file formats. Increasing this number will increase
               RAM usage but could also improve IO utilization.
   ```



##########
cpp/src/arrow/dataset/scanner.h:
##########
@@ -384,6 +381,25 @@ class ARROW_DS_EXPORT ScannerBuilder {
   /// This option provides a control limiting the memory owned by any 
RecordBatch.
   Status BatchSize(int64_t batch_size);
 
+  /// \brief Set the number of batches to read ahead within a fragment.
+  ///
+  /// \param[in] batch_readahead How many batches to read ahead within a 
fragment,
+  ///  might not work for all formats.
+  /// \returns An error if this number is less than 0.
+  ///
+  /// This option provides a control on RAM vs I/O tradeoff.
+  /// It might not be support by all file formats, in which case it will
+  /// simply be ignored.
+  Status BatchReadahead(int32_t batch_readahead);
+
+  /// \brief Set the number of fragments to read ahead
+  ///
+  /// \param[in] fragment_readahead How many fragments to read ahead
+  /// \returns An error if this number is less than 0.
+  ///
+  /// This option provides a control on RAM vs IO tradeoff.

Review Comment:
   ```suggestion
     /// This option provides a control on the RAM vs IO tradeoff.
   ```



##########
python/pyarrow/_dataset.pyx:
##########
@@ -2254,6 +2259,12 @@ cdef class Scanner(_Weakrefable):
         The maximum row count for scanned record batches. If scanned
         record batches are overflowing memory then this method can be
         called to reduce their size.
+    batch_readahead : int, default 16
+        The number of batches to read ahead in a file. Increasing this number 
+        will increase RAM usage but also improve IO utilization.
+    fragment_readahead : int, default 4
+        The number of files to read ahead. Increasing this number will increase
+        RAM usage but also improve IO utilization.

Review Comment:
   ```suggestion
           The number of batches to read ahead in a file. Increasing this 
number 
           will increase RAM usage but could also improve IO utilization.
       fragment_readahead : int, default 4
           The number of files to read ahead. Increasing this number will 
increase
           RAM usage but could also improve IO utilization.
   ```



##########
cpp/src/arrow/dataset/scanner.h:
##########
@@ -384,6 +381,25 @@ class ARROW_DS_EXPORT ScannerBuilder {
   /// This option provides a control limiting the memory owned by any 
RecordBatch.
   Status BatchSize(int64_t batch_size);
 
+  /// \brief Set the number of batches to read ahead within a fragment.
+  ///
+  /// \param[in] batch_readahead How many batches to read ahead within a 
fragment,
+  ///  might not work for all formats.
+  /// \returns An error if this number is less than 0.
+  ///
+  /// This option provides a control on RAM vs I/O tradeoff.
+  /// It might not be support by all file formats, in which case it will
+  /// simply be ignored.
+  Status BatchReadahead(int32_t batch_readahead);
+
+  /// \brief Set the number of fragments to read ahead
+  ///
+  /// \param[in] fragment_readahead How many fragments to read ahead
+  /// \returns An error if this number is less than 0.

Review Comment:
   ```suggestion
     /// \returns an error if this number is less than 0.
   ```



##########
python/pyarrow/_dataset.pyx:
##########
@@ -2328,6 +2341,13 @@ cdef class Scanner(_Weakrefable):
             The maximum row count for scanned record batches. If scanned
             record batches are overflowing memory then this method can be
             called to reduce their size.
+        batch_readahead : int, default 16
+            The number of batches to read ahead in a file. This might not work
+            for all file formats like CSV. Increasing this number will increase
+            RAM usage but also improve IO utilization.
+        fragment_readahead : int, default 4
+            The number of files to read ahead. Increasing this number will 
increase
+            RAM usage but also improve IO utilization.

Review Comment:
   ```suggestion
               RAM usage but could also improve IO utilization.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on a diff in pull request #13799: ARROW-17299: [C++][Python] Expose the Scanner kDefaultBatchReadahead and kDefaultFragmentReadahead parameters

Reply via email to