westonpace commented on code in PR #14663:
URL: https://github.com/apache/arrow/pull/14663#discussion_r1029888840


##########
cpp/src/arrow/dataset/scanner.h:
##########
@@ -178,6 +178,13 @@ struct ARROW_DS_EXPORT ScanV2Options : public 
compute::ExecNodeOptions {
   ///
   /// A single guarantee-aware filtering operation should generally be applied 
to all
   /// resulting batches.  The scan node is not responsible for this.
+  ///
+  /// Fields that are referenced by the filter should be included in the 
`columns` vector.
+  /// The scan node will not automatically fetch fields referenced by the 
filter
+  /// expression. \see AddFieldsNeededForFilter
+  ///
+  /// If the filter references fields that are not included in `columns` this 
may or may
+  /// not be an error, depending on the format.

Review Comment:
   Yes, formats like CSV and IPC which just ignore the filter will not error.  
Even using parquet as an example, one could filter on statistics for a column, 
without ever actually loading the column.  That's generally not what is 
desired, since that filter is best-effort, and the column would then be needed 
to fulfill the filter in-memory.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to