westonpace commented on code in PR #14663:
URL: https://github.com/apache/arrow/pull/14663#discussion_r1029888840
##########
cpp/src/arrow/dataset/scanner.h:
##########
@@ -178,6 +178,13 @@ struct ARROW_DS_EXPORT ScanV2Options : public
compute::ExecNodeOptions {
///
/// A single guarantee-aware filtering operation should generally be applied
to all
/// resulting batches. The scan node is not responsible for this.
+ ///
+ /// Fields that are referenced by the filter should be included in the
`columns` vector.
+ /// The scan node will not automatically fetch fields referenced by the
filter
+ /// expression. \see AddFieldsNeededForFilter
+ ///
+ /// If the filter references fields that are not included in `columns` this
may or may
+ /// not be an error, depending on the format.
Review Comment:
Yes, formats like CSV and IPC which just ignore the filter will not error.
Even using parquet as an example, one could filter on statistics for a column,
without ever actually loading the column. That's generally not what is
desired, since that filter is best-effort, and the column would then be needed
to fulfill the filter in-memory.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]