westonpace commented on code in PR #14663: URL: https://github.com/apache/arrow/pull/14663#discussion_r1029888840
########## cpp/src/arrow/dataset/scanner.h: ########## @@ -178,6 +178,13 @@ struct ARROW_DS_EXPORT ScanV2Options : public compute::ExecNodeOptions { /// /// A single guarantee-aware filtering operation should generally be applied to all /// resulting batches. The scan node is not responsible for this. + /// + /// Fields that are referenced by the filter should be included in the `columns` vector. + /// The scan node will not automatically fetch fields referenced by the filter + /// expression. \see AddFieldsNeededForFilter + /// + /// If the filter references fields that are not included in `columns` this may or may + /// not be an error, depending on the format. Review Comment: Yes, formats like CSV and IPC which just ignore the filter will not error. Even using parquet as an example, one could filter on statistics for a column, without ever actually loading the column. That's generally not what is desired, since that filter is best-effort, and the column would then be needed to fulfill the filter in-memory. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org