westonpace commented on code in PR #14663:
URL: https://github.com/apache/arrow/pull/14663#discussion_r1029893649
##########
cpp/src/arrow/dataset/scanner.h:
##########
@@ -243,10 +250,17 @@ struct ARROW_DS_EXPORT ScanV2Options : public
compute::ExecNodeOptions {
/// one fragment at a time.
int32_t fragment_readahead = kDefaultFragmentReadahead;
/// \brief Options specific to the file format
- FragmentScanOptions* format_options;
+ const FragmentScanOptions* format_options = NULLPTR;
/// \brief Utility method to get a selection representing all columns in a
dataset
- static std::vector<FieldPath> AllColumns(const Dataset& dataset);
+ static std::vector<FieldPath> AllColumns(const Schema& dataset_schema);
+
+ /// \brief Utility method to add fields needed for the current filter
+ ///
+ /// This method adds any fields that are needed by `filter` which are not
already
+ /// included in the list of columns. Any new fields added will be added to
the end
+ /// in no particular order.
+ static Status AddFieldsNeededForFilter(ScanV2Options* options);
Review Comment:
There is a (potentially rare) possibility that a format could fully satisfy
the best effort filter. In that case, if the field is only needed for
filtering, we shouldn't load it.
Although, in that case, @jacques-n has convinced me that we might want two
filter fields, `best_effort_filter` which does not have to be satisfied (and
thus should always be a selected column) and `filter` which MUST be satisfied
(and the format should error if it cannot), which doesn't have to be in the
columns list.
Either way, the main reason I'm not automagically including this is because
I am trying to move away from the concept of a scan node automagically adding
more columns than the user asked for. I will be doing something similar in
ARROW-16072 so that the __filename, etc. fields are only included if asked for.
The goal is that, a scan request with X columns will emit batches with X
columns.
> Why isn't this just a regular method?
I could be convinced otherwise. In my head all exec node options objects
are PODs and I never know whether a POD should have methods or not. I think,
however, it would be fine to make an immutable
```
Result<ScanV2Options> AddFIeldsNeededForFilter() const
```
Would that be preferred?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]