yyanyy commented on code in PR #55252:
URL: https://github.com/apache/spark/pull/55252#discussion_r3054447715
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala:
##########
@@ -197,7 +206,8 @@ case class DataSourceV2ScanRelation(
),
ordering = ordering.map(
_.map(o => o.copy(child = QueryPlan.normalizeExpressions(o.child,
output)))
- )
+ ),
+ pushedFilters = pushedFilters.map(QueryPlan.normalizeExpressions(_,
output))
Review Comment:
I debated with myself a bit on this and decided to not include pushed
filters whose references are no longer in the pruned output.
The reason for me to include them before, is that I was a bit worried that a
fully pushed filter can be lost from `pushedFilter` when the filtered column is
not projected:
> `SELECT j FROM t WHERE i > 3` with a connector that pushes both
GreaterThan and IsNotNull:
1. After analysis: `Project([j], Filter(i > 3, DataSourceV2Relation([i,
j])))`
2. InferFiltersFromConstraints (runs before V2ScanRelationPushDown): derives
`IsNotNull(i) → Project([j], Filter(i > 3 AND IsNotNull(i),
DataSourceV2Relation([i, j])))`
3. pushDownFilters: both `i > 3` and `IsNotNull(i)` fully pushed → Filter
node removed → `Project([j], ScanBuilderHolder), pushedFilterExpressions = [i >
3]`
4. pruneColumns: only j is referenced → `output = [j]`, column `i` pruned →
pushed filter `i > 3` references pruned `i` → dropped
And I was a bit worried since part of the intention for this field was to
keep what guarantee the connector can make, and now we couldn't have the
complete information.
But for now, I think this is acceptable since 1/ pushedFilters is
informational only; a potential future improvement could be to keep fully
pushed filters as post-scan Filter nodes, which would prevent the column from
being pruned in the first place, and 2/ it appears that spark plan validator
will also catch and reject this case when it realizes that some references in
expressions is not resolvable.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]