huaxingao commented on a change in pull request #29415: URL: https://github.com/apache/spark/pull/29415#discussion_r469731345
########## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala ########## @@ -143,7 +140,6 @@ case class RowDataSourceScanExec( // Don't care about `rdd` and `tableIdentifier` when canonicalizing. override def doCanonicalize(): SparkPlan = copy( - fullOutput.map(QueryPlan.normalizeExpressions(_, fullOutput)), Review comment: Sorry I didn't know that we need to use the normalized exprId in canonicalized plan. If we do, then probably we can't remove fullOutput from RowDataSourceScanExec, because using the normalized pruned output would cause problem. For example, in https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/RowDataSourceStrategySuite.scala#L68, normalized the pruned output will give `none#0,none#1` for both df1 and df2, and then both of them have exactly the same plan ``` *(2) HashAggregate(keys=[none#0], functions=[min(none#0)], output=[none#0, #0]) +- Exchange hashpartitioning(none#0, 5), true, [id=#110] +- *(1) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4]) +- *(1) Scan JDBCRelation(TEST.INTTYPES) [numPartitions=1] [none#0,none#1] PushedFilters: [], ReadSchema: struct<none:int,none:int> ``` Then in df1.union(df2), it takes `ReusedExchange` code path since both plans are equal ``` == Physical Plan == Union :- *(2) HashAggregate(keys=[a#0], functions=[min(b#1)], output=[a#0, min(b)#12]) : +- Exchange hashpartitioning(a#0, 5), true, [id=#34] : +- *(1) HashAggregate(keys=[a#0], functions=[partial_min(b#1)], output=[a#0, min#28]) : +- *(1) Scan JDBCRelation(TEST.INTTYPES) [numPartitions=1] [A#0,B#1] PushedFilters: [], ReadSchema: struct<A:int,B:int> +- *(4) HashAggregate(keys=[a#0], functions=[min(c#2)], output=[a#0, min(c)#24]) +- ReusedExchange [a#0, min#30], Exchange hashpartitioning(a#0, 5), true, [id=#34] ``` The union result will be ``` +---+------+ | a|min(b)| +---+------+ | 1| 2| | 1| 2| +---+------+ ``` instead of ``` +---+------+ | a|min(b)| +---+------+ | 1| 2| | 1| 3| +---+------+ ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org