[GitHub] [spark] huaxingao commented on a change in pull request #29415: [SPARK-32590][SQL] Remove fullOutput from RowDataSourceScanExec

GitBox Wed, 12 Aug 2020 23:46:03 -0700


huaxingao commented on a change in pull request #29415:
URL: https://github.com/apache/spark/pull/29415#discussion_r469731345




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
##########
@@ -143,7 +140,6 @@ case class RowDataSourceScanExec(
   // Don't care about `rdd` and `tableIdentifier` when canonicalizing.
   override def doCanonicalize(): SparkPlan =
     copy(
-      fullOutput.map(QueryPlan.normalizeExpressions(_, fullOutput)),

Review comment:
       Sorry I didn't know that we need to use the normalized exprId in 
canonicalized plan. If we do, then probably we can't remove fullOutput from 
RowDataSourceScanExec, because using the normalized pruned output would cause 
problem. For example, in 
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/RowDataSourceStrategySuite.scala#L68,
 normalized the pruned output will give `none#0,none#1` for both df1 and df2, 
and then both of them have exactly the same plan
   
   ```
   *(2) HashAggregate(keys=[none#0], functions=[min(none#0)], output=[none#0, 
#0])
   +- Exchange hashpartitioning(none#0, 5), true, [id=#110]
      +- *(1) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], 
output=[none#0, none#4])
         +- *(1) Scan JDBCRelation(TEST.INTTYPES) [numPartitions=1] 
[none#0,none#1] PushedFilters: [], ReadSchema: struct<none:int,none:int>
   ```
   Then in df1.union(df2), it takes `ReusedExchange` code path since both plans 
are equal
   ```
   == Physical Plan ==
   Union
   :- *(2) HashAggregate(keys=[a#0], functions=[min(b#1)], output=[a#0, 
min(b)#12])
   :  +- Exchange hashpartitioning(a#0, 5), true, [id=#34]
   :     +- *(1) HashAggregate(keys=[a#0], functions=[partial_min(b#1)], 
output=[a#0, min#28])
   :        +- *(1) Scan JDBCRelation(TEST.INTTYPES) [numPartitions=1] 
[A#0,B#1] PushedFilters: [], ReadSchema: struct<A:int,B:int>
   +- *(4) HashAggregate(keys=[a#0], functions=[min(c#2)], output=[a#0, 
min(c)#24])
      +- ReusedExchange [a#0, min#30], Exchange hashpartitioning(a#0, 5), true, 
[id=#34]
   ```
   The union result will be 
   ```
   +---+------+
   |  a|min(b)|
   +---+------+
   |  1|     2|
   |  1|     2|
   +---+------+
   ```
   instead of 
   ```
   +---+------+
   |  a|min(b)|
   +---+------+
   |  1|     2|
   |  1|     3|
   +---+------+
   ```
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on a change in pull request #29415: [SPARK-32590][SQL] Remove fullOutput from RowDataSourceScanExec

Reply via email to