Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14240#discussion_r71585784 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala --- @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext { testPruning("SELECT * FROM oneToTenPruned", "a", "b") testPruning("SELECT a, b FROM oneToTenPruned", "a", "b") testPruning("SELECT b, a FROM oneToTenPruned", "b", "a") - testPruning("SELECT b, b FROM oneToTenPruned", "b") + testPruning("SELECT b, b FROM oneToTenPruned", "b", "b") + testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b") --- End diff -- me too. : ( I am also learning this part for refactoring the Hive table scan. That is why I found the behavior inconsistency between different types of table scans. Let me try to summarize it in a more organized way. Currently, when converting logical plans to physical plans, we have two different strategies for the table scan (+ the adjacent `Filter` and `Project`, if any). - Hive Table Scan and In-memory Table Scan. - Data Source Table Scan. Thus, the basic functionalities are the same. The inputs include the filter predicates and the output project list and a leaf relation `LogicalRelation`, `MetastoreRelation`, and `InMemoryRelation`. The inputs are generated by the `PhysicalOperation` pattern, which extracts and normalizes Scan, Projects and Filters. The target is to generate a physical plan, like: ``` ProjectExec +- FilterExec +- TableScan ``` `ProjectExec` is optional. When the filter predicates contain the unnecessary attributes or output project list has aliases, we have to add `ProjectExec`; otherwise, it is ok to exclude `ProjectExec`. The above logics is implemented by the following conditions in both `pruneFilterProject` functions: ```Scala AttributeSet(projectList.map(_.toAttribute)) == projectSet && filterSet.subsetOf(projectSet)) ``` The source codes: - [The original `pruneFilterProject` for Data Source Table Scan](https://github.com/apache/spark/blob/b1e5281c5cb429e338c3719c13c0b93078d7312a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L364-L366) - [The `pruneFilterProject` for Hive Table Scan and In-memory Table Scan](https://github.com/apache/spark/blob/865ec32dd997e63aea01a871d1c7b4947f43c111/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala#L88-L89 ) The above code shows that Data Source Table Scan has one extra condition: ```Scala projectSet.size == projects.size ``` This condition is very specific for a rare case. Users select duplicate columns without using any alias name. Thus, to make them consistent, we are facing two options: - add this condition to both scan scenarios - remove this condition from both Either is fine to me, but I think we need to make them consistent. Let me know if my explanation is clear.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org