Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14240#discussion_r71585784
  
    --- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
    @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with 
SharedSQLContext {
       testPruning("SELECT * FROM oneToTenPruned", "a", "b")
       testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
       testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
    -  testPruning("SELECT b, b FROM oneToTenPruned", "b")
    +  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
    +  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
    --- End diff --
    
    me too. : ( I am also learning this part for refactoring the Hive table 
scan. That is why I found the behavior inconsistency between different types of 
table scans. Let me try to summarize it in a more organized way. 
    
    Currently, when converting logical plans to physical plans, we have two 
different strategies for the table scan (+ the adjacent `Filter` and `Project`, 
if any). 
    - Hive Table Scan and In-memory Table Scan.
    - Data Source Table Scan.
    
    Thus, the basic functionalities are the same. 
    
    The inputs include the filter predicates and the output project list and a 
leaf relation `LogicalRelation`, `MetastoreRelation`, and `InMemoryRelation`. 
The inputs are generated by the `PhysicalOperation` pattern, which extracts and 
normalizes Scan, Projects and Filters.
    
    The target is to generate a physical plan, like:
    ```
    ProjectExec 
    +- FilterExec
       +- TableScan
    ```
    
    `ProjectExec` is optional. When the filter predicates contain the 
unnecessary attributes or output project list has aliases, we have to add 
`ProjectExec`; otherwise, it is ok to exclude `ProjectExec`. The above logics 
is implemented by the following conditions in both `pruneFilterProject` 
functions:
    ```Scala
    AttributeSet(projectList.map(_.toAttribute)) == projectSet &&
      filterSet.subsetOf(projectSet))
    ```
    The source codes:
    - [The original `pruneFilterProject` for Data Source Table 
Scan](https://github.com/apache/spark/blob/b1e5281c5cb429e338c3719c13c0b93078d7312a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L364-L366)
    - [The `pruneFilterProject` for Hive Table Scan and In-memory Table 
Scan](https://github.com/apache/spark/blob/865ec32dd997e63aea01a871d1c7b4947f43c111/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala#L88-L89
    )
    
    The above code shows that Data Source Table Scan has one extra condition:
    ```Scala
    projectSet.size == projects.size
    ``` 
    
    This condition is very specific for a rare case. Users select duplicate 
columns without using any alias name. Thus, to make them consistent, we are 
facing two options:
    - add this condition to both scan scenarios
    - remove this condition from both
    
    Either is fine to me, but I think we need to make them consistent. Let me 
know if my explanation is clear.
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to