[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14240#discussion_r71938022 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala --- @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext { testPruning("SELECT * FROM oneToTenPruned", "a", "b") testPruning("SELECT a, b FROM oneToTenPruned", "a", "b") testPruning("SELECT b, a FROM oneToTenPruned", "b", "a") - testPruning("SELECT b, b FROM oneToTenPruned", "b") + testPruning("SELECT b, b FROM oneToTenPruned", "b", "b") + testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b") --- End diff -- Found something interesting in File Source Scan. Will submit a separate PR for resolving it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...
Github user gatorsmile closed the pull request at: https://github.com/apache/spark/pull/14240 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14240#discussion_r71839551 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala --- @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext { testPruning("SELECT * FROM oneToTenPruned", "a", "b") testPruning("SELECT a, b FROM oneToTenPruned", "a", "b") testPruning("SELECT b, a FROM oneToTenPruned", "b", "a") - testPruning("SELECT b, b FROM oneToTenPruned", "b") + testPruning("SELECT b, b FROM oneToTenPruned", "b", "b") + testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b") --- End diff -- `FileSourceStrategy` is worse. The order also matters! https://github.com/apache/spark/blob/37f3be5d29192db0a54f6c4699237b149bd0ecae/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L148 The output of file source scan is always following `readDataColumns ++ partitionColumns`. https://github.com/apache/spark/blob/37f3be5d29192db0a54f6c4699237b149bd0ecae/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L140 If the order is different from `readDataColumns ++ partitionColumns`, we will add a `ProjectExec`. This does not sound good to me. The next question is why `partitionColumns` is always part of the scan's output. Will try to find the answer tomorrow. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14240#discussion_r71646312 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala --- @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext { testPruning("SELECT * FROM oneToTenPruned", "a", "b") testPruning("SELECT a, b FROM oneToTenPruned", "a", "b") testPruning("SELECT b, a FROM oneToTenPruned", "b", "a") - testPruning("SELECT b, b FROM oneToTenPruned", "b") + testPruning("SELECT b, b FROM oneToTenPruned", "b", "b") + testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b") --- End diff -- I like your point! Your concern is valid. We should be careful when making an assumption on the external data source implementation! : ) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14240#discussion_r71644651 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala --- @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext { testPruning("SELECT * FROM oneToTenPruned", "a", "b") testPruning("SELECT a, b FROM oneToTenPruned", "a", "b") testPruning("SELECT b, a FROM oneToTenPruned", "b", "a") - testPruning("SELECT b, b FROM oneToTenPruned", "b") + testPruning("SELECT b, b FROM oneToTenPruned", "b", "b") + testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b") --- End diff -- so we are assuming the `PrunedScan.buildScan` can handle duplicated columns? I feel it's dangerous to assume that as we didn't document this semantic, the external implementation may break it. For the hive one, it's internal stuff and we are safe to make this assumption. If we do wanna make them consistent, I'd like to change the hive one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14240#discussion_r71643068 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala --- @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext { testPruning("SELECT * FROM oneToTenPruned", "a", "b") testPruning("SELECT a, b FROM oneToTenPruned", "a", "b") testPruning("SELECT b, a FROM oneToTenPruned", "b", "a") - testPruning("SELECT b, b FROM oneToTenPruned", "b") + testPruning("SELECT b, b FROM oneToTenPruned", "b", "b") + testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b") --- End diff -- After this PR, Data Source Table Scan will return two columns too, just like what Hive Table Scan does. This is also shown in the the [test case](https://github.com/gatorsmile/spark/blob/ba7dcff8829c379666c417b87fc1e5022ed93141/sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala#L117) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14240#discussion_r71634679 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala --- @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext { testPruning("SELECT * FROM oneToTenPruned", "a", "b") testPruning("SELECT a, b FROM oneToTenPruned", "a", "b") testPruning("SELECT b, a FROM oneToTenPruned", "b", "a") - testPruning("SELECT b, b FROM oneToTenPruned", "b") + testPruning("SELECT b, b FROM oneToTenPruned", "b", "b") + testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b") --- End diff -- after your PR, will data source table return only one column for this case? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14240#discussion_r71634358 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala --- @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext { testPruning("SELECT * FROM oneToTenPruned", "a", "b") testPruning("SELECT a, b FROM oneToTenPruned", "a", "b") testPruning("SELECT b, a FROM oneToTenPruned", "b", "a") - testPruning("SELECT b, b FROM oneToTenPruned", "b") + testPruning("SELECT b, b FROM oneToTenPruned", "b", "b") + testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b") --- End diff -- ```SQL SELECT b, b FROM oneToTenPruned ``` For Hive Table Scan, we exclude `ProjectExec` from the physical plan but [the Hive table scan returns two duplicate columns](https://github.com/apache/spark/blob/865ec32dd997e63aea01a871d1c7b4947f43c111/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala#L93). No matter whether `projectSet.size == projects.size` is added or not, the result is always right. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14240#discussion_r71631823 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala --- @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext { testPruning("SELECT * FROM oneToTenPruned", "a", "b") testPruning("SELECT a, b FROM oneToTenPruned", "a", "b") testPruning("SELECT b, a FROM oneToTenPruned", "b", "a") - testPruning("SELECT b, b FROM oneToTenPruned", "b") + testPruning("SELECT b, b FROM oneToTenPruned", "b", "b") + testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b") --- End diff -- so for `SELECT b, b FROM oneToTenPruned` with hive table, we will exclude the `ProjectExec` and thus output only one column? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14240#discussion_r71585784 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala --- @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext { testPruning("SELECT * FROM oneToTenPruned", "a", "b") testPruning("SELECT a, b FROM oneToTenPruned", "a", "b") testPruning("SELECT b, a FROM oneToTenPruned", "b", "a") - testPruning("SELECT b, b FROM oneToTenPruned", "b") + testPruning("SELECT b, b FROM oneToTenPruned", "b", "b") + testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b") --- End diff -- me too. : ( I am also learning this part for refactoring the Hive table scan. That is why I found the behavior inconsistency between different types of table scans. Let me try to summarize it in a more organized way. Currently, when converting logical plans to physical plans, we have two different strategies for the table scan (+ the adjacent `Filter` and `Project`, if any). - Hive Table Scan and In-memory Table Scan. - Data Source Table Scan. Thus, the basic functionalities are the same. The inputs include the filter predicates and the output project list and a leaf relation `LogicalRelation`, `MetastoreRelation`, and `InMemoryRelation`. The inputs are generated by the `PhysicalOperation` pattern, which extracts and normalizes Scan, Projects and Filters. The target is to generate a physical plan, like: ``` ProjectExec +- FilterExec +- TableScan ``` `ProjectExec` is optional. When the filter predicates contain the unnecessary attributes or output project list has aliases, we have to add `ProjectExec`; otherwise, it is ok to exclude `ProjectExec`. The above logics is implemented by the following conditions in both `pruneFilterProject` functions: ```Scala AttributeSet(projectList.map(_.toAttribute)) == projectSet && filterSet.subsetOf(projectSet)) ``` The source codes: - [The original `pruneFilterProject` for Data Source Table Scan](https://github.com/apache/spark/blob/b1e5281c5cb429e338c3719c13c0b93078d7312a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L364-L366) - [The `pruneFilterProject` for Hive Table Scan and In-memory Table Scan](https://github.com/apache/spark/blob/865ec32dd997e63aea01a871d1c7b4947f43c111/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala#L88-L89 ) The above code shows that Data Source Table Scan has one extra condition: ```Scala projectSet.size == projects.size ``` This condition is very specific for a rare case. Users select duplicate columns without using any alias name. Thus, to make them consistent, we are facing two options: - add this condition to both scan scenarios - remove this condition from both Either is fine to me, but I think we need to make them consistent. Let me know if my explanation is clear. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14240#discussion_r71565387 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala --- @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext { testPruning("SELECT * FROM oneToTenPruned", "a", "b") testPruning("SELECT a, b FROM oneToTenPruned", "a", "b") testPruning("SELECT b, a FROM oneToTenPruned", "b", "a") - testPruning("SELECT b, b FROM oneToTenPruned", "b") + testPruning("SELECT b, b FROM oneToTenPruned", "b", "b") + testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b") --- End diff -- so what does the column number here means? Sorry I'm not familiar with this part of code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14240#discussion_r71562777 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala --- @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext { testPruning("SELECT * FROM oneToTenPruned", "a", "b") testPruning("SELECT a, b FROM oneToTenPruned", "a", "b") testPruning("SELECT b, a FROM oneToTenPruned", "b", "a") - testPruning("SELECT b, b FROM oneToTenPruned", "b") + testPruning("SELECT b, b FROM oneToTenPruned", "b", "b") + testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b") --- End diff -- When we do not return two columns, we will always have a `ProjectExec` above the table scan: ``` *Project [b#222 AS alias_b#240, b#222] +- InMemoryTableScan [b#222] ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14240#discussion_r71548579 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala --- @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext { testPruning("SELECT * FROM oneToTenPruned", "a", "b") testPruning("SELECT a, b FROM oneToTenPruned", "a", "b") testPruning("SELECT b, a FROM oneToTenPruned", "b", "a") - testPruning("SELECT b, b FROM oneToTenPruned", "b") + testPruning("SELECT b, b FROM oneToTenPruned", "b", "b") + testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b") --- End diff -- but this doesn't make sense, when user select 2 columns, how can we return only one column? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14240#discussion_r71545159 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala --- @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext { testPruning("SELECT * FROM oneToTenPruned", "a", "b") testPruning("SELECT a, b FROM oneToTenPruned", "a", "b") testPruning("SELECT b, a FROM oneToTenPruned", "b", "a") - testPruning("SELECT b, b FROM oneToTenPruned", "b") + testPruning("SELECT b, b FROM oneToTenPruned", "b", "b") + testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b") --- End diff -- The following condition will be false if an alias is used: https://github.com/gatorsmile/spark/blob/ba7dcff8829c379666c417b87fc1e5022ed93141/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L364 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14240#discussion_r71506928 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala --- @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext { testPruning("SELECT * FROM oneToTenPruned", "a", "b") testPruning("SELECT a, b FROM oneToTenPruned", "a", "b") testPruning("SELECT b, a FROM oneToTenPruned", "b", "a") - testPruning("SELECT b, b FROM oneToTenPruned", "b") + testPruning("SELECT b, b FROM oneToTenPruned", "b", "b") + testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b") --- End diff -- I don't quite understand this, why they are different? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14240#discussion_r71471878 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala --- @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext { testPruning("SELECT * FROM oneToTenPruned", "a", "b") testPruning("SELECT a, b FROM oneToTenPruned", "a", "b") testPruning("SELECT b, a FROM oneToTenPruned", "b", "a") - testPruning("SELECT b, b FROM oneToTenPruned", "b") + testPruning("SELECT b, b FROM oneToTenPruned", "b", "b") + testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b") --- End diff -- Yeah! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14240#discussion_r71471414 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala --- @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext { testPruning("SELECT * FROM oneToTenPruned", "a", "b") testPruning("SELECT a, b FROM oneToTenPruned", "a", "b") testPruning("SELECT b, a FROM oneToTenPruned", "b", "a") - testPruning("SELECT b, b FROM oneToTenPruned", "b") + testPruning("SELECT b, b FROM oneToTenPruned", "b", "b") + testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b") --- End diff -- so `SELECT b, b FROM oneToTenPruned` will return 2 columns and `SELECT b as alias_b, b FROM oneToTenPruned` only returns one column? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/14240 [SPARK-16594] [SQL] Remove Physical Plan Differences when Table Scan Having Duplicate Columns What changes were proposed in this pull request? Currently, we keep two implementations for planning scans over data sources. There is one difference between two implementation when deciding whether a `Project` is needed or not. - Data Source Table Scan: https://github.com/apache/spark/blob/b1e5281c5cb429e338c3719c13c0b93078d7312a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L322-L395 - `SELECT b, b FROM oneToTenPruned`: **_Add `ProjectExec`._** - Hive Table Scan and In-memory Table Scan: https://github.com/apache/spark/blob/865ec32dd997e63aea01a871d1c7b4947f43c111/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala#L71-L99 - `SELECT b, b FROM oneToTenPruned`: **_No `ProjectExec` is added._** **Note.** When alias is being used, we will always add `ProjectExec` in all the scan types. - `SELECT b as alias_b, b FROM oneToTenPruned`: **_Add `ProjectExec`._** Because selecting the same column twice without adding `alias` is very weird, no clue why the code needs to behave differently here. This PR is to remove the differences. How was this patch tested? Updated/added a test case You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark removeProject Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14240.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14240 commit ba7dcff8829c379666c417b87fc1e5022ed93141 Author: gatorsmile Date: 2016-07-17T07:26:47Z fix --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org