[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

2016-07-22 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14240#discussion_r71938022
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
@@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with 
SharedSQLContext {
   testPruning("SELECT * FROM oneToTenPruned", "a", "b")
   testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
   testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
-  testPruning("SELECT b, b FROM oneToTenPruned", "b")
+  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
+  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
--- End diff --

Found something interesting in File Source Scan. Will submit a separate PR 
for resolving it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

2016-07-22 Thread gatorsmile
Github user gatorsmile closed the pull request at:

https://github.com/apache/spark/pull/14240


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

2016-07-22 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14240#discussion_r71839551
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
@@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with 
SharedSQLContext {
   testPruning("SELECT * FROM oneToTenPruned", "a", "b")
   testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
   testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
-  testPruning("SELECT b, b FROM oneToTenPruned", "b")
+  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
+  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
--- End diff --

`FileSourceStrategy` is worse. The order also matters!

https://github.com/apache/spark/blob/37f3be5d29192db0a54f6c4699237b149bd0ecae/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L148

The output of file source scan is always following `readDataColumns ++ 
partitionColumns`.

https://github.com/apache/spark/blob/37f3be5d29192db0a54f6c4699237b149bd0ecae/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L140

If the order is different from `readDataColumns ++ partitionColumns`, we 
will add a `ProjectExec`. This does not sound good to me.

The next question is why `partitionColumns` is always part of the scan's 
output. Will try to find the answer tomorrow. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

2016-07-20 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14240#discussion_r71646312
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
@@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with 
SharedSQLContext {
   testPruning("SELECT * FROM oneToTenPruned", "a", "b")
   testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
   testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
-  testPruning("SELECT b, b FROM oneToTenPruned", "b")
+  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
+  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
--- End diff --

I like your point! Your concern is valid. We should be careful when making 
an assumption on the external data source implementation! : ) 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

2016-07-20 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14240#discussion_r71644651
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
@@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with 
SharedSQLContext {
   testPruning("SELECT * FROM oneToTenPruned", "a", "b")
   testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
   testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
-  testPruning("SELECT b, b FROM oneToTenPruned", "b")
+  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
+  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
--- End diff --

so we are assuming the `PrunedScan.buildScan` can handle duplicated 
columns? I feel it's dangerous to assume that as we didn't document this 
semantic, the external implementation may break it.

For the hive one, it's internal stuff and we are safe to make this 
assumption. If we do wanna make them consistent, I'd like to change the hive 
one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

2016-07-20 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14240#discussion_r71643068
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
@@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with 
SharedSQLContext {
   testPruning("SELECT * FROM oneToTenPruned", "a", "b")
   testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
   testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
-  testPruning("SELECT b, b FROM oneToTenPruned", "b")
+  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
+  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
--- End diff --

After this PR, Data Source Table Scan will return two columns too, just 
like what Hive Table Scan does. This is also shown in the the [test 
case](https://github.com/gatorsmile/spark/blob/ba7dcff8829c379666c417b87fc1e5022ed93141/sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala#L117)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

2016-07-20 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14240#discussion_r71634679
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
@@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with 
SharedSQLContext {
   testPruning("SELECT * FROM oneToTenPruned", "a", "b")
   testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
   testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
-  testPruning("SELECT b, b FROM oneToTenPruned", "b")
+  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
+  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
--- End diff --

after your PR, will data source table return only one column for this case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

2016-07-20 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14240#discussion_r71634358
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
@@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with 
SharedSQLContext {
   testPruning("SELECT * FROM oneToTenPruned", "a", "b")
   testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
   testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
-  testPruning("SELECT b, b FROM oneToTenPruned", "b")
+  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
+  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
--- End diff --

```SQL
SELECT b, b FROM oneToTenPruned
```
For Hive Table Scan, we exclude `ProjectExec` from the physical plan but 
[the Hive table scan returns two duplicate 
columns](https://github.com/apache/spark/blob/865ec32dd997e63aea01a871d1c7b4947f43c111/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala#L93).
 No matter whether `projectSet.size == projects.size` is added or not, the 
result is always right. 






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

2016-07-20 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14240#discussion_r71631823
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
@@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with 
SharedSQLContext {
   testPruning("SELECT * FROM oneToTenPruned", "a", "b")
   testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
   testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
-  testPruning("SELECT b, b FROM oneToTenPruned", "b")
+  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
+  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
--- End diff --

so for `SELECT b, b FROM oneToTenPruned` with hive table, we will exclude 
the `ProjectExec` and thus output only one column?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

2016-07-20 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14240#discussion_r71585784
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
@@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with 
SharedSQLContext {
   testPruning("SELECT * FROM oneToTenPruned", "a", "b")
   testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
   testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
-  testPruning("SELECT b, b FROM oneToTenPruned", "b")
+  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
+  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
--- End diff --

me too. : ( I am also learning this part for refactoring the Hive table 
scan. That is why I found the behavior inconsistency between different types of 
table scans. Let me try to summarize it in a more organized way. 

Currently, when converting logical plans to physical plans, we have two 
different strategies for the table scan (+ the adjacent `Filter` and `Project`, 
if any). 
- Hive Table Scan and In-memory Table Scan.
- Data Source Table Scan.

Thus, the basic functionalities are the same. 

The inputs include the filter predicates and the output project list and a 
leaf relation `LogicalRelation`, `MetastoreRelation`, and `InMemoryRelation`. 
The inputs are generated by the `PhysicalOperation` pattern, which extracts and 
normalizes Scan, Projects and Filters.

The target is to generate a physical plan, like:
```
ProjectExec 
+- FilterExec
   +- TableScan
```

`ProjectExec` is optional. When the filter predicates contain the 
unnecessary attributes or output project list has aliases, we have to add 
`ProjectExec`; otherwise, it is ok to exclude `ProjectExec`. The above logics 
is implemented by the following conditions in both `pruneFilterProject` 
functions:
```Scala
AttributeSet(projectList.map(_.toAttribute)) == projectSet &&
  filterSet.subsetOf(projectSet))
```
The source codes:
- [The original `pruneFilterProject` for Data Source Table 
Scan](https://github.com/apache/spark/blob/b1e5281c5cb429e338c3719c13c0b93078d7312a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L364-L366)
- [The `pruneFilterProject` for Hive Table Scan and In-memory Table 
Scan](https://github.com/apache/spark/blob/865ec32dd997e63aea01a871d1c7b4947f43c111/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala#L88-L89
)

The above code shows that Data Source Table Scan has one extra condition:
```Scala
projectSet.size == projects.size
``` 

This condition is very specific for a rare case. Users select duplicate 
columns without using any alias name. Thus, to make them consistent, we are 
facing two options:
- add this condition to both scan scenarios
- remove this condition from both

Either is fine to me, but I think we need to make them consistent. Let me 
know if my explanation is clear.





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

2016-07-20 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14240#discussion_r71565387
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
@@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with 
SharedSQLContext {
   testPruning("SELECT * FROM oneToTenPruned", "a", "b")
   testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
   testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
-  testPruning("SELECT b, b FROM oneToTenPruned", "b")
+  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
+  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
--- End diff --

so what does the column number here means? Sorry I'm not familiar with this 
part of code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

2016-07-20 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14240#discussion_r71562777
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
@@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with 
SharedSQLContext {
   testPruning("SELECT * FROM oneToTenPruned", "a", "b")
   testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
   testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
-  testPruning("SELECT b, b FROM oneToTenPruned", "b")
+  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
+  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
--- End diff --

When we do not return two columns, we will always have a `ProjectExec` 
above the table scan:
```
*Project [b#222 AS alias_b#240, b#222]
+- InMemoryTableScan [b#222]
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

2016-07-20 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14240#discussion_r71548579
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
@@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with 
SharedSQLContext {
   testPruning("SELECT * FROM oneToTenPruned", "a", "b")
   testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
   testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
-  testPruning("SELECT b, b FROM oneToTenPruned", "b")
+  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
+  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
--- End diff --

but this doesn't make sense, when user select 2 columns, how can we return 
only one column?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

2016-07-20 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14240#discussion_r71545159
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
@@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with 
SharedSQLContext {
   testPruning("SELECT * FROM oneToTenPruned", "a", "b")
   testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
   testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
-  testPruning("SELECT b, b FROM oneToTenPruned", "b")
+  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
+  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
--- End diff --

The following condition will be false if an alias is used: 
https://github.com/gatorsmile/spark/blob/ba7dcff8829c379666c417b87fc1e5022ed93141/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L364


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

2016-07-20 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14240#discussion_r71506928
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
@@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with 
SharedSQLContext {
   testPruning("SELECT * FROM oneToTenPruned", "a", "b")
   testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
   testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
-  testPruning("SELECT b, b FROM oneToTenPruned", "b")
+  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
+  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
--- End diff --

I don't quite understand this, why they are different?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

2016-07-19 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14240#discussion_r71471878
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
@@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with 
SharedSQLContext {
   testPruning("SELECT * FROM oneToTenPruned", "a", "b")
   testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
   testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
-  testPruning("SELECT b, b FROM oneToTenPruned", "b")
+  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
+  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
--- End diff --

Yeah!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

2016-07-19 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14240#discussion_r71471414
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
@@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with 
SharedSQLContext {
   testPruning("SELECT * FROM oneToTenPruned", "a", "b")
   testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
   testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
-  testPruning("SELECT b, b FROM oneToTenPruned", "b")
+  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
+  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
--- End diff --

so `SELECT b, b FROM oneToTenPruned` will return 2 columns and `SELECT b as 
alias_b, b FROM oneToTenPruned` only returns one column?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

2016-07-17 Thread gatorsmile
GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/14240

[SPARK-16594] [SQL] Remove Physical Plan Differences when Table Scan Having 
Duplicate Columns

 What changes were proposed in this pull request?
Currently, we keep two implementations for planning scans over data 
sources. There is one difference between two implementation when deciding 
whether a `Project` is needed or not. 

- Data Source Table Scan: 
https://github.com/apache/spark/blob/b1e5281c5cb429e338c3719c13c0b93078d7312a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L322-L395

  - `SELECT b, b FROM oneToTenPruned`: **_Add `ProjectExec`._**

- Hive Table Scan and In-memory Table Scan:

https://github.com/apache/spark/blob/865ec32dd997e63aea01a871d1c7b4947f43c111/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala#L71-L99
 

  - `SELECT b, b FROM oneToTenPruned`: **_No `ProjectExec` is added._** 

**Note.** When alias is being used, we will always add `ProjectExec` in all 
the scan types.

  - `SELECT b as alias_b, b FROM oneToTenPruned`: **_Add `ProjectExec`._**

Because selecting the same column twice without adding `alias` is very 
weird, no clue why the code needs to behave differently here. This PR is to 
remove the differences. 

 How was this patch tested?
Updated/added a test case

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark removeProject

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14240.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14240


commit ba7dcff8829c379666c417b87fc1e5022ed93141
Author: gatorsmile 
Date:   2016-07-17T07:26:47Z

fix




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org