from:"gatorsmile"

[GitHub] spark issue #15218: [SPARK-17637][Scheduler]Packed scheduling for Spark task...

2016-10-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15218
  
The test case design is pretty good. It covers all the scenarios. 

- Could you add a check for the negative case? That means, when users do 
not provide the right TaskAssigner name, we fall back to the default round 
robin one
- For the existing unchanged test cases in `TaskSchedulerImplSuite.scala`, 
please add a check to verify whether it picks the default one. 
- If possible, please change one of the existing test case in 
`TaskSchedulerImplSuite.scala`, ensure that users are allowed to input the 
round robin as the task assigner.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15398: [SPARK-17647][SQL][WIP] Fix backslash escaping in 'LIKE'...

2016-10-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15398
  
That is a design decision we need to make. Personally, MySQL and PostgreSQL 
are not good examples we should follow. Let us ask ourselves a simple question: 
> When users input `'\a'`, what is their expected results? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15398: [SPARK-17647][SQL][WIP] Fix backslash escaping in 'LIKE'...

2016-10-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15398
  
Also checked SQL Server. It behaves differently from both `Postgre` and 
`DB2`. It simply ignores it. 

> If there is no character after an escape character in the LIKE pattern, 
the pattern is not valid and the LIKE returns FALSE. If the character after an 
escape character is not a wildcard character, the escape character is discarded 
and the character following the escape is treated as a regular character in the 
pattern. This includes the percent sign (%), underscore (_), and left bracket 
([) wildcard characters when they are enclosed in double brackets ([ ]). Also, 
within the double bracket characters ([ ]), escape characters can be used and 
the caret (^), hyphen (-), and right bracket (]) can be escaped.

Let me check Oracle now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15398: [SPARK-17647][SQL][WIP] Fix backslash escaping in 'LIKE'...

2016-10-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15398
  
Checked Oracle. Oracle has the same behavior like DB2:
```
SQL> SELECT USERNAME FROM ALL_USERS WHERE USERNAME LIKE '%P%\' ESCAPE '\';
SELECT USERNAME FROM ALL_USERS WHERE USERNAME LIKE '%P%\' ESCAPE '\'
   *
ERROR at line 1:
ORA-01424: missing or illegal character following the escape character


SQL> SELECT USERNAME FROM ALL_USERS WHERE USERNAME LIKE '%P%\a' ESCAPE '\';
SELECT USERNAME FROM ALL_USERS WHERE USERNAME LIKE '%P%\a' ESCAPE '\'
   *
ERROR at line 1:
ORA-01424: missing or illegal character following the escape character
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15398: [SPARK-17647][SQL][WIP] Fix backslash escaping in 'LIKE'...

2016-10-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15398
  
This is different. 
> If there is no character after an escape character in the LIKE pattern, 
the pattern is not valid and the LIKE returns FALSE.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15398: [SPARK-17647][SQL][WIP] Fix backslash escaping in 'LIKE'...

2016-10-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15398
  
```
ORA-01424: missing or illegal character following the escape character
Cause:  The character following the escape character in LIKE pattern is 
missing or not one of the escape character, '%', or '_'.
Action: Remove the escape character or specify the missing character.
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15398: [SPARK-17647][SQL][WIP] Fix backslash escaping in 'LIKE'...

2016-10-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15398
  
We need to answer all the three questions listed by @jodersky 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15398: [SPARK-17647][SQL][WIP] Fix backslash escaping in 'LIKE'...

2016-10-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15398
  
If you check what we did in the other PRs, we do not need to strictly 
follow Hive. I do not have a strong opinion in all the above options. Let @rxin 
@yhuai make a decision. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15432: [SPARK-17854][SQL] rand/randn allows null/long as input ...

2016-10-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15432
  
@srowen Previously, I also assumed Spark SQL is following the behaviors 
defined by Hive. However, we already merged a few PRs in Spark SQL that do not 
follow Hive. Especially, Hive has so many [configuration 
flags](https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties),
 but we do not want to add them for better usability.

In the long term, when enterprise customers start using Spark more and 
more, I expect they might not even care whether we are Hive compliant or not. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15502: [SPARK-17892] [SQL] [2.0] Do Not Optimize Query in CTAS ...

2016-10-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15502
  
cc @yhuai @hvanhovell @cloud-fan I guess this needs to be merged to 2.0.2 
ASAP?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15502: [SPARK-17892] [SQL] [2.0] Do Not Optimize Query i...

2016-10-17 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15502#discussion_r83587470
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -510,7 +510,7 @@ private[hive] case class InsertIntoHiveTable(
 child: LogicalPlan,
 overwrite: Boolean,
 ifNotExists: Boolean)
-  extends LogicalPlan with Command {
--- End diff --

In the Command, this PR requires [the child must be 
empty](https://github.com/gatorsmile/spark/blob/9cfebc523e4b88c3df3ffae8ca5ea92e98a0a616/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Command.scala#L28)
 . Should we convert `InsertIntoHiveTable` to a non-child `Command`? 

Just FYI, in Spark 2.1, `InsertIntoTable` is still a `LogicalPlan` instead 
of a `Command`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15316: [SPARK-17751] [SQL] Remove spark.sql.eagerAnalysis and O...

2016-10-17 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15316
  
Also cc @cloud-fan and @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15316: [SPARK-17751] [SQL] Remove spark.sql.eagerAnalysis and O...

2016-10-17 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15316
  
cc @hvanhovell @rxin Any more comment about this PR? I assume Spark 2.0.2 
needs it. 

Recently, when we analyzing the JIRA 
https://issues.apache.org/jira/browse/SPARK-17709, we are unable to see the 
plan due the analyzer failure. The users have to manually rebuild it with this 
fix. Then, we can see the failed analyzed plan. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15502: [SPARK-17892] [SQL] [2.0] Do Not Optimize Query i...

2016-10-17 Thread gatorsmile

Github user gatorsmile closed the pull request at:

https://github.com/apache/spark/pull/15502


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15502: [SPARK-17892] [SQL] [2.0] Do Not Optimize Query in CTAS ...

2016-10-17 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15502
  
Thanks! Close it now. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15316: [SPARK-17751] [SQL] Remove spark.sql.eagerAnalysis and O...

2016-10-17 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15316
  
Sure, will do it soon. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15495: [SPARK-17620][SQL] Determine Serde by hive.default.filef...

2016-10-17 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15495
  
Thank you! Will do it soon. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15523: [SPARK-17981] [SPARK-17957] [SQL] Incorrectly Set...

2016-10-17 Thread gatorsmile

GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/15523

[SPARK-17981] [SPARK-17957] [SQL] Incorrectly Set Nullability to False in 
FilterExec

### What changes were proposed in this pull request?
When `FilterExec` contains `isNotNull`, which could be inferred and pushed 
down or users specified, we convert the nullability of the involved columns if 
the top-layer expression is null-intolerant. However, this is not correct, if 
the top-layer expression is not a leaf expression, it could still tolerate the 
null when it has null-tolerant child expressions. 

For example, `cast(coalesce(a#5, a#15) as double)`. Although `cast` is a 
null-intolerant expression, but obviously` coalesce` is null-tolerant. Thus, it 
could eat null.

When the nullability is wrong, we could generate incorrect results in 
different cases. For example, 

```Scala
val df1 = Seq((1, 2), (2, 3)).toDF("a", "b")
val df2 = Seq((2, 5), (3, 4)).toDF("a", "c")
val joinedDf = df1.join(df2, Seq("a"), "outer").na.fill(0)
val df3 = Seq((3, 1)).toDF("a", "d")
joinedDf.join(df3, "a").show
```

The optimized plan is like 
```
Project [a#29, b#30, c#31, d#42]
+- Join Inner, (a#29 = a#41)
   :- Project [cast(coalesce(cast(coalesce(a#5, a#15) as double), 0.0) as 
int) AS a#29, cast(coalesce(cast(b#6 as double), 0.0) as int) AS b#30, 
cast(coalesce(cast(c#16 as double), 0.0) as int) AS c#31]
   :  +- Filter isnotnull(cast(coalesce(cast(coalesce(a#5, a#15) as 
double), 0.0) as int))
   : +- Join FullOuter, (a#5 = a#15)
   ::- LocalRelation [a#5, b#6]
   :+- LocalRelation [a#15, c#16]
   +- LocalRelation [a#41, d#42]
```

Without the fix, it returns an empty result. With the fix, it can return a 
correct answer:
```
+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
|  3|  0|  4|  1|
+---+---+---+---+
```

### How was this patch tested?
Added test cases to verify the nullability changes in FilterExec. Also 
added a test case for verifying the reported incorrect result.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark nullabilityFilterExec

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15523.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15523


commit 54c3cc849e91b9bedc6379e04edc9e23245be760
Author: gatorsmile 
Date:   2016-10-17T23:29:43Z

fix




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15495: [SPARK-17620][SQL] Determine Serde by hive.default.filef...

2016-10-17 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15495
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15266: [SPARK-17693] [SQL] Fixed Insert Failure To Data Source ...

2016-10-17 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15266
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15495: [SPARK-17620][SQL] Determine Serde by hive.default.filef...

2016-10-17 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15495
  
Merging to master! Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15523: [SPARK-17981] [SPARK-17957] [SQL] Fix Incorrect Nullabil...

2016-10-17 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15523
  
cc @cloud-fan @davies @sameeragarwal 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15459: [SPARK-17409] [SQL] [FOLLOW-UP] Do Not Optimize Query in...

2016-10-17 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15459
  
@yhuai Any further comment about it? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15417: [SPARK-17851][SQL][TESTS] Make sure all test sqls in cat...

2016-10-17 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15417
  
@jiangxb1987 Can you leave a comment on the PR changes to explain why you 
made these changes? You know, reviewing these changes is not easy. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15266: [SPARK-17693] [SQL] Fixed Insert Failure To Data Source ...

2016-10-17 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15266
  
Any other comment? @cloud-fan Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15529: [SPARK-17751] [SQL] [Backport-2.0] Remove spark.s...

2016-10-18 Thread gatorsmile

GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/15529

[SPARK-17751] [SQL] [Backport-2.0] Remove spark.sql.eagerAnalysis and 
Output the Plan if Existed in AnalysisException

### What changes were proposed in this pull request?
This PR is to backport the fix https://github.com/apache/spark/pull/15316 
to 2.0.

Dataset always does eager analysis now. Thus, `spark.sql.eagerAnalysis` is 
not used any more. Thus, we need to remove it.

This PR also outputs the plan. Without the fix, the analysis error is like
```
cannot resolve '`k1`' given input columns: [k, v]; line 1 pos 12
```

After the fix, the analysis error becomes:
```
org.apache.spark.sql.AnalysisException: cannot resolve '`k1`' given input 
columns: [k, v]; line 1 pos 12;
'Project [unresolvedalias(CASE WHEN ('k1 = 2) THEN 22 WHEN ('k1 = 4) THEN 
44 ELSE 0 END, None), v#6]
+- SubqueryAlias t
   +- Project [_1#2 AS k#5, _2#3 AS v#6]
  +- LocalRelation [_1#2, _2#3]
```

### How was this patch tested?
N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark eagerAnalysis20

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15529.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15529


commit ebc6f73f225dd008503b0df3b0e575f7f4696d08
Author: gatorsmile 
Date:   2016-10-18T06:53:34Z

fix




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15529: [SPARK-17751] [SQL] [Backport-2.0] Remove spark.sql.eage...

2016-10-18 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15529
  
cc @hvanhovell @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15529: [SPARK-17751] [SQL] [Backport-2.0] Remove spark.sql.eage...

2016-10-18 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15529
  
Thanks! Close it now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15529: [SPARK-17751] [SQL] [Backport-2.0] Remove spark.s...

2016-10-18 Thread gatorsmile

Github user gatorsmile closed the pull request at:

https://github.com/apache/spark/pull/15529


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15546: [SPARK-17892][SQL] SQLBuilder should wrap the generated ...

2016-10-18 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15546
  
Wrong JIRA number. : )



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15541: [SPARK-17637][Scheduler]Packed scheduling for Spa...

2016-10-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15541#discussion_r83995272
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskAssigner.scala 
---
@@ -0,0 +1,233 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import scala.collection.mutable.ArrayBuffer
+import scala.collection.mutable.PriorityQueue
+import scala.util.Random
+
+import org.apache.spark.internal.{config, Logging}
+import org.apache.spark.SparkConf
+import org.apache.spark.util.Utils
+
+/** Tracking the current state of the workers with available cores and 
assigned task list. */
+class OfferState(val workOffer: WorkerOffer) {
+  // The current remaining cores that can be allocated to tasks.
+  var coresAvailable: Int = workOffer.cores
+  // The list of tasks that are assigned to this worker.
+  val tasks = new ArrayBuffer[TaskDescription](coresAvailable)
+}
+
+/**
+ * TaskAssigner is the base class for all task assigner implementations, 
and can be
+ * extended to implement different task scheduling algorithms.
+ * Together with [[org.apache.spark.scheduler.TaskScheduler 
TaskScheduler]], TaskAssigner
+ * is used to assign tasks to workers with available cores. Internally, 
TaskScheduler, requested
+ * to perform task assignment given available workers, first sorts the 
candidate tasksets,
+ * and then for each taskset, it takes a number of rounds to request 
TaskAssigner for task
+ * assignment with different the locality restrictions until there is 
either no qualified
+ * workers or no valid tasks to be assigned.
+ *
+ * TaskAssigner is responsible to maintain the worker availability state 
and task assignment
+ * information. The contract between 
[[org.apache.spark.scheduler.TaskScheduler TaskScheduler]]
+ * and TaskAssigner is as follows. First, TaskScheduler invokes 
construct() of TaskAssigner to
+ * initialize the its internal worker states at the beginning of resource 
offering. Before each
+ * round of task assignment for a taskset, TaskScheduler invoke the init() 
of TaskAssigner to
+ * initialize the data structure for the round. When performing real task 
assignment,
+ * hasNext()/getNext() is used by TaskScheduler to check the worker 
availability and retrieve
+ * current offering from TaskAssigner. Then offerAccepted is used by 
TaskScheduler to notify
+ * the TaskAssigner so that TaskAssigner can decide whether the current 
offer is valid or not for
+ * the next request. After task assignment is done, TaskScheduler invokes 
the tasks() to
+ * retrieve all the task assignment information, and eventually, invokes 
reset() method so that
+ * TaskAssigner can cleanup its internal maintained resources.
+ */
+
+private[scheduler] abstract class TaskAssigner {
+  var offer: Seq[OfferState] = _
+  var CPUS_PER_TASK = 1
+
+  def withCpuPerTask(CPUS_PER_TASK: Int): Unit = {
+this.CPUS_PER_TASK = CPUS_PER_TASK
+  }
+
+  // The final assigned offer returned to TaskScheduler.
+  final def tasks: Seq[ArrayBuffer[TaskDescription]] = offer.map(_.tasks)
+
+  // Invoked at the beginning of resource offering to construct the offer 
with the workoffers.
+  def construct(workOffer: Seq[WorkerOffer]): Unit = {
+offer = workOffer.map(o => new OfferState(o))
+  }
+
+  // Invoked at each round of Taskset assignment to initialize the 
internal structure.
+  def init(): Unit
+
+  // Whether there is offer available to be used inside of one round of 
Taskset assignment.
+  def hasNext: Boolean
+
+  // Returned the next assigned offer based on the task assignment 
strategy.
+  def getNext(): OfferState
+
+  // Invoked by the TaskScheduler to indicate whether the current offer is 
accepted or not so that
+  // the assigner can decide whether the current worker is valid for the 
next offering.
+  def offerAcc

[GitHub] spark pull request #15541: [SPARK-17637][Scheduler]Packed scheduling for Spa...

2016-10-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15541#discussion_r83995415
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskAssigner.scala 
---
@@ -0,0 +1,233 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import scala.collection.mutable.ArrayBuffer
+import scala.collection.mutable.PriorityQueue
+import scala.util.Random
+
+import org.apache.spark.internal.{config, Logging}
+import org.apache.spark.SparkConf
+import org.apache.spark.util.Utils
+
+/** Tracking the current state of the workers with available cores and 
assigned task list. */
+class OfferState(val workOffer: WorkerOffer) {
+  // The current remaining cores that can be allocated to tasks.
+  var coresAvailable: Int = workOffer.cores
+  // The list of tasks that are assigned to this worker.
+  val tasks = new ArrayBuffer[TaskDescription](coresAvailable)
+}
+
+/**
+ * TaskAssigner is the base class for all task assigner implementations, 
and can be
+ * extended to implement different task scheduling algorithms.
+ * Together with [[org.apache.spark.scheduler.TaskScheduler 
TaskScheduler]], TaskAssigner
+ * is used to assign tasks to workers with available cores. Internally, 
TaskScheduler, requested
+ * to perform task assignment given available workers, first sorts the 
candidate tasksets,
+ * and then for each taskset, it takes a number of rounds to request 
TaskAssigner for task
+ * assignment with different the locality restrictions until there is 
either no qualified
+ * workers or no valid tasks to be assigned.
+ *
+ * TaskAssigner is responsible to maintain the worker availability state 
and task assignment
+ * information. The contract between 
[[org.apache.spark.scheduler.TaskScheduler TaskScheduler]]
+ * and TaskAssigner is as follows. First, TaskScheduler invokes 
construct() of TaskAssigner to
+ * initialize the its internal worker states at the beginning of resource 
offering. Before each
+ * round of task assignment for a taskset, TaskScheduler invoke the init() 
of TaskAssigner to
+ * initialize the data structure for the round. When performing real task 
assignment,
+ * hasNext()/getNext() is used by TaskScheduler to check the worker 
availability and retrieve
+ * current offering from TaskAssigner. Then offerAccepted is used by 
TaskScheduler to notify
+ * the TaskAssigner so that TaskAssigner can decide whether the current 
offer is valid or not for
+ * the next request. After task assignment is done, TaskScheduler invokes 
the tasks() to
+ * retrieve all the task assignment information, and eventually, invokes 
reset() method so that
+ * TaskAssigner can cleanup its internal maintained resources.
+ */
+
+private[scheduler] abstract class TaskAssigner {
+  var offer: Seq[OfferState] = _
+  var CPUS_PER_TASK = 1
+
+  def withCpuPerTask(CPUS_PER_TASK: Int): Unit = {
+this.CPUS_PER_TASK = CPUS_PER_TASK
+  }
+
+  // The final assigned offer returned to TaskScheduler.
+  final def tasks: Seq[ArrayBuffer[TaskDescription]] = offer.map(_.tasks)
+
+  // Invoked at the beginning of resource offering to construct the offer 
with the workoffers.
+  def construct(workOffer: Seq[WorkerOffer]): Unit = {
+offer = workOffer.map(o => new OfferState(o))
+  }
+
+  // Invoked at each round of Taskset assignment to initialize the 
internal structure.
+  def init(): Unit
+
+  // Whether there is offer available to be used inside of one round of 
Taskset assignment.
+  def hasNext: Boolean
+
+  // Returned the next assigned offer based on the task assignment 
strategy.
+  def getNext(): OfferState
+
+  // Invoked by the TaskScheduler to indicate whether the current offer is 
accepted or not so that
+  // the assigner can decide whether the current worker is valid for the 
next offering.
+  def offerAcc

[GitHub] spark pull request #15541: [SPARK-17637][Scheduler]Packed scheduling for Spa...

2016-10-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15541#discussion_r83996232
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskAssigner.scala 
---
@@ -0,0 +1,233 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import scala.collection.mutable.ArrayBuffer
+import scala.collection.mutable.PriorityQueue
+import scala.util.Random
+
+import org.apache.spark.internal.{config, Logging}
+import org.apache.spark.SparkConf
+import org.apache.spark.util.Utils
+
+/** Tracking the current state of the workers with available cores and 
assigned task list. */
+class OfferState(val workOffer: WorkerOffer) {
+  // The current remaining cores that can be allocated to tasks.
+  var coresAvailable: Int = workOffer.cores
+  // The list of tasks that are assigned to this worker.
+  val tasks = new ArrayBuffer[TaskDescription](coresAvailable)
+}
+
+/**
+ * TaskAssigner is the base class for all task assigner implementations, 
and can be
+ * extended to implement different task scheduling algorithms.
+ * Together with [[org.apache.spark.scheduler.TaskScheduler 
TaskScheduler]], TaskAssigner
+ * is used to assign tasks to workers with available cores. Internally, 
TaskScheduler, requested
+ * to perform task assignment given available workers, first sorts the 
candidate tasksets,
+ * and then for each taskset, it takes a number of rounds to request 
TaskAssigner for task
+ * assignment with different the locality restrictions until there is 
either no qualified
+ * workers or no valid tasks to be assigned.
+ *
+ * TaskAssigner is responsible to maintain the worker availability state 
and task assignment
+ * information. The contract between 
[[org.apache.spark.scheduler.TaskScheduler TaskScheduler]]
+ * and TaskAssigner is as follows. First, TaskScheduler invokes 
construct() of TaskAssigner to
+ * initialize the its internal worker states at the beginning of resource 
offering. Before each
+ * round of task assignment for a taskset, TaskScheduler invoke the init() 
of TaskAssigner to
+ * initialize the data structure for the round. When performing real task 
assignment,
+ * hasNext()/getNext() is used by TaskScheduler to check the worker 
availability and retrieve
+ * current offering from TaskAssigner. Then offerAccepted is used by 
TaskScheduler to notify
+ * the TaskAssigner so that TaskAssigner can decide whether the current 
offer is valid or not for
+ * the next request. After task assignment is done, TaskScheduler invokes 
the tasks() to
+ * retrieve all the task assignment information, and eventually, invokes 
reset() method so that
+ * TaskAssigner can cleanup its internal maintained resources.
+ */
+
+private[scheduler] abstract class TaskAssigner {
+  var offer: Seq[OfferState] = _
+  var CPUS_PER_TASK = 1
+
+  def withCpuPerTask(CPUS_PER_TASK: Int): Unit = {
+this.CPUS_PER_TASK = CPUS_PER_TASK
+  }
+
+  // The final assigned offer returned to TaskScheduler.
+  final def tasks: Seq[ArrayBuffer[TaskDescription]] = offer.map(_.tasks)
+
+  // Invoked at the beginning of resource offering to construct the offer 
with the workoffers.
+  def construct(workOffer: Seq[WorkerOffer]): Unit = {
+offer = workOffer.map(o => new OfferState(o))
+  }
+
+  // Invoked at each round of Taskset assignment to initialize the 
internal structure.
+  def init(): Unit
+
+  // Whether there is offer available to be used inside of one round of 
Taskset assignment.
+  def hasNext: Boolean
+
+  // Returned the next assigned offer based on the task assignment 
strategy.
+  def getNext(): OfferState
+
+  // Invoked by the TaskScheduler to indicate whether the current offer is 
accepted or not so that
+  // the assigner can decide whether the current worker is valid for the 
next offering.
+  def offerAcc

[GitHub] spark pull request #15523: [SPARK-17981] [SPARK-17957] [SQL] Fix Incorrect N...

2016-10-19 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15523#discussion_r84008907
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala
 ---
@@ -87,7 +87,14 @@ case class FilterExec(condition: Expression, child: 
SparkPlan)
 
   // Split out all the IsNotNulls from condition.
   private val (notNullPreds, otherPreds) = 
splitConjunctivePredicates(condition).partition {
-case IsNotNull(a: NullIntolerant) if 
a.references.subsetOf(child.outputSet) => true
+case IsNotNull(a) => isNullIntolerant(a) && 
a.references.subsetOf(child.outputSet)
+case _ => false
+  }
+
+  // One expression is null intolerant iff it and its children are null 
intolerant
+  private def isNullIntolerant(expr: Expression): Boolean = expr match {
+case e: NullIntolerant =>
+  if (e.isInstanceOf[LeafExpression]) true else 
e.children.forall(isNullIntolerant)
--- End diff --

Just realized the original code was from your PR. Then, in your above code, 
why you still need to keep `a.references.subsetOf(child.outputSet)`? It looks 
confusing to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15523: [SPARK-17981] [SPARK-17957] [SQL] Fix Incorrect N...

2016-10-19 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15523#discussion_r84010709
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala
 ---
@@ -87,7 +87,14 @@ case class FilterExec(condition: Expression, child: 
SparkPlan)
 
   // Split out all the IsNotNulls from condition.
   private val (notNullPreds, otherPreds) = 
splitConjunctivePredicates(condition).partition {
-case IsNotNull(a: NullIntolerant) if 
a.references.subsetOf(child.outputSet) => true
+case IsNotNull(a) => isNullIntolerant(a) && 
a.references.subsetOf(child.outputSet)
+case _ => false
+  }
+
+  // One expression is null intolerant iff it and its children are null 
intolerant
+  private def isNullIntolerant(expr: Expression): Boolean = expr match {
+case e: NullIntolerant =>
+  if (e.isInstanceOf[LeafExpression]) true else 
e.children.forall(isNullIntolerant)
--- End diff --

Could you show me an example? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15523: [SPARK-17981] [SPARK-17957] [SQL] Fix Incorrect N...

2016-10-19 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15523#discussion_r84014760
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala
 ---
@@ -87,7 +87,14 @@ case class FilterExec(condition: Expression, child: 
SparkPlan)
 
   // Split out all the IsNotNulls from condition.
   private val (notNullPreds, otherPreds) = 
splitConjunctivePredicates(condition).partition {
-case IsNotNull(a: NullIntolerant) if 
a.references.subsetOf(child.outputSet) => true
+case IsNotNull(a) => isNullIntolerant(a) && 
a.references.subsetOf(child.outputSet)
+case _ => false
+  }
+
+  // One expression is null intolerant iff it and its children are null 
intolerant
+  private def isNullIntolerant(expr: Expression): Boolean = expr match {
+case e: NullIntolerant =>
+  if (e.isInstanceOf[LeafExpression]) true else 
e.children.forall(isNullIntolerant)
--- End diff --

uh, I see. 

First, we definitely need test cases to cover each positive and negative 
scenario. Previously, we did not have any test case to check the validity of 
nullability changes. Second, the code needs more comments when the 
variable/function names are not able to explain the codes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15523: [SPARK-17981] [SPARK-17957] [SQL] Fix Incorrect Nullabil...

2016-10-19 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15523
  
@viirya I have not changed the algorithm. I just tried to improve the test 
case coverage. 

Thanks to `constructIsNotNullConstraints`, the existing solution already 
covers all the cases, right?

Can you help me check anything scenario is missing? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15382: [SPARK-17810] [SQL] Default spark.sql.warehouse.dir is r...

2016-10-19 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15382
  
cc @yhuai 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15523: [SPARK-17981] [SPARK-17957] [SQL] Fix Incorrect Nullabil...

2016-10-19 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15523
  
The parm name of the verification function is wrong. It should be 
`expectedNonNullableColumns`. Please check the test case again. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15363: [SPARK-17791][SQL] Join reordering using star schema det...

2017-03-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15363
  
See [the list of paper that cited K. 
Ono](https://scholar.google.com/scholar?start=0&hl=en&as_sdt=0,5&sciodt=0,5&cites=5144610119819043766&scipsc=)
 of `Measuring the Complexity of Join Enumeration in Query Optimization`. There 
are a lot of good references we can investigate. : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-15 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106298859
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -51,6 +51,11 @@ case class CostBasedJoinReorder(conf: CatalystConf) 
extends Rule[LogicalPlan] wi
 
   def reorder(plan: LogicalPlan, output: AttributeSet): LogicalPlan = {
 val (items, conditions) = extractInnerJoins(plan)
+// Find the star schema joins. Currently, it returns the star join 
with the largest
+// fact table. In the future, it can return more than one star join 
(e.g. F1-D1-D2
+// and F2-D3-D4).
+val starJoinPlans = StarSchemaDetection(conf).findStarJoins(items, 
conditions.toSeq)
--- End diff --

See [the list of paper that cited K. 
Ono](https://scholar.google.com/scholar?start=0&hl=en&as_sdt=0,5&sciodt=0,5&cites=5144610119819043766&scipsc=)
 of `Measuring the Complexity of Join Enumeration in Query Optimization`. There 
are a lot of good references we can investigate. : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17270: [SPARK-19929] [SQL] Showing Hive Managed table's ...

2017-03-15 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17270#discussion_r106313999
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -910,7 +910,7 @@ case class ShowCreateTableCommand(table: 
TableIdentifier) extends RunnableComman
   }
 }
 
-if (metadata.tableType == EXTERNAL) {
+if (metadata.tableType == EXTERNAL || metadata.tableType == MANAGED) {
--- End diff --

> SHOW CREATE TABLE shows the CREATE TABLE statement that creates a given 
table, or the CREATE VIEW statement that creates a given view.

If we add the location clause for managed tables, CREATE TABLE will create 
an external table instead of a managed table. That is not what we want. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17270: [SPARK-19929] [SQL] Showing Hive Managed table's LOATION...

2017-03-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17270
  
If you want to know hive table's location, you can use the command 
`DESCRIBE EXTENDED` , right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17191: [SPARK-14471][SQL] Aliases in SELECT could be used in GR...

2017-03-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17191
  
Could you please do a check which systems support it and which systems 
disallow it? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17289: [SPARK-19948] Document that saveAsTable uses catalog as ...

2017-03-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17289
  
We introduced a behavior change in Spark 2.2. In Spark 2.1, we reported an 
error if the underlying JDBC table exists. We changed [the mode to 
`SaveMode.Overwrite`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala#L159)
 if the table does not exist in the catalog.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17289: [SPARK-19948] Document that saveAsTable uses catalog as ...

2017-03-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17289
  
```Scala
  test("saveAsTable API with SaveMode.Overwrite") {
val df = spark.createDataFrame(sparkContext.parallelize(arr1x2), 
schema2)
spark.read.jdbc(url1, "test.people", properties).show()

df.write.format("jdbc").mode(SaveMode.ErrorIfExists)
  .option("url", url1)
  .option("dbtable", "test.people")
  .options(properties.asScala)
  .saveAsTable("j1")
spark.read.jdbc(url1, "test.people", properties).show()
  }
```

This is a test case I used. Previously, we respected the user-specified 
mode `SaveMode.ErrorIfExists`. Now, we are not sending the[ mode to the 
_createRelation_ API 
](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L181).
 It might be an unexpected behavior change to the external data source 
connector.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17287: [SPARK-19945][SQL]add test suite for SessionCatal...

2017-03-15 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17287#discussion_r106333835
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalSessionCatalogSuite.scala
 ---
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import org.apache.spark.sql.catalyst.catalog.{CatalogTestUtils, 
ExternalCatalog, SessionCatalogSuite}
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+
+class HiveExternalSessionCatalogSuite extends SessionCatalogSuite with 
TestHiveSingleton {
+
+  protected override val isHiveExternalCatalog = true
+
+  private val externalCatalog = {
+val catalog = spark.sharedState.externalCatalog
+catalog.asInstanceOf[HiveExternalCatalog].client.reset()
+catalog
+  }
+
+  protected val utils = new CatalogTestUtils {
+override val tableInputFormat: String = 
"org.apache.hadoop.mapred.SequenceFileInputFormat"
+override val tableOutputFormat: String =
+  "org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat"
+override val defaultProvider: String = "parquet"
--- End diff --

The above input and output formats does not match what you specified here. 
Let us change it to `hive`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17287: [SPARK-19945][SQL]add test suite for SessionCatal...

2017-03-15 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17287#discussion_r106334485
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala
 ---
@@ -1270,6 +1376,7 @@ class SessionCatalogSuite extends PlanTest {
   }
 
   assert(cause.getMessage.contains("Undefined function: 
'undefined_fn'"))
+  catalog.reset()
--- End diff --

Instead of adding `reset`, why not using your new function 
`withBasicCatalog`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17287: [SPARK-19945][SQL]add test suite for SessionCatal...

2017-03-15 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17287#discussion_r106334626
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala
 ---
@@ -999,257 +1083,279 @@ class SessionCatalogSuite extends PlanTest {
   expectedParts: CatalogTablePartition*): Boolean = {
 // ExternalCatalog may set a default location for partitions, here we 
ignore the partition
 // location when comparing them.
-actualParts.map(p => p.copy(storage = p.storage.copy(locationUri = 
None))).toSet ==
-  expectedParts.map(p => p.copy(storage = p.storage.copy(locationUri = 
None))).toSet
+val actualPartsNormalize = actualParts.map(p =>
+  p.copy(parameters = Map.empty, storage = p.storage.copy(
+properties = Map.empty, locationUri = None, serde = None))).toSet
+
+val expectedPartsNormalize = expectedParts.map(p =>
+p.copy(parameters = Map.empty, storage = p.storage.copy(
+  properties = Map.empty, locationUri = None, serde = None))).toSet
+
+actualPartsNormalize == expectedPartsNormalize
+//actualParts.map(p =>
+//  p.copy(storage = p.storage.copy(
+//properties = Map.empty, locationUri = None))).toSet ==
+//  expectedParts.map(p =>
+//p.copy(storage = p.storage.copy(properties = Map.empty, 
locationUri = None))).toSet
--- End diff --

?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17287: [SPARK-19945][SQL]add test suite for SessionCatal...

2017-03-15 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17287#discussion_r106334827
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala
 ---
@@ -999,257 +1094,279 @@ class SessionCatalogSuite extends PlanTest {
   expectedParts: CatalogTablePartition*): Boolean = {
 // ExternalCatalog may set a default location for partitions, here we 
ignore the partition
 // location when comparing them.
-actualParts.map(p => p.copy(storage = p.storage.copy(locationUri = 
None))).toSet ==
-  expectedParts.map(p => p.copy(storage = p.storage.copy(locationUri = 
None))).toSet
+val actualPartsNormalize = actualParts.map(p =>
--- End diff --

Because Hive metastore fills the values after we call the Hive APIs?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17287: [SPARK-19945][SQL]add test suite for SessionCatal...

2017-03-15 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17287#discussion_r106334939
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala
 ---
@@ -999,257 +1083,279 @@ class SessionCatalogSuite extends PlanTest {
   expectedParts: CatalogTablePartition*): Boolean = {
 // ExternalCatalog may set a default location for partitions, here we 
ignore the partition
 // location when comparing them.
-actualParts.map(p => p.copy(storage = p.storage.copy(locationUri = 
None))).toSet ==
-  expectedParts.map(p => p.copy(storage = p.storage.copy(locationUri = 
None))).toSet
+val actualPartsNormalize = actualParts.map(p =>
--- End diff --

Because Hive metastore fills the values after we calling the Hive APIs? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...

2017-03-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16971
  
ping @zhengruifeng  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-03-15 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16971#discussion_r106335040
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala
 ---
@@ -245,7 +245,7 @@ object ApproximatePercentile {
 val result = new Array[Double](percentages.length)
 var i = 0
 while (i < percentages.length) {
-  result(i) = summaries.query(percentages(i))
+  result(i) = summaries.query(percentages(i)).get
--- End diff --

Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16981: [SPARK-19637][SQL] Add to_json in FunctionRegistry

2017-03-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16981
  
cc @maropu https://github.com/apache/spark/pull/17171 is merged. Are you 
interested in working on `from_json`?

JIRA: https://issues.apache.org/jira/browse/SPARK-19967


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17289: [SPARK-19948] Document that saveAsTable uses catalog as ...

2017-03-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17289
  
This is the design decision we need to make here. 

Spark SQL is kind of a federation system. Two write APIs behave 
differently. The `saveAsTable` API expects users to register it in the **global 
catalog** before usage. The `save` API skips the global catalog and relies on 
the connectors to communicate with the **local catalog**. The users might not 
realize the difference.

```Scala
df.write.format("xyz").mode(SaveMode.ErrorIfExists)
  .saveAsTable("j1")
```

```Scala
df.write.format("xyz").mode(SaveMode.ErrorIfExists)
  .save()
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16209: [WIP][SPARK-10849][SQL] Adds option to the JDBC data sou...

2017-03-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16209
  
@sureshthalamati https://github.com/apache/spark/pull/17171 has been 
resolved. Can you update your PR by allowing users to specify the schema in DDL 
format?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17287: [SPARK-19945][SQL]add test suite for SessionCatal...

2017-03-15 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17287#discussion_r106340219
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala
 ---
@@ -999,257 +1083,279 @@ class SessionCatalogSuite extends PlanTest {
   expectedParts: CatalogTablePartition*): Boolean = {
 // ExternalCatalog may set a default location for partitions, here we 
ignore the partition
 // location when comparing them.
-actualParts.map(p => p.copy(storage = p.storage.copy(locationUri = 
None))).toSet ==
-  expectedParts.map(p => p.copy(storage = p.storage.copy(locationUri = 
None))).toSet
+val actualPartsNormalize = actualParts.map(p =>
--- End diff --

You need to leave a comment to explain it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17287: [SPARK-19945][SQL]add test suite for SessionCatal...

2017-03-15 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17287#discussion_r106340900
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala
 ---
@@ -1270,6 +1376,7 @@ class SessionCatalogSuite extends PlanTest {
   }
 
   assert(cause.getMessage.contains("Undefined function: 
'undefined_fn'"))
+  catalog.reset()
--- End diff --

Then, this `reset()` could be skipped if hitting an exception.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...

2017-03-15 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17286#discussion_r106341181
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -696,6 +696,13 @@ object SQLConf {
   .intConf
   .createWithDefault(12)
 
+  val JOIN_REORDER_CARD_WEIGHT =
+buildConf("spark.sql.cbo.joinReorder.card.weight")
+  .doc("The weight of cardinality (number of rows) for plan cost 
comparison in join reorder: " +
+"rows * weight + size * (1 - weight).")
+  .doubleConf
+  .createWithDefault(0.7)
--- End diff --

What is boundary of this? adding `check`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...

2017-03-15 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16626#discussion_r106341499
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -175,6 +178,78 @@ case class AlterTableRenameCommand(
 }
 
 /**
+ * A command that add columns to a table
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   ALTER TABLE table_identifier
+ *   ADD COLUMNS (col_name data_type [COMMENT col_comment], ...);
+ * }}}
+*/
+case class AlterTableAddColumnsCommand(
+table: TableIdentifier,
+columns: Seq[StructField]) extends RunnableCommand {
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+val catalog = sparkSession.sessionState.catalog
+val catalogTable = verifyAlterTableAddColumn(catalog, table)
+
+// If an exception is thrown here we can just assume the table is 
uncached;
+// this can happen with Hive tables when the underlying catalog is 
in-memory.
+val wasCached = 
Try(sparkSession.catalog.isCached(table.unquotedString)).getOrElse(false)
+if (wasCached) {
+  try {
+sparkSession.catalog.uncacheTable(table.unquotedString)
+  } catch {
+case NonFatal(e) => log.warn(e.toString, e)
+  }
+}
--- End diff --

No need to check if it is cached or not. Just uncache it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...

2017-03-15 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16626#discussion_r106341966
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala 
---
@@ -1860,4 +1860,72 @@ class HiveDDLSuite
   }
 }
   }
+
+  Seq("PARQUET", "ORC", "TEXTFILE", "SEQUENCEFILE", "RCFILE", 
"AVRO").foreach { tableType =>
--- End diff --

If the list is complete, we can create a variable and reuse it in the 
future test cases in `HiveCatalogedDDLSuite `. Let us create it now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...

2017-03-15 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16626#discussion_r106342123
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala 
---
@@ -1860,4 +1860,72 @@ class HiveDDLSuite
   }
 }
   }
+
+  Seq("PARQUET", "ORC", "TEXTFILE", "SEQUENCEFILE", "RCFILE", 
"AVRO").foreach { tableType =>
+test(s"alter hive serde table add columns -- partitioned - 
$tableType") {
+  withTable("alter_add_partitioned") {
--- End diff --

The name is confusing. Let us just simplify it to `tab`. We already can 
know the scenario by the test case name.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...

2017-03-15 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16626#discussion_r106342233
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala 
---
@@ -71,7 +71,6 @@ class JDBCSuite extends SparkFunSuite
 conn.prepareStatement("insert into test.people values ('mary', 
2)").executeUpdate()
 conn.prepareStatement(
   "insert into test.people values ('joe ''foo'' \"bar\"', 
3)").executeUpdate()
-conn.commit()
--- End diff --

Why?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16626: [SPARK-19261][SQL] Alter add columns for Hive serde and ...

2017-03-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16626
  
Could you add a scenario when users add a column name that already exists 
in the table schema?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17270: [SPARK-19929] [SQL] Showing Hive Managed table's LOATION...

2017-03-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17270
  
Yeah. you need to close the PR by yourself. We are unable to close it. 
thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16672: [SPARK-19329][SQL]Reading from or writing to a datasourc...

2017-03-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16672
  
To backport https://github.com/apache/spark/pull/17097, we need to backport 
multiple PRs. This is one of it.

@windpiger Could you please submit a PR to backport it to Spark 2.1? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...

2017-03-16 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16626#discussion_r106465419
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -175,6 +178,87 @@ case class AlterTableRenameCommand(
 }
 
 /**
+ * A command that add columns to a table
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   ALTER TABLE table_identifier
+ *   ADD COLUMNS (col_name data_type [COMMENT col_comment], ...);
+ * }}}
+*/
+case class AlterTableAddColumnsCommand(
+table: TableIdentifier,
+columns: Seq[StructField]) extends RunnableCommand {
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+val catalog = sparkSession.sessionState.catalog
+val catalogTable = verifyAlterTableAddColumn(catalog, table)
+
+try {
+  sparkSession.catalog.uncacheTable(table.unquotedString)
--- End diff --

`table.quotedString`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...

2017-03-16 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16626#discussion_r106467818
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -175,6 +178,87 @@ case class AlterTableRenameCommand(
 }
 
 /**
+ * A command that add columns to a table
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   ALTER TABLE table_identifier
+ *   ADD COLUMNS (col_name data_type [COMMENT col_comment], ...);
+ * }}}
+*/
+case class AlterTableAddColumnsCommand(
+table: TableIdentifier,
+columns: Seq[StructField]) extends RunnableCommand {
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+val catalog = sparkSession.sessionState.catalog
+val catalogTable = verifyAlterTableAddColumn(catalog, table)
+
+try {
+  sparkSession.catalog.uncacheTable(table.unquotedString)
+} catch {
+  case NonFatal(e) =>
+log.warn(s"Exception when attempting to uncache table 
${table.unquotedString}", e)
+}
+
+// Invalidate the table last, otherwise uncaching the table would load 
the logical plan
+// back into the hive metastore cache
+catalog.refreshTable(table)
+val partitionFields = 
catalogTable.schema.takeRight(catalogTable.partitionColumnNames.length)
+val newSchemaFields = catalogTable.schema
+  .take(catalogTable.schema.length - 
catalogTable.partitionColumnNames.length) ++
+  columns ++ partitionFields
+checkDuplication(newSchemaFields.map(_.name))
+catalog.alterTableSchema(table, newSchema =
+  catalogTable.schema.copy(fields = newSchemaFields.toArray))
+
+Seq.empty[Row]
+  }
+
+  private def checkDuplication(colNames: Seq[String]): Unit = {
+if (colNames.distinct.length != colNames.length) {
+  val duplicateColumns = colNames.groupBy(identity).collect {
--- End diff --

This does not consider the case sensitivity 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...

2017-03-16 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16626#discussion_r106472274
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -175,6 +178,87 @@ case class AlterTableRenameCommand(
 }
 
 /**
+ * A command that add columns to a table
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   ALTER TABLE table_identifier
+ *   ADD COLUMNS (col_name data_type [COMMENT col_comment], ...);
+ * }}}
+*/
+case class AlterTableAddColumnsCommand(
+table: TableIdentifier,
+columns: Seq[StructField]) extends RunnableCommand {
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+val catalog = sparkSession.sessionState.catalog
+val catalogTable = verifyAlterTableAddColumn(catalog, table)
+
+try {
+  sparkSession.catalog.uncacheTable(table.unquotedString)
+} catch {
+  case NonFatal(e) =>
+log.warn(s"Exception when attempting to uncache table 
${table.unquotedString}", e)
+}
+
+// Invalidate the table last, otherwise uncaching the table would load 
the logical plan
+// back into the hive metastore cache
+catalog.refreshTable(table)
+val partitionFields = 
catalogTable.schema.takeRight(catalogTable.partitionColumnNames.length)
+val newSchemaFields = catalogTable.schema
+  .take(catalogTable.schema.length - 
catalogTable.partitionColumnNames.length) ++
+  columns ++ partitionFields
+checkDuplication(newSchemaFields.map(_.name))
+catalog.alterTableSchema(table, newSchema =
+  catalogTable.schema.copy(fields = newSchemaFields.toArray))
+
+Seq.empty[Row]
+  }
+
+  private def checkDuplication(colNames: Seq[String]): Unit = {
+if (colNames.distinct.length != colNames.length) {
+  val duplicateColumns = colNames.groupBy(identity).collect {
+case (x, ys) if ys.length > 1 => x
+  }
+  throw new AnalysisException(
+s"Found duplicate column(s): ${duplicateColumns.mkString(", ")}")
+}
+  }
+
+  /**
+   * ALTER TABLE ADD COLUMNS command does not support temporary view/table,
+   * view, or datasource table with text, orc formats or external provider.
+   * For datasource table, it currently only supports parquet, json, csv.
+   */
+  private def verifyAlterTableAddColumn(
+  catalog: SessionCatalog,
+  table: TableIdentifier): CatalogTable = {
+val catalogTable = catalog.getTempViewOrPermanentTableMetadata(table)
+
+if (catalogTable.tableType == CatalogTableType.VIEW) {
+  throw new AnalysisException(
+s"${table.toString} is a VIEW, which does not support ALTER ADD 
COLUMNS.")
--- End diff --

How about?
> ALTER ADD COLUMNS does not support views. You must drop and re-create the 
views for adding the new columns. Views: $table.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...

2017-03-16 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16626#discussion_r106472336
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -175,6 +178,87 @@ case class AlterTableRenameCommand(
 }
 
 /**
+ * A command that add columns to a table
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   ALTER TABLE table_identifier
+ *   ADD COLUMNS (col_name data_type [COMMENT col_comment], ...);
+ * }}}
+*/
+case class AlterTableAddColumnsCommand(
+table: TableIdentifier,
+columns: Seq[StructField]) extends RunnableCommand {
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+val catalog = sparkSession.sessionState.catalog
+val catalogTable = verifyAlterTableAddColumn(catalog, table)
+
+try {
+  sparkSession.catalog.uncacheTable(table.unquotedString)
+} catch {
+  case NonFatal(e) =>
+log.warn(s"Exception when attempting to uncache table 
${table.unquotedString}", e)
+}
+
+// Invalidate the table last, otherwise uncaching the table would load 
the logical plan
+// back into the hive metastore cache
+catalog.refreshTable(table)
+val partitionFields = 
catalogTable.schema.takeRight(catalogTable.partitionColumnNames.length)
+val newSchemaFields = catalogTable.schema
+  .take(catalogTable.schema.length - 
catalogTable.partitionColumnNames.length) ++
+  columns ++ partitionFields
+checkDuplication(newSchemaFields.map(_.name))
+catalog.alterTableSchema(table, newSchema =
+  catalogTable.schema.copy(fields = newSchemaFields.toArray))
+
+Seq.empty[Row]
+  }
+
+  private def checkDuplication(colNames: Seq[String]): Unit = {
+if (colNames.distinct.length != colNames.length) {
+  val duplicateColumns = colNames.groupBy(identity).collect {
+case (x, ys) if ys.length > 1 => x
+  }
+  throw new AnalysisException(
+s"Found duplicate column(s): ${duplicateColumns.mkString(", ")}")
+}
+  }
+
+  /**
+   * ALTER TABLE ADD COLUMNS command does not support temporary view/table,
+   * view, or datasource table with text, orc formats or external provider.
+   * For datasource table, it currently only supports parquet, json, csv.
+   */
+  private def verifyAlterTableAddColumn(
+  catalog: SessionCatalog,
+  table: TableIdentifier): CatalogTable = {
+val catalogTable = catalog.getTempViewOrPermanentTableMetadata(table)
+
+if (catalogTable.tableType == CatalogTableType.VIEW) {
+  throw new AnalysisException(
+s"${table.toString} is a VIEW, which does not support ALTER ADD 
COLUMNS.")
+}
+
+if (DDLUtils.isDatasourceTable(catalogTable)) {
+  DataSource.lookupDataSource(catalogTable.provider.get).newInstance() 
match {
+// For datasource table, this command can only support the 
following File format.
+// TextFileFormat only default to one column "value"
+// OrcFileFormat can not handle difference between user-specified 
schema and
+// inferred schema yet. TODO, once this issue is resolved , we can 
add Orc back.
+// Hive type is already considered as hive serde table, so the 
logic will not
+// come in here.
+case _: JsonFileFormat | _: CSVFileFormat | _: ParquetFileFormat =>
+case s =>
+  throw new AnalysisException(
+s"Datasource table $table with type $s, which does not support 
ALTER ADD COLUMNS.")
--- End diff --

The same here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16626: [SPARK-19261][SQL] Alter add columns for Hive serde and ...

2017-03-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16626
  
Please also add a test case: ALTER TABLE does not affect any view that 
references the table being altered. Also includes the views that have an "*" in 
their SELECT list. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17317: [SPARK-19329][SQL][BRANCH-2.1]Reading from or writing to...

2017-03-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17317
  
Thanks! Merging to 2.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17317: [SPARK-19329][SQL][BRANCH-2.1]Reading from or writing to...

2017-03-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17317
  
@windpiger Could you please close it? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17319: [SPARK-19765][SPARK-18549][SPARK-19093][SPARK-197...

2017-03-16 Thread gatorsmile

GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/17319

[SPARK-19765][SPARK-18549][SPARK-19093][SPARK-19736][BACKPORT-2.1][SQL] 
Backport Three Cache-related PRs to Spark 2.1

### What changes were proposed in this pull request?

Backport a few cache related PRs: 

---
[[SPARK-19093][SQL] Cached tables are not used in 
SubqueryExpression](https://github.com/apache/spark/pull/16493) 

Consider the plans inside subquery expressions while looking up cache 
manager to make
use of cached data. Currently CacheManager.useCachedData does not consider 
the
subquery expressions in the plan.


---
[[SPARK-19736][SQL] refreshByPath should clear all cached plans with the 
specified path](https://github.com/apache/spark/pull/17064)

Catalog.refreshByPath can refresh the cache entry and the associated 
metadata for all dataframes (if any), that contain the given data source path.

However, CacheManager.invalidateCachedPath doesn't clear all cached plans 
with the specified path. It causes some strange behaviors reported in 
SPARK-15678.

---
[[SPARK-19765][SPARK-18549][SQL] UNCACHE TABLE should un-cache all cached 
plans that refer to this table](https://github.com/apache/spark/pull/17097)

When un-cache a table, we should not only remove the cache entry for this 
table, but also un-cache any other cached plans that refer to this table. The 
following commands trigger the table uncache: `DropTableCommand`, 
`TruncateTableCommand`, `AlterTableRenameCommand`, `UncacheTableCommand`, 
`RefreshTable` and `InsertIntoHiveTable`

This PR also includes some refactors:
- use java.util.LinkedList to store the cache entries, so that it's safer 
to remove elements while iterating
- rename invalidateCache to recacheByPlan, which is more obvious about what 
it does.




### How was this patch tested?
N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark backport-17097

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17319.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17319


commit 3f1895f315ef357ec9f9201748d760293deb4f88
Author: Xiao Li 
Date:   2017-03-16T07:36:35Z

fix.

commit 11a8f31d5954c14eb8e546d001688f93357da676
Author: Xiao Li 
Date:   2017-03-16T17:35:21Z

Merge remote-tracking branch 'upstream/branch-2.1' into backport-17097




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17191: [SPARK-14471][SQL] Aliases in SELECT could be used in GR...

2017-03-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17191
  
We hit this issue multiple times. Although it looks not right to support it 
for the users of major enterprise RDBMS, two popular open source RDBMS 
PostgreSQL and MySQL support it. 

This is the design decision we need to make. Should we keep the previous 
decision? @rxin @marmbrus @hvanhovell @cloud-fan Thanks!






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17287: [SPARK-19945][SQL]add test suite for SessionCatalog with...

2017-03-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17287
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17287: [SPARK-19945][SQL]add test suite for SessionCatalog with...

2017-03-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17287
  
Thanks! Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17319: [SPARK-19765][SPARK-18549][SPARK-19093][SPARK-19736][BAC...

2017-03-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17319
  
cc @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17320: [SPARK-19967][SQL] Add from_json in FunctionRegis...

2017-03-16 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17320#discussion_r106550372
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -634,7 +661,12 @@ case class StructToJson(
   override def inputTypes: Seq[AbstractDataType] = StructType :: Nil
 }
 
-object StructToJson {
+object JsonExprUtils {
+
+  def validateSchemaLiteral(exp: Expression): StructType = exp match {
+case Literal(s, StringType) => 
CatalystSqlParser.parseTableSchema(s.toString)
+case e => throw new AnalysisException(s"Must be a string literal, but: 
$e")
--- End diff --

How about? 
> Expected a string literal instead of $e


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17320: [SPARK-19967][SQL] Add from_json in FunctionRegis...

2017-03-16 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17320#discussion_r106550986
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala ---
@@ -202,12 +202,12 @@ class JsonFunctionsSuite extends QueryTest with 
SharedSQLContext {
 val df1 = Seq(Tuple1(Tuple1(1))).toDF("a")
 checkAnswer(
   df1.selectExpr("to_json(a)"),
-  Row("""{"_1":1}""") :: Nil)
+  Row( """{"_1":1}""") :: Nil)
 
 val df2 = Seq(Tuple1(Tuple1(java.sql.Timestamp.valueOf("2015-08-26 
18:00:00.0".toDF("a")
 checkAnswer(
   df2.selectExpr("to_json(a, map('timestampFormat', 'dd/MM/ 
HH:mm'))"),
-  Row("""{"_1":"26/08/2015 18:00"}""") :: Nil)
+  Row( """{"_1":"26/08/2015 18:00"}""") :: Nil)
--- End diff --

revert it back


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17320: [SPARK-19967][SQL] Add from_json in FunctionRegis...

2017-03-16 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17320#discussion_r106550987
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala ---
@@ -202,12 +202,12 @@ class JsonFunctionsSuite extends QueryTest with 
SharedSQLContext {
 val df1 = Seq(Tuple1(Tuple1(1))).toDF("a")
 checkAnswer(
   df1.selectExpr("to_json(a)"),
-  Row("""{"_1":1}""") :: Nil)
+  Row( """{"_1":1}""") :: Nil)
--- End diff --

revert it back


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17320: [SPARK-19967][SQL] Add from_json in FunctionRegistry

2017-03-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17320
  
All the valid examples are using a single column. Could you also add a test 
case to verify the schema having multiple columns? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16330: [SPARK-18817][SPARKR][SQL] change derby log output to te...

2017-03-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16330
  
The code changes are now very specific to R. Let me know if you still need 
me. : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-16 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106562261
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
 ---
@@ -167,8 +167,8 @@ object ExtractFiltersAndInnerJoins extends 
PredicateHelper {
   : (Seq[(LogicalPlan, InnerLike)], Seq[Expression]) = plan match {
 case Join(left, right, joinType: InnerLike, cond) =>
   val (plans, conditions) = flattenJoin(left, joinType)
-  (plans ++ Seq((right, joinType)), conditions ++ cond.toSeq)
-
+   (plans ++ Seq((right, joinType)), conditions ++
--- End diff --

Nit: remove this heading space.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15363: [SPARK-17791][SQL] Join reordering using star schema det...

2017-03-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15363
  
It sounds like all the comments have been addressed. 

LGTM except one minor comment. cc @sameeragarwal @cloud-fan @hvanhovell 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17322: [SPARK-19987][SQL] Pass all filters into FileIndex

2017-03-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17322
  
LGTM pending Jenkins 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17315: [SPARK-19949][SQL] unify bad record handling in C...

2017-03-16 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17315#discussion_r106566183
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -391,9 +288,9 @@ class JacksonParser(
 
 case token =>
   // We cannot parse this token based on the given data type. So, we 
throw a
-  // SparkSQLJsonProcessingException and this exception will be caught 
by
+  // SparkSQLRuntimeException and this exception will be caught by
--- End diff --

`RuntimeException `?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17322: [SPARK-19987][SQL] Pass all filters into FileIndex

2017-03-16 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17322
  
Thanks! Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17315: [SPARK-19949][SQL] unify bad record handling in C...

2017-03-16 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17315#discussion_r106570171
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
 ---
@@ -65,7 +65,7 @@ private[sql] class JSONOptions(
   val allowBackslashEscapingAnyCharacter =
 
parameters.get("allowBackslashEscapingAnyCharacter").map(_.toBoolean).getOrElse(false)
   val compressionCodec = 
parameters.get("compression").map(CompressionCodecs.getCodecClassName)
-  private val parseMode = parameters.getOrElse("mode", "PERMISSIVE")
+  val parseMode = parameters.getOrElse("mode", "PERMISSIVE")
--- End diff --

How about creating an enum, like what we are doing for `SaveMode`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17287: [SPARK-19945][SQL]add test suite for SessionCatal...

2017-03-16 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17287#discussion_r106570569
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala
 ---
@@ -27,41 +27,67 @@ import 
org.apache.spark.sql.catalyst.parser.CatalystSqlParser
 import org.apache.spark.sql.catalyst.plans.PlanTest
 import org.apache.spark.sql.catalyst.plans.logical.{Range, SubqueryAlias, 
View}
 
+class InMemorySessionCatalogSuite extends SessionCatalogSuite {
+  protected val utils = new CatalogTestUtils {
+override val tableInputFormat: String = 
"com.fruit.eyephone.CameraInputFormat"
+override val tableOutputFormat: String = 
"com.fruit.eyephone.CameraOutputFormat"
+override val defaultProvider: String = "parquet"
+override def newEmptyCatalog(): ExternalCatalog = new InMemoryCatalog
+  }
+}
+
 /**
- * Tests for [[SessionCatalog]] that assume that [[InMemoryCatalog]] is 
correctly implemented.
+ * Tests for [[SessionCatalog]]
  *
  * Note: many of the methods here are very similar to the ones in 
[[ExternalCatalogSuite]].
  * This is because [[SessionCatalog]] and [[ExternalCatalog]] share many 
similar method
  * signatures but do not extend a common parent. This is largely by design 
but
  * unfortunately leads to very similar test code in two places.
  */
-class SessionCatalogSuite extends PlanTest {
-  private val utils = new CatalogTestUtils {
-override val tableInputFormat: String = 
"com.fruit.eyephone.CameraInputFormat"
-override val tableOutputFormat: String = 
"com.fruit.eyephone.CameraOutputFormat"
-override val defaultProvider: String = "parquet"
-override def newEmptyCatalog(): ExternalCatalog = new InMemoryCatalog
-  }
+abstract class SessionCatalogSuite extends PlanTest {
+  protected val utils: CatalogTestUtils
+
+  protected val isHiveExternalCatalog = false
 
   import utils._
 
+  private def withBasicCatalog(f: SessionCatalog => Unit): Unit = {
+val catalog = new SessionCatalog(newBasicCatalog())
+catalog.createDatabase(newDb("default"), ignoreIfExists = true)
+try {
+  f(catalog)
+} finally {
+  catalog.reset()
+}
+  }
+
+  private def withEmptyCatalog(f: SessionCatalog => Unit): Unit = {
+val catalog = new SessionCatalog(newEmptyCatalog())
+catalog.createDatabase(newDb("default"), ignoreIfExists = true)
+try {
+  f(catalog)
+} finally {
+  catalog.reset()
+}
+  }
   // 
--
   // Databases
   // 
--
 
   test("basic create and list databases") {
-val catalog = new SessionCatalog(newEmptyCatalog())
-catalog.createDatabase(newDb("default"), ignoreIfExists = true)
-assert(catalog.databaseExists("default"))
-assert(!catalog.databaseExists("testing"))
-assert(!catalog.databaseExists("testing2"))
-catalog.createDatabase(newDb("testing"), ignoreIfExists = false)
-assert(catalog.databaseExists("testing"))
-assert(catalog.listDatabases().toSet == Set("default", "testing"))
-catalog.createDatabase(newDb("testing2"), ignoreIfExists = false)
-assert(catalog.listDatabases().toSet == Set("default", "testing", 
"testing2"))
-assert(catalog.databaseExists("testing2"))
-assert(!catalog.databaseExists("does_not_exist"))
+withEmptyCatalog { catalog =>
+  catalog.createDatabase(newDb("default"), ignoreIfExists = true)
--- End diff --

This is a basic test case to test `create and list databases`. Thus, it 
will be good to create a database without the existing one.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17319: [SPARK-19765][SPARK-18549][SPARK-19093][SPARK-197...

2017-03-17 Thread gatorsmile

Github user gatorsmile closed the pull request at:

https://github.com/apache/spark/pull/17319


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17320: [SPARK-19967][SQL] Add from_json in FunctionRegistry

2017-03-17 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17320
  
LGTM cc @


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17191: [SPARK-14471][SQL] Aliases in SELECT could be use...

2017-03-17 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17191#discussion_r106717092
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -2598,4 +2598,26 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
 }
 assert(!jobStarted.get(), "Command should not trigger a Spark job.")
   }
+
+  test("SPARK-14471 When groupByAliasesEnabled=true, aliases in SELECT 
could exist in GROUP BY") {
+withSQLConf(SQLConf.GROUP_BY_ALIASES_ENABLED.key -> "true") {
--- End diff --

Since you test the mixed case, you also need to enable `GROUP_BY_ORDINAL`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17320: [SPARK-19967][SQL] Add from_json in FunctionRegistry

2017-03-17 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17320
  
Thanks! Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17191: [SPARK-14471][SQL] Aliases in SELECT could be used in GR...

2017-03-17 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17191
  
What is the behaviors of MySQL and Postgres when we use digits as alias?
```SQL
SELECT k1 AS `2`, k2 AS a, SUM(v) FROM t GROUP BY 2, k2
```

You might need to replace backticks by the other symbols for quoting the 
column names


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17240: [SPARK-19915][SQL] Improve join reorder: simplify...

2017-03-17 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17240#discussion_r106772853
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -36,27 +36,24 @@ case class CostBasedJoinReorder(conf: CatalystConf) 
extends Rule[LogicalPlan] wi
 if (!conf.cboEnabled || !conf.joinReorderEnabled) {
   plan
 } else {
-  val result = plan transform {
-case p @ Project(projectList, j @ Join(_, _, _: InnerLike, _)) =>
-  reorder(p, p.outputSet)
-case j @ Join(_, _, _: InnerLike, _) =>
-  reorder(j, j.outputSet)
+  val result = plan transformDown {
+case j @ Join(_, _, _: InnerLike, _) => reorder(j)
   }
   // After reordering is finished, convert OrderedJoin back to Join
-  result transform {
+  result transformDown {
 case oj: OrderedJoin => oj.join
   }
 }
   }
 
-  def reorder(plan: LogicalPlan, output: AttributeSet): LogicalPlan = {
+  def reorder(plan: LogicalPlan): LogicalPlan = {
--- End diff --

private


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17138: [SPARK-17080] [SQL] join reorder

2017-03-17 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17138#discussion_r106773013
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -0,0 +1,297 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, 
AttributeSet, Expression, PredicateHelper}
+import org.apache.spark.sql.catalyst.plans.{Inner, InnerLike}
+import org.apache.spark.sql.catalyst.plans.logical.{BinaryNode, Join, 
LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+
+/**
+ * Cost-based join reorder.
+ * We may have several join reorder algorithms in the future. This class 
is the entry of these
+ * algorithms, and chooses which one to use.
+ */
+case class CostBasedJoinReorder(conf: CatalystConf) extends 
Rule[LogicalPlan] with PredicateHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (!conf.cboEnabled || !conf.joinReorderEnabled) {
+  plan
+} else {
+  val result = plan transform {
+case p @ Project(projectList, j @ Join(_, _, _: InnerLike, _)) =>
+  reorder(p, p.outputSet)
+case j @ Join(_, _, _: InnerLike, _) =>
+  reorder(j, j.outputSet)
+  }
+  // After reordering is finished, convert OrderedJoin back to Join
+  result transform {
+case oj: OrderedJoin => oj.join
+  }
+}
+  }
+
+  def reorder(plan: LogicalPlan, output: AttributeSet): LogicalPlan = {
+val (items, conditions) = extractInnerJoins(plan)
+val result =
+  // Do reordering if the number of items is appropriate and join 
conditions exist.
+  // We also need to check if costs of all items can be evaluated.
+  if (items.size > 2 && items.size <= conf.joinReorderDPThreshold && 
conditions.nonEmpty &&
+  items.forall(_.stats(conf).rowCount.isDefined)) {
+JoinReorderDP.search(conf, items, conditions, 
output).getOrElse(plan)
+  } else {
+plan
+  }
+// Set consecutive join nodes ordered.
+replaceWithOrderedJoin(result)
+  }
+
+  /**
+   * Extract consecutive inner joinable items and join conditions.
--- End diff --

How about

`Extracts the join conditions and sub-plans of consecutive inner joins`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17138: [SPARK-17080] [SQL] join reorder

2017-03-17 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17138#discussion_r106773023
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -0,0 +1,297 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, 
AttributeSet, Expression, PredicateHelper}
+import org.apache.spark.sql.catalyst.plans.{Inner, InnerLike}
+import org.apache.spark.sql.catalyst.plans.logical.{BinaryNode, Join, 
LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+
+/**
+ * Cost-based join reorder.
+ * We may have several join reorder algorithms in the future. This class 
is the entry of these
+ * algorithms, and chooses which one to use.
+ */
+case class CostBasedJoinReorder(conf: CatalystConf) extends 
Rule[LogicalPlan] with PredicateHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (!conf.cboEnabled || !conf.joinReorderEnabled) {
+  plan
+} else {
+  val result = plan transform {
+case p @ Project(projectList, j @ Join(_, _, _: InnerLike, _)) =>
+  reorder(p, p.outputSet)
+case j @ Join(_, _, _: InnerLike, _) =>
+  reorder(j, j.outputSet)
+  }
+  // After reordering is finished, convert OrderedJoin back to Join
+  result transform {
+case oj: OrderedJoin => oj.join
+  }
+}
+  }
+
+  def reorder(plan: LogicalPlan, output: AttributeSet): LogicalPlan = {
+val (items, conditions) = extractInnerJoins(plan)
--- End diff --

Nit: `items` -> `subplans`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17138: [SPARK-17080] [SQL] join reorder

2017-03-17 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17138#discussion_r106773079
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -0,0 +1,297 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, 
AttributeSet, Expression, PredicateHelper}
+import org.apache.spark.sql.catalyst.plans.{Inner, InnerLike}
+import org.apache.spark.sql.catalyst.plans.logical.{BinaryNode, Join, 
LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+
+/**
+ * Cost-based join reorder.
+ * We may have several join reorder algorithms in the future. This class 
is the entry of these
+ * algorithms, and chooses which one to use.
+ */
+case class CostBasedJoinReorder(conf: CatalystConf) extends 
Rule[LogicalPlan] with PredicateHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (!conf.cboEnabled || !conf.joinReorderEnabled) {
+  plan
+} else {
+  val result = plan transform {
+case p @ Project(projectList, j @ Join(_, _, _: InnerLike, _)) =>
+  reorder(p, p.outputSet)
+case j @ Join(_, _, _: InnerLike, _) =>
+  reorder(j, j.outputSet)
+  }
+  // After reordering is finished, convert OrderedJoin back to Join
+  result transform {
+case oj: OrderedJoin => oj.join
+  }
+}
+  }
+
+  def reorder(plan: LogicalPlan, output: AttributeSet): LogicalPlan = {
+val (items, conditions) = extractInnerJoins(plan)
+val result =
+  // Do reordering if the number of items is appropriate and join 
conditions exist.
+  // We also need to check if costs of all items can be evaluated.
+  if (items.size > 2 && items.size <= conf.joinReorderDPThreshold && 
conditions.nonEmpty &&
+  items.forall(_.stats(conf).rowCount.isDefined)) {
+JoinReorderDP.search(conf, items, conditions, 
output).getOrElse(plan)
+  } else {
+plan
+  }
+// Set consecutive join nodes ordered.
+replaceWithOrderedJoin(result)
+  }
+
+  /**
+   * Extract consecutive inner joinable items and join conditions.
+   * This method works for bushy trees and left/right deep trees.
+   */
+  private def extractInnerJoins(plan: LogicalPlan): (Seq[LogicalPlan], 
Set[Expression]) = {
+plan match {
+  case Join(left, right, _: InnerLike, cond) =>
+val (leftPlans, leftConditions) = extractInnerJoins(left)
+val (rightPlans, rightConditions) = extractInnerJoins(right)
+(leftPlans ++ rightPlans, 
cond.toSet.flatMap(splitConjunctivePredicates) ++
+  leftConditions ++ rightConditions)
+  case Project(projectList, join) if 
projectList.forall(_.isInstanceOf[Attribute]) =>
+extractInnerJoins(join)
+  case _ =>
+(Seq(plan), Set())
+}
+  }
+
+  private def replaceWithOrderedJoin(plan: LogicalPlan): LogicalPlan = 
plan match {
+case j @ Join(left, right, _: InnerLike, cond) =>
--- End diff --

Nit: `cond` -> `_`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17138: [SPARK-17080] [SQL] join reorder

2017-03-17 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17138#discussion_r106774778
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -0,0 +1,297 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, 
AttributeSet, Expression, PredicateHelper}
+import org.apache.spark.sql.catalyst.plans.{Inner, InnerLike}
+import org.apache.spark.sql.catalyst.plans.logical.{BinaryNode, Join, 
LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+
+/**
+ * Cost-based join reorder.
+ * We may have several join reorder algorithms in the future. This class 
is the entry of these
+ * algorithms, and chooses which one to use.
+ */
+case class CostBasedJoinReorder(conf: CatalystConf) extends 
Rule[LogicalPlan] with PredicateHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (!conf.cboEnabled || !conf.joinReorderEnabled) {
+  plan
+} else {
+  val result = plan transform {
+case p @ Project(projectList, j @ Join(_, _, _: InnerLike, _)) =>
+  reorder(p, p.outputSet)
+case j @ Join(_, _, _: InnerLike, _) =>
+  reorder(j, j.outputSet)
+  }
+  // After reordering is finished, convert OrderedJoin back to Join
+  result transform {
+case oj: OrderedJoin => oj.join
+  }
+}
+  }
+
+  def reorder(plan: LogicalPlan, output: AttributeSet): LogicalPlan = {
+val (items, conditions) = extractInnerJoins(plan)
+val result =
+  // Do reordering if the number of items is appropriate and join 
conditions exist.
+  // We also need to check if costs of all items can be evaluated.
+  if (items.size > 2 && items.size <= conf.joinReorderDPThreshold && 
conditions.nonEmpty &&
+  items.forall(_.stats(conf).rowCount.isDefined)) {
+JoinReorderDP.search(conf, items, conditions, 
output).getOrElse(plan)
+  } else {
+plan
+  }
+// Set consecutive join nodes ordered.
+replaceWithOrderedJoin(result)
+  }
+
+  /**
+   * Extract consecutive inner joinable items and join conditions.
+   * This method works for bushy trees and left/right deep trees.
+   */
+  private def extractInnerJoins(plan: LogicalPlan): (Seq[LogicalPlan], 
Set[Expression]) = {
+plan match {
+  case Join(left, right, _: InnerLike, cond) =>
+val (leftPlans, leftConditions) = extractInnerJoins(left)
+val (rightPlans, rightConditions) = extractInnerJoins(right)
+(leftPlans ++ rightPlans, 
cond.toSet.flatMap(splitConjunctivePredicates) ++
+  leftConditions ++ rightConditions)
+  case Project(projectList, join) if 
projectList.forall(_.isInstanceOf[Attribute]) =>
+extractInnerJoins(join)
+  case _ =>
+(Seq(plan), Set())
+}
+  }
+
+  private def replaceWithOrderedJoin(plan: LogicalPlan): LogicalPlan = 
plan match {
+case j @ Join(left, right, _: InnerLike, cond) =>
+  val replacedLeft = replaceWithOrderedJoin(left)
+  val replacedRight = replaceWithOrderedJoin(right)
+  OrderedJoin(j.copy(left = replacedLeft, right = replacedRight))
+case p @ Project(_, join) =>
+  p.copy(child = replaceWithOrderedJoin(join))
+case _ =>
+  plan
+  }
+
+  /** This is a wrapper class for a join node that has been ordered. */
+  private case class OrderedJoin(join: Join) extends BinaryNode {
+override def left: LogicalPlan = join.left
+override def right: LogicalPlan = join.right
+override def output: Se

< 5 6 7 8 9 10 11 12 13 14 >

901 - 1000 of 14069 matches

Mail list logo