spark git commit: [SPARK-8184][SQL] Add additional function description for weekofyear

2017-05-29 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master c9749068e -> 1c7db00c7


[SPARK-8184][SQL] Add additional function description for weekofyear

## What changes were proposed in this pull request?

Add additional function description for weekofyear.

## How was this patch tested?

 manual tests

![weekofyear](https://cloud.githubusercontent.com/assets/5399861/26525752/08a1c278-4394-11e7-8988-7cbf82c3a999.gif)

Author: Yuming Wang 

Closes #18132 from wangyum/SPARK-8184.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1c7db00c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1c7db00c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1c7db00c

Branch: refs/heads/master
Commit: 1c7db00c74ec6a91c7eefbdba85cbf41fbe8634a
Parents: c974906
Author: Yuming Wang 
Authored: Mon May 29 16:10:22 2017 -0700
Committer: Reynold Xin 
Committed: Mon May 29 16:10:22 2017 -0700

--
 .../spark/sql/catalyst/expressions/datetimeExpressions.scala | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1c7db00c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
index 43ca2cf..4098300 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
@@ -402,13 +402,15 @@ case class DayOfMonth(child: Expression) extends 
UnaryExpression with ImplicitCa
   }
 }
 
+// scalastyle:off line.size.limit
 @ExpressionDescription(
-  usage = "_FUNC_(date) - Returns the week of the year of the given date.",
+  usage = "_FUNC_(date) - Returns the week of the year of the given date. A 
week is considered to start on a Monday and week 1 is the first week with >3 
days.",
   extended = """
 Examples:
   > SELECT _FUNC_('2008-02-20');
8
   """)
+// scalastyle:on line.size.limit
 case class WeekOfYear(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes {
 
   override def inputTypes: Seq[AbstractDataType] = Seq(DateType)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] spark issue #18132: [SPARK-8184][SQL] Add additional function description fo...

2017-05-29 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/18132
  
Thanks - merging in master/branch-2.2.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18086: [SPARK-20854][SQL] Extend hint syntax to support ...

2017-05-25 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18086#discussion_r118473083
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
 ---
@@ -533,13 +533,16 @@ class AstBuilder(conf: SQLConf) extends 
SqlBaseBaseVisitor[AnyRef] with Logging
   }
 
   /**
-   * Add a [[UnresolvedHint]] to a logical plan.
+   * Add a [[UnresolvedHint]]s to a logical plan.
*/
   private def withHints(
   ctx: HintContext,
   query: LogicalPlan): LogicalPlan = withOrigin(ctx) {
-val stmt = ctx.hintStatement
-UnresolvedHint(stmt.hintName.getText, 
stmt.parameters.asScala.map(_.getText), query)
+var plan = query
--- End diff --

Honestly I think foldLeft is almost always a bad idea ...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18042: [SPARK-20817][core] Fix to return "Unknown processor" on...

2017-05-25 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/18042
  
Does this really matter? I'd rather not complicate the actual code for it 
to display properly in some niche hardware that very few people use.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18086: [SPARK-20854][SQL] Extend hint syntax to support express...

2017-05-25 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/18086
  
cc @gatorsmile @cloud-fan @hvanhovell 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18016: [SPARK-20786][SQL]Improve ceil and floor handle the valu...

2017-05-25 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/18016
  
hm guys please don’t use the end-to-end tests to test expression 
behavior. use unit tests which automatically tests code gen, interpreted, and 
different data types.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18087: [SPARK-20867][SQL] Move hints from Statistics int...

2017-05-24 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18087#discussion_r118353924
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -195,9 +195,9 @@ case class Intersect(left: LogicalPlan, right: 
LogicalPlan) extends SetOperation
 val leftSize = left.stats(conf).sizeInBytes
 val rightSize = right.stats(conf).sizeInBytes
 val sizeInBytes = if (leftSize < rightSize) leftSize else rightSize
-val isBroadcastable = left.stats(conf).isBroadcastable || 
right.stats(conf).isBroadcastable
-
-Statistics(sizeInBytes = sizeInBytes, isBroadcastable = 
isBroadcastable)
+Statistics(
+  sizeInBytes = sizeInBytes,
+  hints = left.stats(conf).hints.resetForJoin())
--- End diff --

It's actually no-op since Intersect is rewritten to a join always ..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18087: [SPARK-20867][SQL] Move hints from Statistics into HintI...

2017-05-24 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/18087
  
cc @hvanhovell, @bogdanrdc 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18087: [SPARK-20867][SQL] Move hints from Statistics int...

2017-05-24 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/18087

[SPARK-20867][SQL] Move hints from Statistics into HintInfo class

## What changes were proposed in this pull request?
This is a follow-up to SPARK-20857 to move the broadcast hint from 
Statistics into a new HintInfo class, so we can be more flexible in adding new 
hints in the future.

## How was this patch tested?
Updated test cases to reflect the change.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-20867

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18087.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18087


commit 19232cfc8ad54229b73fce4792c8b4c6d3d72495
Author: Reynold Xin <r...@databricks.com>
Date:   2017-05-24T12:57:36Z

[SPARK-20867][SQL] Move individual hints from Statistics into HintInfo class




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18082: [SPARK-20665][SQL][FOLLOW-UP]Move test case to SQLQueryT...

2017-05-24 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/18082
  
Hm I'm not sure if it is a good idea to run so many "unit test" style tests 
for expressions in the end to end suites. It takes a lot of time than just 
running unit tests.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



spark git commit: [SPARK-20857][SQL] Generic resolved hint node

2017-05-23 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 dbb068f4f -> d20c64695


[SPARK-20857][SQL] Generic resolved hint node

## What changes were proposed in this pull request?
This patch renames BroadcastHint to ResolvedHint (and Hint to UnresolvedHint) 
so the hint framework is more generic and would allow us to introduce other 
hint types in the future without introducing new hint nodes.

## How was this patch tested?
Updated test cases.

Author: Reynold Xin <r...@databricks.com>

Closes #18072 from rxin/SPARK-20857.

(cherry picked from commit 0d589ba00b5d539fbfef5174221de046a70548cd)
Signed-off-by: Reynold Xin <r...@databricks.com>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d20c6469
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d20c6469
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d20c6469

Branch: refs/heads/branch-2.2
Commit: d20c6469565c4f7687f9af14a6f12a775b0c6e62
Parents: dbb068f
Author: Reynold Xin <r...@databricks.com>
Authored: Tue May 23 18:44:49 2017 +0200
Committer: Reynold Xin <r...@databricks.com>
Committed: Tue May 23 18:45:08 2017 +0200

--
 .../spark/sql/catalyst/analysis/Analyzer.scala  |  2 +-
 .../sql/catalyst/analysis/CheckAnalysis.scala   |  2 +-
 .../sql/catalyst/analysis/ResolveHints.scala| 12 ++---
 .../sql/catalyst/optimizer/Optimizer.scala  |  2 +-
 .../sql/catalyst/optimizer/expressions.scala|  2 +-
 .../spark/sql/catalyst/parser/AstBuilder.scala  |  4 +-
 .../spark/sql/catalyst/planning/patterns.scala  |  4 +-
 .../sql/catalyst/plans/logical/Statistics.scala |  5 ++
 .../plans/logical/basicLogicalOperators.scala   | 22 +
 .../sql/catalyst/plans/logical/hints.scala  | 49 
 .../catalyst/analysis/ResolveHintsSuite.scala   | 41 
 .../catalyst/optimizer/ColumnPruningSuite.scala |  5 +-
 .../optimizer/FilterPushdownSuite.scala |  4 +-
 .../optimizer/JoinOptimizationSuite.scala   |  4 +-
 .../sql/catalyst/parser/PlanParserSuite.scala   | 15 +++---
 .../BasicStatsEstimationSuite.scala |  2 +-
 .../scala/org/apache/spark/sql/Dataset.scala|  2 +-
 .../spark/sql/execution/SparkStrategies.scala   |  2 +-
 .../scala/org/apache/spark/sql/functions.scala  |  5 +-
 .../execution/joins/BroadcastJoinSuite.scala| 14 +++---
 20 files changed, 118 insertions(+), 80 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d20c6469/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index 5be67ac..9979642 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -1311,7 +1311,7 @@ class Analyzer(
 
 // Category 1:
 // BroadcastHint, Distinct, LeafNode, Repartition, and SubqueryAlias
-case _: BroadcastHint | _: Distinct | _: LeafNode | _: Repartition | 
_: SubqueryAlias =>
+case _: ResolvedHint | _: Distinct | _: LeafNode | _: Repartition | _: 
SubqueryAlias =>
 
 // Category 2:
 // These operators can be anywhere in a correlated subquery.

http://git-wip-us.apache.org/repos/asf/spark/blob/d20c6469/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index ea4560a..2e3ac3e 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -399,7 +399,7 @@ trait CheckAnalysis extends PredicateHelper {
  |in operator ${operator.simpleString}
""".stripMargin)
 
-  case _: Hint =>
+  case _: UnresolvedHint =>
 throw new IllegalStateException(
   "Internal error: logical hint operator should have been removed 
during analysis")
 

http://git-wip-us.apache.org/repos/asf/spark/blob/d20c6469/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala
 
b/sq

spark git commit: [SPARK-20857][SQL] Generic resolved hint node

2017-05-23 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master ad09e4ca0 -> 0d589ba00


[SPARK-20857][SQL] Generic resolved hint node

## What changes were proposed in this pull request?
This patch renames BroadcastHint to ResolvedHint (and Hint to UnresolvedHint) 
so the hint framework is more generic and would allow us to introduce other 
hint types in the future without introducing new hint nodes.

## How was this patch tested?
Updated test cases.

Author: Reynold Xin <r...@databricks.com>

Closes #18072 from rxin/SPARK-20857.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0d589ba0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0d589ba0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0d589ba0

Branch: refs/heads/master
Commit: 0d589ba00b5d539fbfef5174221de046a70548cd
Parents: ad09e4c
Author: Reynold Xin <r...@databricks.com>
Authored: Tue May 23 18:44:49 2017 +0200
Committer: Reynold Xin <r...@databricks.com>
Committed: Tue May 23 18:44:49 2017 +0200

--
 .../spark/sql/catalyst/analysis/Analyzer.scala  |  2 +-
 .../sql/catalyst/analysis/CheckAnalysis.scala   |  2 +-
 .../sql/catalyst/analysis/ResolveHints.scala| 12 ++---
 .../sql/catalyst/optimizer/Optimizer.scala  |  2 +-
 .../sql/catalyst/optimizer/expressions.scala|  2 +-
 .../spark/sql/catalyst/parser/AstBuilder.scala  |  4 +-
 .../spark/sql/catalyst/planning/patterns.scala  |  4 +-
 .../sql/catalyst/plans/logical/Statistics.scala |  5 ++
 .../plans/logical/basicLogicalOperators.scala   | 22 +
 .../sql/catalyst/plans/logical/hints.scala  | 49 
 .../catalyst/analysis/ResolveHintsSuite.scala   | 41 
 .../catalyst/optimizer/ColumnPruningSuite.scala |  5 +-
 .../optimizer/FilterPushdownSuite.scala |  4 +-
 .../optimizer/JoinOptimizationSuite.scala   |  4 +-
 .../sql/catalyst/parser/PlanParserSuite.scala   | 15 +++---
 .../BasicStatsEstimationSuite.scala |  2 +-
 .../scala/org/apache/spark/sql/Dataset.scala|  2 +-
 .../spark/sql/execution/SparkStrategies.scala   |  2 +-
 .../scala/org/apache/spark/sql/functions.scala  |  5 +-
 .../execution/joins/BroadcastJoinSuite.scala| 14 +++---
 20 files changed, 118 insertions(+), 80 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0d589ba0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index d58b8ac..d130962 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -1336,7 +1336,7 @@ class Analyzer(
 
 // Category 1:
 // BroadcastHint, Distinct, LeafNode, Repartition, and SubqueryAlias
-case _: BroadcastHint | _: Distinct | _: LeafNode | _: Repartition | 
_: SubqueryAlias =>
+case _: ResolvedHint | _: Distinct | _: LeafNode | _: Repartition | _: 
SubqueryAlias =>
 
 // Category 2:
 // These operators can be anywhere in a correlated subquery.

http://git-wip-us.apache.org/repos/asf/spark/blob/0d589ba0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index ea4560a..2e3ac3e 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -399,7 +399,7 @@ trait CheckAnalysis extends PredicateHelper {
  |in operator ${operator.simpleString}
""".stripMargin)
 
-  case _: Hint =>
+  case _: UnresolvedHint =>
 throw new IllegalStateException(
   "Internal error: logical hint operator should have been removed 
during analysis")
 

http://git-wip-us.apache.org/repos/asf/spark/blob/0d589ba0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala
index df688fa..9dfd84c 100644
--- 
a/sql/cata

[GitHub] spark issue #18072: [SPARK-20857][SQL] Generic resolved hint node

2017-05-23 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/18072
  
Merging in master / branch-2.2 ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18072: [SPARK-20857][SQL] Generic resolved hint node

2017-05-23 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/18072

[SPARK-20857][SQL] Generic resolved hint node

## What changes were proposed in this pull request?
This patch renames BroadcastHint to ResolvedHint so it is more generic and 
would allow us to introduce other hint types in the future without introducing 
new hint nodes.

## How was this patch tested?
Updated test cases.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-20857

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18072.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18072






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18064: [SPARK-20213][SQL] Fix DataFrameWriter operations in SQL...

2017-05-23 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/18064
  
That works too, if we can attach metrics to these commands.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18070: [SPARK-20713][Spark Core] Convert CommitDenied to TaskKi...

2017-05-23 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/18070
  
cc @ericl 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17999: [SPARK-20751][SQL] Add built-in SQL Function - COT

2017-05-19 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17999
  
hmnmm

seems like we should be following how we test tan, cos, etc in 
MathExpressionsSuite?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18023: [SPARK-12139] [SQL] REGEX Column Specification

2017-05-19 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18023#discussion_r117540055
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -2624,4 +2624,92 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
 val e = intercept[AnalysisException](sql("SELECT nvl(1, 2, 3)"))
 assert(e.message.contains("Invalid number of arguments"))
   }
+
+  test("SPARK-12139: REGEX Column Specification for Hive Queries") {
--- End diff --

Yes let's use those rather than adding more files to SQLQuerySUite. I'd 
love to get rid of SQLQuerySuite 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18023: [SPARK-12139] [SQL] REGEX Column Specification

2017-05-19 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18023#discussion_r117539904
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -795,6 +795,12 @@ object SQLConf {
   .intConf
   
.createWithDefault(UnsafeExternalSorter.DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD.toInt)
 
+  val SUPPORT_QUOTED_REGEX_COLUMN_NAME = 
buildConf("spark.sql.parser.quotedRegexColumnNames")
+.internal()
--- End diff --

should be public


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16478: [SPARK-7768][SQL] Revise user defined types (UDT)

2017-05-18 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/16478
  
I don't know how important it is. It seems like it's primarily used by 
MLlib and very few other things ... 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17997: [SPARK-20763][SQL]The function of `month` and `da...

2017-05-16 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17997#discussion_r116878495
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -601,22 +601,32 @@ object DateTimeUtils {
* The calculation uses the fact that the period 1.1.2001 until 
31.12.2400 is
* equals to the period 1.1.1601 until 31.12.2000.
*/
-  private[this] def getYearAndDayInYear(daysSince1970: SQLDate): (Int, 
Int) = {
-// add the difference (in days) between 1.1.1970 and the artificial 
year 0 (-17999)
-val daysNormalized = daysSince1970 + toYearZero
-val numOfQuarterCenturies = daysNormalized / daysIn400Years
-val daysInThis400 = daysNormalized % daysIn400Years + 1
-val (years, dayInYear) = numYears(daysInThis400)
-val year: Int = (2001 - 2) + 400 * numOfQuarterCenturies + years
-(year, dayInYear)
+  private[this] def getYearAndDayInYear(daysSince1970: SQLDate): (Int, 
Int, Int) = {
+val date = new Date(daysToMillis(daysSince1970))
+val YMD = date.toString.trim.split("-")
--- End diff --

this would cause massive performance regression.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15821: [SPARK-13534][PySpark] Using Apache Arrow to increase pe...

2017-05-15 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15821
  
@BryanCutler even though the json is long, it is still so much clearer than 
reading a pile of code that generates json ...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17941: [SPARK-20684][R] Expose createGlobalTempView and dropGlo...

2017-05-15 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17941
  

@felixcheung what's your concern with this one? seems like just for api 
parity sake we should add this?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17711: [SPARK-19951][SQL] Add string concatenate operator || to...

2017-05-12 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17711
  
I feel both are pretty complicated. Can we just do something similar to 
CombineUnion:

```
/**
 * Combines all adjacent [[Union]] operators into a single [[Union]].
 */
object CombineUnions extends Rule[LogicalPlan] {
  def apply(plan: LogicalPlan): LogicalPlan = plan transformDown {
case u: Union => flattenUnion(u, false)
case Distinct(u: Union) => Distinct(flattenUnion(u, true))
  }

  private def flattenUnion(union: Union, flattenDistinct: Boolean): Union = 
{
val stack = mutable.Stack[LogicalPlan](union)
val flattened = mutable.ArrayBuffer.empty[LogicalPlan]
while (stack.nonEmpty) {
  stack.pop() match {
case Distinct(Union(children)) if flattenDistinct =>
  stack.pushAll(children.reverse)
case Union(children) =>
  stack.pushAll(children.reverse)
case child =>
  flattened += child
  }
}
Union(flattened)
  }
}```

It's going to be simpler because you don't need to handle distinct here.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17942: [SPARK-20702][Core]TaskContextImpl.markTaskComple...

2017-05-11 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17942#discussion_r116143097
  
--- Diff: core/src/main/scala/org/apache/spark/util/taskListeners.scala ---
@@ -55,14 +55,16 @@ class TaskCompletionListenerException(
   extends RuntimeException {
 
   override def getMessage: String = {
-if (errorMessages.size == 1) {
--- End diff --

It's a common pattern in scala.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17923: [SPARK-20591][WEB UI] Succeeded tasks num not equal in a...

2017-05-10 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17923
  
sry too long ago


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17931: [SPARK-12837][CORE][FOLLOWUP] getting name should not fa...

2017-05-09 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17931
  
What's the issue with SQL metrics?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



spark git commit: Revert "[SPARK-12297][SQL] Hive compatibility for Parquet Timestamps"

2017-05-09 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master 1b85bcd92 -> ac1ab6b9d


Revert "[SPARK-12297][SQL] Hive compatibility for Parquet Timestamps"

This reverts commit 22691556e5f0dfbac81b8cc9ca0a67c70c1711ca.

See JIRA ticket for more information.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ac1ab6b9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ac1ab6b9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ac1ab6b9

Branch: refs/heads/master
Commit: ac1ab6b9db188ac54c745558d57dd0a031d0b162
Parents: 1b85bcd
Author: Reynold Xin 
Authored: Tue May 9 11:35:59 2017 -0700
Committer: Reynold Xin 
Committed: Tue May 9 11:35:59 2017 -0700

--
 .../spark/sql/catalyst/catalog/interface.scala  |   4 +-
 .../spark/sql/catalyst/util/DateTimeUtils.scala |   5 -
 .../parquet/VectorizedColumnReader.java |  28 +-
 .../parquet/VectorizedParquetRecordReader.java  |   6 +-
 .../spark/sql/execution/command/tables.scala|   8 +-
 .../datasources/parquet/ParquetFileFormat.scala |   2 -
 .../parquet/ParquetReadSupport.scala|   3 +-
 .../parquet/ParquetRecordMaterializer.scala |   9 +-
 .../parquet/ParquetRowConverter.scala   |  53 +--
 .../parquet/ParquetWriteSupport.scala   |  25 +-
 .../spark/sql/hive/HiveExternalCatalog.scala|  11 +-
 .../spark/sql/hive/HiveMetastoreCatalog.scala   |  12 +-
 .../hive/ParquetHiveCompatibilitySuite.scala| 379 +--
 13 files changed, 29 insertions(+), 516 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ac1ab6b9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
index c39017e..cc0cbba 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
@@ -132,10 +132,10 @@ case class CatalogTablePartition(
   /**
* Given the partition schema, returns a row with that schema holding the 
partition values.
*/
-  def toRow(partitionSchema: StructType, defaultTimeZoneId: String): 
InternalRow = {
+  def toRow(partitionSchema: StructType, defaultTimeZondId: String): 
InternalRow = {
 val caseInsensitiveProperties = CaseInsensitiveMap(storage.properties)
 val timeZoneId = caseInsensitiveProperties.getOrElse(
-  DateTimeUtils.TIMEZONE_OPTION, defaultTimeZoneId)
+  DateTimeUtils.TIMEZONE_OPTION, defaultTimeZondId)
 InternalRow.fromSeq(partitionSchema.map { field =>
   val partValue = if (spec(field.name) == 
ExternalCatalogUtils.DEFAULT_PARTITION_NAME) {
 null

http://git-wip-us.apache.org/repos/asf/spark/blob/ac1ab6b9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
index bf596fa..6c1592f 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
@@ -498,11 +498,6 @@ object DateTimeUtils {
 false
   }
 
-  lazy val validTimezones = TimeZone.getAvailableIDs().toSet
-  def isValidTimezone(timezoneId: String): Boolean = {
-validTimezones.contains(timezoneId)
-  }
-
   /**
* Returns the microseconds since year zero (-17999) from microseconds since 
epoch.
*/

http://git-wip-us.apache.org/repos/asf/spark/blob/ac1ab6b9/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
 
b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
index dabbc2b..9d641b5 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
@@ -18,9 +18,7 @@
 package org.apache.spark.sql.execution.datasources.parquet;
 
 import java.io.IOException;
-import java.util.TimeZone;
 
-import org.apache.hadoop.conf.Configuration;
 import org.apache.parquet.bytes.BytesUtils;
 import 

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

2017-05-09 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/16781
  
Did we conduct any performance tests on this patch?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17915: [SPARK-20674][SQL] Support registering UserDefine...

2017-05-09 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/17915

[SPARK-20674][SQL] Support registering UserDefinedFunction as named UDF

## What changes were proposed in this pull request?
For some reason we don't have an API to register UserDefinedFunction as 
named UDF. It is a no brainer to add one, in addition to the existing register 
functions we have.

## How was this patch tested?
Added a test case in UDFSuite for the new API.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-20674

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17915.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17915






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



spark git commit: [SPARK-20616] RuleExecutor logDebug of batch results should show diff to start of batch

2017-05-05 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master b31648c08 -> 5d75b14bf


[SPARK-20616] RuleExecutor logDebug of batch results should show diff to start 
of batch

## What changes were proposed in this pull request?

Due to a likely typo, the logDebug msg printing the diff of query plans shows a 
diff to the initial plan, not diff to the start of batch.

## How was this patch tested?

Now the debug message prints the diff between start and end of batch.

Author: Juliusz Sompolski 

Closes #17875 from juliuszsompolski/SPARK-20616.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5d75b14b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5d75b14b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5d75b14b

Branch: refs/heads/master
Commit: 5d75b14bf0f4c1f0813287efaabf49797908ed55
Parents: b31648c
Author: Juliusz Sompolski 
Authored: Fri May 5 15:31:06 2017 -0700
Committer: Reynold Xin 
Committed: Fri May 5 15:31:06 2017 -0700

--
 .../scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5d75b14b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala
index 6fc828f..85b368c 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala
@@ -122,7 +122,7 @@ abstract class RuleExecutor[TreeType <: TreeNode[_]] 
extends Logging {
 logDebug(
   s"""
   |=== Result of Batch ${batch.name} ===
-  |${sideBySide(plan.treeString, curPlan.treeString).mkString("\n")}
+  |${sideBySide(batchStartPlan.treeString, 
curPlan.treeString).mkString("\n")}
 """.stripMargin)
   } else {
 logTrace(s"Batch ${batch.name} has no effect.")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-20616] RuleExecutor logDebug of batch results should show diff to start of batch

2017-05-05 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 f59c74a94 -> 1d9b7a74a


[SPARK-20616] RuleExecutor logDebug of batch results should show diff to start 
of batch

## What changes were proposed in this pull request?

Due to a likely typo, the logDebug msg printing the diff of query plans shows a 
diff to the initial plan, not diff to the start of batch.

## How was this patch tested?

Now the debug message prints the diff between start and end of batch.

Author: Juliusz Sompolski 

Closes #17875 from juliuszsompolski/SPARK-20616.

(cherry picked from commit 5d75b14bf0f4c1f0813287efaabf49797908ed55)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1d9b7a74
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1d9b7a74
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1d9b7a74

Branch: refs/heads/branch-2.2
Commit: 1d9b7a74a839021814ab28d3eba3636c64483130
Parents: f59c74a
Author: Juliusz Sompolski 
Authored: Fri May 5 15:31:06 2017 -0700
Committer: Reynold Xin 
Committed: Fri May 5 15:31:13 2017 -0700

--
 .../scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1d9b7a74/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala
index 6fc828f..85b368c 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala
@@ -122,7 +122,7 @@ abstract class RuleExecutor[TreeType <: TreeNode[_]] 
extends Logging {
 logDebug(
   s"""
   |=== Result of Batch ${batch.name} ===
-  |${sideBySide(plan.treeString, curPlan.treeString).mkString("\n")}
+  |${sideBySide(batchStartPlan.treeString, 
curPlan.treeString).mkString("\n")}
 """.stripMargin)
   } else {
 logTrace(s"Batch ${batch.name} has no effect.")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-20616] RuleExecutor logDebug of batch results should show diff to start of batch

2017-05-05 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 704b249b6 -> a1112c615


[SPARK-20616] RuleExecutor logDebug of batch results should show diff to start 
of batch

## What changes were proposed in this pull request?

Due to a likely typo, the logDebug msg printing the diff of query plans shows a 
diff to the initial plan, not diff to the start of batch.

## How was this patch tested?

Now the debug message prints the diff between start and end of batch.

Author: Juliusz Sompolski 

Closes #17875 from juliuszsompolski/SPARK-20616.

(cherry picked from commit 5d75b14bf0f4c1f0813287efaabf49797908ed55)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a1112c61
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a1112c61
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a1112c61

Branch: refs/heads/branch-2.1
Commit: a1112c615b05d615048159c9d324aa10a4391d4e
Parents: 704b249
Author: Juliusz Sompolski 
Authored: Fri May 5 15:31:06 2017 -0700
Committer: Reynold Xin 
Committed: Fri May 5 15:31:23 2017 -0700

--
 .../scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a1112c61/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala
index 6fc828f..85b368c 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala
@@ -122,7 +122,7 @@ abstract class RuleExecutor[TreeType <: TreeNode[_]] 
extends Logging {
 logDebug(
   s"""
   |=== Result of Batch ${batch.name} ===
-  |${sideBySide(plan.treeString, curPlan.treeString).mkString("\n")}
+  |${sideBySide(batchStartPlan.treeString, 
curPlan.treeString).mkString("\n")}
 """.stripMargin)
   } else {
 logTrace(s"Batch ${batch.name} has no effect.")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] spark issue #17875: [SPARK-20616] RuleExecutor logDebug of batch results sho...

2017-05-05 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17875
  
Merging in master/branch-2.2.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17875: [SPARK-20616] RuleExecutor logDebug of batch results sho...

2017-05-05 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17875
  
Jenkins, test this please.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17851: [SPARK-20585][SPARKR] R generic hint support

2017-05-05 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17851
  
@felixcheung was this merged only in master but not branch-2.2?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17770: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...

2017-05-04 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17770
  
@srinathshankar also thinks it's weird to add a barrier node. I suggest 
@hvanhovell and @srinathshankar duke it out.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17723: [SPARK-20434][YARN][CORE] Move kerberos delegation token...

2017-05-04 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17723
  
I'm saying avoid exposing Hadoop APIs. Wrap them around something if 
possible.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17723: [SPARK-20434][YARN][CORE] Move kerberos delegation token...

2017-05-04 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17723
  
I didn't read through the super long debate here, but I have a strong 
preference to not expose Hadoop APIs directly. I'm seeing more and more 
deployments out there that do not use Hadoop (e.g. connect directly to cloud 
storage, connect to some on-premise object store, connect to Redis, connect to 
some netapp appliance, connect directly to a message queue or just run Spark on 
a laptop).

Hadoop APIs were designed for a different world pre Spark. Serialization is 
painful (Configuration?) to deal with, API breaking changes are painful to deal 
with, size of the dependencies are painful to deal with (especially considering 
the single node use cases in which ideally we'd just want a super trimmed down 
jar).

As you can see (although most of you that have chimed in here don't know 
much about the new components), the newer components (Spark SQL) does not 
expose Hadoop APIs.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



spark git commit: [SPARK-20584][PYSPARK][SQL] Python generic hint support

2017-05-03 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master 13eb37c86 -> 02bbe7311


[SPARK-20584][PYSPARK][SQL] Python generic hint support

## What changes were proposed in this pull request?

Adds `hint` method to PySpark `DataFrame`.

## How was this patch tested?

Unit tests, doctests.

Author: zero323 

Closes #17850 from zero323/SPARK-20584.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/02bbe731
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/02bbe731
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/02bbe731

Branch: refs/heads/master
Commit: 02bbe73118a39e2fb378aa2002449367a92f6d67
Parents: 13eb37c
Author: zero323 
Authored: Wed May 3 19:15:28 2017 -0700
Committer: Reynold Xin 
Committed: Wed May 3 19:15:28 2017 -0700

--
 python/pyspark/sql/dataframe.py | 29 +
 python/pyspark/sql/tests.py | 16 
 2 files changed, 45 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/02bbe731/python/pyspark/sql/dataframe.py
--
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index ab6d35b..7b67985 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -380,6 +380,35 @@ class DataFrame(object):
 jdf = self._jdf.withWatermark(eventTime, delayThreshold)
 return DataFrame(jdf, self.sql_ctx)
 
+@since(2.2)
+def hint(self, name, *parameters):
+"""Specifies some hint on the current DataFrame.
+
+:param name: A name of the hint.
+:param parameters: Optional parameters.
+:return: :class:`DataFrame`
+
+>>> df.join(df2.hint("broadcast"), "name").show()
+++---+--+
+|name|age|height|
+++---+--+
+| Bob|  5|85|
+++---+--+
+"""
+if len(parameters) == 1 and isinstance(parameters[0], list):
+parameters = parameters[0]
+
+if not isinstance(name, str):
+raise TypeError("name should be provided as str, got 
{0}".format(type(name)))
+
+for p in parameters:
+if not isinstance(p, str):
+raise TypeError(
+"all parameters should be str, got {0} of type 
{1}".format(p, type(p)))
+
+jdf = self._jdf.hint(name, self._jseq(parameters))
+return DataFrame(jdf, self.sql_ctx)
+
 @since(1.3)
 def count(self):
 """Returns the number of rows in this :class:`DataFrame`.

http://git-wip-us.apache.org/repos/asf/spark/blob/02bbe731/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index ce4abf8..f644624 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -1906,6 +1906,22 @@ class SQLTests(ReusedPySparkTestCase):
 # planner should not crash without a join
 broadcast(df1)._jdf.queryExecution().executedPlan()
 
+def test_generic_hints(self):
+from pyspark.sql import DataFrame
+
+df1 = self.spark.range(10e10).toDF("id")
+df2 = self.spark.range(10e10).toDF("id")
+
+self.assertIsInstance(df1.hint("broadcast"), DataFrame)
+self.assertIsInstance(df1.hint("broadcast", []), DataFrame)
+
+# Dummy rules
+self.assertIsInstance(df1.hint("broadcast", "foo", "bar"), DataFrame)
+self.assertIsInstance(df1.hint("broadcast", ["foo", "bar"]), DataFrame)
+
+plan = df1.join(df2.hint("broadcast"), 
"id")._jdf.queryExecution().executedPlan()
+self.assertEqual(1, plan.toString().count("BroadcastHashJoin"))
+
 def test_toDF_with_schema_string(self):
 data = [Row(key=i, value=str(i)) for i in range(100)]
 rdd = self.sc.parallelize(data, 5)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-20584][PYSPARK][SQL] Python generic hint support

2017-05-03 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 a3a5fcfef -> d8bd213f1


[SPARK-20584][PYSPARK][SQL] Python generic hint support

## What changes were proposed in this pull request?

Adds `hint` method to PySpark `DataFrame`.

## How was this patch tested?

Unit tests, doctests.

Author: zero323 

Closes #17850 from zero323/SPARK-20584.

(cherry picked from commit 02bbe73118a39e2fb378aa2002449367a92f6d67)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d8bd213f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d8bd213f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d8bd213f

Branch: refs/heads/branch-2.2
Commit: d8bd213f13279664d50ffa57c1814d0b16fc5d23
Parents: a3a5fcf
Author: zero323 
Authored: Wed May 3 19:15:28 2017 -0700
Committer: Reynold Xin 
Committed: Wed May 3 19:15:42 2017 -0700

--
 python/pyspark/sql/dataframe.py | 29 +
 python/pyspark/sql/tests.py | 16 
 2 files changed, 45 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d8bd213f/python/pyspark/sql/dataframe.py
--
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index f567cc4..d62ba96 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -371,6 +371,35 @@ class DataFrame(object):
 jdf = self._jdf.withWatermark(eventTime, delayThreshold)
 return DataFrame(jdf, self.sql_ctx)
 
+@since(2.2)
+def hint(self, name, *parameters):
+"""Specifies some hint on the current DataFrame.
+
+:param name: A name of the hint.
+:param parameters: Optional parameters.
+:return: :class:`DataFrame`
+
+>>> df.join(df2.hint("broadcast"), "name").show()
+++---+--+
+|name|age|height|
+++---+--+
+| Bob|  5|85|
+++---+--+
+"""
+if len(parameters) == 1 and isinstance(parameters[0], list):
+parameters = parameters[0]
+
+if not isinstance(name, str):
+raise TypeError("name should be provided as str, got 
{0}".format(type(name)))
+
+for p in parameters:
+if not isinstance(p, str):
+raise TypeError(
+"all parameters should be str, got {0} of type 
{1}".format(p, type(p)))
+
+jdf = self._jdf.hint(name, self._jseq(parameters))
+return DataFrame(jdf, self.sql_ctx)
+
 @since(1.3)
 def count(self):
 """Returns the number of rows in this :class:`DataFrame`.

http://git-wip-us.apache.org/repos/asf/spark/blob/d8bd213f/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index cd92148..2aa2d23 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -1906,6 +1906,22 @@ class SQLTests(ReusedPySparkTestCase):
 # planner should not crash without a join
 broadcast(df1)._jdf.queryExecution().executedPlan()
 
+def test_generic_hints(self):
+from pyspark.sql import DataFrame
+
+df1 = self.spark.range(10e10).toDF("id")
+df2 = self.spark.range(10e10).toDF("id")
+
+self.assertIsInstance(df1.hint("broadcast"), DataFrame)
+self.assertIsInstance(df1.hint("broadcast", []), DataFrame)
+
+# Dummy rules
+self.assertIsInstance(df1.hint("broadcast", "foo", "bar"), DataFrame)
+self.assertIsInstance(df1.hint("broadcast", ["foo", "bar"]), DataFrame)
+
+plan = df1.join(df2.hint("broadcast"), 
"id")._jdf.queryExecution().executedPlan()
+self.assertEqual(1, plan.toString().count("BroadcastHashJoin"))
+
 def test_toDF_with_schema_string(self):
 data = [Row(key=i, value=str(i)) for i in range(100)]
 rdd = self.sc.parallelize(data, 5)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] spark issue #17850: [SPARK-20584][PYSPARK][SQL] Python generic hint support

2017-05-03 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17850
  
Merging in master/2.2.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17850: [SPARK-20584][PYSPARK][SQL] Python generic hint support

2017-05-03 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17850
  
LGTM pending Jenkins.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17850: [SPARK-20584][PYSPARK][SQL] Python generic hint s...

2017-05-03 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17850#discussion_r114677412
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -380,6 +380,35 @@ def withWatermark(self, eventTime, delayThreshold):
 jdf = self._jdf.withWatermark(eventTime, delayThreshold)
 return DataFrame(jdf, self.sql_ctx)
 
+@since(2.3)
--- End diff --

2.2


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



spark git commit: [MINOR][SQL] Fix the test title from =!= to <=>, remove a duplicated test and add a test for =!=

2017-05-03 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master 6b9e49d12 -> 13eb37c86


[MINOR][SQL] Fix the test title from =!= to <=>, remove a duplicated test and 
add a test for =!=

## What changes were proposed in this pull request?

This PR proposes three things as below:

- This test looks not testing `<=>` and identical with the test above, `===`. 
So, it removes the test.

  ```diff
  -   test("<=>") {
  - checkAnswer(
  -  testData2.filter($"a" === 1),
  -  testData2.collect().toSeq.filter(r => r.getInt(0) == 1))
  -
  -checkAnswer(
  -  testData2.filter($"a" === $"b"),
  -  testData2.collect().toSeq.filter(r => r.getInt(0) == r.getInt(1)))
  -   }
  ```

- Replace the test title from `=!=` to `<=>`. It looks the test actually 
testing `<=>`.

  ```diff
  +  private lazy val nullData = Seq(
  +(Some(1), Some(1)), (Some(1), Some(2)), (Some(1), None), (None, 
None)).toDF("a", "b")
  +
...
  -  test("=!=") {
  +  test("<=>") {
  -val nullData = spark.createDataFrame(sparkContext.parallelize(
  -  Row(1, 1) ::
  -  Row(1, 2) ::
  -  Row(1, null) ::
  -  Row(null, null) :: Nil),
  -  StructType(Seq(StructField("a", IntegerType), StructField("b", 
IntegerType
  -
 checkAnswer(
   nullData.filter($"b" <=> 1),
...
  ```

- Add the tests for `=!=` which looks not existing.

  ```diff
  +  test("=!=") {
  +checkAnswer(
  +  nullData.filter($"b" =!= 1),
  +  Row(1, 2) :: Nil)
  +
  +checkAnswer(nullData.filter($"b" =!= null), Nil)
  +
  +checkAnswer(
  +  nullData.filter($"a" =!= $"b"),
  +  Row(1, 2) :: Nil)
  +  }
  ```

## How was this patch tested?

Manually running the tests.

Author: hyukjinkwon 

Closes #17842 from HyukjinKwon/minor-test-fix.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/13eb37c8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/13eb37c8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/13eb37c8

Branch: refs/heads/master
Commit: 13eb37c860c8f672d0e9d9065d0333f981db71e3
Parents: 6b9e49d
Author: hyukjinkwon 
Authored: Wed May 3 13:08:25 2017 -0700
Committer: Reynold Xin 
Committed: Wed May 3 13:08:25 2017 -0700

--
 .../spark/sql/ColumnExpressionSuite.scala   | 31 +---
 1 file changed, 14 insertions(+), 17 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/13eb37c8/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala
index b0f398d..bc708ca 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala
@@ -39,6 +39,9 @@ class ColumnExpressionSuite extends QueryTest with 
SharedSQLContext {
   StructType(Seq(StructField("a", BooleanType), StructField("b", 
BooleanType
   }
 
+  private lazy val nullData = Seq(
+(Some(1), Some(1)), (Some(1), Some(2)), (Some(1), None), (None, 
None)).toDF("a", "b")
+
   test("column names with space") {
 val df = Seq((1, "a")).toDF("name with space", "name.with.dot")
 
@@ -284,23 +287,6 @@ class ColumnExpressionSuite extends QueryTest with 
SharedSQLContext {
 
   test("<=>") {
 checkAnswer(
-  testData2.filter($"a" === 1),
-  testData2.collect().toSeq.filter(r => r.getInt(0) == 1))
-
-checkAnswer(
-  testData2.filter($"a" === $"b"),
-  testData2.collect().toSeq.filter(r => r.getInt(0) == r.getInt(1)))
-  }
-
-  test("=!=") {
-val nullData = spark.createDataFrame(sparkContext.parallelize(
-  Row(1, 1) ::
-  Row(1, 2) ::
-  Row(1, null) ::
-  Row(null, null) :: Nil),
-  StructType(Seq(StructField("a", IntegerType), StructField("b", 
IntegerType
-
-checkAnswer(
   nullData.filter($"b" <=> 1),
   Row(1, 1) :: Nil)
 
@@ -321,7 +307,18 @@ class ColumnExpressionSuite extends QueryTest with 
SharedSQLContext {
 checkAnswer(
   nullData2.filter($"a" <=> null),
   Row(null) :: Nil)
+  }
 
+  test("=!=") {
+checkAnswer(
+  nullData.filter($"b" =!= 1),
+  Row(1, 2) :: Nil)
+
+checkAnswer(nullData.filter($"b" =!= null), Nil)
+
+checkAnswer(
+  nullData.filter($"a" =!= $"b"),
+  Row(1, 2) :: Nil)
   }
 
   test(">") {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [MINOR][SQL] Fix the test title from =!= to <=>, remove a duplicated test and add a test for =!=

2017-05-03 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 36d807906 -> 2629e7c7a


[MINOR][SQL] Fix the test title from =!= to <=>, remove a duplicated test and 
add a test for =!=

## What changes were proposed in this pull request?

This PR proposes three things as below:

- This test looks not testing `<=>` and identical with the test above, `===`. 
So, it removes the test.

  ```diff
  -   test("<=>") {
  - checkAnswer(
  -  testData2.filter($"a" === 1),
  -  testData2.collect().toSeq.filter(r => r.getInt(0) == 1))
  -
  -checkAnswer(
  -  testData2.filter($"a" === $"b"),
  -  testData2.collect().toSeq.filter(r => r.getInt(0) == r.getInt(1)))
  -   }
  ```

- Replace the test title from `=!=` to `<=>`. It looks the test actually 
testing `<=>`.

  ```diff
  +  private lazy val nullData = Seq(
  +(Some(1), Some(1)), (Some(1), Some(2)), (Some(1), None), (None, 
None)).toDF("a", "b")
  +
...
  -  test("=!=") {
  +  test("<=>") {
  -val nullData = spark.createDataFrame(sparkContext.parallelize(
  -  Row(1, 1) ::
  -  Row(1, 2) ::
  -  Row(1, null) ::
  -  Row(null, null) :: Nil),
  -  StructType(Seq(StructField("a", IntegerType), StructField("b", 
IntegerType
  -
 checkAnswer(
   nullData.filter($"b" <=> 1),
...
  ```

- Add the tests for `=!=` which looks not existing.

  ```diff
  +  test("=!=") {
  +checkAnswer(
  +  nullData.filter($"b" =!= 1),
  +  Row(1, 2) :: Nil)
  +
  +checkAnswer(nullData.filter($"b" =!= null), Nil)
  +
  +checkAnswer(
  +  nullData.filter($"a" =!= $"b"),
  +  Row(1, 2) :: Nil)
  +  }
  ```

## How was this patch tested?

Manually running the tests.

Author: hyukjinkwon 

Closes #17842 from HyukjinKwon/minor-test-fix.

(cherry picked from commit 13eb37c860c8f672d0e9d9065d0333f981db71e3)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2629e7c7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2629e7c7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2629e7c7

Branch: refs/heads/branch-2.2
Commit: 2629e7c7a1dacfb267d866cf825fa8a078612462
Parents: 36d8079
Author: hyukjinkwon 
Authored: Wed May 3 13:08:25 2017 -0700
Committer: Reynold Xin 
Committed: Wed May 3 13:08:31 2017 -0700

--
 .../spark/sql/ColumnExpressionSuite.scala   | 31 +---
 1 file changed, 14 insertions(+), 17 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2629e7c7/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala
index b0f398d..bc708ca 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala
@@ -39,6 +39,9 @@ class ColumnExpressionSuite extends QueryTest with 
SharedSQLContext {
   StructType(Seq(StructField("a", BooleanType), StructField("b", 
BooleanType
   }
 
+  private lazy val nullData = Seq(
+(Some(1), Some(1)), (Some(1), Some(2)), (Some(1), None), (None, 
None)).toDF("a", "b")
+
   test("column names with space") {
 val df = Seq((1, "a")).toDF("name with space", "name.with.dot")
 
@@ -284,23 +287,6 @@ class ColumnExpressionSuite extends QueryTest with 
SharedSQLContext {
 
   test("<=>") {
 checkAnswer(
-  testData2.filter($"a" === 1),
-  testData2.collect().toSeq.filter(r => r.getInt(0) == 1))
-
-checkAnswer(
-  testData2.filter($"a" === $"b"),
-  testData2.collect().toSeq.filter(r => r.getInt(0) == r.getInt(1)))
-  }
-
-  test("=!=") {
-val nullData = spark.createDataFrame(sparkContext.parallelize(
-  Row(1, 1) ::
-  Row(1, 2) ::
-  Row(1, null) ::
-  Row(null, null) :: Nil),
-  StructType(Seq(StructField("a", IntegerType), StructField("b", 
IntegerType
-
-checkAnswer(
   nullData.filter($"b" <=> 1),
   Row(1, 1) :: Nil)
 
@@ -321,7 +307,18 @@ class ColumnExpressionSuite extends QueryTest with 
SharedSQLContext {
 checkAnswer(
   nullData2.filter($"a" <=> null),
   Row(null) :: Nil)
+  }
 
+  test("=!=") {
+checkAnswer(
+  nullData.filter($"b" =!= 1),
+  Row(1, 2) :: Nil)
+
+checkAnswer(nullData.filter($"b" =!= null), Nil)
+
+checkAnswer(
+  nullData.filter($"a" =!= $"b"),
+  Row(1, 2) :: Nil)
   }
 
   test(">") {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For 

[GitHub] spark issue #17842: [MINOR][SQL] Fix the test title from =!= to <=>, remove ...

2017-05-03 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17842
  
Merging in master/branch-2.2.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17839: [SPARK-20576][SQL] Support generic hint function in Data...

2017-05-03 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17839
  
BTW I filed follow-up tickets for Python/R at 
https://issues.apache.org/jira/browse/SPARK-20576



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



spark git commit: [SPARK-20576][SQL] Support generic hint function in Dataset/DataFrame

2017-05-03 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 b1a732fea -> f0e80aa2d


[SPARK-20576][SQL] Support generic hint function in Dataset/DataFrame

## What changes were proposed in this pull request?
We allow users to specify hints (currently only "broadcast" is supported) in 
SQL and DataFrame. However, while SQL has a standard hint format (/*+ ... */), 
DataFrame doesn't have one and sometimes users are confused that they can't 
find how to apply a broadcast hint. This ticket adds a generic hint function on 
DataFrame that allows using the same hint on DataFrames as well as SQL.

As an example, after this patch, the following will apply a broadcast hint on a 
DataFrame using the new hint function:

```
df1.join(df2.hint("broadcast"))
```

## How was this patch tested?
Added a test case in DataFrameJoinSuite.

Author: Reynold Xin <r...@databricks.com>

Closes #17839 from rxin/SPARK-20576.

(cherry picked from commit 527fc5d0c990daaacad4740f62cfe6736609b77b)
Signed-off-by: Reynold Xin <r...@databricks.com>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f0e80aa2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f0e80aa2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f0e80aa2

Branch: refs/heads/branch-2.2
Commit: f0e80aa2ddee80819ef33ee24eb6a15a73bc02d5
Parents: b1a732f
Author: Reynold Xin <r...@databricks.com>
Authored: Wed May 3 09:22:25 2017 -0700
Committer: Reynold Xin <r...@databricks.com>
Committed: Wed May 3 09:22:41 2017 -0700

--
 .../sql/catalyst/analysis/ResolveHints.scala  |  8 +++-
 .../main/scala/org/apache/spark/sql/Dataset.scala | 16 
 .../org/apache/spark/sql/DataFrameJoinSuite.scala | 18 +-
 3 files changed, 40 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f0e80aa2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala
index c4827b8..df688fa 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala
@@ -86,7 +86,13 @@ object ResolveHints {
 
 def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
   case h: Hint if 
BROADCAST_HINT_NAMES.contains(h.name.toUpperCase(Locale.ROOT)) =>
-applyBroadcastHint(h.child, h.parameters.toSet)
+if (h.parameters.isEmpty) {
+  // If there is no table alias specified, turn the entire subtree 
into a BroadcastHint.
+  BroadcastHint(h.child)
+} else {
+  // Otherwise, find within the subtree query plans that should be 
broadcasted.
+  applyBroadcastHint(h.child, h.parameters.toSet)
+}
 }
   }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/f0e80aa2/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
index 06dd550..5f602dc 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -1074,6 +1074,22 @@ class Dataset[T] private[sql](
   def apply(colName: String): Column = col(colName)
 
   /**
+   * Specifies some hint on the current Dataset. As an example, the following 
code specifies
+   * that one of the plan can be broadcasted:
+   *
+   * {{{
+   *   df1.join(df2.hint("broadcast"))
+   * }}}
+   *
+   * @group basic
+   * @since 2.2.0
+   */
+  @scala.annotation.varargs
+  def hint(name: String, parameters: String*): Dataset[T] = withTypedPlan {
+Hint(name, parameters, logicalPlan)
+  }
+
+  /**
* Selects column based on the column name and return it as a [[Column]].
*
* @note The column name can also reference to a nested column like `a.b`.

http://git-wip-us.apache.org/repos/asf/spark/blob/f0e80aa2/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala
index 541ffb5..4a52af6 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala
@@

spark git commit: [SPARK-20576][SQL] Support generic hint function in Dataset/DataFrame

2017-05-03 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master 27f543b15 -> 527fc5d0c


[SPARK-20576][SQL] Support generic hint function in Dataset/DataFrame

## What changes were proposed in this pull request?
We allow users to specify hints (currently only "broadcast" is supported) in 
SQL and DataFrame. However, while SQL has a standard hint format (/*+ ... */), 
DataFrame doesn't have one and sometimes users are confused that they can't 
find how to apply a broadcast hint. This ticket adds a generic hint function on 
DataFrame that allows using the same hint on DataFrames as well as SQL.

As an example, after this patch, the following will apply a broadcast hint on a 
DataFrame using the new hint function:

```
df1.join(df2.hint("broadcast"))
```

## How was this patch tested?
Added a test case in DataFrameJoinSuite.

Author: Reynold Xin <r...@databricks.com>

Closes #17839 from rxin/SPARK-20576.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/527fc5d0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/527fc5d0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/527fc5d0

Branch: refs/heads/master
Commit: 527fc5d0c990daaacad4740f62cfe6736609b77b
Parents: 27f543b
Author: Reynold Xin <r...@databricks.com>
Authored: Wed May 3 09:22:25 2017 -0700
Committer: Reynold Xin <r...@databricks.com>
Committed: Wed May 3 09:22:25 2017 -0700

--
 .../sql/catalyst/analysis/ResolveHints.scala  |  8 +++-
 .../main/scala/org/apache/spark/sql/Dataset.scala | 16 
 .../org/apache/spark/sql/DataFrameJoinSuite.scala | 18 +-
 3 files changed, 40 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/527fc5d0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala
index c4827b8..df688fa 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala
@@ -86,7 +86,13 @@ object ResolveHints {
 
 def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
   case h: Hint if 
BROADCAST_HINT_NAMES.contains(h.name.toUpperCase(Locale.ROOT)) =>
-applyBroadcastHint(h.child, h.parameters.toSet)
+if (h.parameters.isEmpty) {
+  // If there is no table alias specified, turn the entire subtree 
into a BroadcastHint.
+  BroadcastHint(h.child)
+} else {
+  // Otherwise, find within the subtree query plans that should be 
broadcasted.
+  applyBroadcastHint(h.child, h.parameters.toSet)
+}
 }
   }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/527fc5d0/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
index 147e765..620c8bd 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -1161,6 +1161,22 @@ class Dataset[T] private[sql](
   def apply(colName: String): Column = col(colName)
 
   /**
+   * Specifies some hint on the current Dataset. As an example, the following 
code specifies
+   * that one of the plan can be broadcasted:
+   *
+   * {{{
+   *   df1.join(df2.hint("broadcast"))
+   * }}}
+   *
+   * @group basic
+   * @since 2.2.0
+   */
+  @scala.annotation.varargs
+  def hint(name: String, parameters: String*): Dataset[T] = withTypedPlan {
+Hint(name, parameters, logicalPlan)
+  }
+
+  /**
* Selects column based on the column name and return it as a [[Column]].
*
* @note The column name can also reference to a nested column like `a.b`.

http://git-wip-us.apache.org/repos/asf/spark/blob/527fc5d0/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala
index 541ffb5..4a52af6 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala
@@ -151,7 +151,7 @@ class DataFrameJoinSuite extends QueryTest with 
SharedSQLContext {
   Row(1, 1, 1, 1) :: Row(2, 1, 2, 2) :: Nil)
   }

[GitHub] spark issue #17839: [SPARK-20576][SQL] Support generic hint function in Data...

2017-05-03 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17839
  
Merging in master/branch-2.2.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17839: [SPARK-20576][SQL] Support generic hint function in Data...

2017-05-03 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17839
  
@felixcheung do you worry about conflicts?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17678: [SPARK-20381][SQL] Add SQL metrics of numOutputRows for ...

2017-05-03 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17678
  
cc @gatorsmile can you review this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17770: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...

2017-05-03 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17770
  
Let's see what other people say before going too far...

cc @cloud-fan / @hvanhovell / @marmbrus / @gatorsmile see my proposal: 
https://github.com/apache/spark/pull/17770#issuecomment-298833348


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17770: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...

2017-05-03 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17770
  
What self join case are you talking about? The one that we manually rewrite 
half of the plan? That one would be a special case anyway, wouldn't it?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17770: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...

2017-05-03 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17770
  
I'm actually wondering if we should just introduce a variant of transform 
that takes a stop condition, e.g.

```
def transform(stopCondition: BaseType => Boolean)(rule: 
PartialFunction[BaseType, BaseType])
```

and then in analyzer we can use

```
def transform(_.resolved) {
  case ...
}
```

I worry adding this random node everywhere will make the plan look ugly and 
break certain assumptions we have somewhere.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17839: [SPARK-20576][SQL] Support generic hint function in Data...

2017-05-03 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17839
  
Actually somebody should add the Python / R wrapper.

cc @felixcheung 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17770: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...

2017-05-03 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17770
  
why don't we always add this to the dataset's logicalPlan? we can change 
that in one place.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17770: [SPARK-20392][SQL] Set barrier to prevent re-ente...

2017-05-03 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17770#discussion_r114478015
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -1134,7 +1138,7 @@ class Dataset[T] private[sql](
*/
   @scala.annotation.varargs
   def select(cols: Column*): DataFrame = withPlan {
-Project(cols.map(_.named), logicalPlan)
+Project(cols.map(_.named), AnalysisBarrier(logicalPlan))
--- End diff --

does this work if we turn off eager analysis?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17839: [SPARK-20576][SQL] Support generic hint function ...

2017-05-03 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/17839

[SPARK-20576][SQL] Support generic hint function in Dataset/DataFrame

## What changes were proposed in this pull request?
We allow users to specify hints (currently only "broadcast" is supported) 
in SQL and DataFrame. However, while SQL has a standard hint format (/*+ ... 
*/), DataFrame doesn't have one and sometimes users are confused that they 
can't find how to apply a broadcast hint. This ticket adds a generic hint 
function on DataFrame that allows using the same hint on DataFrames as well as 
SQL.

As an example, after this patch, the following will apply a broadcast hint 
on a DataFrame using the new hint function:

```
df1.join(df2.hint("broadcast"))
```

## How was this patch tested?
Added a test case in DataFrameJoinSuite.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-20576

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17839.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17839


commit 921eb21d0bbb65715617a019ce13b53e2c868121
Author: Reynold Xin <r...@databricks.com>
Date:   2017-05-03T06:02:51Z

[SPARK-20576][SQL] Support generic hint function in Dataset/DataFrame




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17806: [SPARK-20487][SQL] Display `serde` for `HiveTableScan` n...

2017-04-28 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17806
  
@gatorsmile i will let you merge ...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17806: [SPARK-20487][SQL] Display `serde` for `HiveTableScan` n...

2017-04-28 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17806
  
Maybe get rid of the Some? If it is not defined, we probably just shouldn't 
show anything.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17780: [SPARK-20487][SQL] `HiveTableScan` node is quite verbose...

2017-04-28 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17780
  
Can we at least include the serde?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



spark git commit: [SPARK-20474] Fixing OnHeapColumnVector reallocation

2017-04-26 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 6709bcf6e -> e278876ba


[SPARK-20474] Fixing OnHeapColumnVector reallocation

## What changes were proposed in this pull request?
OnHeapColumnVector reallocation copies to the new storage data up to 
'elementsAppended'. This variable is only updated when using the 
ColumnVector.appendX API, while ColumnVector.putX is more commonly used.

## How was this patch tested?
Tested using existing unit tests.

Author: Michal Szafranski 

Closes #17773 from michal-databricks/spark-20474.

(cherry picked from commit a277ae80a2836e6533b338d2b9c4e59ed8a1daae)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e278876b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e278876b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e278876b

Branch: refs/heads/branch-2.2
Commit: e278876ba3d66d3fb249df59c3de8d78ca25c5f0
Parents: 6709bcf
Author: Michal Szafranski 
Authored: Wed Apr 26 12:47:37 2017 -0700
Committer: Reynold Xin 
Committed: Wed Apr 26 12:47:50 2017 -0700

--
 .../vectorized/OnHeapColumnVector.java  | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e278876b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
 
b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
index 9b410ba..94ed322 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
@@ -410,53 +410,53 @@ public final class OnHeapColumnVector extends 
ColumnVector {
   int[] newLengths = new int[newCapacity];
   int[] newOffsets = new int[newCapacity];
   if (this.arrayLengths != null) {
-System.arraycopy(this.arrayLengths, 0, newLengths, 0, 
elementsAppended);
-System.arraycopy(this.arrayOffsets, 0, newOffsets, 0, 
elementsAppended);
+System.arraycopy(this.arrayLengths, 0, newLengths, 0, capacity);
+System.arraycopy(this.arrayOffsets, 0, newOffsets, 0, capacity);
   }
   arrayLengths = newLengths;
   arrayOffsets = newOffsets;
 } else if (type instanceof BooleanType) {
   if (byteData == null || byteData.length < newCapacity) {
 byte[] newData = new byte[newCapacity];
-if (byteData != null) System.arraycopy(byteData, 0, newData, 0, 
elementsAppended);
+if (byteData != null) System.arraycopy(byteData, 0, newData, 0, 
capacity);
 byteData = newData;
   }
 } else if (type instanceof ByteType) {
   if (byteData == null || byteData.length < newCapacity) {
 byte[] newData = new byte[newCapacity];
-if (byteData != null) System.arraycopy(byteData, 0, newData, 0, 
elementsAppended);
+if (byteData != null) System.arraycopy(byteData, 0, newData, 0, 
capacity);
 byteData = newData;
   }
 } else if (type instanceof ShortType) {
   if (shortData == null || shortData.length < newCapacity) {
 short[] newData = new short[newCapacity];
-if (shortData != null) System.arraycopy(shortData, 0, newData, 0, 
elementsAppended);
+if (shortData != null) System.arraycopy(shortData, 0, newData, 0, 
capacity);
 shortData = newData;
   }
 } else if (type instanceof IntegerType || type instanceof DateType ||
   DecimalType.is32BitDecimalType(type)) {
   if (intData == null || intData.length < newCapacity) {
 int[] newData = new int[newCapacity];
-if (intData != null) System.arraycopy(intData, 0, newData, 0, 
elementsAppended);
+if (intData != null) System.arraycopy(intData, 0, newData, 0, 
capacity);
 intData = newData;
   }
 } else if (type instanceof LongType || type instanceof TimestampType ||
 DecimalType.is64BitDecimalType(type)) {
   if (longData == null || longData.length < newCapacity) {
 long[] newData = new long[newCapacity];
-if (longData != null) System.arraycopy(longData, 0, newData, 0, 
elementsAppended);
+if (longData != null) System.arraycopy(longData, 0, newData, 0, 
capacity);
 longData = newData;
   }
 } else if (type instanceof FloatType) {
   if (floatData == null || floatData.length < newCapacity) {
 float[] newData = new float[newCapacity];
-if (floatData != null) System.arraycopy(floatData, 0, 

spark git commit: [SPARK-20474] Fixing OnHeapColumnVector reallocation

2017-04-26 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master 99c6cf9ef -> a277ae80a


[SPARK-20474] Fixing OnHeapColumnVector reallocation

## What changes were proposed in this pull request?
OnHeapColumnVector reallocation copies to the new storage data up to 
'elementsAppended'. This variable is only updated when using the 
ColumnVector.appendX API, while ColumnVector.putX is more commonly used.

## How was this patch tested?
Tested using existing unit tests.

Author: Michal Szafranski 

Closes #17773 from michal-databricks/spark-20474.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a277ae80
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a277ae80
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a277ae80

Branch: refs/heads/master
Commit: a277ae80a2836e6533b338d2b9c4e59ed8a1daae
Parents: 99c6cf9
Author: Michal Szafranski 
Authored: Wed Apr 26 12:47:37 2017 -0700
Committer: Reynold Xin 
Committed: Wed Apr 26 12:47:37 2017 -0700

--
 .../vectorized/OnHeapColumnVector.java  | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a277ae80/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
 
b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
index 9b410ba..94ed322 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
@@ -410,53 +410,53 @@ public final class OnHeapColumnVector extends 
ColumnVector {
   int[] newLengths = new int[newCapacity];
   int[] newOffsets = new int[newCapacity];
   if (this.arrayLengths != null) {
-System.arraycopy(this.arrayLengths, 0, newLengths, 0, 
elementsAppended);
-System.arraycopy(this.arrayOffsets, 0, newOffsets, 0, 
elementsAppended);
+System.arraycopy(this.arrayLengths, 0, newLengths, 0, capacity);
+System.arraycopy(this.arrayOffsets, 0, newOffsets, 0, capacity);
   }
   arrayLengths = newLengths;
   arrayOffsets = newOffsets;
 } else if (type instanceof BooleanType) {
   if (byteData == null || byteData.length < newCapacity) {
 byte[] newData = new byte[newCapacity];
-if (byteData != null) System.arraycopy(byteData, 0, newData, 0, 
elementsAppended);
+if (byteData != null) System.arraycopy(byteData, 0, newData, 0, 
capacity);
 byteData = newData;
   }
 } else if (type instanceof ByteType) {
   if (byteData == null || byteData.length < newCapacity) {
 byte[] newData = new byte[newCapacity];
-if (byteData != null) System.arraycopy(byteData, 0, newData, 0, 
elementsAppended);
+if (byteData != null) System.arraycopy(byteData, 0, newData, 0, 
capacity);
 byteData = newData;
   }
 } else if (type instanceof ShortType) {
   if (shortData == null || shortData.length < newCapacity) {
 short[] newData = new short[newCapacity];
-if (shortData != null) System.arraycopy(shortData, 0, newData, 0, 
elementsAppended);
+if (shortData != null) System.arraycopy(shortData, 0, newData, 0, 
capacity);
 shortData = newData;
   }
 } else if (type instanceof IntegerType || type instanceof DateType ||
   DecimalType.is32BitDecimalType(type)) {
   if (intData == null || intData.length < newCapacity) {
 int[] newData = new int[newCapacity];
-if (intData != null) System.arraycopy(intData, 0, newData, 0, 
elementsAppended);
+if (intData != null) System.arraycopy(intData, 0, newData, 0, 
capacity);
 intData = newData;
   }
 } else if (type instanceof LongType || type instanceof TimestampType ||
 DecimalType.is64BitDecimalType(type)) {
   if (longData == null || longData.length < newCapacity) {
 long[] newData = new long[newCapacity];
-if (longData != null) System.arraycopy(longData, 0, newData, 0, 
elementsAppended);
+if (longData != null) System.arraycopy(longData, 0, newData, 0, 
capacity);
 longData = newData;
   }
 } else if (type instanceof FloatType) {
   if (floatData == null || floatData.length < newCapacity) {
 float[] newData = new float[newCapacity];
-if (floatData != null) System.arraycopy(floatData, 0, newData, 0, 
elementsAppended);
+if (floatData != null) System.arraycopy(floatData, 0, newData, 0, 
capacity);
 

[GitHub] spark issue #17773: [SPARK-20474] Fixing OnHeapColumnVector reallocation

2017-04-26 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17773
  
Merging in master/branch-2.2.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



spark git commit: [SPARK-20473] Enabling missing types in ColumnVector.Array

2017-04-26 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 b65858bb3 -> 6709bcf6e


[SPARK-20473] Enabling missing types in ColumnVector.Array

## What changes were proposed in this pull request?
ColumnVector implementations originally did not support some Catalyst types 
(float, short, and boolean). Now that they do, those types should be also added 
to the ColumnVector.Array.

## How was this patch tested?
Tested using existing unit tests.

Author: Michal Szafranski 

Closes #17772 from michal-databricks/spark-20473.

(cherry picked from commit 99c6cf9ef16bf8fae6edb23a62e46546a16bca80)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6709bcf6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6709bcf6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6709bcf6

Branch: refs/heads/branch-2.2
Commit: 6709bcf6e66e99e17ba2a3b1482df2dba1a15716
Parents: b65858b
Author: Michal Szafranski 
Authored: Wed Apr 26 11:21:25 2017 -0700
Committer: Reynold Xin 
Committed: Wed Apr 26 11:21:57 2017 -0700

--
 .../apache/spark/sql/execution/vectorized/ColumnVector.java| 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6709bcf6/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java
 
b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java
index 354c878..b105e60 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java
@@ -180,7 +180,7 @@ public abstract class ColumnVector implements AutoCloseable 
{
 
 @Override
 public boolean getBoolean(int ordinal) {
-  throw new UnsupportedOperationException();
+  return data.getBoolean(offset + ordinal);
 }
 
 @Override
@@ -188,7 +188,7 @@ public abstract class ColumnVector implements AutoCloseable 
{
 
 @Override
 public short getShort(int ordinal) {
-  throw new UnsupportedOperationException();
+  return data.getShort(offset + ordinal);
 }
 
 @Override
@@ -199,7 +199,7 @@ public abstract class ColumnVector implements AutoCloseable 
{
 
 @Override
 public float getFloat(int ordinal) {
-  throw new UnsupportedOperationException();
+  return data.getFloat(offset + ordinal);
 }
 
 @Override


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-20473] Enabling missing types in ColumnVector.Array

2017-04-26 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master 66dd5b83f -> 99c6cf9ef


[SPARK-20473] Enabling missing types in ColumnVector.Array

## What changes were proposed in this pull request?
ColumnVector implementations originally did not support some Catalyst types 
(float, short, and boolean). Now that they do, those types should be also added 
to the ColumnVector.Array.

## How was this patch tested?
Tested using existing unit tests.

Author: Michal Szafranski 

Closes #17772 from michal-databricks/spark-20473.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/99c6cf9e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/99c6cf9e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/99c6cf9e

Branch: refs/heads/master
Commit: 99c6cf9ef16bf8fae6edb23a62e46546a16bca80
Parents: 66dd5b8
Author: Michal Szafranski 
Authored: Wed Apr 26 11:21:25 2017 -0700
Committer: Reynold Xin 
Committed: Wed Apr 26 11:21:25 2017 -0700

--
 .../apache/spark/sql/execution/vectorized/ColumnVector.java| 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/99c6cf9e/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java
 
b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java
index 354c878..b105e60 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java
@@ -180,7 +180,7 @@ public abstract class ColumnVector implements AutoCloseable 
{
 
 @Override
 public boolean getBoolean(int ordinal) {
-  throw new UnsupportedOperationException();
+  return data.getBoolean(offset + ordinal);
 }
 
 @Override
@@ -188,7 +188,7 @@ public abstract class ColumnVector implements AutoCloseable 
{
 
 @Override
 public short getShort(int ordinal) {
-  throw new UnsupportedOperationException();
+  return data.getShort(offset + ordinal);
 }
 
 @Override
@@ -199,7 +199,7 @@ public abstract class ColumnVector implements AutoCloseable 
{
 
 @Override
 public float getFloat(int ordinal) {
-  throw new UnsupportedOperationException();
+  return data.getFloat(offset + ordinal);
 }
 
 @Override


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] spark issue #17772: [SPARK-20473] Enabling missing types in ColumnVector.Arr...

2017-04-26 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17772
  
Merging in master / branch-2.2.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17770: [SPARK-20392][SQL][WIP] Set barrier to prevent re-enteri...

2017-04-26 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17770
  
Can we fix the description? It is really confusing since it uses the word 
exchange. Also can we just skip a plan if it is resolved in transform?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17727: [SQL][MINOR] Remove misleading comment (and tags do bett...

2017-04-25 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17727
  
Hm I don't think the comment makes sense ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



spark git commit: [SPARK-20453] Bump master branch version to 2.3.0-SNAPSHOT

2017-04-24 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master 5280d93e6 -> f44c8a843


[SPARK-20453] Bump master branch version to 2.3.0-SNAPSHOT

This patch bumps the master branch version to `2.3.0-SNAPSHOT`.

Author: Josh Rosen 

Closes #17753 from JoshRosen/SPARK-20453.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f44c8a84
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f44c8a84
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f44c8a84

Branch: refs/heads/master
Commit: f44c8a843ca512b319f099477415bc13eca2e373
Parents: 5280d93
Author: Josh Rosen 
Authored: Mon Apr 24 21:48:04 2017 -0700
Committer: Reynold Xin 
Committed: Mon Apr 24 21:48:04 2017 -0700

--
 assembly/pom.xml  | 2 +-
 common/network-common/pom.xml | 2 +-
 common/network-shuffle/pom.xml| 2 +-
 common/network-yarn/pom.xml   | 2 +-
 common/sketch/pom.xml | 2 +-
 common/tags/pom.xml   | 2 +-
 common/unsafe/pom.xml | 2 +-
 core/pom.xml  | 2 +-
 docs/_config.yml  | 4 ++--
 examples/pom.xml  | 2 +-
 external/docker-integration-tests/pom.xml | 2 +-
 external/flume-assembly/pom.xml   | 2 +-
 external/flume-sink/pom.xml   | 2 +-
 external/flume/pom.xml| 2 +-
 external/kafka-0-10-assembly/pom.xml  | 2 +-
 external/kafka-0-10-sql/pom.xml   | 2 +-
 external/kafka-0-10/pom.xml   | 2 +-
 external/kafka-0-8-assembly/pom.xml   | 2 +-
 external/kafka-0-8/pom.xml| 2 +-
 external/kinesis-asl-assembly/pom.xml | 2 +-
 external/kinesis-asl/pom.xml  | 2 +-
 external/spark-ganglia-lgpl/pom.xml   | 2 +-
 graphx/pom.xml| 2 +-
 launcher/pom.xml  | 2 +-
 mllib-local/pom.xml   | 2 +-
 mllib/pom.xml | 2 +-
 pom.xml   | 2 +-
 project/MimaExcludes.scala| 5 +
 repl/pom.xml  | 2 +-
 resource-managers/mesos/pom.xml   | 2 +-
 resource-managers/yarn/pom.xml| 2 +-
 sql/catalyst/pom.xml  | 2 +-
 sql/core/pom.xml  | 2 +-
 sql/hive-thriftserver/pom.xml | 2 +-
 sql/hive/pom.xml  | 2 +-
 streaming/pom.xml | 2 +-
 tools/pom.xml | 2 +-
 37 files changed, 42 insertions(+), 37 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f44c8a84/assembly/pom.xml
--
diff --git a/assembly/pom.xml b/assembly/pom.xml
index 9d8607d..742a4a1 100644
--- a/assembly/pom.xml
+++ b/assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.2.0-SNAPSHOT
+2.3.0-SNAPSHOT
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/f44c8a84/common/network-common/pom.xml
--
diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml
index 8657af7..066970f 100644
--- a/common/network-common/pom.xml
+++ b/common/network-common/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.2.0-SNAPSHOT
+2.3.0-SNAPSHOT
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/f44c8a84/common/network-shuffle/pom.xml
--
diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml
index 24c10fb..2de882a 100644
--- a/common/network-shuffle/pom.xml
+++ b/common/network-shuffle/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.2.0-SNAPSHOT
+2.3.0-SNAPSHOT
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/f44c8a84/common/network-yarn/pom.xml
--
diff --git a/common/network-yarn/pom.xml b/common/network-yarn/pom.xml
index 5e5a80b..a8488d8 100644
--- a/common/network-yarn/pom.xml
+++ b/common/network-yarn/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.2.0-SNAPSHOT
+2.3.0-SNAPSHOT
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/f44c8a84/common/sketch/pom.xml
--
diff --git a/common/sketch/pom.xml b/common/sketch/pom.xml
index 1356c47..6b81fc2 100644
--- a/common/sketch/pom.xml
+++ b/common/sketch/pom.xml
@@ -22,7 +22,7 @@
  

[GitHub] spark issue #17753: [SPARK-20453] Bump master branch version to 2.3.0-SNAPSH...

2017-04-24 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17753
  
Merging in master.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-04-24 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14731
  
Steve I think the main point is you should also respect the time of 
reviewers. The way most of your pull requests manifest have been suboptimal: 
they often start with a very early WIP (which is not necessarily a problem), 
and once in a while (e.g. a month or two) you update it to almost completely 
change it. The time itself is a problem. It requires a lot of context switching 
to review your pull requests. In addition, every time you update it it looks 
like a complete new giant pull request.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17648: [SPARK-19851] Add support for EVERY and ANY (SOME) aggre...

2017-04-24 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17648
  
sgtm


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17736: [SPARK-20399][SQL] Can't use same regex pattern between ...

2017-04-24 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17736
  
cc @hvanhovell for review ...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17712: [SPARK-20416][SQL] Print UDF names in EXPLAIN

2017-04-23 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17712
  
Why use a map? That's super unstructured and easy to break ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17712: [SPARK-20416][SQL] Print UDF names in EXPLAIN

2017-04-22 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17712
  
cc @gatorsmile 

This is related to the deterministic thing you want to do?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17717: [SPARK-20430][SQL] Initialise RangeExec parameters in a ...

2017-04-21 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17717
  
LGTM pending Jenkins.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17717: [SPARK-20430][SQL] Initialise RangeExec parameter...

2017-04-21 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17717#discussion_r112803232
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
---
@@ -1732,4 +1732,10 @@ class DataFrameSuite extends QueryTest with 
SharedSQLContext {
   .filter($"x1".isNotNull || !$"y".isin("a!"))
   .count
   }
+
+  test("SPARK-20430 Initialize Range parameters in a deriver side") {
--- End diff --

driver


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17717: [SPARK-20430][SQL] Initialise RangeExec parameter...

2017-04-21 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17717#discussion_r112803234
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
---
@@ -1732,4 +1732,10 @@ class DataFrameSuite extends QueryTest with 
SharedSQLContext {
   .filter($"x1".isNotNull || !$"y".isin("a!"))
   .count
   }
+
+  test("SPARK-20430 Initialize Range parameters in a deriver side") {
--- End diff --

also move this into dataframe range suite?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17712: [SPARK-20416][SQL] Print UDF names in EXPLAIN

2017-04-21 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17712#discussion_r112803097
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
 ---
@@ -45,14 +45,33 @@ import org.apache.spark.sql.types.DataType
 case class UserDefinedFunction protected[sql] (
 f: AnyRef,
 dataType: DataType,
-inputTypes: Option[Seq[DataType]]) {
+inputTypes: Option[Seq[DataType]],
+name: Option[String]) {
+
+  // Optionally used for printing an UDF name in EXPLAIN
+  def withName(name: String): UserDefinedFunction = {
+UserDefinedFunction(f, dataType, inputTypes, Option(name))
+  }
 
   /**
* Returns an expression that invokes the UDF, using the given arguments.
*
* @since 1.3.0
*/
   def apply(exprs: Column*): Column = {
-Column(ScalaUDF(f, dataType, exprs.map(_.expr), 
inputTypes.getOrElse(Nil)))
+Column(ScalaUDF(f, dataType, exprs.map(_.expr), 
inputTypes.getOrElse(Nil), name))
+  }
+}
+
+object UserDefinedFunction {
--- End diff --

ah ok - that sucks. that means this will break compatibility ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17712: [SPARK-20416][SQL] Print UDF names in EXPLAIN

2017-04-21 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17712#discussion_r112800640
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
 ---
@@ -45,14 +45,33 @@ import org.apache.spark.sql.types.DataType
 case class UserDefinedFunction protected[sql] (
 f: AnyRef,
 dataType: DataType,
-inputTypes: Option[Seq[DataType]]) {
+inputTypes: Option[Seq[DataType]],
+name: Option[String]) {
+
+  // Optionally used for printing an UDF name in EXPLAIN
+  def withName(name: String): UserDefinedFunction = {
+UserDefinedFunction(f, dataType, inputTypes, Option(name))
+  }
 
   /**
* Returns an expression that invokes the UDF, using the given arguments.
*
* @since 1.3.0
*/
   def apply(exprs: Column*): Column = {
-Column(ScalaUDF(f, dataType, exprs.map(_.expr), 
inputTypes.getOrElse(Nil)))
+Column(ScalaUDF(f, dataType, exprs.map(_.expr), 
inputTypes.getOrElse(Nil), name))
+  }
+}
+
+object UserDefinedFunction {
--- End diff --

also need an unapply function


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17648: [SPARK-19851] Add support for EVERY and ANY (SOME) aggre...

2017-04-21 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17648
  
I was saying rather than implementing them, just rewrite them into an 
aggregate on the conditions and compare them against the value.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17712: [SPARK-20416][SQL] Print UDF names in EXPLAIN

2017-04-21 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17712#discussion_r112754224
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
 ---
@@ -47,12 +47,20 @@ case class UserDefinedFunction protected[sql] (
 dataType: DataType,
 inputTypes: Option[Seq[DataType]]) {
 
+  // Optionally used for printing UDF names in EXPLAIN
+  private var nameOption: Option[String] = None
--- End diff --

it will be fine if we add an explicit apply method and unapply method.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



spark git commit: [SPARK-20420][SQL] Add events to the external catalog

2017-04-21 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master 48d760d02 -> e2b3d2367


[SPARK-20420][SQL] Add events to the external catalog

## What changes were proposed in this pull request?
It is often useful to be able to track changes to the `ExternalCatalog`. This 
PR makes the `ExternalCatalog` emit events when a catalog object is changed. 
Events are fired before and after the change.

The following events are fired per object:

- Database
  - CreateDatabasePreEvent: event fired before the database is created.
  - CreateDatabaseEvent: event fired after the database has been created.
  - DropDatabasePreEvent: event fired before the database is dropped.
  - DropDatabaseEvent: event fired after the database has been dropped.
- Table
  - CreateTablePreEvent: event fired before the table is created.
  - CreateTableEvent: event fired after the table has been created.
  - RenameTablePreEvent: event fired before the table is renamed.
  - RenameTableEvent: event fired after the table has been renamed.
  - DropTablePreEvent: event fired before the table is dropped.
  - DropTableEvent: event fired after the table has been dropped.
- Function
  - CreateFunctionPreEvent: event fired before the function is created.
  - CreateFunctionEvent: event fired after the function has been created.
  - RenameFunctionPreEvent: event fired before the function is renamed.
  - RenameFunctionEvent: event fired after the function has been renamed.
  - DropFunctionPreEvent: event fired before the function is dropped.
  - DropFunctionPreEvent: event fired after the function has been dropped.

The current events currently only contain the names of the object modified. We 
add more events, and more details at a later point.

A user can monitor changes to the external catalog by adding a listener to the 
Spark listener bus checking for `ExternalCatalogEvent`s using the 
`SparkListener.onOtherEvent` hook. A more direct approach is add listener 
directly to the `ExternalCatalog`.

## How was this patch tested?
Added the `ExternalCatalogEventSuite`.

Author: Herman van Hovell 

Closes #17710 from hvanhovell/SPARK-20420.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e2b3d236
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e2b3d236
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e2b3d236

Branch: refs/heads/master
Commit: e2b3d2367a563d4600d8d87b5317e71135c362f0
Parents: 48d760d
Author: Herman van Hovell 
Authored: Fri Apr 21 00:05:03 2017 -0700
Committer: Reynold Xin 
Committed: Fri Apr 21 00:05:03 2017 -0700

--
 .../sql/catalyst/catalog/ExternalCatalog.scala  |  85 -
 .../sql/catalyst/catalog/InMemoryCatalog.scala  |  22 ++-
 .../spark/sql/catalyst/catalog/events.scala | 158 
 .../catalog/ExternalCatalogEventSuite.scala | 188 +++
 .../apache/spark/sql/internal/SharedState.scala |   7 +
 .../spark/sql/hive/HiveExternalCatalog.scala|  22 ++-
 6 files changed, 457 insertions(+), 25 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e2b3d236/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala
index 08a01e8..974ef90 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala
@@ -20,6 +20,7 @@ package org.apache.spark.sql.catalyst.catalog
 import org.apache.spark.sql.catalyst.analysis.{FunctionAlreadyExistsException, 
NoSuchDatabaseException, NoSuchFunctionException, NoSuchTableException}
 import org.apache.spark.sql.catalyst.expressions.Expression
 import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.ListenerBus
 
 /**
  * Interface for the system catalog (of functions, partitions, tables, and 
databases).
@@ -30,7 +31,8 @@ import org.apache.spark.sql.types.StructType
  *
  * Implementations should throw [[NoSuchDatabaseException]] when databases 
don't exist.
  */
-abstract class ExternalCatalog {
+abstract class ExternalCatalog
+  extends ListenerBus[ExternalCatalogEventListener, ExternalCatalogEvent] {
   import CatalogTypes.TablePartitionSpec
 
   protected def requireDbExists(db: String): Unit = {
@@ -61,9 +63,22 @@ abstract class ExternalCatalog {
   // Databases
   // --
 
-  def createDatabase(dbDefinition: CatalogDatabase, ignoreIfExists: Boolean): 

spark git commit: [SPARK-20420][SQL] Add events to the external catalog

2017-04-21 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 6cd2f16b1 -> cddb4b7db


[SPARK-20420][SQL] Add events to the external catalog

## What changes were proposed in this pull request?
It is often useful to be able to track changes to the `ExternalCatalog`. This 
PR makes the `ExternalCatalog` emit events when a catalog object is changed. 
Events are fired before and after the change.

The following events are fired per object:

- Database
  - CreateDatabasePreEvent: event fired before the database is created.
  - CreateDatabaseEvent: event fired after the database has been created.
  - DropDatabasePreEvent: event fired before the database is dropped.
  - DropDatabaseEvent: event fired after the database has been dropped.
- Table
  - CreateTablePreEvent: event fired before the table is created.
  - CreateTableEvent: event fired after the table has been created.
  - RenameTablePreEvent: event fired before the table is renamed.
  - RenameTableEvent: event fired after the table has been renamed.
  - DropTablePreEvent: event fired before the table is dropped.
  - DropTableEvent: event fired after the table has been dropped.
- Function
  - CreateFunctionPreEvent: event fired before the function is created.
  - CreateFunctionEvent: event fired after the function has been created.
  - RenameFunctionPreEvent: event fired before the function is renamed.
  - RenameFunctionEvent: event fired after the function has been renamed.
  - DropFunctionPreEvent: event fired before the function is dropped.
  - DropFunctionPreEvent: event fired after the function has been dropped.

The current events currently only contain the names of the object modified. We 
add more events, and more details at a later point.

A user can monitor changes to the external catalog by adding a listener to the 
Spark listener bus checking for `ExternalCatalogEvent`s using the 
`SparkListener.onOtherEvent` hook. A more direct approach is add listener 
directly to the `ExternalCatalog`.

## How was this patch tested?
Added the `ExternalCatalogEventSuite`.

Author: Herman van Hovell 

Closes #17710 from hvanhovell/SPARK-20420.

(cherry picked from commit e2b3d2367a563d4600d8d87b5317e71135c362f0)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cddb4b7d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cddb4b7d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cddb4b7d

Branch: refs/heads/branch-2.2
Commit: cddb4b7db81b01b4abf2ab683aba97e4eabb9769
Parents: 6cd2f16
Author: Herman van Hovell 
Authored: Fri Apr 21 00:05:03 2017 -0700
Committer: Reynold Xin 
Committed: Fri Apr 21 00:05:10 2017 -0700

--
 .../sql/catalyst/catalog/ExternalCatalog.scala  |  85 -
 .../sql/catalyst/catalog/InMemoryCatalog.scala  |  22 ++-
 .../spark/sql/catalyst/catalog/events.scala | 158 
 .../catalog/ExternalCatalogEventSuite.scala | 188 +++
 .../apache/spark/sql/internal/SharedState.scala |   7 +
 .../spark/sql/hive/HiveExternalCatalog.scala|  22 ++-
 6 files changed, 457 insertions(+), 25 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/cddb4b7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala
index 08a01e8..974ef90 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala
@@ -20,6 +20,7 @@ package org.apache.spark.sql.catalyst.catalog
 import org.apache.spark.sql.catalyst.analysis.{FunctionAlreadyExistsException, 
NoSuchDatabaseException, NoSuchFunctionException, NoSuchTableException}
 import org.apache.spark.sql.catalyst.expressions.Expression
 import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.ListenerBus
 
 /**
  * Interface for the system catalog (of functions, partitions, tables, and 
databases).
@@ -30,7 +31,8 @@ import org.apache.spark.sql.types.StructType
  *
  * Implementations should throw [[NoSuchDatabaseException]] when databases 
don't exist.
  */
-abstract class ExternalCatalog {
+abstract class ExternalCatalog
+  extends ListenerBus[ExternalCatalogEventListener, ExternalCatalogEvent] {
   import CatalogTypes.TablePartitionSpec
 
   protected def requireDbExists(db: String): Unit = {
@@ -61,9 +63,22 @@ abstract class ExternalCatalog {
   // Databases
   // 

[GitHub] spark issue #17710: [SPARK-20420][SQL] Add events to the external catalog

2017-04-21 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17710
  
Merging in master/branch-2.2.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17712: [SPARK-20416][SQL] Print UDF names in EXPLAIN

2017-04-21 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17712#discussion_r112622098
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
 ---
@@ -47,12 +47,20 @@ case class UserDefinedFunction protected[sql] (
 dataType: DataType,
 inputTypes: Option[Seq[DataType]]) {
 
+  // Optionally used for printing UDF names in EXPLAIN
+  private var nameOption: Option[String] = None
--- End diff --

can we create a new instance instead so this is immutable?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17711: [SPARK-19951][SQL] Add string concatenate operator || to...

2017-04-20 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17711
  
can you add a test case in sql query file tests? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17711: [SPARK-19951][SQL] Add string concatenate operato...

2017-04-20 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17711#discussion_r112590613
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -1483,4 +1483,12 @@ class SparkSqlAstBuilder(conf: SQLConf) extends 
AstBuilder {
   query: LogicalPlan): LogicalPlan = {
 RepartitionByExpression(expressions, query, conf.numShufflePartitions)
   }
+
+  /**
+   * Create a [[Concat]] expression for pipeline concatenation.
+   */
+  override def visitConcat(ctx: ConcatContext): Expression = {
+val exprs = ctx.primaryExpression().asScala
+Concat(expression(exprs.head) +: exprs.drop(1).map(expression))
--- End diff --

isn't this just `expression(exprs)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17705: [SPARK-20410][SQL] Make sparkConf a def in SharedSQLCont...

2017-04-20 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/17705
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17699: [SPARK-20405][SQL] Dataset.withNewExecutionId sho...

2017-04-20 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/17699

[SPARK-20405][SQL] Dataset.withNewExecutionId should be private

## What changes were proposed in this pull request?
Dataset.withNewExecutionId is only used in Dataset itself and should be 
private.

## How was this patch tested?
N/A - this is a simple visibility change.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-20405

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17699.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17699


commit 0c24ad2b38a9369399daf5a79b137c8b5495d2ca
Author: Reynold Xin <r...@databricks.com>
Date:   2017-04-20T07:37:07Z

[SPARK-20405][SQL] Dataset.withNewExecutionId should be private




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17698: [SPARK-20403][SQL][Documentation]Modify the instr...

2017-04-20 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17698#discussion_r112383091
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
 ---
@@ -1036,3 +1036,8 @@ case class UpCast(child: Expression, dataType: 
DataType, walkedTypePath: Seq[Str
   extends UnaryExpression with Unevaluable {
   override lazy val resolved = false
 }
+
+@ExpressionDescription(
+  usage = "_FUNC_(expr) - Casts the value `expr` to the target data type 
`_FUNC_`.")
+class CastAlias(child: Expression, dataType: DataType, timeZoneId: 
Option[String] = None)
--- End diff --

will this work with our pattern matches in the query optimizer?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...

2017-04-20 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15821#discussion_r112382152
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/ArrowConvertersSuite.scala ---
@@ -0,0 +1,568 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql
+
+import java.io.File
+import java.nio.charset.StandardCharsets
+import java.sql.{Date, Timestamp}
+import java.text.SimpleDateFormat
+import java.util.Locale
+
+import com.google.common.io.Files
+import org.apache.arrow.vector.{VectorLoader, VectorSchemaRoot}
+import org.apache.arrow.vector.file.json.JsonFileReader
+import org.apache.arrow.vector.util.Validator
+import org.json4s.jackson.JsonMethods._
+import org.json4s.JsonAST._
+import org.json4s.JsonDSL._
+import org.scalatest.BeforeAndAfterAll
+
+import org.apache.spark.SparkException
+import org.apache.spark.sql.test.SharedSQLContext
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.Utils
+
+
+class ArrowConvertersSuite extends SharedSQLContext with BeforeAndAfterAll 
{
+  import testImplicits._
+
+  private var tempDataPath: String = _
+
+  private def collectAsArrow(df: DataFrame,
+ converter: Option[ArrowConverters] = None): 
ArrowPayload = {
+val cnvtr = converter.getOrElse(new ArrowConverters)
+val payloadByteArrays = df.toArrowPayloadBytes().collect()
+cnvtr.readPayloadByteArrays(payloadByteArrays)
+  }
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+tempDataPath = Utils.createTempDir(namePrefix = 
"arrow").getAbsolutePath
+  }
+
+  test("collect to arrow record batch") {
+val indexData = (1 to 6).toDF("i")
+val arrowPayload = collectAsArrow(indexData)
+assert(arrowPayload.nonEmpty)
+val arrowBatches = arrowPayload.toArray
+assert(arrowBatches.length == indexData.rdd.getNumPartitions)
+val rowCount = arrowBatches.map(batch => batch.getLength).sum
+assert(rowCount === indexData.count())
+arrowBatches.foreach(batch => assert(batch.getNodes.size() > 0))
+arrowBatches.foreach(batch => batch.close())
+  }
+
+  test("numeric type conversion") {
+collectAndValidate(indexData)
+collectAndValidate(shortData)
+collectAndValidate(intData)
+collectAndValidate(longData)
+collectAndValidate(floatData)
+collectAndValidate(doubleData)
+  }
+
+  test("mixed numeric type conversion") {
+collectAndValidate(mixedNumericData)
+  }
+
+  test("boolean type conversion") {
+collectAndValidate(boolData)
+  }
+
+  test("string type conversion") {
+collectAndValidate(stringData)
+  }
+
+  test("byte type conversion") {
+collectAndValidate(byteData)
+  }
+
+  ignore("timestamp conversion") {
+collectAndValidate(timestampData)
+  }
+
+  // TODO: Not currently supported in Arrow JSON reader
+  ignore("date conversion") {
+// collectAndValidate(dateTimeData)
+  }
+
+  // TODO: Not currently supported in Arrow JSON reader
+  ignore("binary type conversion") {
+// collectAndValidate(binaryData)
+  }
+
+  test("floating-point NaN") {
+collectAndValidate(floatNaNData)
+  }
+
+  test("partitioned DataFrame") {
+val converter = new ArrowConverters
+val schema = testData2.schema
+val arrowPayload = collectAsArrow(testData2, Some(converter))
+val arrowBatches = arrowPayload.toArray
+// NOTE: testData2 should have 2 partitions -> 2 arrow batches in 
payload
+assert(arrowBatches.length === 2)
+val pl1 = new ArrowStaticPayload(arrowBatches(0))
+val pl2 = new ArrowStaticPayload

[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...

2017-04-20 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15821#discussion_r112381608
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/ArrowConvertersSuite.scala ---
@@ -0,0 +1,568 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql
+
+import java.io.File
+import java.nio.charset.StandardCharsets
+import java.sql.{Date, Timestamp}
+import java.text.SimpleDateFormat
+import java.util.Locale
+
+import com.google.common.io.Files
+import org.apache.arrow.vector.{VectorLoader, VectorSchemaRoot}
+import org.apache.arrow.vector.file.json.JsonFileReader
+import org.apache.arrow.vector.util.Validator
+import org.json4s.jackson.JsonMethods._
+import org.json4s.JsonAST._
+import org.json4s.JsonDSL._
+import org.scalatest.BeforeAndAfterAll
+
+import org.apache.spark.SparkException
+import org.apache.spark.sql.test.SharedSQLContext
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.Utils
+
+
+class ArrowConvertersSuite extends SharedSQLContext with BeforeAndAfterAll 
{
+  import testImplicits._
+
+  private var tempDataPath: String = _
+
+  private def collectAsArrow(df: DataFrame,
+ converter: Option[ArrowConverters] = None): 
ArrowPayload = {
+val cnvtr = converter.getOrElse(new ArrowConverters)
+val payloadByteArrays = df.toArrowPayloadBytes().collect()
+cnvtr.readPayloadByteArrays(payloadByteArrays)
+  }
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+tempDataPath = Utils.createTempDir(namePrefix = 
"arrow").getAbsolutePath
+  }
+
+  test("collect to arrow record batch") {
+val indexData = (1 to 6).toDF("i")
+val arrowPayload = collectAsArrow(indexData)
+assert(arrowPayload.nonEmpty)
+val arrowBatches = arrowPayload.toArray
+assert(arrowBatches.length == indexData.rdd.getNumPartitions)
+val rowCount = arrowBatches.map(batch => batch.getLength).sum
+assert(rowCount === indexData.count())
+arrowBatches.foreach(batch => assert(batch.getNodes.size() > 0))
+arrowBatches.foreach(batch => batch.close())
+  }
+
+  test("numeric type conversion") {
+collectAndValidate(indexData)
--- End diff --

separate these into different test cases, and please inline the data 
directly in each test case. It's pretty annoying to have to jump around.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...

2017-04-20 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15821#discussion_r112376143
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/ArrowConverters.scala ---
@@ -0,0 +1,432 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql
+
+import java.io.ByteArrayOutputStream
+import java.nio.ByteBuffer
+import java.nio.channels.{Channels, SeekableByteChannel}
+
+import scala.collection.JavaConverters._
+
+import io.netty.buffer.ArrowBuf
+import org.apache.arrow.memory.{BaseAllocator, RootAllocator}
+import org.apache.arrow.vector._
+import org.apache.arrow.vector.BaseValueVector.BaseMutator
+import org.apache.arrow.vector.file.{ArrowReader, ArrowWriter}
+import org.apache.arrow.vector.schema.{ArrowFieldNode, ArrowRecordBatch}
+import org.apache.arrow.vector.types.{FloatingPointPrecision, TimeUnit}
+import org.apache.arrow.vector.types.pojo.{ArrowType, Field, Schema}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types._
+import org.apache.spark.util.Utils
+
+
+/**
+ * ArrowReader requires a seekable byte channel.
+ * TODO: This is available in arrow-vector now with ARROW-615, to be 
included in 0.2.1 release
+ */
+private[sql] class ByteArrayReadableSeekableByteChannel(var byteArray: 
Array[Byte])
+  extends SeekableByteChannel {
+  var _position: Long = 0L
+
+  override def isOpen: Boolean = {
+byteArray != null
+  }
+
+  override def close(): Unit = {
+byteArray = null
+  }
+
+  override def read(dst: ByteBuffer): Int = {
+val remainingBuf = byteArray.length - _position
+val length = Math.min(dst.remaining(), remainingBuf).toInt
+dst.put(byteArray, _position.toInt, length)
+_position += length
+length
+  }
+
+  override def position(): Long = _position
+
+  override def position(newPosition: Long): SeekableByteChannel = {
+_position = newPosition.toLong
+this
+  }
+
+  override def size: Long = {
+byteArray.length.toLong
+  }
+
+  override def write(src: ByteBuffer): Int = {
+throw new UnsupportedOperationException("Read Only")
+  }
+
+  override def truncate(size: Long): SeekableByteChannel = {
+throw new UnsupportedOperationException("Read Only")
+  }
+}
+
+/**
+ * Intermediate data structure returned from Arrow conversions
+ */
+private[sql] abstract class ArrowPayload extends Iterator[ArrowRecordBatch]
+
+/**
+ * Build a payload from existing ArrowRecordBatches
+ */
+private[sql] class ArrowStaticPayload(batches: ArrowRecordBatch*) extends 
ArrowPayload {
+  private val iter = batches.iterator
+  override def next(): ArrowRecordBatch = iter.next()
+  override def hasNext: Boolean = iter.hasNext
+}
+
+/**
+ * Class that wraps an Arrow RootAllocator used in conversion
+ */
+private[sql] class ArrowConverters {
+  private val _allocator = new RootAllocator(Long.MaxValue)
+
+  private[sql] def allocator: RootAllocator = _allocator
+
+  /**
+   * Iterate over the rows and convert to an ArrowPayload, using 
RootAllocator from this class
+   */
+  def interalRowIterToPayload(rowIter: Iterator[InternalRow], schema: 
StructType): ArrowPayload = {
+val batch = ArrowConverters.internalRowIterToArrowBatch(rowIter, 
schema, _allocator)
+new ArrowStaticPayload(batch)
+  }
+
+  /**
+   * Read an Array of Arrow Record batches as byte Arrays into an 
ArrowPayload, using
+   * RootAllocator from this class
+   */
+  def readPayloadByteArrays(payloadByteArrays: Array[Array[Byte]]): 
ArrowPayload = {
+val batches = 
scala.collection.mutable.ArrayBuffer.empty[ArrowRecordBatch]
+var i = 0
+while (i < payloadByteArrays.length) {
+  val payloadBytes = p

[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...

2017-04-20 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15821#discussion_r112376037
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/ArrowConverters.scala ---
@@ -0,0 +1,432 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql
+
+import java.io.ByteArrayOutputStream
+import java.nio.ByteBuffer
+import java.nio.channels.{Channels, SeekableByteChannel}
+
+import scala.collection.JavaConverters._
+
+import io.netty.buffer.ArrowBuf
+import org.apache.arrow.memory.{BaseAllocator, RootAllocator}
+import org.apache.arrow.vector._
+import org.apache.arrow.vector.BaseValueVector.BaseMutator
+import org.apache.arrow.vector.file.{ArrowReader, ArrowWriter}
+import org.apache.arrow.vector.schema.{ArrowFieldNode, ArrowRecordBatch}
+import org.apache.arrow.vector.types.{FloatingPointPrecision, TimeUnit}
+import org.apache.arrow.vector.types.pojo.{ArrowType, Field, Schema}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types._
+import org.apache.spark.util.Utils
+
+
+/**
+ * ArrowReader requires a seekable byte channel.
+ * TODO: This is available in arrow-vector now with ARROW-615, to be 
included in 0.2.1 release
+ */
+private[sql] class ByteArrayReadableSeekableByteChannel(var byteArray: 
Array[Byte])
+  extends SeekableByteChannel {
+  var _position: Long = 0L
+
+  override def isOpen: Boolean = {
+byteArray != null
+  }
+
+  override def close(): Unit = {
+byteArray = null
+  }
+
+  override def read(dst: ByteBuffer): Int = {
+val remainingBuf = byteArray.length - _position
+val length = Math.min(dst.remaining(), remainingBuf).toInt
+dst.put(byteArray, _position.toInt, length)
+_position += length
+length
+  }
+
+  override def position(): Long = _position
+
+  override def position(newPosition: Long): SeekableByteChannel = {
+_position = newPosition.toLong
+this
+  }
+
+  override def size: Long = {
+byteArray.length.toLong
+  }
+
+  override def write(src: ByteBuffer): Int = {
+throw new UnsupportedOperationException("Read Only")
+  }
+
+  override def truncate(size: Long): SeekableByteChannel = {
+throw new UnsupportedOperationException("Read Only")
+  }
+}
+
+/**
+ * Intermediate data structure returned from Arrow conversions
+ */
+private[sql] abstract class ArrowPayload extends Iterator[ArrowRecordBatch]
+
+/**
+ * Build a payload from existing ArrowRecordBatches
+ */
+private[sql] class ArrowStaticPayload(batches: ArrowRecordBatch*) extends 
ArrowPayload {
+  private val iter = batches.iterator
+  override def next(): ArrowRecordBatch = iter.next()
+  override def hasNext: Boolean = iter.hasNext
+}
+
+/**
+ * Class that wraps an Arrow RootAllocator used in conversion
+ */
+private[sql] class ArrowConverters {
+  private val _allocator = new RootAllocator(Long.MaxValue)
+
+  private[sql] def allocator: RootAllocator = _allocator
+
+  /**
+   * Iterate over the rows and convert to an ArrowPayload, using 
RootAllocator from this class
+   */
+  def interalRowIterToPayload(rowIter: Iterator[InternalRow], schema: 
StructType): ArrowPayload = {
+val batch = ArrowConverters.internalRowIterToArrowBatch(rowIter, 
schema, _allocator)
+new ArrowStaticPayload(batch)
+  }
+
+  /**
+   * Read an Array of Arrow Record batches as byte Arrays into an 
ArrowPayload, using
+   * RootAllocator from this class
+   */
+  def readPayloadByteArrays(payloadByteArrays: Array[Array[Byte]]): 
ArrowPayload = {
+val batches = 
scala.collection.mutable.ArrayBuffer.empty[ArrowRecordBatch]
+var i = 0
+while (i < payloadByteArrays.length) {
+  val payloadBytes = p

[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...

2017-04-20 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15821#discussion_r112375921
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/ArrowConverters.scala ---
@@ -0,0 +1,432 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql
+
+import java.io.ByteArrayOutputStream
+import java.nio.ByteBuffer
+import java.nio.channels.{Channels, SeekableByteChannel}
+
+import scala.collection.JavaConverters._
+
+import io.netty.buffer.ArrowBuf
+import org.apache.arrow.memory.{BaseAllocator, RootAllocator}
+import org.apache.arrow.vector._
+import org.apache.arrow.vector.BaseValueVector.BaseMutator
+import org.apache.arrow.vector.file.{ArrowReader, ArrowWriter}
+import org.apache.arrow.vector.schema.{ArrowFieldNode, ArrowRecordBatch}
+import org.apache.arrow.vector.types.{FloatingPointPrecision, TimeUnit}
+import org.apache.arrow.vector.types.pojo.{ArrowType, Field, Schema}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types._
+import org.apache.spark.util.Utils
+
+
+/**
+ * ArrowReader requires a seekable byte channel.
+ * TODO: This is available in arrow-vector now with ARROW-615, to be 
included in 0.2.1 release
+ */
+private[sql] class ByteArrayReadableSeekableByteChannel(var byteArray: 
Array[Byte])
+  extends SeekableByteChannel {
+  var _position: Long = 0L
+
+  override def isOpen: Boolean = {
+byteArray != null
+  }
+
+  override def close(): Unit = {
+byteArray = null
+  }
+
+  override def read(dst: ByteBuffer): Int = {
+val remainingBuf = byteArray.length - _position
+val length = Math.min(dst.remaining(), remainingBuf).toInt
+dst.put(byteArray, _position.toInt, length)
+_position += length
+length
+  }
+
+  override def position(): Long = _position
+
+  override def position(newPosition: Long): SeekableByteChannel = {
+_position = newPosition.toLong
+this
+  }
+
+  override def size: Long = {
+byteArray.length.toLong
+  }
+
+  override def write(src: ByteBuffer): Int = {
+throw new UnsupportedOperationException("Read Only")
+  }
+
+  override def truncate(size: Long): SeekableByteChannel = {
+throw new UnsupportedOperationException("Read Only")
+  }
+}
+
+/**
+ * Intermediate data structure returned from Arrow conversions
+ */
+private[sql] abstract class ArrowPayload extends Iterator[ArrowRecordBatch]
+
+/**
+ * Build a payload from existing ArrowRecordBatches
+ */
+private[sql] class ArrowStaticPayload(batches: ArrowRecordBatch*) extends 
ArrowPayload {
+  private val iter = batches.iterator
+  override def next(): ArrowRecordBatch = iter.next()
+  override def hasNext: Boolean = iter.hasNext
+}
+
+/**
+ * Class that wraps an Arrow RootAllocator used in conversion
+ */
+private[sql] class ArrowConverters {
+  private val _allocator = new RootAllocator(Long.MaxValue)
+
+  private[sql] def allocator: RootAllocator = _allocator
+
+  /**
+   * Iterate over the rows and convert to an ArrowPayload, using 
RootAllocator from this class
+   */
+  def interalRowIterToPayload(rowIter: Iterator[InternalRow], schema: 
StructType): ArrowPayload = {
+val batch = ArrowConverters.internalRowIterToArrowBatch(rowIter, 
schema, _allocator)
+new ArrowStaticPayload(batch)
+  }
+
+  /**
+   * Read an Array of Arrow Record batches as byte Arrays into an 
ArrowPayload, using
+   * RootAllocator from this class
+   */
+  def readPayloadByteArrays(payloadByteArrays: Array[Array[Byte]]): 
ArrowPayload = {
+val batches = 
scala.collection.mutable.ArrayBuffer.empty[ArrowRecordBatch]
+var i = 0
+while (i < payloadByteArrays.length) {
+  val payloadBytes = p

[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...

2017-04-20 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15821#discussion_r112375496
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/ArrowConverters.scala ---
@@ -0,0 +1,432 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql
+
+import java.io.ByteArrayOutputStream
+import java.nio.ByteBuffer
+import java.nio.channels.{Channels, SeekableByteChannel}
+
+import scala.collection.JavaConverters._
+
+import io.netty.buffer.ArrowBuf
+import org.apache.arrow.memory.{BaseAllocator, RootAllocator}
+import org.apache.arrow.vector._
+import org.apache.arrow.vector.BaseValueVector.BaseMutator
+import org.apache.arrow.vector.file.{ArrowReader, ArrowWriter}
+import org.apache.arrow.vector.schema.{ArrowFieldNode, ArrowRecordBatch}
+import org.apache.arrow.vector.types.{FloatingPointPrecision, TimeUnit}
+import org.apache.arrow.vector.types.pojo.{ArrowType, Field, Schema}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types._
+import org.apache.spark.util.Utils
+
+
+/**
+ * ArrowReader requires a seekable byte channel.
+ * TODO: This is available in arrow-vector now with ARROW-615, to be 
included in 0.2.1 release
+ */
+private[sql] class ByteArrayReadableSeekableByteChannel(var byteArray: 
Array[Byte])
+  extends SeekableByteChannel {
+  var _position: Long = 0L
+
+  override def isOpen: Boolean = {
+byteArray != null
+  }
+
+  override def close(): Unit = {
+byteArray = null
+  }
+
+  override def read(dst: ByteBuffer): Int = {
+val remainingBuf = byteArray.length - _position
+val length = Math.min(dst.remaining(), remainingBuf).toInt
+dst.put(byteArray, _position.toInt, length)
+_position += length
+length
+  }
+
+  override def position(): Long = _position
+
+  override def position(newPosition: Long): SeekableByteChannel = {
+_position = newPosition.toLong
+this
+  }
+
+  override def size: Long = {
+byteArray.length.toLong
+  }
+
+  override def write(src: ByteBuffer): Int = {
+throw new UnsupportedOperationException("Read Only")
+  }
+
+  override def truncate(size: Long): SeekableByteChannel = {
+throw new UnsupportedOperationException("Read Only")
+  }
+}
+
+/**
+ * Intermediate data structure returned from Arrow conversions
+ */
+private[sql] abstract class ArrowPayload extends Iterator[ArrowRecordBatch]
+
+/**
+ * Build a payload from existing ArrowRecordBatches
+ */
+private[sql] class ArrowStaticPayload(batches: ArrowRecordBatch*) extends 
ArrowPayload {
+  private val iter = batches.iterator
+  override def next(): ArrowRecordBatch = iter.next()
+  override def hasNext: Boolean = iter.hasNext
+}
+
+/**
+ * Class that wraps an Arrow RootAllocator used in conversion
+ */
+private[sql] class ArrowConverters {
+  private val _allocator = new RootAllocator(Long.MaxValue)
+
+  private[sql] def allocator: RootAllocator = _allocator
+
+  /**
+   * Iterate over the rows and convert to an ArrowPayload, using 
RootAllocator from this class
+   */
+  def interalRowIterToPayload(rowIter: Iterator[InternalRow], schema: 
StructType): ArrowPayload = {
+val batch = ArrowConverters.internalRowIterToArrowBatch(rowIter, 
schema, _allocator)
+new ArrowStaticPayload(batch)
+  }
+
+  /**
+   * Read an Array of Arrow Record batches as byte Arrays into an 
ArrowPayload, using
+   * RootAllocator from this class
+   */
+  def readPayloadByteArrays(payloadByteArrays: Array[Array[Byte]]): 
ArrowPayload = {
+val batches = 
scala.collection.mutable.ArrayBuffer.empty[ArrowRecordBatch]
+var i = 0
+while (i < payloadByteArrays.length) {
+  val payloadBytes = p

<    3   4   5   6   7   8   9   10   11   12   >