[GitHub] spark pull request #19606: [SPARK-22333][SQL][Backport-2.2]timeFunctionCall(...

2017-10-31 Thread DonnyZone
Github user DonnyZone closed the pull request at:

https://github.com/apache/spark/pull/19606


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19568: SPARK-22345: Fix sort-merge joins with conditions and co...

2017-10-30 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/19568
  
@rdblue Could you find any cases that can trigger the problem of wrong 
INPUT_ROW in SortMergeJoinExec after the fix 
(https://github.com/apache/spark/pull/18656) for CollapseCodegenStages rule? I 
tried to do it before.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19568: SPARK-22345: Fix sort-merge joins with conditions and co...

2017-10-30 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/19568
  
This PR is similar to the initial commit when I try to fix SPARK-21441 in  
https://github.com/apache/spark/pull/18656 
(https://github.com/apache/spark/pull/18656/commits/92dc106aec59a0f2755d7621d2d038312500).
 
(1) The INPUT_ROW in SortMergeJoinExec's codegen context points to the 
wrong row.  In general, it works well. However, this behavior potentially 
causes wrong result or even NPE in `bindReference`. I think it is necessary to 
fix it.
(2) The `CollapseCodegenStages` rule before 2.1.1 has an issue which may 
lead to code generation even the SortMergeJoinExec contains CodegenFallback 
expressions, when it has an umbrella SparkPlan (e.g., ProjectExec). 
Consequently, it triggers the above potential issue. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19606: [SPARK-22333][SQL][Backport-2.2]timeFunctionCall(CURRENT...

2017-10-30 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/19606
  
cc @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19606: [SPARK-22333][SQL][Backport-2.2]timeFunctionCall(...

2017-10-29 Thread DonnyZone
GitHub user DonnyZone opened a pull request:

https://github.com/apache/spark/pull/19606

[SPARK-22333][SQL][Backport-2.2]timeFunctionCall(CURRENT_DATE, 
CURRENT_TIMESTAMP) has conflicts with columnReference

## What changes were proposed in this pull request?

This is a backport pr of https://github.com/apache/spark/pull/19559
for branch-2.2

## How was this patch tested?
unit tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/DonnyZone/spark branch-2.2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19606.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19606


commit 2bcc2ea6fd0ca9f12959246bb9ee6796cb7a90a0
Author: donnyzone <wellfeng...@gmail.com>
Date:   2017-10-30T03:08:36Z

2.2-backport




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19559: [SPARK-22333][SQL]timeFunctionCall(CURRENT_DATE, CURRENT...

2017-10-29 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/19559
  
Sure, I will submit it later.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19559: [SPARK-22333][SQL]timeFunctionCall(CURRENT_DATE, CURRENT...

2017-10-27 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/19559
  
Yes, ordering in `Sort(ordering, global, child)` is resolved in 
`resolveExpression`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19559: [SPARK-22333][SQL]timeFunctionCall(CURRENT_DATE, CURRENT...

2017-10-27 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/19559
  
OK, I will close this PR after review and submit a new one, after merging 
https://github.com/apache/spark/pull/19585


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19559: [SPARK-22333][SQL]timeFunctionCall(CURRENT_DATE, CURRENT...

2017-10-27 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/19559
  
@gatorsmile 
It seems that we should also support this logic in `resolveExpressions` for 
Sort plan.
`select a from t order by current_date`
Therefore, I think current `resolveAsLiteralFunctions` can be moved out 
from `ResolveReference` rule to be  a common function `resolveLiteralFunctions`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19559: [SPARK-22333][SQL]timeFunctionCall(CURRENT_DATE, ...

2017-10-27 Thread DonnyZone
Github user DonnyZone commented on a diff in the pull request:

https://github.com/apache/spark/pull/19559#discussion_r147330565
  
--- Diff: sql/core/src/test/resources/sql-tests/inputs/datetime.sql ---
@@ -8,3 +8,18 @@ select to_date(null), to_date('2016-12-31'), 
to_date('2016-12-31', '-MM-dd')
 select to_timestamp(null), to_timestamp('2016-12-31 00:12:00'), 
to_timestamp('2016-12-31', '-MM-dd');
 
 select dayofweek('2007-02-03'), dayofweek('2009-07-30'), 
dayofweek('2017-05-27'), dayofweek(null), dayofweek('1582-10-15 13:10:15');
+
+-- [SPARK-22333]: timeFunctionCall has conflicts with columnReference
+create temporary view ttf1 as select * from values
+  (1, 2),
+  (2, 3),
+  as ttf1(current_date, current_timestamp);
+  
+select current_date, current_timestamp from ttf1
+
+create temporary view ttf2 as select * from values
+  (1, 2),
+  (2, 3),
+  as ttf2(a, b);
+  
+select current_date = current_date(), current_timestamp = 
current_timestamp(), a, b from ttf2
--- End diff --

Fixed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19573: [SPARK-22350][SQL] select grouping__id from subquery

2017-10-26 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/19573
  
Is it similar to the below issue?
https://github.com/apache/spark/pull/19178


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19559: [SPARK-22333][SQL]timeFunctionCall(CURRENT_DATE, CURRENT...

2017-10-26 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/19559
  
@gatorsmile @gatorsmile 
There are still two issues need to be figured out.
(1)It will be complicated to determine whether a literal function should be 
resolved as Expression or NamedExpression.
Current fix just resolves them as NamedExpressions (i.e., Alias). 
However, this leads to different schema in some cases, for example, the 
end-to-end test sql.
`select current_date = current_date()`
The output schema will be
`struct<(current_date() AS ’current_date()‘ = current_date()):boolean>`
(2)Shall we also support the feature in ResolveMissingReference rule?
e.g., `select id from table order by current_date`
The same logic in different rules brings redundant code.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19559: [SPARK-22333][SQL]timeFunctionCall(CURRENT_DATE, ...

2017-10-26 Thread DonnyZone
Github user DonnyZone commented on a diff in the pull request:

https://github.com/apache/spark/pull/19559#discussion_r147068164
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -783,6 +783,25 @@ class Analyzer(
   }
 }
 
+/**
+ * Literal functions do not require the user to specify braces when 
calling them
+ * When an attributes is not resolvable, we try to resolve it as a 
literal function.
+ */
+private def resolveAsLiteralFunctions(nameParts: Seq[String]): 
Option[NamedExpression] = {
+  if (nameParts.length != 1) {
+return None
+  }
+  // support CURRENT_DATE and CURRENT_TIMESTAMP
+  val literalFunctions = Seq(CurrentDate(), CurrentTimestamp())
+  val name = nameParts.head
+  val func = literalFunctions.find(e => resolver(e.prettyName, name))
+  if (func.isDefined) {
--- End diff --

Thanks, I will refactor it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19559: [SPARK-22333][SQL]timeFunctionCall(CURRENT_DATE, ...

2017-10-25 Thread DonnyZone
Github user DonnyZone commented on a diff in the pull request:

https://github.com/apache/spark/pull/19559#discussion_r147041980
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -139,6 +139,7 @@ class Analyzer(
   ExtractGenerator ::
   ResolveGenerate ::
   ResolveFunctions ::
+  ResolveLiteralFunctions ::
--- End diff --

Agree! I will refactor it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19559: [SPARK-22333][SQL]timeFunctionCall(CURRENT_DATE, CURRENT...

2017-10-25 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/19559
  
@gatorsmile Thank for your advice, I will work on it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19559: [SPARK-22333][SQL]ColumnReference should get higher prio...

2017-10-23 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/19559
  
@hvanhovell Yes! I made something wrong. The `timeFunctionCall` has 
conflicts with `columnReference`. This fix will break every use of 
CURRENT_DATE/CURRENT_TIMESTAMP.

For [SPARK-16836](https://github.com/apache/spark/pull/14442), 
I think this feature should be implemented in analysis phase rather than in 
parser phase. When there is no such columns, they can be transformed as 
functions. Another approach is to define a configuration.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19559: [SPARK-22333][SQL]ColumnReference should get higher prio...

2017-10-23 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/19559
  
ping @gatorsmile @hvanhovell @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19559: [SPARK-22333][SQL]ColumnReference should get higher prio...

2017-10-23 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/19559
  
ping @hvanhovell @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19559: [SPARK-22333][SQL]ColumnReference should get high...

2017-10-23 Thread DonnyZone
GitHub user DonnyZone opened a pull request:

https://github.com/apache/spark/pull/19559

[SPARK-22333][SQL]ColumnReference should get higher priority than 
timeFunctionCall(CURRENT_DATE, CURRENT_TIMESTAMP)

## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-22333

In current version, users can use CURRENT_DATE() and CURRENT_TIMESTAMP 
without specifying braces.
However, when a table has columns named as "current_date" or 
"current_timestamp", it will still be parsed as function call.

There are many such cases in our production cluster. We get the wrong 
answer due to this inappropriate behevior. In general, ColumnReference should 
get higher priority than timeFunctionCall.

## How was this patch tested?
unit test
manul test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/DonnyZone/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19559.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19559


commit 60a5a56a77245f3467920c60cc39ae6cd4989572
Author: donnyzone <wellfeng...@gmail.com>
Date:   2017-10-23T11:40:31Z

spark-22333




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19175: [SPARK-21964][SQL]Enable splitting the Aggregate (on Exp...

2017-09-28 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/19175
  
cc @hvanhovell @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19175: [SPARK-21964][SQL]Enable splitting the Aggregate ...

2017-09-25 Thread DonnyZone
Github user DonnyZone commented on a diff in the pull request:

https://github.com/apache/spark/pull/19175#discussion_r140741265
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1287,3 +1288,33 @@ object RemoveRepetitionFromGroupExpressions extends 
Rule[LogicalPlan] {
   a.copy(groupingExpressions = newGrouping)
   }
 }
+
+/**
+ * Splits [[Aggregate]] on [[Expand]], which has large number of 
projections,
+ * into various [[Aggregate]]s.
+ */
+object SplitAggregateWithExpand extends Rule[LogicalPlan] {
+  /**
+   * Split [[Expand]] operator to a number of [[Expand]] operators
+   */
+  private def splitExpand(expand: Expand): Seq[Expand] = {
+val len = expand.projections.length
+val allProjections = expand.projections
+Seq.tabulate(len)(
+  i => Expand(Seq(allProjections(i)), expand.output, expand.child)
--- End diff --

And I think we may figure out some cost-based methods.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19175: [SPARK-21964][SQL]Enable splitting the Aggregate ...

2017-09-25 Thread DonnyZone
Github user DonnyZone commented on a diff in the pull request:

https://github.com/apache/spark/pull/19175#discussion_r140741053
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1287,3 +1288,33 @@ object RemoveRepetitionFromGroupExpressions extends 
Rule[LogicalPlan] {
   a.copy(groupingExpressions = newGrouping)
   }
 }
+
+/**
+ * Splits [[Aggregate]] on [[Expand]], which has large number of 
projections,
+ * into various [[Aggregate]]s.
+ */
+object SplitAggregateWithExpand extends Rule[LogicalPlan] {
+  /**
+   * Split [[Expand]] operator to a number of [[Expand]] operators
+   */
+  private def splitExpand(expand: Expand): Seq[Expand] = {
+val len = expand.projections.length
+val allProjections = expand.projections
+Seq.tabulate(len)(
+  i => Expand(Seq(allProjections(i)), expand.output, expand.child)
--- End diff --

Can we set another config param here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19175: [SPARK-21964][SQL]Enable splitting the Aggregate ...

2017-09-25 Thread DonnyZone
Github user DonnyZone commented on a diff in the pull request:

https://github.com/apache/spark/pull/19175#discussion_r140729331
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1287,3 +1288,33 @@ object RemoveRepetitionFromGroupExpressions extends 
Rule[LogicalPlan] {
   a.copy(groupingExpressions = newGrouping)
   }
 }
+
+/**
+ * Splits [[Aggregate]] on [[Expand]], which has large number of 
projections,
+ * into various [[Aggregate]]s.
+ */
+object SplitAggregateWithExpand extends Rule[LogicalPlan] {
+  /**
+   * Split [[Expand]] operator to a number of [[Expand]] operators
+   */
+  private def splitExpand(expand: Expand): Seq[Expand] = {
+val len = expand.projections.length
+val allProjections = expand.projections
+Seq.tabulate(len)(
+  i => Expand(Seq(allProjections(i)), expand.output, expand.child)
--- End diff --

Shall we keep Expand here for further optimization?  I think we may put 
more than one projections together for Union operator.

In our production, we found some cube queries may even have 22 dimensions, 
whch result to 2^22=4194304 projections. In such case, it is not appropriate to 
tansform it into "one-projection-one-child" for Union node.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19175: [SPARK-21964][SQL]Enable splitting the Aggregate (on Exp...

2017-09-25 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/19175
  
@cloud-fan Do you have time to review this PR? We found it is useful in 
high dimensional cube cases.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19175: [SPARK-21964][SQL]Enable splitting the Aggregate (on Exp...

2017-09-14 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/19175
  
ping! @hvanhovell 
@cloud-fan @gatorsmile Do you have any ideas about this optimization? We 
found it is useful in some scenarios.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19175: [SPARK-21964][SQL]Enable splitting the Aggregate (on Exp...

2017-09-12 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/19175
  
Could you help to review this PR? @jiangxb1987 @hvanhovell 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19202: [SPARK-21980][SQL]References in grouping functions shoul...

2017-09-12 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/19202
  
ping @cloud-fan @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19202: [SPARK-21980][SQL]References in grouping function...

2017-09-12 Thread DonnyZone
GitHub user DonnyZone opened a pull request:

https://github.com/apache/spark/pull/19202

[SPARK-21980][SQL]References in grouping functions should be indexed with 
resolver

## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-21980

This PR fixes the issue in ResolveGroupingAnalytics rule, which indexes the 
column references in grouping functions without considering case sensitive 
configurations.

## How was this patch tested?
unit tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/DonnyZone/spark ResolveGroupingAnalytics

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19202.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19202


commit ac61a6620e59447c575092bee5d4d7f0af99695c
Author: donnyzone <wellfeng...@gmail.com>
Date:   2017-09-12T09:28:01Z

SPARK-21980

commit b08fd9301cdbd4c1a29d5eb322eacd1cf2ffc546
Author: donnyzone <wellfeng...@gmail.com>
Date:   2017-09-12T09:34:53Z

rename




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19178: [SPARK-21966][SQL]ResolveMissingReference rule sh...

2017-09-11 Thread DonnyZone
Github user DonnyZone closed the pull request at:

https://github.com/apache/spark/pull/19178


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19178: [SPARK-21966][SQL]ResolveMissingReference rule should no...

2017-09-11 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/19178
  
Thanks, I will make a try in our private repository as there are several 
such cases and the users want to in a seamless way. It is really complicated 
for a general support.
Should I close this PR now?
@gatorsmile @maropu 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19178: [SPARK-21966][SQL]ResolveMissingReference rule sh...

2017-09-10 Thread DonnyZone
Github user DonnyZone commented on a diff in the pull request:

https://github.com/apache/spark/pull/19178#discussion_r137979887
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1115,6 +1115,8 @@ class Analyzer(
   g.copy(join = true, child = addMissingAttr(g.child, missing))
 case d: Distinct =>
   throw new AnalysisException(s"Can't add $missingAttrs to $d")
+case u: Union =>
+  u.withNewChildren(u.children.map(addMissingAttr(_, 
missingAttrs)))
--- End diff --

Yeah, I agree with you. Current implementation only checks UnaryNode. 
It is necessary to take all node types into consideration.
Thanks for suggestion, I will work on a general solution.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19178: [SPARK-21966][SQL]ResolveMissingReference rule sh...

2017-09-10 Thread DonnyZone
GitHub user DonnyZone opened a pull request:

https://github.com/apache/spark/pull/19178

[SPARK-21966][SQL]ResolveMissingReference rule should not ignore Union

## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-21966

The problem can be reproduced by following example.

`val df1 = spark.createDataFrame(Seq((1, 1), (2, 1), (2, 2))).toDF("a", "b")
val df2 = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 3))).toDF("a", "b")
val df3 = df1.cube("a").sum("b")
val df4 = df2.cube("a").sum("b")
val df5 = df3.union(df4).filter("grouping_id()=0").show()`

The `org.apache.spark.sql.AnalysisException: cannot resolve 
'`spark_grouping_id`' given input columns`
is thrown as the ResolveMissingReference rule ignore the Union operator. 
This PR fix the issue.

## How was this patch tested?
unit tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/DonnyZone/spark ResolveMissingReference

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19178.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19178


commit 29ae28598c69bb4cb26ad66c86e6a73054e5fbef
Author: donnyzone <wellfeng...@gmail.com>
Date:   2017-09-10T09:26:36Z

SPARK-21966




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19175: [SPARK-21964][SQL]Enable splitting the Aggregate ...

2017-09-09 Thread DonnyZone
GitHub user DonnyZone opened a pull request:

https://github.com/apache/spark/pull/19175

[SPARK-21964][SQL]Enable splitting the Aggregate (on Expand) into a number 
of Aggregates for grouing analytics

## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-21964

Current implementation for grouping analytics (i.e., cube, rollup and 
grouping sets) is "heavyweight" for many scenarios (e.g., high dimensions 
cube), as the Expand operator produces a large number of projections, resulting 
vast shuffle write.  It may result into low performance or even OOM issues for 
direct buffer memory.

This PR provides another choice which enables splitting the heavyweight 
aggregate into a number of lightweight aggregates for each group. Actually, it 
implements the grouping analytics as Union and executes the aggregates one by 
one. This optimization is opposite to the general sense of "avoding redundant 
data scan"

The current splitting strategy is simple as one aggregation for one group. 
In future, we may figure out more intelligent splitting stategies (e.g., 
cost-based method).

## How was this patch tested?

Unit tests
Manual tests in production environment


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/DonnyZone/spark spark-21964

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19175.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19175


commit da36e37df9c31901975c29dfa77cb7d648e94f40
Author: donnyzone <wellfeng...@gmail.com>
Date:   2017-09-09T13:46:48Z

grouping with union




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18986: [SPARK-21774][SQL] The rule PromoteStrings should cast a...

2017-08-20 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/18986
  
@gatorsmile For this issue, I think the behevior in PromoteStrings rule is 
reasonable, but there are problems in underlying converter UTF8String.

As described in PR-15880 (https://github.com/apache/spark/pull/15880):
> It's more reasonable to follow postgres, i.e. cast string to the type of 
the other side, but return null if the string is not castable to keep hive 
compatibility.

However, the underlying UTF8String still returns true for cases that are 
not castable. From the below code, we can get res=true and wrapper.value=0. 
Consequently, it results in wrong answer in SPARK-21774.
```
val x = UTF8String.fromString("0.1")
val wrapper = new IntWrapper
val res = x.toInt(wrapper)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18986: [SPARK-21774][SQL] The rule PromoteStrings should...

2017-08-18 Thread DonnyZone
Github user DonnyZone commented on a diff in the pull request:

https://github.com/apache/spark/pull/18986#discussion_r133916434
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -127,8 +127,10 @@ object TypeCoercion {
 case (DateType, TimestampType) => Some(StringType)
 case (StringType, NullType) => Some(StringType)
 case (NullType, StringType) => Some(StringType)
+case (StringType, IntegerType) => Some(DoubleType)
--- End diff --

How abount other types (e.g., LongType)? Is the similar issue still exists?
I think the logic in pr-15880 (https://github.com/apache/spark/pull/15880) 
is reasonable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18960: [SPARK-21739][SQL]Cast expression should initialize time...

2017-08-17 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/18960
  
Moreover, how about using `CastSupport.cast`, shall I initialize a 
`DataSourceAnalysis` or `DataSourceStrategy` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18960: [SPARK-21739][SQL]Cast expression should initialize time...

2017-08-17 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/18960
  
Test case updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18960: [SPARK-21739][SQL]Cast expression should initiali...

2017-08-17 Thread DonnyZone
Github user DonnyZone commented on a diff in the pull request:

https://github.com/apache/spark/pull/18960#discussion_r133631784
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala ---
@@ -68,4 +68,25 @@ class QueryPartitionSuite extends QueryTest with 
SQLTestUtils with TestHiveSingl
   sql("DROP TABLE IF EXISTS createAndInsertTest")
 }
   }
+
+  test("SPARK-21739: Cast expression should initialize timezoneId " +
--- End diff --

Oh, it should select the TimestampType column. Thanks for reminder, I will 
fix it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18960: [SPARK-21739][SQL]Cast expression should initiali...

2017-08-16 Thread DonnyZone
Github user DonnyZone commented on a diff in the pull request:

https://github.com/apache/spark/pull/18960#discussion_r133618679
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -227,7 +228,8 @@ class HadoopTableReader(
   def fillPartitionKeys(rawPartValues: Array[String], row: 
InternalRow): Unit = {
 partitionKeyAttrs.foreach { case (attr, ordinal) =>
   val partOrdinal = partitionKeys.indexOf(attr)
-  row(ordinal) = Cast(Literal(rawPartValues(partOrdinal)), 
attr.dataType).eval(null)
+  row(ordinal) = Cast(Literal(rawPartValues(partOrdinal)), 
attr.dataType,
+Option(SQLConf.get.sessionLocalTimeZone)).eval(null)
--- End diff --

Do you mean a test case for HadoopTableReader?  a little confusing


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18960: [SPARK-21739][SQL]Cast expression should initiali...

2017-08-16 Thread DonnyZone
Github user DonnyZone commented on a diff in the pull request:

https://github.com/apache/spark/pull/18960#discussion_r133614757
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala
 ---
@@ -104,7 +105,7 @@ case class HiveTableScanExec(
 hadoopConf)
 
   private def castFromString(value: String, dataType: DataType) = {
-Cast(Literal(value), dataType).eval(null)
+Cast(Literal(value), dataType, 
Option(SQLConf.get.sessionLocalTimeZone)).eval(null)
--- End diff --

Here, we can obtain SQLConf directly with `sparkSession.sessionState.conf`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18960: [SPARK-21739][SQL]Cast expression should initiali...

2017-08-16 Thread DonnyZone
Github user DonnyZone commented on a diff in the pull request:

https://github.com/apache/spark/pull/18960#discussion_r133612759
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala
 ---
@@ -104,7 +105,7 @@ case class HiveTableScanExec(
 hadoopConf)
 
   private def castFromString(value: String, dataType: DataType) = {
-Cast(Literal(value), dataType).eval(null)
+Cast(Literal(value), dataType, 
Option(SQLConf.get.sessionLocalTimeZone)).eval(null)
--- End diff --

BTW, is it elegant to initialize a `CastSupport` (`DataSourceAnalysis` rule 
or `DataSourceStrategy`) here, in which we still need to pass `SQLConf`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18960: [SPARK-21739][SQL]Cast expression should initiali...

2017-08-16 Thread DonnyZone
Github user DonnyZone commented on a diff in the pull request:

https://github.com/apache/spark/pull/18960#discussion_r133607798
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -227,7 +228,8 @@ class HadoopTableReader(
   def fillPartitionKeys(rawPartValues: Array[String], row: 
InternalRow): Unit = {
 partitionKeyAttrs.foreach { case (attr, ordinal) =>
   val partOrdinal = partitionKeys.indexOf(attr)
-  row(ordinal) = Cast(Literal(rawPartValues(partOrdinal)), 
attr.dataType).eval(null)
+  row(ordinal) = Cast(Literal(rawPartValues(partOrdinal)), 
attr.dataType,
+Option(SQLConf.get.sessionLocalTimeZone)).eval(null)
--- End diff --

OK, I will work on it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18960: [SPARK-21739][SQL]Cast expression should initialize time...

2017-08-16 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/18960
  
@cloud-fan @gatorsmile 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18960: [SPARK-21739][SQL]Cast expression should initialize time...

2017-08-16 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/18960
  
@cloud-fan @gatorsmile 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18960: [SPARK-21739][SQL]Cast expression should initiali...

2017-08-16 Thread DonnyZone
GitHub user DonnyZone opened a pull request:

https://github.com/apache/spark/pull/18960

[SPARK-21739][SQL]Cast expression should initialize timezoneId when it is 
called statically to convert something into TimestampType

## What changes were proposed in this pull request?

https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21739

This issue is caused by introducing TimeZoneAwareExpression.
When the **Cast** expression converts something into TimestampType, it 
should be resolved with setting `timezoneId`. In general, it is resolved in 
LogicalPlan phase.

However, there are still some places that use Cast expression statically to 
convert datatypes without setting `timezoneId`. In such cases,  
`NoSuchElementException: None.get` will be thrown for TimestampType.

This PR is proposed to fix the issue. We have checked the whole project and 
found two such usages(i.e., in`TableReader` and `HiveTableScanExec`). 

## How was this patch tested?

unit test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/DonnyZone/spark spark-21739

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18960.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18960


commit 2537c1ea5028541a68ae63e8f5eea8eb8f62dadf
Author: donnyzone <wellfeng...@gmail.com>
Date:   2017-08-16T07:53:19Z

spark-21739

commit 86331e37550f4204c8c4ee2c409fbd6000654f43
Author: donnyzone <wellfeng...@gmail.com>
Date:   2017-08-16T07:55:17Z

fix code style

commit 492b756fde5008854d1351ed423c3897c683c662
Author: donnyzone <wellfeng...@gmail.com>
Date:   2017-08-16T07:56:34Z

correct answer




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18946: [SPARK-19471][SQL]AggregationIterator does not initializ...

2017-08-15 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/18946
  
Is Jenkin unstable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18946: [SPARK-19471][SQL]AggregationIterator does not initializ...

2017-08-14 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/18946
  
@gatorsmile 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18946: [SPARK-19471][SQL]AggregationIterator does not in...

2017-08-14 Thread DonnyZone
GitHub user DonnyZone opened a pull request:

https://github.com/apache/spark/pull/18946

[SPARK-19471][SQL]AggregationIterator does not initialize the generated 
result projection before using it

## What changes were proposed in this pull request?

This is a follow-up PR that moves the test case in PR-18920 
(https://github.com/apache/spark/pull/18920) to DataFrameAggregateSuit.

## How was this patch tested?
unit test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/DonnyZone/spark branch-19471-followingPR

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18946.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18946


commit 9434fc767e6b7907b9beacb2f4358d767c9d4d32
Author: donnyzone <wellfeng...@gmail.com>
Date:   2017-08-15T02:29:25Z

move test case to DataFrameAggregateSuite

commit f057ff8400076fce615fe9b6521ed1b3d66cb669
Author: donnyzone <wellfeng...@gmail.com>
Date:   2017-08-15T02:37:33Z

change import order




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18920: [SPARK-19471][SQL]AggregationIterator does not initializ...

2017-08-14 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/18920
  
Sure, I will do it later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18920: [SPARK-19471][SQL]AggregationIterator does not initializ...

2017-08-14 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/18920
  
retest please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18920: [SPARK-19471][SQL]AggregationIterator does not initializ...

2017-08-14 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/18920
  
Updated, thanks for reviewing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18920: [SPARK-19471][SQL]AggregationIterator does not initializ...

2017-08-13 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/18920
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18920: [SPARK-19471][SQL]AggregationIterator does not initializ...

2017-08-13 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/18920
  
updated


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18920: [SPARK-19471][SQL]AggregationIterator does not initializ...

2017-08-13 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/18920
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18920: [SPARK-19471][SQL]AggregationIterator does not initializ...

2017-08-11 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/18920
  
@hvanhovell, @yangw1234, @gatorsmile 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18920: [SPARK-19471][SQL]AggregationIterator does not initializ...

2017-08-11 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/18920
  
Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18920: [SPARK-19471][SQL]AggregationIterator does not in...

2017-08-11 Thread DonnyZone
GitHub user DonnyZone opened a pull request:

https://github.com/apache/spark/pull/18920

[SPARK-19471][SQL]AggregationIterator does not initialize the generated 
result projection before using it

## What changes were proposed in this pull request?

Recently, we have also encountered such NPE issues in our production 
environment as described in:
https://issues.apache.org/jira/browse/SPARK-19471

This issue can be reproduced by the following examples:
` val df = spark.createDataFrame(Seq(("1", 1), ("1", 2), ("2", 3), ("2", 
4))).toDF("x", "y")

//HashAggregate, SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key=false
df.groupBy("x").agg(rand(),sum("y")).show()

//ObjectHashAggregate, SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key=false
df.groupBy("x").agg(rand(),collect_list("y")).show()

//SortAggregate, SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key=false 
&_OBJECT_HASH_AGG.key=false
df.groupBy("x").agg(rand(),collect_list("y")).show()`
`

This PR is based on PR-16820(https://github.com/apache/spark/pull/16820) 
with test cases for all aggregation paths. We want to push it forward. 

> When AggregationIterator generates result projection, it does not call 
the initialize method of the Projection class. This will cause a runtime 
NullPointerException when the projection involves nondeterministic expressions.

## How was this patch tested?

unit test
verified in production environment


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/DonnyZone/spark Branch-spark-19471

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18920.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18920


commit b932d2f3a6741a8ef052cbd8087f4b0836c617d6
Author: donnyzone <wellfeng...@gmail.com>
Date:   2017-08-11T13:00:00Z

spark-19471




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18919: [SPARK-19471][SQL]AggregationIterator does not initializ...

2017-08-11 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/18919
  
There are some confilicts, close it first


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18919: [SPARK-19471][SQL]AggregationIterator does not in...

2017-08-11 Thread DonnyZone
Github user DonnyZone closed the pull request at:

https://github.com/apache/spark/pull/18919


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18919: [SPARK-19471][SQL]AggregationIterator does not in...

2017-08-11 Thread DonnyZone
GitHub user DonnyZone opened a pull request:

https://github.com/apache/spark/pull/18919

[SPARK-19471][SQL]AggregationIterator does not initialize the generated 
result projection before using it

## What changes were proposed in this pull request?

Recently, we have also encountered such NPE issues in our production 
environment as introduced in: 
https://issues.apache.org/jira/browse/SPARK-19471

This issue can be reproduced by the following examples:

`   val df = spark.createDataFrame(Seq(("1", 1), ("1", 2), ("2", 3), ("2", 
4))).toDF("x", "y")

//HashAggregate, SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key=false
df.groupBy("x").agg(rand(),sum("y")).show()

//ObjectHashAggregate, SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key=false
df.groupBy("x").agg(rand(),collect_list("y")).show()

//SortAggregate, SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key=false 
&_OBJECT_HASH_AGG.key=false
df.groupBy("x").agg(rand(),collect_list("y")).show()`

This PR is based on PR-16820(https://github.com/apache/spark/pull/16820) 
with test cases for all aggregation paths.

> When AggregationIterator generates result projection, it does not call 
the initialize method of the Projection class. This will cause a runtime 
NullPointerException when the projection involves nondeterministic expressions.

## How was this patch tested?
unit test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/DonnyZone/spark Branch-SPARK-19471

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18919.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18919


commit 9a22b71f428f4e78978a8f87445949080edfc717
Author: donnyzone <wellfeng...@gmail.com>
Date:   2017-08-11T10:03:12Z

  SPARK-19471 with test cases

commit 8f813cfac545a9ee77802fc691f1a16167f6cf3d
Author: donnyzone <wellfeng...@gmail.com>
Date:   2017-08-11T11:07:31Z

error fix




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18656: [SPARK-21441][SQL]Incorrect Codegen in SortMergeJoinExec...

2017-07-18 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/18656
  
Thanks for reviewing, I will add a test later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18656: [SPARK-21441]Incorrect Codegen in SortMergeJoinEx...

2017-07-18 Thread DonnyZone
Github user DonnyZone commented on a diff in the pull request:

https://github.com/apache/spark/pull/18656#discussion_r128142467
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
 ---
@@ -489,13 +489,13 @@ case class CollapseCodegenStages(conf: SQLConf) 
extends Rule[SparkPlan] {
* Inserts an InputAdapter on top of those that do not support codegen.
*/
   private def insertInputAdapter(plan: SparkPlan): SparkPlan = plan match {
+case p if !supportCodegen(p) =>
+  // collapse them recursively
+  InputAdapter(insertWholeStageCodegen(p))
 case j @ SortMergeJoinExec(_, _, _, _, left, right) if 
j.supportCodegen =>
--- End diff --

Therefore, I think we should still verify it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18656: [SPARK-21441]Incorrect Codegen in SortMergeJoinEx...

2017-07-18 Thread DonnyZone
Github user DonnyZone commented on a diff in the pull request:

https://github.com/apache/spark/pull/18656#discussion_r128142370
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
 ---
@@ -489,13 +489,13 @@ case class CollapseCodegenStages(conf: SQLConf) 
extends Rule[SparkPlan] {
* Inserts an InputAdapter on top of those that do not support codegen.
*/
   private def insertInputAdapter(plan: SparkPlan): SparkPlan = plan match {
+case p if !supportCodegen(p) =>
+  // collapse them recursively
+  InputAdapter(insertWholeStageCodegen(p))
 case j @ SortMergeJoinExec(_, _, _, _, left, right) if 
j.supportCodegen =>
--- End diff --

SortMergeJoinExec.supportCodegen checks whether 
`joinType.isInstanceOf[InnerLike]`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18656: [SPARK-21441]Incorrect Codegen in SortMergeJoinExec resu...

2017-07-18 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/18656
  
I have validated both cases with and without CodegenFallback expressions 
for `SortMergeJoinExec`.
The fix works well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18656: [SPARK-21441]Incorrect Codegen in SortMergeJoinExec resu...

2017-07-18 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/18656
  
Great! I'm also considering to disable codegen for `SortMergeJoinExec` with 
CodegenFallback expressions. 
Thanks for your advise. I will work on it and validate in our environment.

Moreover, I just wonder whether the current pattern oder in 
`insertInputAdapter` is specifically designed to generate code for 
`SortMergeJoinExec` in all cases.

Could you give any ideas? @davies


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18656: [SPARK-21441]Incorrect Codegen in SortMergeJoinExec resu...

2017-07-18 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/18656
  
I notice that the CollapseCodegenStages rule will still enable codegen for 
SortMergeJoinExec without checking CodegenFallback expressions.

The logic in `insertInputAdapter` seems to skip validating 
SortMergeJoinExec. 

Actually, I'am not familiar with this part, please correct me if I get 
something wrong

```
private def insertInputAdapter(plan: SparkPlan): SparkPlan = plan match {
case j @ SortMergeJoinExec(_, _, _, _, left, right) if j.supportCodegen 
=>
  // The children of SortMergeJoin should do codegen separately.
  j.copy(left = InputAdapter(insertWholeStageCodegen(left)),
right = InputAdapter(insertWholeStageCodegen(right)))
  ..
```

 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18656: [SPARK-21441]Incorrect Codegen in SortMergeJoinExec resu...

2017-07-18 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/18656
  
That's interesting, I will take a look at why the codegen is enabled


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18656: [SPARK-21441]Incorrect Codegen in SortMergeJoinExec resu...

2017-07-18 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/18656
  
Yeah, CodegenFallback just provide a fallback mode. 
However, in such case, SortMergeJoinExec passes incomplete row as input to 
hiveUDF that implements CodegenFallback.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18656: [SPARK-21441]Incorrect Codegen in SortMergeJoinExec resu...

2017-07-18 Thread DonnyZone
Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/18656
  
Hi, @cloud-fan, @vanzin , could you help to take a look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18656: [SPARK-21441]Incorrect Codegen in SortMergeJoinEx...

2017-07-17 Thread DonnyZone
GitHub user DonnyZone opened a pull request:

https://github.com/apache/spark/pull/18656

[SPARK-21441]Incorrect Codegen in SortMergeJoinExec results failures in 
some cases

## What changes were proposed in this pull request?

https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21441

SortMergeJoinExec sets "ctx.INPUT_ROW = rightRow" in createRightVar() 
function, but generates BoundReference with the schema of (left.output ++ 
right.output). 
 
However, Expressions implementing CodegenFallback (e.g., SimpleHiveUDF) 
take ctx.INPUT_ROW as its inputs.

```
protected void processNext() throws java.io.IOException {
  while (findNextInnerJoinRows(smj_leftInput, smj_rightInput)) {
int smj_size = smj_matches.size();
..
for (int smj_i = 0; smj_i < smj_size; smj_i ++) {
  ..
  Object smj_obj = ((Expression) references[1]).eval(smj_rightRow1);
..
```

In such cases, NegativeArraySizeException will be thrown.
This patch fixes the issue by passing the joined row for condition 
evaluation.

## How was this patch tested?

Manual verification in cluster.

we also check the code snippet after changes

```
protected void processNext() throws java.io.IOException {
  while (findNextInnerJoinRows(smj_leftInput, smj_rightInput)) {
int smj_size = smj_matches.size();
..
for (int smj_i = 0; smj_i < smj_size; smj_i ++) {
  ..
  InternalRow smj_joinedRow = new JoinedRow(smj_leftRow, smj_rightRow1);
  ..
  Object smj_obj = ((Expression) references[1]).eval(smj_joinedRow);
..
```


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/DonnyZone/spark Fix_SortMergeJoinExec

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18656.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18656


commit 92dc106aec59a0f2755d7621d2d038312500
Author: donnyzone <wellfeng...@gmail.com>
Date:   2017-07-17T13:06:18Z

Passing the joined row for condition evaluation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org