[GitHub] spark pull request #19606: [SPARK-22333][SQL][Backport-2.2]timeFunctionCall(...
Github user DonnyZone closed the pull request at: https://github.com/apache/spark/pull/19606 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19568: SPARK-22345: Fix sort-merge joins with conditions and co...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/19568 @rdblue Could you find any cases that can trigger the problem of wrong INPUT_ROW in SortMergeJoinExec after the fix (https://github.com/apache/spark/pull/18656) for CollapseCodegenStages rule? I tried to do it before. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19568: SPARK-22345: Fix sort-merge joins with conditions and co...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/19568 This PR is similar to the initial commit when I try to fix SPARK-21441 in https://github.com/apache/spark/pull/18656 (https://github.com/apache/spark/pull/18656/commits/92dc106aec59a0f2755d7621d2d038312500). (1) The INPUT_ROW in SortMergeJoinExec's codegen context points to the wrong row. In general, it works well. However, this behavior potentially causes wrong result or even NPE in `bindReference`. I think it is necessary to fix it. (2) The `CollapseCodegenStages` rule before 2.1.1 has an issue which may lead to code generation even the SortMergeJoinExec contains CodegenFallback expressions, when it has an umbrella SparkPlan (e.g., ProjectExec). Consequently, it triggers the above potential issue. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19606: [SPARK-22333][SQL][Backport-2.2]timeFunctionCall(CURRENT...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/19606 cc @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19606: [SPARK-22333][SQL][Backport-2.2]timeFunctionCall(...
GitHub user DonnyZone opened a pull request: https://github.com/apache/spark/pull/19606 [SPARK-22333][SQL][Backport-2.2]timeFunctionCall(CURRENT_DATE, CURRENT_TIMESTAMP) has conflicts with columnReference ## What changes were proposed in this pull request? This is a backport pr of https://github.com/apache/spark/pull/19559 for branch-2.2 ## How was this patch tested? unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/DonnyZone/spark branch-2.2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19606.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19606 commit 2bcc2ea6fd0ca9f12959246bb9ee6796cb7a90a0 Author: donnyzone <wellfeng...@gmail.com> Date: 2017-10-30T03:08:36Z 2.2-backport --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19559: [SPARK-22333][SQL]timeFunctionCall(CURRENT_DATE, CURRENT...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/19559 Sure, I will submit it later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19559: [SPARK-22333][SQL]timeFunctionCall(CURRENT_DATE, CURRENT...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/19559 Yes, ordering in `Sort(ordering, global, child)` is resolved in `resolveExpression` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19559: [SPARK-22333][SQL]timeFunctionCall(CURRENT_DATE, CURRENT...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/19559 OK, I will close this PR after review and submit a new one, after merging https://github.com/apache/spark/pull/19585 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19559: [SPARK-22333][SQL]timeFunctionCall(CURRENT_DATE, CURRENT...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/19559 @gatorsmile It seems that we should also support this logic in `resolveExpressions` for Sort plan. `select a from t order by current_date` Therefore, I think current `resolveAsLiteralFunctions` can be moved out from `ResolveReference` rule to be a common function `resolveLiteralFunctions` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19559: [SPARK-22333][SQL]timeFunctionCall(CURRENT_DATE, ...
Github user DonnyZone commented on a diff in the pull request: https://github.com/apache/spark/pull/19559#discussion_r147330565 --- Diff: sql/core/src/test/resources/sql-tests/inputs/datetime.sql --- @@ -8,3 +8,18 @@ select to_date(null), to_date('2016-12-31'), to_date('2016-12-31', '-MM-dd') select to_timestamp(null), to_timestamp('2016-12-31 00:12:00'), to_timestamp('2016-12-31', '-MM-dd'); select dayofweek('2007-02-03'), dayofweek('2009-07-30'), dayofweek('2017-05-27'), dayofweek(null), dayofweek('1582-10-15 13:10:15'); + +-- [SPARK-22333]: timeFunctionCall has conflicts with columnReference +create temporary view ttf1 as select * from values + (1, 2), + (2, 3), + as ttf1(current_date, current_timestamp); + +select current_date, current_timestamp from ttf1 + +create temporary view ttf2 as select * from values + (1, 2), + (2, 3), + as ttf2(a, b); + +select current_date = current_date(), current_timestamp = current_timestamp(), a, b from ttf2 --- End diff -- Fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19573: [SPARK-22350][SQL] select grouping__id from subquery
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/19573 Is it similar to the below issue? https://github.com/apache/spark/pull/19178 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19559: [SPARK-22333][SQL]timeFunctionCall(CURRENT_DATE, CURRENT...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/19559 @gatorsmile @gatorsmile There are still two issues need to be figured out. (1)It will be complicated to determine whether a literal function should be resolved as Expression or NamedExpression. Current fix just resolves them as NamedExpressions (i.e., Alias). However, this leads to different schema in some cases, for example, the end-to-end test sql. `select current_date = current_date()` The output schema will be `struct<(current_date() AS âcurrent_date()â = current_date()):boolean>` (2)Shall we also support the feature in ResolveMissingReference rule? e.g., `select id from table order by current_date` The same logic in different rules brings redundant code. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19559: [SPARK-22333][SQL]timeFunctionCall(CURRENT_DATE, ...
Github user DonnyZone commented on a diff in the pull request: https://github.com/apache/spark/pull/19559#discussion_r147068164 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -783,6 +783,25 @@ class Analyzer( } } +/** + * Literal functions do not require the user to specify braces when calling them + * When an attributes is not resolvable, we try to resolve it as a literal function. + */ +private def resolveAsLiteralFunctions(nameParts: Seq[String]): Option[NamedExpression] = { + if (nameParts.length != 1) { +return None + } + // support CURRENT_DATE and CURRENT_TIMESTAMP + val literalFunctions = Seq(CurrentDate(), CurrentTimestamp()) + val name = nameParts.head + val func = literalFunctions.find(e => resolver(e.prettyName, name)) + if (func.isDefined) { --- End diff -- Thanks, I will refactor it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19559: [SPARK-22333][SQL]timeFunctionCall(CURRENT_DATE, ...
Github user DonnyZone commented on a diff in the pull request: https://github.com/apache/spark/pull/19559#discussion_r147041980 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -139,6 +139,7 @@ class Analyzer( ExtractGenerator :: ResolveGenerate :: ResolveFunctions :: + ResolveLiteralFunctions :: --- End diff -- Agree! I will refactor it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19559: [SPARK-22333][SQL]timeFunctionCall(CURRENT_DATE, CURRENT...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/19559 @gatorsmile Thank for your advice, I will work on it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19559: [SPARK-22333][SQL]ColumnReference should get higher prio...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/19559 @hvanhovell Yes! I made something wrong. The `timeFunctionCall` has conflicts with `columnReference`. This fix will break every use of CURRENT_DATE/CURRENT_TIMESTAMP. For [SPARK-16836](https://github.com/apache/spark/pull/14442), I think this feature should be implemented in analysis phase rather than in parser phase. When there is no such columns, they can be transformed as functions. Another approach is to define a configuration. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19559: [SPARK-22333][SQL]ColumnReference should get higher prio...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/19559 ping @gatorsmile @hvanhovell @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19559: [SPARK-22333][SQL]ColumnReference should get higher prio...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/19559 ping @hvanhovell @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19559: [SPARK-22333][SQL]ColumnReference should get high...
GitHub user DonnyZone opened a pull request: https://github.com/apache/spark/pull/19559 [SPARK-22333][SQL]ColumnReference should get higher priority than timeFunctionCall(CURRENT_DATE, CURRENT_TIMESTAMP) ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-22333 In current version, users can use CURRENT_DATE() and CURRENT_TIMESTAMP without specifying braces. However, when a table has columns named as "current_date" or "current_timestamp", it will still be parsed as function call. There are many such cases in our production cluster. We get the wrong answer due to this inappropriate behevior. In general, ColumnReference should get higher priority than timeFunctionCall. ## How was this patch tested? unit test manul test You can merge this pull request into a Git repository by running: $ git pull https://github.com/DonnyZone/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19559.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19559 commit 60a5a56a77245f3467920c60cc39ae6cd4989572 Author: donnyzone <wellfeng...@gmail.com> Date: 2017-10-23T11:40:31Z spark-22333 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19175: [SPARK-21964][SQL]Enable splitting the Aggregate (on Exp...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/19175 cc @hvanhovell @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19175: [SPARK-21964][SQL]Enable splitting the Aggregate ...
Github user DonnyZone commented on a diff in the pull request: https://github.com/apache/spark/pull/19175#discussion_r140741265 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1287,3 +1288,33 @@ object RemoveRepetitionFromGroupExpressions extends Rule[LogicalPlan] { a.copy(groupingExpressions = newGrouping) } } + +/** + * Splits [[Aggregate]] on [[Expand]], which has large number of projections, + * into various [[Aggregate]]s. + */ +object SplitAggregateWithExpand extends Rule[LogicalPlan] { + /** + * Split [[Expand]] operator to a number of [[Expand]] operators + */ + private def splitExpand(expand: Expand): Seq[Expand] = { +val len = expand.projections.length +val allProjections = expand.projections +Seq.tabulate(len)( + i => Expand(Seq(allProjections(i)), expand.output, expand.child) --- End diff -- And I think we may figure out some cost-based methods. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19175: [SPARK-21964][SQL]Enable splitting the Aggregate ...
Github user DonnyZone commented on a diff in the pull request: https://github.com/apache/spark/pull/19175#discussion_r140741053 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1287,3 +1288,33 @@ object RemoveRepetitionFromGroupExpressions extends Rule[LogicalPlan] { a.copy(groupingExpressions = newGrouping) } } + +/** + * Splits [[Aggregate]] on [[Expand]], which has large number of projections, + * into various [[Aggregate]]s. + */ +object SplitAggregateWithExpand extends Rule[LogicalPlan] { + /** + * Split [[Expand]] operator to a number of [[Expand]] operators + */ + private def splitExpand(expand: Expand): Seq[Expand] = { +val len = expand.projections.length +val allProjections = expand.projections +Seq.tabulate(len)( + i => Expand(Seq(allProjections(i)), expand.output, expand.child) --- End diff -- Can we set another config param here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19175: [SPARK-21964][SQL]Enable splitting the Aggregate ...
Github user DonnyZone commented on a diff in the pull request: https://github.com/apache/spark/pull/19175#discussion_r140729331 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1287,3 +1288,33 @@ object RemoveRepetitionFromGroupExpressions extends Rule[LogicalPlan] { a.copy(groupingExpressions = newGrouping) } } + +/** + * Splits [[Aggregate]] on [[Expand]], which has large number of projections, + * into various [[Aggregate]]s. + */ +object SplitAggregateWithExpand extends Rule[LogicalPlan] { + /** + * Split [[Expand]] operator to a number of [[Expand]] operators + */ + private def splitExpand(expand: Expand): Seq[Expand] = { +val len = expand.projections.length +val allProjections = expand.projections +Seq.tabulate(len)( + i => Expand(Seq(allProjections(i)), expand.output, expand.child) --- End diff -- Shall we keep Expand here for further optimization? I think we may put more than one projections together for Union operator. In our production, we found some cube queries may even have 22 dimensions, whch result to 2^22=4194304 projections. In such case, it is not appropriate to tansform it into "one-projection-one-child" for Union node. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19175: [SPARK-21964][SQL]Enable splitting the Aggregate (on Exp...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/19175 @cloud-fan Do you have time to review this PR? We found it is useful in high dimensional cube cases. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19175: [SPARK-21964][SQL]Enable splitting the Aggregate (on Exp...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/19175 ping! @hvanhovell @cloud-fan @gatorsmile Do you have any ideas about this optimization? We found it is useful in some scenarios. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19175: [SPARK-21964][SQL]Enable splitting the Aggregate (on Exp...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/19175 Could you help to review this PR? @jiangxb1987 @hvanhovell --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19202: [SPARK-21980][SQL]References in grouping functions shoul...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/19202 ping @cloud-fan @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19202: [SPARK-21980][SQL]References in grouping function...
GitHub user DonnyZone opened a pull request: https://github.com/apache/spark/pull/19202 [SPARK-21980][SQL]References in grouping functions should be indexed with resolver ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-21980 This PR fixes the issue in ResolveGroupingAnalytics rule, which indexes the column references in grouping functions without considering case sensitive configurations. ## How was this patch tested? unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/DonnyZone/spark ResolveGroupingAnalytics Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19202.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19202 commit ac61a6620e59447c575092bee5d4d7f0af99695c Author: donnyzone <wellfeng...@gmail.com> Date: 2017-09-12T09:28:01Z SPARK-21980 commit b08fd9301cdbd4c1a29d5eb322eacd1cf2ffc546 Author: donnyzone <wellfeng...@gmail.com> Date: 2017-09-12T09:34:53Z rename --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19178: [SPARK-21966][SQL]ResolveMissingReference rule sh...
Github user DonnyZone closed the pull request at: https://github.com/apache/spark/pull/19178 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19178: [SPARK-21966][SQL]ResolveMissingReference rule should no...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/19178 Thanks, I will make a try in our private repository as there are several such cases and the users want to in a seamless way. It is really complicated for a general support. Should I close this PR now? @gatorsmile @maropu --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19178: [SPARK-21966][SQL]ResolveMissingReference rule sh...
Github user DonnyZone commented on a diff in the pull request: https://github.com/apache/spark/pull/19178#discussion_r137979887 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1115,6 +1115,8 @@ class Analyzer( g.copy(join = true, child = addMissingAttr(g.child, missing)) case d: Distinct => throw new AnalysisException(s"Can't add $missingAttrs to $d") +case u: Union => + u.withNewChildren(u.children.map(addMissingAttr(_, missingAttrs))) --- End diff -- Yeah, I agree with you. Current implementation only checks UnaryNode. It is necessary to take all node types into consideration. Thanks for suggestion, I will work on a general solution. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19178: [SPARK-21966][SQL]ResolveMissingReference rule sh...
GitHub user DonnyZone opened a pull request: https://github.com/apache/spark/pull/19178 [SPARK-21966][SQL]ResolveMissingReference rule should not ignore Union ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-21966 The problem can be reproduced by following example. `val df1 = spark.createDataFrame(Seq((1, 1), (2, 1), (2, 2))).toDF("a", "b") val df2 = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 3))).toDF("a", "b") val df3 = df1.cube("a").sum("b") val df4 = df2.cube("a").sum("b") val df5 = df3.union(df4).filter("grouping_id()=0").show()` The `org.apache.spark.sql.AnalysisException: cannot resolve '`spark_grouping_id`' given input columns` is thrown as the ResolveMissingReference rule ignore the Union operator. This PR fix the issue. ## How was this patch tested? unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/DonnyZone/spark ResolveMissingReference Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19178.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19178 commit 29ae28598c69bb4cb26ad66c86e6a73054e5fbef Author: donnyzone <wellfeng...@gmail.com> Date: 2017-09-10T09:26:36Z SPARK-21966 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19175: [SPARK-21964][SQL]Enable splitting the Aggregate ...
GitHub user DonnyZone opened a pull request: https://github.com/apache/spark/pull/19175 [SPARK-21964][SQL]Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-21964 Current implementation for grouping analytics (i.e., cube, rollup and grouping sets) is "heavyweight" for many scenarios (e.g., high dimensions cube), as the Expand operator produces a large number of projections, resulting vast shuffle write. It may result into low performance or even OOM issues for direct buffer memory. This PR provides another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. This optimization is opposite to the general sense of "avoding redundant data scan" The current splitting strategy is simple as one aggregation for one group. In future, we may figure out more intelligent splitting stategies (e.g., cost-based method). ## How was this patch tested? Unit tests Manual tests in production environment You can merge this pull request into a Git repository by running: $ git pull https://github.com/DonnyZone/spark spark-21964 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19175.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19175 commit da36e37df9c31901975c29dfa77cb7d648e94f40 Author: donnyzone <wellfeng...@gmail.com> Date: 2017-09-09T13:46:48Z grouping with union --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18986: [SPARK-21774][SQL] The rule PromoteStrings should cast a...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/18986 @gatorsmile For this issue, I think the behevior in PromoteStrings rule is reasonable, but there are problems in underlying converter UTF8String. As described in PR-15880 (https://github.com/apache/spark/pull/15880): > It's more reasonable to follow postgres, i.e. cast string to the type of the other side, but return null if the string is not castable to keep hive compatibility. However, the underlying UTF8String still returns true for cases that are not castable. From the below code, we can get res=true and wrapper.value=0. Consequently, it results in wrong answer in SPARK-21774. ``` val x = UTF8String.fromString("0.1") val wrapper = new IntWrapper val res = x.toInt(wrapper) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18986: [SPARK-21774][SQL] The rule PromoteStrings should...
Github user DonnyZone commented on a diff in the pull request: https://github.com/apache/spark/pull/18986#discussion_r133916434 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -127,8 +127,10 @@ object TypeCoercion { case (DateType, TimestampType) => Some(StringType) case (StringType, NullType) => Some(StringType) case (NullType, StringType) => Some(StringType) +case (StringType, IntegerType) => Some(DoubleType) --- End diff -- How abount other types (e.g., LongType)? Is the similar issue still exists? I think the logic in pr-15880 (https://github.com/apache/spark/pull/15880) is reasonable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18960: [SPARK-21739][SQL]Cast expression should initialize time...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/18960 Moreover, how about using `CastSupport.cast`, shall I initialize a `DataSourceAnalysis` or `DataSourceStrategy` ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18960: [SPARK-21739][SQL]Cast expression should initialize time...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/18960 Test case updated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18960: [SPARK-21739][SQL]Cast expression should initiali...
Github user DonnyZone commented on a diff in the pull request: https://github.com/apache/spark/pull/18960#discussion_r133631784 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala --- @@ -68,4 +68,25 @@ class QueryPartitionSuite extends QueryTest with SQLTestUtils with TestHiveSingl sql("DROP TABLE IF EXISTS createAndInsertTest") } } + + test("SPARK-21739: Cast expression should initialize timezoneId " + --- End diff -- Oh, it should select the TimestampType column. Thanks for reminder, I will fix it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18960: [SPARK-21739][SQL]Cast expression should initiali...
Github user DonnyZone commented on a diff in the pull request: https://github.com/apache/spark/pull/18960#discussion_r133618679 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -227,7 +228,8 @@ class HadoopTableReader( def fillPartitionKeys(rawPartValues: Array[String], row: InternalRow): Unit = { partitionKeyAttrs.foreach { case (attr, ordinal) => val partOrdinal = partitionKeys.indexOf(attr) - row(ordinal) = Cast(Literal(rawPartValues(partOrdinal)), attr.dataType).eval(null) + row(ordinal) = Cast(Literal(rawPartValues(partOrdinal)), attr.dataType, +Option(SQLConf.get.sessionLocalTimeZone)).eval(null) --- End diff -- Do you mean a test case for HadoopTableReader? a little confusing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18960: [SPARK-21739][SQL]Cast expression should initiali...
Github user DonnyZone commented on a diff in the pull request: https://github.com/apache/spark/pull/18960#discussion_r133614757 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala --- @@ -104,7 +105,7 @@ case class HiveTableScanExec( hadoopConf) private def castFromString(value: String, dataType: DataType) = { -Cast(Literal(value), dataType).eval(null) +Cast(Literal(value), dataType, Option(SQLConf.get.sessionLocalTimeZone)).eval(null) --- End diff -- Here, we can obtain SQLConf directly with `sparkSession.sessionState.conf` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18960: [SPARK-21739][SQL]Cast expression should initiali...
Github user DonnyZone commented on a diff in the pull request: https://github.com/apache/spark/pull/18960#discussion_r133612759 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala --- @@ -104,7 +105,7 @@ case class HiveTableScanExec( hadoopConf) private def castFromString(value: String, dataType: DataType) = { -Cast(Literal(value), dataType).eval(null) +Cast(Literal(value), dataType, Option(SQLConf.get.sessionLocalTimeZone)).eval(null) --- End diff -- BTW, is it elegant to initialize a `CastSupport` (`DataSourceAnalysis` rule or `DataSourceStrategy`) here, in which we still need to pass `SQLConf`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18960: [SPARK-21739][SQL]Cast expression should initiali...
Github user DonnyZone commented on a diff in the pull request: https://github.com/apache/spark/pull/18960#discussion_r133607798 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -227,7 +228,8 @@ class HadoopTableReader( def fillPartitionKeys(rawPartValues: Array[String], row: InternalRow): Unit = { partitionKeyAttrs.foreach { case (attr, ordinal) => val partOrdinal = partitionKeys.indexOf(attr) - row(ordinal) = Cast(Literal(rawPartValues(partOrdinal)), attr.dataType).eval(null) + row(ordinal) = Cast(Literal(rawPartValues(partOrdinal)), attr.dataType, +Option(SQLConf.get.sessionLocalTimeZone)).eval(null) --- End diff -- OK, I will work on it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18960: [SPARK-21739][SQL]Cast expression should initialize time...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/18960 @cloud-fan @gatorsmile --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18960: [SPARK-21739][SQL]Cast expression should initialize time...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/18960 @cloud-fan @gatorsmile --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18960: [SPARK-21739][SQL]Cast expression should initiali...
GitHub user DonnyZone opened a pull request: https://github.com/apache/spark/pull/18960 [SPARK-21739][SQL]Cast expression should initialize timezoneId when it is called statically to convert something into TimestampType ## What changes were proposed in this pull request? https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21739 This issue is caused by introducing TimeZoneAwareExpression. When the **Cast** expression converts something into TimestampType, it should be resolved with setting `timezoneId`. In general, it is resolved in LogicalPlan phase. However, there are still some places that use Cast expression statically to convert datatypes without setting `timezoneId`. In such cases, `NoSuchElementException: None.get` will be thrown for TimestampType. This PR is proposed to fix the issue. We have checked the whole project and found two such usages(i.e., in`TableReader` and `HiveTableScanExec`). ## How was this patch tested? unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/DonnyZone/spark spark-21739 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18960.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18960 commit 2537c1ea5028541a68ae63e8f5eea8eb8f62dadf Author: donnyzone <wellfeng...@gmail.com> Date: 2017-08-16T07:53:19Z spark-21739 commit 86331e37550f4204c8c4ee2c409fbd6000654f43 Author: donnyzone <wellfeng...@gmail.com> Date: 2017-08-16T07:55:17Z fix code style commit 492b756fde5008854d1351ed423c3897c683c662 Author: donnyzone <wellfeng...@gmail.com> Date: 2017-08-16T07:56:34Z correct answer --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18946: [SPARK-19471][SQL]AggregationIterator does not initializ...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/18946 Is Jenkin unstable? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18946: [SPARK-19471][SQL]AggregationIterator does not initializ...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/18946 @gatorsmile --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18946: [SPARK-19471][SQL]AggregationIterator does not in...
GitHub user DonnyZone opened a pull request: https://github.com/apache/spark/pull/18946 [SPARK-19471][SQL]AggregationIterator does not initialize the generated result projection before using it ## What changes were proposed in this pull request? This is a follow-up PR that moves the test case in PR-18920 (https://github.com/apache/spark/pull/18920) to DataFrameAggregateSuit. ## How was this patch tested? unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/DonnyZone/spark branch-19471-followingPR Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18946.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18946 commit 9434fc767e6b7907b9beacb2f4358d767c9d4d32 Author: donnyzone <wellfeng...@gmail.com> Date: 2017-08-15T02:29:25Z move test case to DataFrameAggregateSuite commit f057ff8400076fce615fe9b6521ed1b3d66cb669 Author: donnyzone <wellfeng...@gmail.com> Date: 2017-08-15T02:37:33Z change import order --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18920: [SPARK-19471][SQL]AggregationIterator does not initializ...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/18920 Sure, I will do it later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18920: [SPARK-19471][SQL]AggregationIterator does not initializ...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/18920 retest please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18920: [SPARK-19471][SQL]AggregationIterator does not initializ...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/18920 Updated, thanks for reviewing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18920: [SPARK-19471][SQL]AggregationIterator does not initializ...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/18920 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18920: [SPARK-19471][SQL]AggregationIterator does not initializ...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/18920 updated --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18920: [SPARK-19471][SQL]AggregationIterator does not initializ...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/18920 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18920: [SPARK-19471][SQL]AggregationIterator does not initializ...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/18920 @hvanhovell, @yangw1234, @gatorsmile --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18920: [SPARK-19471][SQL]AggregationIterator does not initializ...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/18920 Jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18920: [SPARK-19471][SQL]AggregationIterator does not in...
GitHub user DonnyZone opened a pull request: https://github.com/apache/spark/pull/18920 [SPARK-19471][SQL]AggregationIterator does not initialize the generated result projection before using it ## What changes were proposed in this pull request? Recently, we have also encountered such NPE issues in our production environment as described in: https://issues.apache.org/jira/browse/SPARK-19471 This issue can be reproduced by the following examples: ` val df = spark.createDataFrame(Seq(("1", 1), ("1", 2), ("2", 3), ("2", 4))).toDF("x", "y") //HashAggregate, SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key=false df.groupBy("x").agg(rand(),sum("y")).show() //ObjectHashAggregate, SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key=false df.groupBy("x").agg(rand(),collect_list("y")).show() //SortAggregate, SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key=false &_OBJECT_HASH_AGG.key=false df.groupBy("x").agg(rand(),collect_list("y")).show()` ` This PR is based on PR-16820(https://github.com/apache/spark/pull/16820) with test cases for all aggregation paths. We want to push it forward. > When AggregationIterator generates result projection, it does not call the initialize method of the Projection class. This will cause a runtime NullPointerException when the projection involves nondeterministic expressions. ## How was this patch tested? unit test verified in production environment You can merge this pull request into a Git repository by running: $ git pull https://github.com/DonnyZone/spark Branch-spark-19471 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18920.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18920 commit b932d2f3a6741a8ef052cbd8087f4b0836c617d6 Author: donnyzone <wellfeng...@gmail.com> Date: 2017-08-11T13:00:00Z spark-19471 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18919: [SPARK-19471][SQL]AggregationIterator does not initializ...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/18919 There are some confilicts, close it first --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18919: [SPARK-19471][SQL]AggregationIterator does not in...
Github user DonnyZone closed the pull request at: https://github.com/apache/spark/pull/18919 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18919: [SPARK-19471][SQL]AggregationIterator does not in...
GitHub user DonnyZone opened a pull request: https://github.com/apache/spark/pull/18919 [SPARK-19471][SQL]AggregationIterator does not initialize the generated result projection before using it ## What changes were proposed in this pull request? Recently, we have also encountered such NPE issues in our production environment as introduced in: https://issues.apache.org/jira/browse/SPARK-19471 This issue can be reproduced by the following examples: ` val df = spark.createDataFrame(Seq(("1", 1), ("1", 2), ("2", 3), ("2", 4))).toDF("x", "y") //HashAggregate, SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key=false df.groupBy("x").agg(rand(),sum("y")).show() //ObjectHashAggregate, SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key=false df.groupBy("x").agg(rand(),collect_list("y")).show() //SortAggregate, SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key=false &_OBJECT_HASH_AGG.key=false df.groupBy("x").agg(rand(),collect_list("y")).show()` This PR is based on PR-16820(https://github.com/apache/spark/pull/16820) with test cases for all aggregation paths. > When AggregationIterator generates result projection, it does not call the initialize method of the Projection class. This will cause a runtime NullPointerException when the projection involves nondeterministic expressions. ## How was this patch tested? unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/DonnyZone/spark Branch-SPARK-19471 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18919.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18919 commit 9a22b71f428f4e78978a8f87445949080edfc717 Author: donnyzone <wellfeng...@gmail.com> Date: 2017-08-11T10:03:12Z SPARK-19471 with test cases commit 8f813cfac545a9ee77802fc691f1a16167f6cf3d Author: donnyzone <wellfeng...@gmail.com> Date: 2017-08-11T11:07:31Z error fix --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18656: [SPARK-21441][SQL]Incorrect Codegen in SortMergeJoinExec...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/18656 Thanks for reviewing, I will add a test later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18656: [SPARK-21441]Incorrect Codegen in SortMergeJoinEx...
Github user DonnyZone commented on a diff in the pull request: https://github.com/apache/spark/pull/18656#discussion_r128142467 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala --- @@ -489,13 +489,13 @@ case class CollapseCodegenStages(conf: SQLConf) extends Rule[SparkPlan] { * Inserts an InputAdapter on top of those that do not support codegen. */ private def insertInputAdapter(plan: SparkPlan): SparkPlan = plan match { +case p if !supportCodegen(p) => + // collapse them recursively + InputAdapter(insertWholeStageCodegen(p)) case j @ SortMergeJoinExec(_, _, _, _, left, right) if j.supportCodegen => --- End diff -- Therefore, I think we should still verify it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18656: [SPARK-21441]Incorrect Codegen in SortMergeJoinEx...
Github user DonnyZone commented on a diff in the pull request: https://github.com/apache/spark/pull/18656#discussion_r128142370 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala --- @@ -489,13 +489,13 @@ case class CollapseCodegenStages(conf: SQLConf) extends Rule[SparkPlan] { * Inserts an InputAdapter on top of those that do not support codegen. */ private def insertInputAdapter(plan: SparkPlan): SparkPlan = plan match { +case p if !supportCodegen(p) => + // collapse them recursively + InputAdapter(insertWholeStageCodegen(p)) case j @ SortMergeJoinExec(_, _, _, _, left, right) if j.supportCodegen => --- End diff -- SortMergeJoinExec.supportCodegen checks whether `joinType.isInstanceOf[InnerLike]` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18656: [SPARK-21441]Incorrect Codegen in SortMergeJoinExec resu...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/18656 I have validated both cases with and without CodegenFallback expressions for `SortMergeJoinExec`. The fix works well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18656: [SPARK-21441]Incorrect Codegen in SortMergeJoinExec resu...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/18656 Great! I'm also considering to disable codegen for `SortMergeJoinExec` with CodegenFallback expressions. Thanks for your advise. I will work on it and validate in our environment. Moreover, I just wonder whether the current pattern oder in `insertInputAdapter` is specifically designed to generate code for `SortMergeJoinExec` in all cases. Could you give any ideas? @davies --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18656: [SPARK-21441]Incorrect Codegen in SortMergeJoinExec resu...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/18656 I notice that the CollapseCodegenStages rule will still enable codegen for SortMergeJoinExec without checking CodegenFallback expressions. The logic in `insertInputAdapter` seems to skip validating SortMergeJoinExec. Actually, I'am not familiar with this part, please correct me if I get something wrong ``` private def insertInputAdapter(plan: SparkPlan): SparkPlan = plan match { case j @ SortMergeJoinExec(_, _, _, _, left, right) if j.supportCodegen => // The children of SortMergeJoin should do codegen separately. j.copy(left = InputAdapter(insertWholeStageCodegen(left)), right = InputAdapter(insertWholeStageCodegen(right))) .. ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18656: [SPARK-21441]Incorrect Codegen in SortMergeJoinExec resu...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/18656 That's interesting, I will take a look at why the codegen is enabled --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18656: [SPARK-21441]Incorrect Codegen in SortMergeJoinExec resu...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/18656 Yeah, CodegenFallback just provide a fallback mode. However, in such case, SortMergeJoinExec passes incomplete row as input to hiveUDF that implements CodegenFallback. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18656: [SPARK-21441]Incorrect Codegen in SortMergeJoinExec resu...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/18656 Hi, @cloud-fan, @vanzin , could you help to take a look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18656: [SPARK-21441]Incorrect Codegen in SortMergeJoinEx...
GitHub user DonnyZone opened a pull request: https://github.com/apache/spark/pull/18656 [SPARK-21441]Incorrect Codegen in SortMergeJoinExec results failures in some cases ## What changes were proposed in this pull request? https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21441 SortMergeJoinExec sets "ctx.INPUT_ROW = rightRow" in createRightVar() function, but generates BoundReference with the schema of (left.output ++ right.output). However, Expressions implementing CodegenFallback (e.g., SimpleHiveUDF) take ctx.INPUT_ROW as its inputs. ``` protected void processNext() throws java.io.IOException { while (findNextInnerJoinRows(smj_leftInput, smj_rightInput)) { int smj_size = smj_matches.size(); .. for (int smj_i = 0; smj_i < smj_size; smj_i ++) { .. Object smj_obj = ((Expression) references[1]).eval(smj_rightRow1); .. ``` In such cases, NegativeArraySizeException will be thrown. This patch fixes the issue by passing the joined row for condition evaluation. ## How was this patch tested? Manual verification in cluster. we also check the code snippet after changes ``` protected void processNext() throws java.io.IOException { while (findNextInnerJoinRows(smj_leftInput, smj_rightInput)) { int smj_size = smj_matches.size(); .. for (int smj_i = 0; smj_i < smj_size; smj_i ++) { .. InternalRow smj_joinedRow = new JoinedRow(smj_leftRow, smj_rightRow1); .. Object smj_obj = ((Expression) references[1]).eval(smj_joinedRow); .. ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/DonnyZone/spark Fix_SortMergeJoinExec Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18656.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18656 commit 92dc106aec59a0f2755d7621d2d038312500 Author: donnyzone <wellfeng...@gmail.com> Date: 2017-07-17T13:06:18Z Passing the joined row for condition evaluation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org