[jira] [Commented] (SPARK-10100) AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction
[ https://issues.apache.org/jira/browse/SPARK-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14705057#comment-14705057 ] Yin Huai commented on SPARK-10100: -- I am changing the title of this jira to "Eliminate hash table lookup if there is no grouping key in aggregation." since https://github.com/apache/spark/pull/8332 is using this JIRA as the issue. For the Max and Min expressions, we can revisit them later if we find a better way to improve the performance. > AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction > -- > > Key: SPARK-10100 > URL: https://issues.apache.org/jira/browse/SPARK-10100 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Herman van Hovell > Fix For: 1.5.0 > > Attachments: SPARK-10100.perf.test.scala > > > Looks like Max (probably Min) implemented based on AggregateFunction2 is > slower than the old MaxFunction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10100) AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction
[ https://issues.apache.org/jira/browse/SPARK-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14704742#comment-14704742 ] Herman van Hovell commented on SPARK-10100: --- Lets leave it for 1.6. > AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction > -- > > Key: SPARK-10100 > URL: https://issues.apache.org/jira/browse/SPARK-10100 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Herman van Hovell > Attachments: SPARK-10100.perf.test.scala > > > Looks like Max (probably Min) implemented based on AggregateFunction2 is > slower than the old MaxFunction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10100) AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction
[ https://issues.apache.org/jira/browse/SPARK-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14704418#comment-14704418 ] Apache Spark commented on SPARK-10100: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/8332 > AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction > -- > > Key: SPARK-10100 > URL: https://issues.apache.org/jira/browse/SPARK-10100 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Herman van Hovell > Attachments: SPARK-10100.perf.test.scala > > > Looks like Max (probably Min) implemented based on AggregateFunction2 is > slower than the old MaxFunction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10100) AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction
[ https://issues.apache.org/jira/browse/SPARK-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14704351#comment-14704351 ] Yin Huai commented on SPARK-10100: -- How about we leave these functions as is for now (looks like the improvement provided by updating expressions is not very significant and also avoid code changes in the QA period )? > AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction > -- > > Key: SPARK-10100 > URL: https://issues.apache.org/jira/browse/SPARK-10100 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Herman van Hovell > Attachments: SPARK-10100.perf.test.scala > > > Looks like Max (probably Min) implemented based on AggregateFunction2 is > slower than the old MaxFunction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10100) AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction
[ https://issues.apache.org/jira/browse/SPARK-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14704322#comment-14704322 ] Yin Huai commented on SPARK-10100: -- The dataset I created has 11 columns and 2 groups. The query was applying 10 max functions {code} sqlContext.sql(""" select i, sum(j1), sum(j2), sum(j3), sum(j4), sum(j5), sum(j6), sum(j7), sum(j8), sum(j9), sum(j10) from testAgg group by i""") {code} In my laptop, 1.5 is about 5% slower than 1.4. > AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction > -- > > Key: SPARK-10100 > URL: https://issues.apache.org/jira/browse/SPARK-10100 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Herman van Hovell > Attachments: SPARK-10100.perf.test.scala > > > Looks like Max (probably Min) implemented based on AggregateFunction2 is > slower than the old MaxFunction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10100) AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction
[ https://issues.apache.org/jira/browse/SPARK-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14704309#comment-14704309 ] Yin Huai commented on SPARK-10100: -- I was comparing 1.4 with 1.5 and found 1.5 is slower. I also tweaked about the update expression in master. Seems no significant improvement. > AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction > -- > > Key: SPARK-10100 > URL: https://issues.apache.org/jira/browse/SPARK-10100 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Herman van Hovell > Attachments: SPARK-10100.perf.test.scala > > > Looks like Max (probably Min) implemented based on AggregateFunction2 is > slower than the old MaxFunction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10100) AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction
[ https://issues.apache.org/jira/browse/SPARK-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702398#comment-14702398 ] Yin Huai commented on SPARK-10100: -- [~hvanhovell] How's the performance? > AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction > -- > > Key: SPARK-10100 > URL: https://issues.apache.org/jira/browse/SPARK-10100 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai > > Looks like Max (probably Min) implemented based on AggregateFunction2 is > slower than the old MaxFunction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10100) AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction
[ https://issues.apache.org/jira/browse/SPARK-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702343#comment-14702343 ] Apache Spark commented on SPARK-10100: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/8298 > AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction > -- > > Key: SPARK-10100 > URL: https://issues.apache.org/jira/browse/SPARK-10100 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai > > Looks like Max (probably Min) implemented based on AggregateFunction2 is > slower than the old MaxFunction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10100) AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction
[ https://issues.apache.org/jira/browse/SPARK-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702344#comment-14702344 ] Herman van Hovell commented on SPARK-10100: --- PR is in. > AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction > -- > > Key: SPARK-10100 > URL: https://issues.apache.org/jira/browse/SPARK-10100 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai > > Looks like Max (probably Min) implemented based on AggregateFunction2 is > slower than the old MaxFunction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10100) AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction
[ https://issues.apache.org/jira/browse/SPARK-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702279#comment-14702279 ] Yin Huai commented on SPARK-10100: -- [~hvanhovell] I think it the expression we are using causes the slowness. In new version of Max, we have {code} override val updateExpressions = Seq( /* max = */ If(IsNull(child), max, If(IsNull(max), child, Greatest(Seq(max, child ) {code} For the old MaxFunction, we have {code} val currentMax: MutableLiteral = MutableLiteral(null, expr.dataType) val cmp = LessThan(currentMax, expr) override def update(input: InternalRow): Unit = { if (currentMax.value == null) { currentMax.value = expr.eval(input) } else if (cmp.eval(input) == true) { currentMax.value = expr.eval(input) } } {code} I feel we are just using a more expansive expression to calculate max (and probably min). Will you have time to look at it? I think the fix will be pretty small and we can get it in 1.5. > AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction > -- > > Key: SPARK-10100 > URL: https://issues.apache.org/jira/browse/SPARK-10100 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai > > Looks like Max (probably Min) implemented based on AggregateFunction2 is > slower than the old MaxFunction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10100) AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction
[ https://issues.apache.org/jira/browse/SPARK-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702252#comment-14702252 ] Herman van Hovell commented on SPARK-10100: --- Any idea why? JoinedRow? > AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction > -- > > Key: SPARK-10100 > URL: https://issues.apache.org/jira/browse/SPARK-10100 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai > > Looks like Max (probably Min) implemented based on AggregateFunction2 is > slower than the old MaxFunction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org