[jira] [Commented] (SPARK-18799) Spark SQL expose interface for plug-gable parser extension
[ https://issues.apache.org/jira/browse/SPARK-18799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15736118#comment-15736118 ] Jihong MA commented on SPARK-18799: --- are we looking at first quarter of 2017 for Spark 2.2? is now too late to squeeze this in Spark 2.1 release? thanks! > Spark SQL expose interface for plug-gable parser extension > --- > > Key: SPARK-18799 > URL: https://issues.apache.org/jira/browse/SPARK-18799 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jihong MA > > There used to be an interface to plug a parser extension through > ParserDialect in HiveContext in all Spark 1.x version. Starting Spark 2.x > release, Apache Spark moved to the new parser (Antlr4), there is no longer a > way to extend the default SQL parser through SparkSession interface, however > this is really a pain and hard to work around it when integrating other data > source with Spark with extended support such as Insert, Update, Delete > statement or any other data management statement. > It would be very nice to continue to expose an interface for parser extension > to make data source integration easier and smoother. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18799) Spark SQL expose interface for plug-gable parser extension
[ https://issues.apache.org/jira/browse/SPARK-18799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15736100#comment-15736100 ] Jihong MA commented on SPARK-18799: --- DML statement support for instance > Spark SQL expose interface for plug-gable parser extension > --- > > Key: SPARK-18799 > URL: https://issues.apache.org/jira/browse/SPARK-18799 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jihong MA > > There used to be an interface to plug a parser extension through > ParserDialect in HiveContext in all Spark 1.x version. Starting Spark 2.x > release, Apache Spark moved to the new parser (Antlr4), there is no longer a > way to extend the default SQL parser through SparkSession interface, however > this is really a pain and hard to work around it when integrating other data > source with Spark with extended support such as Insert, Update, Delete > statement or any other data management statement. > It would be very nice to continue to expose an interface for parser extension > to make data source integration easier and smoother. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18799) Spark SQL expose interface for plug-gable parser extension
[ https://issues.apache.org/jira/browse/SPARK-18799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15736003#comment-15736003 ] Jihong MA commented on SPARK-18799: --- [~hyukjin.kwon] the intention to remove it at that time is different as Spark doesn't have its own parser yet and would prefer community to contribute parser changes directly. for Spark 2.x, it is essential to provide an interface for extension for data source integration, especially for those syntax/statements Spark has no intention to support even in the future. > Spark SQL expose interface for plug-gable parser extension > --- > > Key: SPARK-18799 > URL: https://issues.apache.org/jira/browse/SPARK-18799 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jihong MA > > There used to be an interface to plug a parser extension through > ParserDialect in HiveContext in all Spark 1.x version. Starting Spark 2.x > release, Apache Spark moved to the new parser (Antlr4), there is no longer a > way to extend the default SQL parser through SparkSession interface, however > this is really a pain and hard to work around it when integrating other data > source with Spark with extended support such as Insert, Update, Delete > statement or any other data management statement. > It would be very nice to continue to expose an interface for parser extension > to make data source integration easier and smoother. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18799) Spark SQL expose interface for plug-gable parser extension
Jihong MA created SPARK-18799: - Summary: Spark SQL expose interface for plug-gable parser extension Key: SPARK-18799 URL: https://issues.apache.org/jira/browse/SPARK-18799 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Jihong MA There used to be an interface to plug a parser extension through ParserDialect in HiveContext in all Spark 1.x version. Starting Spark 2.x release, Apache Spark moved to the new parser (Antlr4), there is no longer a way to extend the default SQL parser through SparkSession interface, however this is really a pain and hard to work around it when integrating other data source with Spark with extended support such as Insert, Update, Delete statement or any other data management statement. It would be very nice to continue to expose an interface for parser extension to make data source integration easier and smoother. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11720) Handle edge cases when count = 0 or 1 for Stats function
[ https://issues.apache.org/jira/browse/SPARK-11720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-11720: -- Description: update the behavior of stats function when count =0 or 1 to make it in consistent across SPARK-SQL (was: change the default behavior of mean in case of count = 0 from null to Double.NaN, to make it inline with all other univariate stats function. ) > Handle edge cases when count = 0 or 1 for Stats function > > > Key: SPARK-11720 > URL: https://issues.apache.org/jira/browse/SPARK-11720 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA >Assignee: Jihong MA >Priority: Minor > > update the behavior of stats function when count =0 or 1 to make it in > consistent across SPARK-SQL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11720) Handle edge cases when count = 0 or 1 for Stats function
[ https://issues.apache.org/jira/browse/SPARK-11720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-11720: -- Summary: Handle edge cases when count = 0 or 1 for Stats function (was: Return Double.NaN instead of null for Mean and Average when count = 0) > Handle edge cases when count = 0 or 1 for Stats function > > > Key: SPARK-11720 > URL: https://issues.apache.org/jira/browse/SPARK-11720 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA >Assignee: Jihong MA >Priority: Minor > > change the default behavior of mean in case of count = 0 from null to > Double.NaN, to make it inline with all other univariate stats function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11720) Return Double.NaN instead of null for Mean and Average when count = 0
[ https://issues.apache.org/jira/browse/SPARK-11720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15003589#comment-15003589 ] Jihong MA commented on SPARK-11720: --- Also, for mean, we treat Decimal differently vs. other numeric type, all other numeric type all converted to Double, for Decimal type, we continue to keep its type but increase its p/s to avoid precison loss. so for types that converted to Double, we can return Double.NaN, for decimal type, we can't return a double value but still null? > Return Double.NaN instead of null for Mean and Average when count = 0 > - > > Key: SPARK-11720 > URL: https://issues.apache.org/jira/browse/SPARK-11720 > Project: Spark > Issue Type: Sub-task >Reporter: Jihong MA > > change the default behavior of mean in case of count = 0 from null to > Double.NaN, to make it inline with all other univariate stats function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11720) Return Double.NaN instead of null for Mean and Average when count = 0
[ https://issues.apache.org/jira/browse/SPARK-11720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15003558#comment-15003558 ] Jihong MA commented on SPARK-11720: --- [~mengxr] the implementation of average is not a numerical stable version, should we update it to leverage CentralMomentAgg as part of this JIRA? > Return Double.NaN instead of null for Mean and Average when count = 0 > - > > Key: SPARK-11720 > URL: https://issues.apache.org/jira/browse/SPARK-11720 > Project: Spark > Issue Type: Sub-task >Reporter: Jihong MA > > change the default behavior of mean in case of count = 0 from null to > Double.NaN, to make it inline with all other univariate stats function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11720) Return Double.NaN instead of null for Mean and Average when count = 0
Jihong MA created SPARK-11720: - Summary: Return Double.NaN instead of null for Mean and Average when count = 0 Key: SPARK-11720 URL: https://issues.apache.org/jira/browse/SPARK-11720 Project: Spark Issue Type: Improvement Reporter: Jihong MA change the default behavior of mean in case of count = 0 from null to Double.NaN, to make it inline with all other univariate stats function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11720) Return Double.NaN instead of null for Mean and Average when count = 0
[ https://issues.apache.org/jira/browse/SPARK-11720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-11720: -- Issue Type: Sub-task (was: Improvement) Parent: SPARK-10384 > Return Double.NaN instead of null for Mean and Average when count = 0 > - > > Key: SPARK-11720 > URL: https://issues.apache.org/jira/browse/SPARK-11720 > Project: Spark > Issue Type: Sub-task >Reporter: Jihong MA > > change the default behavior of mean in case of count = 0 from null to > Double.NaN, to make it inline with all other univariate stats function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11420) Updating Stddev support with Imperative Aggregate
[ https://issues.apache.org/jira/browse/SPARK-11420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-11420: -- Summary: Updating Stddev support with Imperative Aggregate (was: Changing Stddev support with Imperative Aggregate) > Updating Stddev support with Imperative Aggregate > - > > Key: SPARK-11420 > URL: https://issues.apache.org/jira/browse/SPARK-11420 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA > > based on the performance comparison of Declaritive vs. Imperative Aggregate > (SPARK-10953), switching to Imerative aggregate for stddev. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11420) Changing Stddev support with Imperative Aggregate
[ https://issues.apache.org/jira/browse/SPARK-11420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-11420: -- Issue Type: Sub-task (was: Improvement) Parent: SPARK-10384 > Changing Stddev support with Imperative Aggregate > - > > Key: SPARK-11420 > URL: https://issues.apache.org/jira/browse/SPARK-11420 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA > > based on the performance comparison of Declaritive vs. Imperative Aggregate > (SPARK-10953), switching to Imerative aggregate for stddev. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11420) Changing Stddev support with Imperative Aggregate
Jihong MA created SPARK-11420: - Summary: Changing Stddev support with Imperative Aggregate Key: SPARK-11420 URL: https://issues.apache.org/jira/browse/SPARK-11420 Project: Spark Issue Type: Improvement Components: ML, SQL Reporter: Jihong MA based on the performance comparison of Declaritive vs. Imperative Aggregate (SPARK-10953), switching to Imerative aggregate for stddev. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9297) covar_pop and covar_samp aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-9297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14965662#comment-14965662 ] Jihong MA commented on SPARK-9297: -- [~viirya] I just noticed your initial PR for Pearson's correlation is based on ImperativeAggregate, which calculates covariance, not sure if you would like to take this JIRA? if not, I will work on it as we would like to have this one merge in for 1.6 if possible. > covar_pop and covar_samp aggregate functions > > > Key: SPARK-9297 > URL: https://issues.apache.org/jira/browse/SPARK-9297 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > A short introduction on how to build aggregate functions based on our new > interface can be found at > https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10646) Bivariate Statistics: Pearson's Chi-Squared goodness of fit test
[ https://issues.apache.org/jira/browse/SPARK-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14965298#comment-14965298 ] Jihong MA commented on SPARK-10646: --- [~mengxr] to add chi-squared test support through UDAF framework, I will need to keep around a HashMap for tracking counts for each category encountered, I prototyped it using ImperativeAggregate interface, and realized current UDAF infrastructure doesn't support varied length of aggregation attribute buffer and cause GC pressure as well even with the support in place. I discussed with [~yhuai] offline on this sometime back. it looks adding it through UDAF is not feasible at this point, please kindly let me know how we should proceed? thanks! > Bivariate Statistics: Pearson's Chi-Squared goodness of fit test > > > Key: SPARK-10646 > URL: https://issues.apache.org/jira/browse/SPARK-10646 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA >Assignee: Jihong MA > > Pearson's chi-squared goodness of fit test for observed against the expected > distribution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9297) covar_pop and covar_samp aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-9297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14964370#comment-14964370 ] Jihong MA commented on SPARK-9297: -- I will work on this and prepare a PR using ImperativeAggregate interface. > covar_pop and covar_samp aggregate functions > > > Key: SPARK-9297 > URL: https://issues.apache.org/jira/browse/SPARK-9297 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > A short introduction on how to build aggregate functions based on our new > interface can be found at > https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10953) Benchmark codegen vs. hand-written code for univariate statistics
[ https://issues.apache.org/jira/browse/SPARK-10953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958386#comment-14958386 ] Jihong MA commented on SPARK-10953: --- [~mengxr] merged PR9038. below are the avg elapsed time collected, which is in line with what we observed earlier. Seth has started preparing an ImperativeAggregate implementation for SPARK-10641 (skewness, kurtosis) #rowsImperativeAggregate DeclarativeAggregate 100 90ms0.2s 10000.4s 0.8s 1 4s 7s > Benchmark codegen vs. hand-written code for univariate statistics > - > > Key: SPARK-10953 > URL: https://issues.apache.org/jira/browse/SPARK-10953 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Xiangrui Meng >Assignee: Jihong MA > > I checked the generated code for a simple stddev_pop call: > {code} > val df = sqlContext.range(100) > df.select(stddev_pop(col("id"))).show() > {code} > This is the generated code for the merge part, which is very long and > complex. I'm not sure whether we can get benefit from the code generation for > univariate statistics. We should benchmark it against Scala implementation. > {code} > 15/10/06 10:10:57 DEBUG GenerateMutableProjection: code for if > (isnull(input[1, DoubleType])) cast(0 as double) else input[1, DoubleType],if > (isnull(input[1, DoubleType])) input[6, DoubleType] else if (isnull(input[6, > DoubleType])) input[1, DoubleType] else (input[1, DoubleType] + input[6, > DoubleType]),if (isnull(input[3, DoubleType])) cast(0 as double) else > input[3, DoubleType],if (isnull(input[3, DoubleType])) input[8, DoubleType] > else if (isnull(input[8, DoubleType])) input[3, DoubleType] else (((input[3, > DoubleType] * input[0, DoubleType]) + (input[8, DoubleType] * input[6, > DoubleType])) / (input[0, DoubleType] + input[6, DoubleType])),if > (isnull(input[4, DoubleType])) input[9, DoubleType] else if (isnull(input[9, > DoubleType])) input[4, DoubleType] else ((input[4, DoubleType] + input[9, > DoubleType]) + input[8, DoubleType] - input[2, DoubleType]) * (input[8, > DoubleType] - input[2, DoubleType])) * (input[0, DoubleType] * input[6, > DoubleType])) / (input[0, DoubleType] + input[6, DoubleType]))): > public Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] > expr) { > return new SpecificMutableProjection(expr); > } > class SpecificMutableProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection { > private org.apache.spark.sql.catalyst.expressions.Expression[] expressions; > private org.apache.spark.sql.catalyst.expressions.MutableRow mutableRow; > public > SpecificMutableProjection(org.apache.spark.sql.catalyst.expressions.Expression[] > expr) { > expressions = expr; > mutableRow = new > org.apache.spark.sql.catalyst.expressions.GenericMutableRow(5); > } > public > org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection > target(org.apache.spark.sql.catalyst.expressions.MutableRow row) { > mutableRow = row; > return this; > } > /* Provide immutable access to the last projected row. */ > public InternalRow currentValue() { > return (InternalRow) mutableRow; > } > public Object apply(Object _i) { > InternalRow i = (InternalRow) _i; > /* if (isnull(input[1, DoubleType])) cast(0 as double) else input[1, > DoubleType] */ > /* isnull(input[1, DoubleType]) */ > /* input[1, DoubleType] */ > boolean isNull4 = i.isNullAt(1); > double primitive5 = isNull4 ? -1.0 : (i.getDouble(1)); > boolean isNull0 = false; > double primitive1 = -1.0; > if (!false && isNull4) { > /* cast(0 as double) */ > /* 0 */ > boolean isNull6 = false; > double primitive7 = -1.0; > if (!false) { > primitive7 = (double) 0; > } > isNull0 = isNull6; > primitive1 = primitive7; > } else { > /* input[1, DoubleType] */ > boolean isNull10 = i.isNullAt(1); > double primitive11 = isNull10 ? -1.0 : (i.getDouble(1)); > isNull0 = isNull10; > primitive1 = primitive11; > } > if (isNull0) { > mutableRow.setNullAt(0); > } else { > mutableRow.setDouble(0, primitive1); > } > /* if (isnull(input[1, DoubleType])) input[6, DoubleType] else if > (isnull(input[6, DoubleType])) input[1, DoubleType] else (input[1, > DoubleType] + input[6, DoubleType]) */ > /* isnull(input[1, DoubleType]) */ > /* input[1, DoubleType] */ > boolean isNull16 = i.isNullAt(1); > double primitive17 = isNull16 ? -1.0 : (i.getDouble(1)); > boolean isNull12 = fal
[jira] [Comment Edited] (SPARK-10953) Benchmark codegen vs. hand-written code for univariate statistics
[ https://issues.apache.org/jira/browse/SPARK-10953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954447#comment-14954447 ] Jihong MA edited comment on SPARK-10953 at 10/13/15 6:05 AM: - [~mengxr] I had a quick run on my laptop, with a stddev implementation based on ImperativeAggregate vs. DeclarativeAggregate, where ImperativeAggregate still use SortBasedAggregate at runtime, and DeclarativeAggregate uses TungstenAggregate. with a single double column of DF (cached): #rowsImperativeAggregate DeclarativeAggregate 100 58ms 0.1s 1000 0.4s 0.6s 1 4s 7s overall it seems ImperativeAggregate perform better. if enabling TungstenAggregate support for ImperativeAggregate is in good shape (PR 9038), I will merge them in and have another try. was (Author: jihongma): [~mengxr]had a quick run on my laptop, with a stddev implementation based on ImperativeAggregate vs. DeclarativeAggregate, where ImperativeAggregate still use SortBasedAggregate at runtime, and DeclarativeAggregate uses TungstenAggregate. with a single double column of DF (cached): #rowsImperativeAggregate DeclarativeAggregate 100 58ms 0.1s 1000 0.4s 0.6s 1 4s 7s overall it seems ImperativeAggregate perform better. if enabling TungstenAggregate support for ImperativeAggregate is in good shape (PR 9038), I will merge them in and have another try. > Benchmark codegen vs. hand-written code for univariate statistics > - > > Key: SPARK-10953 > URL: https://issues.apache.org/jira/browse/SPARK-10953 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Xiangrui Meng >Assignee: Jihong MA > > I checked the generated code for a simple stddev_pop call: > {code} > val df = sqlContext.range(100) > df.select(stddev_pop(col("id"))).show() > {code} > This is the generated code for the merge part, which is very long and > complex. I'm not sure whether we can get benefit from the code generation for > univariate statistics. We should benchmark it against Scala implementation. > {code} > 15/10/06 10:10:57 DEBUG GenerateMutableProjection: code for if > (isnull(input[1, DoubleType])) cast(0 as double) else input[1, DoubleType],if > (isnull(input[1, DoubleType])) input[6, DoubleType] else if (isnull(input[6, > DoubleType])) input[1, DoubleType] else (input[1, DoubleType] + input[6, > DoubleType]),if (isnull(input[3, DoubleType])) cast(0 as double) else > input[3, DoubleType],if (isnull(input[3, DoubleType])) input[8, DoubleType] > else if (isnull(input[8, DoubleType])) input[3, DoubleType] else (((input[3, > DoubleType] * input[0, DoubleType]) + (input[8, DoubleType] * input[6, > DoubleType])) / (input[0, DoubleType] + input[6, DoubleType])),if > (isnull(input[4, DoubleType])) input[9, DoubleType] else if (isnull(input[9, > DoubleType])) input[4, DoubleType] else ((input[4, DoubleType] + input[9, > DoubleType]) + input[8, DoubleType] - input[2, DoubleType]) * (input[8, > DoubleType] - input[2, DoubleType])) * (input[0, DoubleType] * input[6, > DoubleType])) / (input[0, DoubleType] + input[6, DoubleType]))): > public Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] > expr) { > return new SpecificMutableProjection(expr); > } > class SpecificMutableProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection { > private org.apache.spark.sql.catalyst.expressions.Expression[] expressions; > private org.apache.spark.sql.catalyst.expressions.MutableRow mutableRow; > public > SpecificMutableProjection(org.apache.spark.sql.catalyst.expressions.Expression[] > expr) { > expressions = expr; > mutableRow = new > org.apache.spark.sql.catalyst.expressions.GenericMutableRow(5); > } > public > org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection > target(org.apache.spark.sql.catalyst.expressions.MutableRow row) { > mutableRow = row; > return this; > } > /* Provide immutable access to the last projected row. */ > public InternalRow currentValue() { > return (InternalRow) mutableRow; > } > public Object apply(Object _i) { > InternalRow i = (InternalRow) _i; > /* if (isnull(input[1, DoubleType])) cast(0 as double) else input[1, > DoubleType] */ > /* isnull(input[1, DoubleType]) */ > /* input[1, DoubleType] */ > boolean isNull4 = i.isNullAt(1); > double primitive5 = isNull4 ? -1.0 :
[jira] [Comment Edited] (SPARK-10953) Benchmark codegen vs. hand-written code for univariate statistics
[ https://issues.apache.org/jira/browse/SPARK-10953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954447#comment-14954447 ] Jihong MA edited comment on SPARK-10953 at 10/13/15 6:05 AM: - [~mengxr]had a quick run on my laptop, with a stddev implementation based on ImperativeAggregate vs. DeclarativeAggregate, where ImperativeAggregate still use SortBasedAggregate at runtime, and DeclarativeAggregate uses TungstenAggregate. with a single double column of DF (cached): #rowsImperativeAggregate DeclarativeAggregate 100 58ms 0.1s 1000 0.4s 0.6s 1 4s 7s overall it seems ImperativeAggregate perform better. if enabling TungstenAggregate support for ImperativeAggregate is in good shape (PR 9038), I will merge them in and have another try. was (Author: jihongma): had a quick run on my laptop, with a stddev implementation based on ImperativeAggregate vs. DeclarativeAggregate, where ImperativeAggregate still use SortBasedAggregate at runtime, and DeclarativeAggregate uses TungstenAggregate. with a single double column of DF (cached): #rowsImperativeAggregate DeclarativeAggregate 100 58ms 0.1s 1000 0.4s 0.6s 1 4s 7s overall it seems ImperativeAggregate perform better. if enabling TungstenAggregate support for ImperativeAggregate is in good shape (PR 9038), I will merge them in and have another try. > Benchmark codegen vs. hand-written code for univariate statistics > - > > Key: SPARK-10953 > URL: https://issues.apache.org/jira/browse/SPARK-10953 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Xiangrui Meng >Assignee: Jihong MA > > I checked the generated code for a simple stddev_pop call: > {code} > val df = sqlContext.range(100) > df.select(stddev_pop(col("id"))).show() > {code} > This is the generated code for the merge part, which is very long and > complex. I'm not sure whether we can get benefit from the code generation for > univariate statistics. We should benchmark it against Scala implementation. > {code} > 15/10/06 10:10:57 DEBUG GenerateMutableProjection: code for if > (isnull(input[1, DoubleType])) cast(0 as double) else input[1, DoubleType],if > (isnull(input[1, DoubleType])) input[6, DoubleType] else if (isnull(input[6, > DoubleType])) input[1, DoubleType] else (input[1, DoubleType] + input[6, > DoubleType]),if (isnull(input[3, DoubleType])) cast(0 as double) else > input[3, DoubleType],if (isnull(input[3, DoubleType])) input[8, DoubleType] > else if (isnull(input[8, DoubleType])) input[3, DoubleType] else (((input[3, > DoubleType] * input[0, DoubleType]) + (input[8, DoubleType] * input[6, > DoubleType])) / (input[0, DoubleType] + input[6, DoubleType])),if > (isnull(input[4, DoubleType])) input[9, DoubleType] else if (isnull(input[9, > DoubleType])) input[4, DoubleType] else ((input[4, DoubleType] + input[9, > DoubleType]) + input[8, DoubleType] - input[2, DoubleType]) * (input[8, > DoubleType] - input[2, DoubleType])) * (input[0, DoubleType] * input[6, > DoubleType])) / (input[0, DoubleType] + input[6, DoubleType]))): > public Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] > expr) { > return new SpecificMutableProjection(expr); > } > class SpecificMutableProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection { > private org.apache.spark.sql.catalyst.expressions.Expression[] expressions; > private org.apache.spark.sql.catalyst.expressions.MutableRow mutableRow; > public > SpecificMutableProjection(org.apache.spark.sql.catalyst.expressions.Expression[] > expr) { > expressions = expr; > mutableRow = new > org.apache.spark.sql.catalyst.expressions.GenericMutableRow(5); > } > public > org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection > target(org.apache.spark.sql.catalyst.expressions.MutableRow row) { > mutableRow = row; > return this; > } > /* Provide immutable access to the last projected row. */ > public InternalRow currentValue() { > return (InternalRow) mutableRow; > } > public Object apply(Object _i) { > InternalRow i = (InternalRow) _i; > /* if (isnull(input[1, DoubleType])) cast(0 as double) else input[1, > DoubleType] */ > /* isnull(input[1, DoubleType]) */ > /* input[1, DoubleType] */ > boolean isNull4 = i.isNullAt(1); > double primitive5 = isNull4 ? -1.0 : (i.getDoubl
[jira] [Comment Edited] (SPARK-10953) Benchmark codegen vs. hand-written code for univariate statistics
[ https://issues.apache.org/jira/browse/SPARK-10953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954447#comment-14954447 ] Jihong MA edited comment on SPARK-10953 at 10/13/15 5:59 AM: - had a quick run on my laptop, with a stddev implementation based on ImperativeAggregate vs. DeclarativeAggregate, where ImperativeAggregate still use SortBasedAggregate at runtime, and DeclarativeAggregate uses TungstenAggregate. with a single double column of DF (cached): #rowsImperativeAggregate DeclarativeAggregate 100 58ms 0.1s 1000 0.4s 0.6s 1 4s 7s overall it seems ImperativeAggregate perform better. if enabling TungstenAggregate support for ImperativeAggregate is in good shape (PR 9038), I will merge them in and have another try. was (Author: jihongma): had a quick run on my laptop, with a stddev implementation based on ImperativeAggregate vs. DeclarativeAggregate, where ImperativeAggregate still use SortBasedAggregate at runtime, where DeclarativeAggregate uses TungstenAggregate. with a single double column of DF (cached): #rowsImperativeAggregate DeclarativeAggregate 100 58ms 0.1s 1000 0.4s 0.6s 1 4s 7s overall it seems ImperativeAggregate perform better. if enabling TungstenAggregate support for ImperativeAggregate is in good shape (PR 9038), I will merge them in and have another try. > Benchmark codegen vs. hand-written code for univariate statistics > - > > Key: SPARK-10953 > URL: https://issues.apache.org/jira/browse/SPARK-10953 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Xiangrui Meng >Assignee: Jihong MA > > I checked the generated code for a simple stddev_pop call: > {code} > val df = sqlContext.range(100) > df.select(stddev_pop(col("id"))).show() > {code} > This is the generated code for the merge part, which is very long and > complex. I'm not sure whether we can get benefit from the code generation for > univariate statistics. We should benchmark it against Scala implementation. > {code} > 15/10/06 10:10:57 DEBUG GenerateMutableProjection: code for if > (isnull(input[1, DoubleType])) cast(0 as double) else input[1, DoubleType],if > (isnull(input[1, DoubleType])) input[6, DoubleType] else if (isnull(input[6, > DoubleType])) input[1, DoubleType] else (input[1, DoubleType] + input[6, > DoubleType]),if (isnull(input[3, DoubleType])) cast(0 as double) else > input[3, DoubleType],if (isnull(input[3, DoubleType])) input[8, DoubleType] > else if (isnull(input[8, DoubleType])) input[3, DoubleType] else (((input[3, > DoubleType] * input[0, DoubleType]) + (input[8, DoubleType] * input[6, > DoubleType])) / (input[0, DoubleType] + input[6, DoubleType])),if > (isnull(input[4, DoubleType])) input[9, DoubleType] else if (isnull(input[9, > DoubleType])) input[4, DoubleType] else ((input[4, DoubleType] + input[9, > DoubleType]) + input[8, DoubleType] - input[2, DoubleType]) * (input[8, > DoubleType] - input[2, DoubleType])) * (input[0, DoubleType] * input[6, > DoubleType])) / (input[0, DoubleType] + input[6, DoubleType]))): > public Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] > expr) { > return new SpecificMutableProjection(expr); > } > class SpecificMutableProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection { > private org.apache.spark.sql.catalyst.expressions.Expression[] expressions; > private org.apache.spark.sql.catalyst.expressions.MutableRow mutableRow; > public > SpecificMutableProjection(org.apache.spark.sql.catalyst.expressions.Expression[] > expr) { > expressions = expr; > mutableRow = new > org.apache.spark.sql.catalyst.expressions.GenericMutableRow(5); > } > public > org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection > target(org.apache.spark.sql.catalyst.expressions.MutableRow row) { > mutableRow = row; > return this; > } > /* Provide immutable access to the last projected row. */ > public InternalRow currentValue() { > return (InternalRow) mutableRow; > } > public Object apply(Object _i) { > InternalRow i = (InternalRow) _i; > /* if (isnull(input[1, DoubleType])) cast(0 as double) else input[1, > DoubleType] */ > /* isnull(input[1, DoubleType]) */ > /* input[1, DoubleType] */ > boolean isNull4 = i.isNullAt(1); > double primitive5 = isNull4 ? -1.0 : (i.getDouble(1)); >
[jira] [Commented] (SPARK-10953) Benchmark codegen vs. hand-written code for univariate statistics
[ https://issues.apache.org/jira/browse/SPARK-10953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954447#comment-14954447 ] Jihong MA commented on SPARK-10953: --- had a quick run on my laptop, with a stddev implementation based on ImperativeAggregate vs. DeclarativeAggregate, where ImperativeAggregate still use SortBasedAggregate at runtime, where DeclarativeAggregate uses TungstenAggregate. with a single double column of DF (cached): #rowsImperativeAggregate DeclarativeAggregate 100 58ms 0.1s 1000 0.4s 0.6s 1 4s 7s overall it seems ImperativeAggregate perform better. if enabling TungstenAggregate support for ImperativeAggregate is in good shape (PR 9038), I will merge them in and have another try. > Benchmark codegen vs. hand-written code for univariate statistics > - > > Key: SPARK-10953 > URL: https://issues.apache.org/jira/browse/SPARK-10953 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Xiangrui Meng >Assignee: Jihong MA > > I checked the generated code for a simple stddev_pop call: > {code} > val df = sqlContext.range(100) > df.select(stddev_pop(col("id"))).show() > {code} > This is the generated code for the merge part, which is very long and > complex. I'm not sure whether we can get benefit from the code generation for > univariate statistics. We should benchmark it against Scala implementation. > {code} > 15/10/06 10:10:57 DEBUG GenerateMutableProjection: code for if > (isnull(input[1, DoubleType])) cast(0 as double) else input[1, DoubleType],if > (isnull(input[1, DoubleType])) input[6, DoubleType] else if (isnull(input[6, > DoubleType])) input[1, DoubleType] else (input[1, DoubleType] + input[6, > DoubleType]),if (isnull(input[3, DoubleType])) cast(0 as double) else > input[3, DoubleType],if (isnull(input[3, DoubleType])) input[8, DoubleType] > else if (isnull(input[8, DoubleType])) input[3, DoubleType] else (((input[3, > DoubleType] * input[0, DoubleType]) + (input[8, DoubleType] * input[6, > DoubleType])) / (input[0, DoubleType] + input[6, DoubleType])),if > (isnull(input[4, DoubleType])) input[9, DoubleType] else if (isnull(input[9, > DoubleType])) input[4, DoubleType] else ((input[4, DoubleType] + input[9, > DoubleType]) + input[8, DoubleType] - input[2, DoubleType]) * (input[8, > DoubleType] - input[2, DoubleType])) * (input[0, DoubleType] * input[6, > DoubleType])) / (input[0, DoubleType] + input[6, DoubleType]))): > public Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] > expr) { > return new SpecificMutableProjection(expr); > } > class SpecificMutableProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection { > private org.apache.spark.sql.catalyst.expressions.Expression[] expressions; > private org.apache.spark.sql.catalyst.expressions.MutableRow mutableRow; > public > SpecificMutableProjection(org.apache.spark.sql.catalyst.expressions.Expression[] > expr) { > expressions = expr; > mutableRow = new > org.apache.spark.sql.catalyst.expressions.GenericMutableRow(5); > } > public > org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection > target(org.apache.spark.sql.catalyst.expressions.MutableRow row) { > mutableRow = row; > return this; > } > /* Provide immutable access to the last projected row. */ > public InternalRow currentValue() { > return (InternalRow) mutableRow; > } > public Object apply(Object _i) { > InternalRow i = (InternalRow) _i; > /* if (isnull(input[1, DoubleType])) cast(0 as double) else input[1, > DoubleType] */ > /* isnull(input[1, DoubleType]) */ > /* input[1, DoubleType] */ > boolean isNull4 = i.isNullAt(1); > double primitive5 = isNull4 ? -1.0 : (i.getDouble(1)); > boolean isNull0 = false; > double primitive1 = -1.0; > if (!false && isNull4) { > /* cast(0 as double) */ > /* 0 */ > boolean isNull6 = false; > double primitive7 = -1.0; > if (!false) { > primitive7 = (double) 0; > } > isNull0 = isNull6; > primitive1 = primitive7; > } else { > /* input[1, DoubleType] */ > boolean isNull10 = i.isNullAt(1); > double primitive11 = isNull10 ? -1.0 : (i.getDouble(1)); > isNull0 = isNull10; > primitive1 = primitive11; > } > if (isNull0) { > mutableRow.setNullAt(0); > } else { > mutableRow.setDouble(0, primitive1); > } > /* if (isnull(input[1, DoubleType])) input[6, DoubleType] else if > (isnull(input[6, DoubleType])) input[1, DoubleType] e
[jira] [Commented] (SPARK-10953) Benchmark codegen vs. hand-written code for univariate statistics
[ https://issues.apache.org/jira/browse/SPARK-10953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954281#comment-14954281 ] Jihong MA commented on SPARK-10953: --- [~mengxr]as Yin indicated in the comment, we would like to merge the pull request for enabling TungstenAggregate support for ImperativeAggregate (pull 9038), to make it a fair comparison between ImperativeAggregate vs. DeclariativeAggregate and eliminate the perf impact due to runtime difference (used to be SortBasedAggregate), I noticed there are issues with the pull request and Josh merged couple more commits later this afternoon, [~yhuai] would you say it is ok now to merge those code for perf testing? Thanks! > Benchmark codegen vs. hand-written code for univariate statistics > - > > Key: SPARK-10953 > URL: https://issues.apache.org/jira/browse/SPARK-10953 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Xiangrui Meng >Assignee: Jihong MA > > I checked the generated code for a simple stddev_pop call: > {code} > val df = sqlContext.range(100) > df.select(stddev_pop(col("id"))).show() > {code} > This is the generated code for the merge part, which is very long and > complex. I'm not sure whether we can get benefit from the code generation for > univariate statistics. We should benchmark it against Scala implementation. > {code} > 15/10/06 10:10:57 DEBUG GenerateMutableProjection: code for if > (isnull(input[1, DoubleType])) cast(0 as double) else input[1, DoubleType],if > (isnull(input[1, DoubleType])) input[6, DoubleType] else if (isnull(input[6, > DoubleType])) input[1, DoubleType] else (input[1, DoubleType] + input[6, > DoubleType]),if (isnull(input[3, DoubleType])) cast(0 as double) else > input[3, DoubleType],if (isnull(input[3, DoubleType])) input[8, DoubleType] > else if (isnull(input[8, DoubleType])) input[3, DoubleType] else (((input[3, > DoubleType] * input[0, DoubleType]) + (input[8, DoubleType] * input[6, > DoubleType])) / (input[0, DoubleType] + input[6, DoubleType])),if > (isnull(input[4, DoubleType])) input[9, DoubleType] else if (isnull(input[9, > DoubleType])) input[4, DoubleType] else ((input[4, DoubleType] + input[9, > DoubleType]) + input[8, DoubleType] - input[2, DoubleType]) * (input[8, > DoubleType] - input[2, DoubleType])) * (input[0, DoubleType] * input[6, > DoubleType])) / (input[0, DoubleType] + input[6, DoubleType]))): > public Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] > expr) { > return new SpecificMutableProjection(expr); > } > class SpecificMutableProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection { > private org.apache.spark.sql.catalyst.expressions.Expression[] expressions; > private org.apache.spark.sql.catalyst.expressions.MutableRow mutableRow; > public > SpecificMutableProjection(org.apache.spark.sql.catalyst.expressions.Expression[] > expr) { > expressions = expr; > mutableRow = new > org.apache.spark.sql.catalyst.expressions.GenericMutableRow(5); > } > public > org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection > target(org.apache.spark.sql.catalyst.expressions.MutableRow row) { > mutableRow = row; > return this; > } > /* Provide immutable access to the last projected row. */ > public InternalRow currentValue() { > return (InternalRow) mutableRow; > } > public Object apply(Object _i) { > InternalRow i = (InternalRow) _i; > /* if (isnull(input[1, DoubleType])) cast(0 as double) else input[1, > DoubleType] */ > /* isnull(input[1, DoubleType]) */ > /* input[1, DoubleType] */ > boolean isNull4 = i.isNullAt(1); > double primitive5 = isNull4 ? -1.0 : (i.getDouble(1)); > boolean isNull0 = false; > double primitive1 = -1.0; > if (!false && isNull4) { > /* cast(0 as double) */ > /* 0 */ > boolean isNull6 = false; > double primitive7 = -1.0; > if (!false) { > primitive7 = (double) 0; > } > isNull0 = isNull6; > primitive1 = primitive7; > } else { > /* input[1, DoubleType] */ > boolean isNull10 = i.isNullAt(1); > double primitive11 = isNull10 ? -1.0 : (i.getDouble(1)); > isNull0 = isNull10; > primitive1 = primitive11; > } > if (isNull0) { > mutableRow.setNullAt(0); > } else { > mutableRow.setDouble(0, primitive1); > } > /* if (isnull(input[1, DoubleType])) input[6, DoubleType] else if > (isnull(input[6, DoubleType])) input[1, DoubleType] else (input[1, > DoubleType] + input[6, DoubleType]) */ > /* isnull(input[1, DoubleType]) */ > /* input[1, DoubleType] */ > boolean isNull16 = i.isNullAt(1); > double primitive17 = isNull16 ? -1
[jira] [Updated] (SPARK-10641) skewness and kurtosis support
[ https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-10641: -- Issue Type: Sub-task (was: New Feature) Parent: SPARK-10384 > skewness and kurtosis support > - > > Key: SPARK-10641 > URL: https://issues.apache.org/jira/browse/SPARK-10641 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA >Assignee: Seth Hendrickson > > Implementing skewness and kurtosis support based on following algorithm: > https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10953) Benchmark codegen vs. hand-written code for univariate statistics
[ https://issues.apache.org/jira/browse/SPARK-10953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949797#comment-14949797 ] Jihong MA commented on SPARK-10953: --- we should have a cluster for testing next Monday. we will run the performance comparison between ImperativeAggregate vs. DeclarativeAggregate (initially named AggregateFunction2 vs. AlgebraicAggregate). [~yhuai] my understanding of the difference of these two: DeclarativeAggregate directly manipulate the aggregatebuffer where ImperativeAggregate express aggregates as expression, at runtime, Declarative uses TungstenAggregate where ImperativeAggregate uses SortBasedAggregate? please clarify if it is correct. Thanks! > Benchmark codegen vs. hand-written code for univariate statistics > - > > Key: SPARK-10953 > URL: https://issues.apache.org/jira/browse/SPARK-10953 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Xiangrui Meng > > I checked the generated code for a simple stddev_pop call: > {code} > val df = sqlContext.range(100) > df.select(stddev_pop(col("id"))).show() > {code} > This is the generated code for the merge part, which is very long and > complex. I'm not sure whether we can get benefit from the code generation for > univariate statistics. We should benchmark it against Scala implementation. > {code} > 15/10/06 10:10:57 DEBUG GenerateMutableProjection: code for if > (isnull(input[1, DoubleType])) cast(0 as double) else input[1, DoubleType],if > (isnull(input[1, DoubleType])) input[6, DoubleType] else if (isnull(input[6, > DoubleType])) input[1, DoubleType] else (input[1, DoubleType] + input[6, > DoubleType]),if (isnull(input[3, DoubleType])) cast(0 as double) else > input[3, DoubleType],if (isnull(input[3, DoubleType])) input[8, DoubleType] > else if (isnull(input[8, DoubleType])) input[3, DoubleType] else (((input[3, > DoubleType] * input[0, DoubleType]) + (input[8, DoubleType] * input[6, > DoubleType])) / (input[0, DoubleType] + input[6, DoubleType])),if > (isnull(input[4, DoubleType])) input[9, DoubleType] else if (isnull(input[9, > DoubleType])) input[4, DoubleType] else ((input[4, DoubleType] + input[9, > DoubleType]) + input[8, DoubleType] - input[2, DoubleType]) * (input[8, > DoubleType] - input[2, DoubleType])) * (input[0, DoubleType] * input[6, > DoubleType])) / (input[0, DoubleType] + input[6, DoubleType]))): > public Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] > expr) { > return new SpecificMutableProjection(expr); > } > class SpecificMutableProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection { > private org.apache.spark.sql.catalyst.expressions.Expression[] expressions; > private org.apache.spark.sql.catalyst.expressions.MutableRow mutableRow; > public > SpecificMutableProjection(org.apache.spark.sql.catalyst.expressions.Expression[] > expr) { > expressions = expr; > mutableRow = new > org.apache.spark.sql.catalyst.expressions.GenericMutableRow(5); > } > public > org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection > target(org.apache.spark.sql.catalyst.expressions.MutableRow row) { > mutableRow = row; > return this; > } > /* Provide immutable access to the last projected row. */ > public InternalRow currentValue() { > return (InternalRow) mutableRow; > } > public Object apply(Object _i) { > InternalRow i = (InternalRow) _i; > /* if (isnull(input[1, DoubleType])) cast(0 as double) else input[1, > DoubleType] */ > /* isnull(input[1, DoubleType]) */ > /* input[1, DoubleType] */ > boolean isNull4 = i.isNullAt(1); > double primitive5 = isNull4 ? -1.0 : (i.getDouble(1)); > boolean isNull0 = false; > double primitive1 = -1.0; > if (!false && isNull4) { > /* cast(0 as double) */ > /* 0 */ > boolean isNull6 = false; > double primitive7 = -1.0; > if (!false) { > primitive7 = (double) 0; > } > isNull0 = isNull6; > primitive1 = primitive7; > } else { > /* input[1, DoubleType] */ > boolean isNull10 = i.isNullAt(1); > double primitive11 = isNull10 ? -1.0 : (i.getDouble(1)); > isNull0 = isNull10; > primitive1 = primitive11; > } > if (isNull0) { > mutableRow.setNullAt(0); > } else { > mutableRow.setDouble(0, primitive1); > } > /* if (isnull(input[1, DoubleType])) input[6, DoubleType] else if > (isnull(input[6, DoubleType])) input[1, DoubleType] else (input[1, > DoubleType] + input[6, DoubleType]) */ > /* isnull(input[1, DoubleType]) */ > /* input[1, DoubleType] */ > boolean isNull16 = i.isNullAt(1); > double primitive17 = isNull16 ? -1.0 : (i.getDouble(1)); >
[jira] [Commented] (SPARK-10953) Benchmark codegen vs. hand-written code for univariate statistics
[ https://issues.apache.org/jira/browse/SPARK-10953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14946061#comment-14946061 ] Jihong MA commented on SPARK-10953: --- it should not be too hard to put together an implementation based on AggregateFunction2 interface (e.g HyperLogLogPlusPlus) , would it make more sense to compare implementation with AggregateFunction2 vs. AlgebraicAggregate, if we will go for AggregateFunction2 as a better alternative based on the result. or we could do as you suggested , comparing between rdd.stats() and df.describe() where describe uses UDAF AlgebraicAggregate internally. we need a bigger cluster (bare-medal or on cloud ), not sure when we can have it, and are there performance tools to generate data for testing purpose? > Benchmark codegen vs. hand-written code for univariate statistics > - > > Key: SPARK-10953 > URL: https://issues.apache.org/jira/browse/SPARK-10953 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Xiangrui Meng > > I checked the generated code for a simple stddev_pop call: > {code} > val df = sqlContext.range(100) > df.select(stddev_pop(col("id"))).show() > {code} > This is the generated code for the merge part, which is very long and > complex. I'm not sure whether we can get benefit from the code generation for > univariate statistics. We should benchmark it against Scala implementation. > {code} > 15/10/06 10:10:57 DEBUG GenerateMutableProjection: code for if > (isnull(input[1, DoubleType])) cast(0 as double) else input[1, DoubleType],if > (isnull(input[1, DoubleType])) input[6, DoubleType] else if (isnull(input[6, > DoubleType])) input[1, DoubleType] else (input[1, DoubleType] + input[6, > DoubleType]),if (isnull(input[3, DoubleType])) cast(0 as double) else > input[3, DoubleType],if (isnull(input[3, DoubleType])) input[8, DoubleType] > else if (isnull(input[8, DoubleType])) input[3, DoubleType] else (((input[3, > DoubleType] * input[0, DoubleType]) + (input[8, DoubleType] * input[6, > DoubleType])) / (input[0, DoubleType] + input[6, DoubleType])),if > (isnull(input[4, DoubleType])) input[9, DoubleType] else if (isnull(input[9, > DoubleType])) input[4, DoubleType] else ((input[4, DoubleType] + input[9, > DoubleType]) + input[8, DoubleType] - input[2, DoubleType]) * (input[8, > DoubleType] - input[2, DoubleType])) * (input[0, DoubleType] * input[6, > DoubleType])) / (input[0, DoubleType] + input[6, DoubleType]))): > public Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] > expr) { > return new SpecificMutableProjection(expr); > } > class SpecificMutableProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection { > private org.apache.spark.sql.catalyst.expressions.Expression[] expressions; > private org.apache.spark.sql.catalyst.expressions.MutableRow mutableRow; > public > SpecificMutableProjection(org.apache.spark.sql.catalyst.expressions.Expression[] > expr) { > expressions = expr; > mutableRow = new > org.apache.spark.sql.catalyst.expressions.GenericMutableRow(5); > } > public > org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection > target(org.apache.spark.sql.catalyst.expressions.MutableRow row) { > mutableRow = row; > return this; > } > /* Provide immutable access to the last projected row. */ > public InternalRow currentValue() { > return (InternalRow) mutableRow; > } > public Object apply(Object _i) { > InternalRow i = (InternalRow) _i; > /* if (isnull(input[1, DoubleType])) cast(0 as double) else input[1, > DoubleType] */ > /* isnull(input[1, DoubleType]) */ > /* input[1, DoubleType] */ > boolean isNull4 = i.isNullAt(1); > double primitive5 = isNull4 ? -1.0 : (i.getDouble(1)); > boolean isNull0 = false; > double primitive1 = -1.0; > if (!false && isNull4) { > /* cast(0 as double) */ > /* 0 */ > boolean isNull6 = false; > double primitive7 = -1.0; > if (!false) { > primitive7 = (double) 0; > } > isNull0 = isNull6; > primitive1 = primitive7; > } else { > /* input[1, DoubleType] */ > boolean isNull10 = i.isNullAt(1); > double primitive11 = isNull10 ? -1.0 : (i.getDouble(1)); > isNull0 = isNull10; > primitive1 = primitive11; > } > if (isNull0) { > mutableRow.setNullAt(0); > } else { > mutableRow.setDouble(0, primitive1); > } > /* if (isnull(input[1, DoubleType])) input[6, DoubleType] else if > (isnull(input[6, DoubleType])) input[1, DoubleType] else (input[1, > DoubleType] + input[6, DoubleType]) */ > /* isnull(input[1, DoubleType]) */ > /* input[1, DoubleType] */ > boolean isNull16 = i.is
[jira] [Commented] (SPARK-10953) Benchmark codegen vs. hand-written code for univariate statistics
[ https://issues.apache.org/jira/browse/SPARK-10953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945532#comment-14945532 ] Jihong MA commented on SPARK-10953: --- [~mengxr] do you mean comparing an implementation which operate directly at RDD level vs. leveraging UDAF framework? like what has been done under sql/core/src/main/scala/org/apache/spark/sql/execution/stat/. > Benchmark codegen vs. hand-written code for univariate statistics > - > > Key: SPARK-10953 > URL: https://issues.apache.org/jira/browse/SPARK-10953 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Xiangrui Meng > > I checked the generated code for a simple stddev_pop call: > {code} > val df = sqlContext.range(100) > df.select(stddev_pop(col("id"))).show() > {code} > This is the generated code for the merge part, which is very long and > complex. I'm not sure whether we can get benefit from the code generation for > univariate statistics. We should benchmark it against Scala implementation. > {code} > 15/10/06 10:10:57 DEBUG GenerateMutableProjection: code for if > (isnull(input[1, DoubleType])) cast(0 as double) else input[1, DoubleType],if > (isnull(input[1, DoubleType])) input[6, DoubleType] else if (isnull(input[6, > DoubleType])) input[1, DoubleType] else (input[1, DoubleType] + input[6, > DoubleType]),if (isnull(input[3, DoubleType])) cast(0 as double) else > input[3, DoubleType],if (isnull(input[3, DoubleType])) input[8, DoubleType] > else if (isnull(input[8, DoubleType])) input[3, DoubleType] else (((input[3, > DoubleType] * input[0, DoubleType]) + (input[8, DoubleType] * input[6, > DoubleType])) / (input[0, DoubleType] + input[6, DoubleType])),if > (isnull(input[4, DoubleType])) input[9, DoubleType] else if (isnull(input[9, > DoubleType])) input[4, DoubleType] else ((input[4, DoubleType] + input[9, > DoubleType]) + input[8, DoubleType] - input[2, DoubleType]) * (input[8, > DoubleType] - input[2, DoubleType])) * (input[0, DoubleType] * input[6, > DoubleType])) / (input[0, DoubleType] + input[6, DoubleType]))): > public Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] > expr) { > return new SpecificMutableProjection(expr); > } > class SpecificMutableProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection { > private org.apache.spark.sql.catalyst.expressions.Expression[] expressions; > private org.apache.spark.sql.catalyst.expressions.MutableRow mutableRow; > public > SpecificMutableProjection(org.apache.spark.sql.catalyst.expressions.Expression[] > expr) { > expressions = expr; > mutableRow = new > org.apache.spark.sql.catalyst.expressions.GenericMutableRow(5); > } > public > org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection > target(org.apache.spark.sql.catalyst.expressions.MutableRow row) { > mutableRow = row; > return this; > } > /* Provide immutable access to the last projected row. */ > public InternalRow currentValue() { > return (InternalRow) mutableRow; > } > public Object apply(Object _i) { > InternalRow i = (InternalRow) _i; > /* if (isnull(input[1, DoubleType])) cast(0 as double) else input[1, > DoubleType] */ > /* isnull(input[1, DoubleType]) */ > /* input[1, DoubleType] */ > boolean isNull4 = i.isNullAt(1); > double primitive5 = isNull4 ? -1.0 : (i.getDouble(1)); > boolean isNull0 = false; > double primitive1 = -1.0; > if (!false && isNull4) { > /* cast(0 as double) */ > /* 0 */ > boolean isNull6 = false; > double primitive7 = -1.0; > if (!false) { > primitive7 = (double) 0; > } > isNull0 = isNull6; > primitive1 = primitive7; > } else { > /* input[1, DoubleType] */ > boolean isNull10 = i.isNullAt(1); > double primitive11 = isNull10 ? -1.0 : (i.getDouble(1)); > isNull0 = isNull10; > primitive1 = primitive11; > } > if (isNull0) { > mutableRow.setNullAt(0); > } else { > mutableRow.setDouble(0, primitive1); > } > /* if (isnull(input[1, DoubleType])) input[6, DoubleType] else if > (isnull(input[6, DoubleType])) input[1, DoubleType] else (input[1, > DoubleType] + input[6, DoubleType]) */ > /* isnull(input[1, DoubleType]) */ > /* input[1, DoubleType] */ > boolean isNull16 = i.isNullAt(1); > double primitive17 = isNull16 ? -1.0 : (i.getDouble(1)); > boolean isNull12 = false; > double primitive13 = -1.0; > if (!false && isNull16) { > /* input[6, DoubleType] */ > boolean isNull18 = i.isNullAt(6); > double primitive19 = isNull18 ? -1.0 : (i.getDouble(6)); > isNull12 = isNull18; > primitive13 = primitive19; > } else { >
[jira] [Updated] (SPARK-10862) Univariate Statistics: Adding median & quantile support as UDAF
[ https://issues.apache.org/jira/browse/SPARK-10862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-10862: -- Issue Type: Sub-task (was: New Feature) Parent: SPARK-10384 > Univariate Statistics: Adding median & quantile support as UDAF > --- > > Key: SPARK-10862 > URL: https://issues.apache.org/jira/browse/SPARK-10862 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10862) Univariate Statistics: Adding median & quantile support as UDAF
[ https://issues.apache.org/jira/browse/SPARK-10862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-10862: -- Summary: Univariate Statistics: Adding median & quantile support as UDAF (was: Univariate Statistics: Adding median support as UDAF) > Univariate Statistics: Adding median & quantile support as UDAF > --- > > Key: SPARK-10862 > URL: https://issues.apache.org/jira/browse/SPARK-10862 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Reporter: Jihong MA > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10862) Univariate Statistics: Adding median support as UDAF
Jihong MA created SPARK-10862: - Summary: Univariate Statistics: Adding median support as UDAF Key: SPARK-10862 URL: https://issues.apache.org/jira/browse/SPARK-10862 Project: Spark Issue Type: New Feature Components: ML, SQL Reporter: Jihong MA -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10861) Univariate Statistics: Adding range support as UDAF
[ https://issues.apache.org/jira/browse/SPARK-10861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934109#comment-14934109 ] Jihong MA commented on SPARK-10861: --- I will send a PR soon. > Univariate Statistics: Adding range support as UDAF > --- > > Key: SPARK-10861 > URL: https://issues.apache.org/jira/browse/SPARK-10861 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA > > Range support for continuous -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10861) Univariate Statistics: Adding range support as UDAF
[ https://issues.apache.org/jira/browse/SPARK-10861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-10861: -- Summary: Univariate Statistics: Adding range support as UDAF (was: Univariate Statistics: Adding range support for continuous ) > Univariate Statistics: Adding range support as UDAF > --- > > Key: SPARK-10861 > URL: https://issues.apache.org/jira/browse/SPARK-10861 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA > > Range support as UDAF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10861) Univariate Statistics: Adding range support as UDAF
[ https://issues.apache.org/jira/browse/SPARK-10861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-10861: -- Description: Range support for continuous (was: Range support as UDAF ) > Univariate Statistics: Adding range support as UDAF > --- > > Key: SPARK-10861 > URL: https://issues.apache.org/jira/browse/SPARK-10861 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA > > Range support for continuous -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10861) Univariate Statistics: Adding range support for continuous
[ https://issues.apache.org/jira/browse/SPARK-10861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-10861: -- Description: Range support as UDAF > Univariate Statistics: Adding range support for continuous > --- > > Key: SPARK-10861 > URL: https://issues.apache.org/jira/browse/SPARK-10861 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA > > Range support as UDAF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10861) Univariate Statistics: Adding range support for continuous
[ https://issues.apache.org/jira/browse/SPARK-10861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-10861: -- Issue Type: Sub-task (was: New Feature) Parent: SPARK-10384 > Univariate Statistics: Adding range support for continuous > --- > > Key: SPARK-10861 > URL: https://issues.apache.org/jira/browse/SPARK-10861 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10861) Univariate Statistics: Adding range support for continuous
Jihong MA created SPARK-10861: - Summary: Univariate Statistics: Adding range support for continuous Key: SPARK-10861 URL: https://issues.apache.org/jira/browse/SPARK-10861 Project: Spark Issue Type: New Feature Components: ML, SQL Reporter: Jihong MA -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10860) Bivariate Statistics: Chi-Squared independence test
[ https://issues.apache.org/jira/browse/SPARK-10860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934098#comment-14934098 ] Jihong MA commented on SPARK-10860: --- [~josephkb] please assign this JIRA to me. Thanks! > Bivariate Statistics: Chi-Squared independence test > --- > > Key: SPARK-10860 > URL: https://issues.apache.org/jira/browse/SPARK-10860 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA > > Pearson's chi-squared independence test -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10860) Bivariate Statistics: Chi-Squared independence test
[ https://issues.apache.org/jira/browse/SPARK-10860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-10860: -- Description: Pearson's chi-squared independence test > Bivariate Statistics: Chi-Squared independence test > --- > > Key: SPARK-10860 > URL: https://issues.apache.org/jira/browse/SPARK-10860 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA > > Pearson's chi-squared independence test -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10860) Bivariate Statistics: Chi-Squared independence test
[ https://issues.apache.org/jira/browse/SPARK-10860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-10860: -- Issue Type: Sub-task (was: New Feature) Parent: SPARK-10385 > Bivariate Statistics: Chi-Squared independence test > --- > > Key: SPARK-10860 > URL: https://issues.apache.org/jira/browse/SPARK-10860 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10860) Bivariate Statistics: Chi-Squared independence test
Jihong MA created SPARK-10860: - Summary: Bivariate Statistics: Chi-Squared independence test Key: SPARK-10860 URL: https://issues.apache.org/jira/browse/SPARK-10860 Project: Spark Issue Type: New Feature Components: ML, SQL Reporter: Jihong MA -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10646) Bivariate Statistics: Pearson's Chi-Squared goodness of fit test
[ https://issues.apache.org/jira/browse/SPARK-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-10646: -- Description: Pearson's chi-squared goodness of fit test for observed against the expected distribution. (was: Pearson's chi-squared goodness of fit test for observed against the expected distribution & independence test. ) Summary: Bivariate Statistics: Pearson's Chi-Squared goodness of fit test (was: Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. categorical) > Bivariate Statistics: Pearson's Chi-Squared goodness of fit test > > > Key: SPARK-10646 > URL: https://issues.apache.org/jira/browse/SPARK-10646 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA >Assignee: Jihong MA > > Pearson's chi-squared goodness of fit test for observed against the expected > distribution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10645) Bivariate Statistics: Spearman's Correlation support as UDAF
[ https://issues.apache.org/jira/browse/SPARK-10645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-10645: -- Description: Spearman's rank correlation coefficient : https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient (was: this is an umbrella jira, which covers Bivariate Statistics for continuous vs. continuous columns, including covariance, Pearson's correlation, Spearman's correlation (for both continuous & categorical).) Summary: Bivariate Statistics: Spearman's Correlation support as UDAF (was: Bivariate Statistics for continuous vs. continuous) > Bivariate Statistics: Spearman's Correlation support as UDAF > > > Key: SPARK-10645 > URL: https://issues.apache.org/jira/browse/SPARK-10645 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA > > Spearman's rank correlation coefficient : > https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10646) Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. categorical
[ https://issues.apache.org/jira/browse/SPARK-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803378#comment-14803378 ] Jihong MA commented on SPARK-10646: --- [~josephkb] please assign this JIRA to me, I will start working on it. Thanks! > Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. > categorical > > > Key: SPARK-10646 > URL: https://issues.apache.org/jira/browse/SPARK-10646 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA > > Pearson's chi-squared goodness of fit test for observed against the expected > distribution & independence test. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10646) Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. categorical
[ https://issues.apache.org/jira/browse/SPARK-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-10646: -- Description: Pearson's chi-squared goodness of fit test for observed against the expected distribution & independence test. (was: Pearson's chi-squared goodness of fit test for observed against the expected distribution.) > Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. > categorical > > > Key: SPARK-10646 > URL: https://issues.apache.org/jira/browse/SPARK-10646 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA > > Pearson's chi-squared goodness of fit test for observed against the expected > distribution & independence test. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10645) Bivariate Statistics for continuous vs. continuous
[ https://issues.apache.org/jira/browse/SPARK-10645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-10645: -- Component/s: SQL ML > Bivariate Statistics for continuous vs. continuous > -- > > Key: SPARK-10645 > URL: https://issues.apache.org/jira/browse/SPARK-10645 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA > > this is an umbrella jira, which covers Bivariate Statistics for continuous > vs. continuous columns, including covariance, Pearson's correlation, > Spearman's correlation (for both continuous & categorical). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10646) Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. categorical
[ https://issues.apache.org/jira/browse/SPARK-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-10646: -- Component/s: SQL ML > Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. > categorical > > > Key: SPARK-10646 > URL: https://issues.apache.org/jira/browse/SPARK-10646 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA > > Pearson's chi-squared goodness of fit test for observed against the expected > distribution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10646) Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. categorical
[ https://issues.apache.org/jira/browse/SPARK-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-10646: -- Description: Pearson's chi-squared goodness of fit test for observed against the expected distribution. > Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. > categorical > > > Key: SPARK-10646 > URL: https://issues.apache.org/jira/browse/SPARK-10646 > Project: Spark > Issue Type: Sub-task >Reporter: Jihong MA > > Pearson's chi-squared goodness of fit test for observed against the expected > distribution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10646) Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. categorical
[ https://issues.apache.org/jira/browse/SPARK-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-10646: -- Issue Type: Sub-task (was: New Feature) Parent: SPARK-10385 > Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. > categorical > > > Key: SPARK-10646 > URL: https://issues.apache.org/jira/browse/SPARK-10646 > Project: Spark > Issue Type: Sub-task >Reporter: Jihong MA > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10646) Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. categorical
Jihong MA created SPARK-10646: - Summary: Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. categorical Key: SPARK-10646 URL: https://issues.apache.org/jira/browse/SPARK-10646 Project: Spark Issue Type: New Feature Reporter: Jihong MA -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10645) Bivariate Statistics for continuous vs. continuous
[ https://issues.apache.org/jira/browse/SPARK-10645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-10645: -- Issue Type: Sub-task (was: New Feature) Parent: SPARK-10385 > Bivariate Statistics for continuous vs. continuous > -- > > Key: SPARK-10645 > URL: https://issues.apache.org/jira/browse/SPARK-10645 > Project: Spark > Issue Type: Sub-task >Reporter: Jihong MA > > this is an umbrella jira, which covers Bivariate Statistics for continuous > vs. continuous columns, including covariance, Pearson's correlation, > Spearman's correlation (for both continuous & categorical). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10645) Bivariate Statistics for continuous vs. continuous
Jihong MA created SPARK-10645: - Summary: Bivariate Statistics for continuous vs. continuous Key: SPARK-10645 URL: https://issues.apache.org/jira/browse/SPARK-10645 Project: Spark Issue Type: New Feature Reporter: Jihong MA this is an umbrella jira, which covers Bivariate Statistics for continuous vs. continuous columns, including covariance, Pearson's correlation, Spearman's correlation (for both continuous & categorical). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats
[ https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790816#comment-14790816 ] Jihong MA commented on SPARK-10602: --- I go ahead/ created SPARK-10641, since this JIRA is not listed as umbrella, couldn't link to this JIRA directly instead linked to SPARK-10384. @Joseph, can you assign SPARK-10641 to Seth? and help fix the link, Thanks! > Univariate statistics as UDAFs: single-pass continuous stats > > > Key: SPARK-10602 > URL: https://issues.apache.org/jira/browse/SPARK-10602 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Joseph K. Bradley > > See parent JIRA for more details. This subtask covers statistics for > continuous values requiring a single pass over the data, such as min and max. > This JIRA is an umbrella. For individual stats, please create and link a new > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10641) skewness and kurtosis support
[ https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-10641: -- Issue Type: Sub-task (was: New Feature) Parent: SPARK-10384 > skewness and kurtosis support > - > > Key: SPARK-10641 > URL: https://issues.apache.org/jira/browse/SPARK-10641 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA > > Implementing skewness and kurtosis support based on following algorithm: > https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10641) skewness and kurtosis support
Jihong MA created SPARK-10641: - Summary: skewness and kurtosis support Key: SPARK-10641 URL: https://issues.apache.org/jira/browse/SPARK-10641 Project: Spark Issue Type: New Feature Components: ML, SQL Reporter: Jihong MA Implementing skewness and kurtosis support based on following algorithm: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6548) stddev_pop and stddev_samp aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14743959#comment-14743959 ] Jihong MA commented on SPARK-6548: -- [~davies]please fix the assignee to Jihong, Thanks! > stddev_pop and stddev_samp aggregate functions > -- > > Key: SPARK-6548 > URL: https://issues.apache.org/jira/browse/SPARK-6548 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > Labels: DataFrame, starter > Fix For: 1.6.0 > > > Add it to the list of aggregate functions: > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala > Also add it to > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala > We can either add a Stddev Catalyst expression, or just compute it using > existing functions like here: > https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8951) support CJK characters in collect()
[ https://issues.apache.org/jira/browse/SPARK-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731140#comment-14731140 ] Jihong MA commented on SPARK-8951: -- This commit cause R style check failure. Running R style checks Loading required package: methods Attaching package: 'SparkR' The following objects are masked from 'package:stats': filter, na.omit The following objects are masked from 'package:base': intersect, rbind, sample, subset, summary, table, transform Attaching package: 'testthat' The following object is masked from 'package:SparkR': describe R/deserialize.R:63:9: style: Trailing whitespace is superfluous. string ^ lintr checks failed. [error] running /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-r ; received return code 1 Archiving unit tests logs... > No log files found. Attempting to post to Github... > Post successful. Build step 'Execute shell' marked build as failure Archiving artifacts Recording test results ERROR: Publisher 'Publish JUnit test result report' failed: No test report files were found. Configuration error? Test FAILed. Refer to this link for build results (access rights to CI server needed): > support CJK characters in collect() > --- > > Key: SPARK-8951 > URL: https://issues.apache.org/jira/browse/SPARK-8951 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Jaehong Choi >Assignee: Jaehong Choi >Priority: Minor > Fix For: 1.6.0 > > Attachments: SerDe.scala.diff > > > Spark gives an error message and does not show the output when a field of the > result DataFrame contains characters in CJK. > I found out that SerDe in R API only supports ASCII format for strings right > now as commented in source code. > So, I fixed SerDe.scala a little to support CJK as the file attached. > I did not care efficiency, but just wanted to see if it works. > {noformat} > people.json > {"name":"가나"} > {"name":"테스트123", "age":30} > {"name":"Justin", "age":19} > df <- read.df(sqlContext, "./people.json", "json") > head(df) > Error in rawtochar(string) : embedded nul in string : '\0 \x98' > {noformat} > {code:title=core/src/main/scala/org/apache/spark/api/r/SerDe.scala} > // NOTE: Only works for ASCII right now > def writeString(out: DataOutputStream, value: String): Unit = { > val len = value.length > out.writeInt(len + 1) // For the \0 > out.writeBytes(value) > out.writeByte(0) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8800) Spark SQL Decimal Division operation loss of precision/scale when type is defined as DecimalType.Unlimited
[ https://issues.apache.org/jira/browse/SPARK-8800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627317#comment-14627317 ] Jihong MA commented on SPARK-8800: -- I would like to suggest to revert back the initial code change for SPARK-8359, SPARK-8677 and SPARK-8800. and think though whether supporting unlimited precision decimal multiplication is the right way of fixing it as Hive only support limited precision decimal multiplication. > Spark SQL Decimal Division operation loss of precision/scale when type is > defined as DecimalType.Unlimited > -- > > Key: SPARK-8800 > URL: https://issues.apache.org/jira/browse/SPARK-8800 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Jihong MA >Assignee: Liang-Chi Hsieh >Priority: Blocker > Fix For: 1.5.0 > > > According to specification defined in Java doc over BigDecimal : > http://docs.oracle.com/javase/1.5.0/docs/api/java/math/BigDecimal.html > When a MathContext object is supplied with a precision setting of 0 (for > example, MathContext.UNLIMITED), arithmetic operations are exact, as are the > arithmetic methods which take no MathContext object. (This is the only > behavior that was supported in releases prior to 5.) As a corollary of > computing the exact result, the rounding mode setting of a MathContext object > with a precision setting of 0 is not used and thus irrelevant. In the case of > divide, the exact quotient could have an infinitely long decimal expansion; > for example, 1 divided by 3. If the quotient has a nonterminating decimal > expansion and the operation is specified to return an exact result, an > ArithmeticException is thrown. Otherwise, the exact result of the division is > returned, as done for other operations. > when Decimal data is defined as DecimalType.Unlimited in Spark SQL, the exact > result of the division should be returned or truncated to precision = 38 > which is in align with what Hive supports. the current behavior is as shown > following, which cause we lose the accuracy of Decimal division operation. > scala> val aa = Decimal(2) / Decimal(3); > aa: org.apache.spark.sql.types.Decimal = 1 > here is another example where we should return 0.125 instead of 0 > scala> val aa = Decimal(1) /Decimal(8) > aa: org.apache.spark.sql.types.Decimal = 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8800) Spark SQL Decimal Division operation loss of precision/scale when type is defined as DecimalType.Unlimited
[ https://issues.apache.org/jira/browse/SPARK-8800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627306#comment-14627306 ] Jihong MA commented on SPARK-8800: -- I applied the fix and noticed the same. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1198.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1198.0 (TID 3847, localhost): java.lang.ArithmeticException: Non-terminating decimal expansion; no exact representable decimal result.^[[0m ^[[31m at java.math.BigDecimal.divide(BigDecimal.java:1616)^[[0m ^[[31m at java.math.BigDecimal.divide(BigDecimal.java:1650)^[[0m ^[[31m at scala.math.BigDecimal.$div(BigDecimal.scala:256)^[[0m ^[[31m at org.apache.spark.sql.types.Decimal.$div(Decimal.scala:282)^[[0m ^[[31m at org.apache.spark.sql.types.Decimal$DecimalIsFractional$.div(Decimal.scala:348)^[[0m ^[[31m at org.apache.spark.sql.types.Decimal$DecimalIsFractional$.div(Decimal.scala:347)^[[0m ^[[31m at org.apache.spark.sql.catalyst.expressions.Divide$$anonfun$div$1.apply(arithmetic.scala:193)^[[0m ^[[31m at org.apache.spark.sql.catalyst.expressions.Divide.eval(arithmetic.scala:206)^[[0m > Spark SQL Decimal Division operation loss of precision/scale when type is > defined as DecimalType.Unlimited > -- > > Key: SPARK-8800 > URL: https://issues.apache.org/jira/browse/SPARK-8800 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Jihong MA >Assignee: Liang-Chi Hsieh >Priority: Blocker > Fix For: 1.5.0 > > > According to specification defined in Java doc over BigDecimal : > http://docs.oracle.com/javase/1.5.0/docs/api/java/math/BigDecimal.html > When a MathContext object is supplied with a precision setting of 0 (for > example, MathContext.UNLIMITED), arithmetic operations are exact, as are the > arithmetic methods which take no MathContext object. (This is the only > behavior that was supported in releases prior to 5.) As a corollary of > computing the exact result, the rounding mode setting of a MathContext object > with a precision setting of 0 is not used and thus irrelevant. In the case of > divide, the exact quotient could have an infinitely long decimal expansion; > for example, 1 divided by 3. If the quotient has a nonterminating decimal > expansion and the operation is specified to return an exact result, an > ArithmeticException is thrown. Otherwise, the exact result of the division is > returned, as done for other operations. > when Decimal data is defined as DecimalType.Unlimited in Spark SQL, the exact > result of the division should be returned or truncated to precision = 38 > which is in align with what Hive supports. the current behavior is as shown > following, which cause we lose the accuracy of Decimal division operation. > scala> val aa = Decimal(2) / Decimal(3); > aa: org.apache.spark.sql.types.Decimal = 1 > here is another example where we should return 0.125 instead of 0 > scala> val aa = Decimal(1) /Decimal(8) > aa: org.apache.spark.sql.types.Decimal = 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8800) Spark SQL Decimal Division operation loss of precision/scale when type is defined as DecimalType.Unlimited
[ https://issues.apache.org/jira/browse/SPARK-8800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-8800: - Description: According to specification defined in Java doc over BigDecimal : http://docs.oracle.com/javase/1.5.0/docs/api/java/math/BigDecimal.html When a MathContext object is supplied with a precision setting of 0 (for example, MathContext.UNLIMITED), arithmetic operations are exact, as are the arithmetic methods which take no MathContext object. (This is the only behavior that was supported in releases prior to 5.) As a corollary of computing the exact result, the rounding mode setting of a MathContext object with a precision setting of 0 is not used and thus irrelevant. In the case of divide, the exact quotient could have an infinitely long decimal expansion; for example, 1 divided by 3. If the quotient has a nonterminating decimal expansion and the operation is specified to return an exact result, an ArithmeticException is thrown. Otherwise, the exact result of the division is returned, as done for other operations. when Decimal data is defined as DecimalType.Unlimited in Spark SQL, the exact result of the division should be returned or truncated to precision = 38 which is in align with what Hive supports. the current behavior is as shown following, which cause we lose the accuracy of Decimal division operation. scala> val aa = Decimal(2) / Decimal(3); aa: org.apache.spark.sql.types.Decimal = 1 here is another example where we should return 0.125 instead of 0 scala> val aa = Decimal(1) /Decimal(8) aa: org.apache.spark.sql.types.Decimal = 0 was: According to specification defined in Java doc over BigDecimal : http://docs.oracle.com/javase/1.5.0/docs/api/java/math/BigDecimal.html When a MathContext object is supplied with a precision setting of 0 (for example, MathContext.UNLIMITED), arithmetic operations are exact, as are the arithmetic methods which take no MathContext object. (This is the only behavior that was supported in releases prior to 5.) As a corollary of computing the exact result, the rounding mode setting of a MathContext object with a precision setting of 0 is not used and thus irrelevant. In the case of divide, the exact quotient could have an infinitely long decimal expansion; for example, 1 divided by 3. If the quotient has a nonterminating decimal expansion and the operation is specified to return an exact result, an ArithmeticException is thrown. Otherwise, the exact result of the division is returned, as done for other operations. when Decimal data is defined as DecimalType.Unlimited in Spark SQL, the exact result of the division should be returned or truncated to precision = 38 which is in align with what Hive supports. the current behavior is as shown following, which cause we lose the accuracy of Decimal division operation. scala> val aa = Decimal(2) / Decimal(3); aa: org.apache.spark.sql.types.Decimal = 1 > Spark SQL Decimal Division operation loss of precision/scale when type is > defined as DecimalType.Unlimited > -- > > Key: SPARK-8800 > URL: https://issues.apache.org/jira/browse/SPARK-8800 > Project: Spark > Issue Type: Bug >Reporter: Jihong MA > > According to specification defined in Java doc over BigDecimal : > http://docs.oracle.com/javase/1.5.0/docs/api/java/math/BigDecimal.html > When a MathContext object is supplied with a precision setting of 0 (for > example, MathContext.UNLIMITED), arithmetic operations are exact, as are the > arithmetic methods which take no MathContext object. (This is the only > behavior that was supported in releases prior to 5.) As a corollary of > computing the exact result, the rounding mode setting of a MathContext object > with a precision setting of 0 is not used and thus irrelevant. In the case of > divide, the exact quotient could have an infinitely long decimal expansion; > for example, 1 divided by 3. If the quotient has a nonterminating decimal > expansion and the operation is specified to return an exact result, an > ArithmeticException is thrown. Otherwise, the exact result of the division is > returned, as done for other operations. > when Decimal data is defined as DecimalType.Unlimited in Spark SQL, the exact > result of the division should be returned or truncated to precision = 38 > which is in align with what Hive supports. the current behavior is as shown > following, which cause we lose the accuracy of Decimal division operation. > scala> val aa = Decimal(2) / Decimal(3); > aa: org.apache.spark.sql.types.Decimal = 1 > here is another example where we should return 0.125 instead of 0 > scala> val aa = Decimal(1) /Decimal(8) > aa: org.apache.spark.sql.types.Decimal = 0 -- This message was sent by Atlassian JIR
[jira] [Commented] (SPARK-8800) Spark SQL Decimal Division operation loss of precision/scale when type is defined as DecimalType.Unlimited
[ https://issues.apache.org/jira/browse/SPARK-8800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612524#comment-14612524 ] Jihong MA commented on SPARK-8800: -- this is an issue noticed after we open up the precision limit for decimal multiplication SPARK-8359. SPARK-8677 only solves partial of the issue. > Spark SQL Decimal Division operation loss of precision/scale when type is > defined as DecimalType.Unlimited > -- > > Key: SPARK-8800 > URL: https://issues.apache.org/jira/browse/SPARK-8800 > Project: Spark > Issue Type: Bug >Reporter: Jihong MA > > According to specification defined in Java doc over BigDecimal : > http://docs.oracle.com/javase/1.5.0/docs/api/java/math/BigDecimal.html > When a MathContext object is supplied with a precision setting of 0 (for > example, MathContext.UNLIMITED), arithmetic operations are exact, as are the > arithmetic methods which take no MathContext object. (This is the only > behavior that was supported in releases prior to 5.) As a corollary of > computing the exact result, the rounding mode setting of a MathContext object > with a precision setting of 0 is not used and thus irrelevant. In the case of > divide, the exact quotient could have an infinitely long decimal expansion; > for example, 1 divided by 3. If the quotient has a nonterminating decimal > expansion and the operation is specified to return an exact result, an > ArithmeticException is thrown. Otherwise, the exact result of the division is > returned, as done for other operations. > when Decimal data is defined as DecimalType.Unlimited in Spark SQL, the exact > result of the division should be returned or truncated to precision = 38 > which is in align with what Hive supports. the current behavior is as shown > following, which cause we lose the accuracy of Decimal division operation. > scala> val aa = Decimal(2) / Decimal(3); > aa: org.apache.spark.sql.types.Decimal = 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8800) Spark SQL Decimal Division operation loss of precision/scale when type is defined as DecimalType.Unlimited
Jihong MA created SPARK-8800: Summary: Spark SQL Decimal Division operation loss of precision/scale when type is defined as DecimalType.Unlimited Key: SPARK-8800 URL: https://issues.apache.org/jira/browse/SPARK-8800 Project: Spark Issue Type: Bug Reporter: Jihong MA According to specification defined in Java doc over BigDecimal : http://docs.oracle.com/javase/1.5.0/docs/api/java/math/BigDecimal.html When a MathContext object is supplied with a precision setting of 0 (for example, MathContext.UNLIMITED), arithmetic operations are exact, as are the arithmetic methods which take no MathContext object. (This is the only behavior that was supported in releases prior to 5.) As a corollary of computing the exact result, the rounding mode setting of a MathContext object with a precision setting of 0 is not used and thus irrelevant. In the case of divide, the exact quotient could have an infinitely long decimal expansion; for example, 1 divided by 3. If the quotient has a nonterminating decimal expansion and the operation is specified to return an exact result, an ArithmeticException is thrown. Otherwise, the exact result of the division is returned, as done for other operations. when Decimal data is defined as DecimalType.Unlimited in Spark SQL, the exact result of the division should be returned or truncated to precision = 38 which is in align with what Hive supports. the current behavior is as shown following, which cause we lose the accuracy of Decimal division operation. scala> val aa = Decimal(2) / Decimal(3); aa: org.apache.spark.sql.types.Decimal = 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8677) Decimal divide operation throws ArithmeticException
[ https://issues.apache.org/jira/browse/SPARK-8677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14610865#comment-14610865 ] Jihong MA commented on SPARK-8677: -- I am not sure if there is guideline for DecimalType.Unlimited, can we go for an accuracy at least equivalent to Double? > Decimal divide operation throws ArithmeticException > --- > > Key: SPARK-8677 > URL: https://issues.apache.org/jira/browse/SPARK-8677 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 1.5.0 > > > Please refer to [BigDecimal > doc|http://docs.oracle.com/javase/1.5.0/docs/api/java/math/BigDecimal.html]: > {quote} > ... the rounding mode setting of a MathContext object with a precision > setting of 0 is not used and thus irrelevant. In the case of divide, the > exact quotient could have an infinitely long decimal expansion; for example, > 1 divided by 3. > {quote} > Because we provide a MathContext.UNLIMITED in toBigDecimal, Decimal divide > operation will throw the following exception: > {code} > val decimal = Decimal(1.0, 10, 3) / Decimal(3.0, 10, 3) > [info] java.lang.ArithmeticException: Non-terminating decimal expansion; no > exact representable decimal result. > [info] at java.math.BigDecimal.divide(BigDecimal.java:1690) > [info] at java.math.BigDecimal.divide(BigDecimal.java:1723) > [info] at scala.math.BigDecimal.$div(BigDecimal.scala:256) > [info] at org.apache.spark.sql.types.Decimal.$div(Decimal.scala:272) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8677) Decimal divide operation throws ArithmeticException
[ https://issues.apache.org/jira/browse/SPARK-8677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14610831#comment-14610831 ] Jihong MA commented on SPARK-8677: -- Thanks for fixing the division problem. but this fix introduces one more issue w.r.t the accuracy of Decimal computation. scala> val aa = Decimal(2) / Decimal(3); aa: org.apache.spark.sql.types.Decimal = 1 when a Decimal is defined as Decimal.Unlimited, we are not expecting the division result's scale value to inherit from its parent, this is causing big accuracy issue once we go coupe round of division over decimal data vs. double data. below is a sample output from my run. 10:27:46.042 WARN org.apache.spark.sql.catalyst.expressions.CombinePartialStdFunction: COMBINE STDDEV DOUBLE---4.0 , 0.8VALUE 10:27:46.137 WARN org.apache.spark.sql.catalyst.expressions.CombinePartialStdFunction: COMBINE STDDEV DECIMAL---4.29000 , 0.858VALUE > Decimal divide operation throws ArithmeticException > --- > > Key: SPARK-8677 > URL: https://issues.apache.org/jira/browse/SPARK-8677 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 1.5.0 > > > Please refer to [BigDecimal > doc|http://docs.oracle.com/javase/1.5.0/docs/api/java/math/BigDecimal.html]: > {quote} > ... the rounding mode setting of a MathContext object with a precision > setting of 0 is not used and thus irrelevant. In the case of divide, the > exact quotient could have an infinitely long decimal expansion; for example, > 1 divided by 3. > {quote} > Because we provide a MathContext.UNLIMITED in toBigDecimal, Decimal divide > operation will throw the following exception: > {code} > val decimal = Decimal(1.0, 10, 3) / Decimal(3.0, 10, 3) > [info] java.lang.ArithmeticException: Non-terminating decimal expansion; no > exact representable decimal result. > [info] at java.math.BigDecimal.divide(BigDecimal.java:1690) > [info] at java.math.BigDecimal.divide(BigDecimal.java:1723) > [info] at scala.math.BigDecimal.$div(BigDecimal.scala:256) > [info] at org.apache.spark.sql.types.Decimal.$div(Decimal.scala:272) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8359) Spark SQL Decimal type precision loss on multiplication
[ https://issues.apache.org/jira/browse/SPARK-8359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604041#comment-14604041 ] Jihong MA commented on SPARK-8359: -- This fix is causing issue with divide over Decimal.Unlimited type when precision and scale are not defined as show below java.lang.ArithmeticException: Non-terminating decimal expansion; no exact representable decimal result. at java.math.BigDecimal.divide(BigDecimal.java:1616) at java.math.BigDecimal.divide(BigDecimal.java:1650) at scala.math.BigDecimal.$div(BigDecimal.scala:256) at org.apache.spark.sql.types.Decimal.$div(Decimal.scala:269) at org.apache.spark.sql.types.Decimal$DecimalIsFractional$.div(Decimal.scala:333) at org.apache.spark.sql.types.Decimal$DecimalIsFractional$.div(Decimal.scala:332) at org.apache.spark.sql.catalyst.expressions.Divide$$anonfun$div$1.apply(arithmetic.scala:193) at org.apache.spark.sql.catalyst.expressions.Divide.eval(arithmetic.scala:206) > Spark SQL Decimal type precision loss on multiplication > --- > > Key: SPARK-8359 > URL: https://issues.apache.org/jira/browse/SPARK-8359 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Rene Treffer >Assignee: Liang-Chi Hsieh > Fix For: 1.5.0 > > > It looks like the precision of decimal can not be raised beyond ~2^112 > without causing full value truncation. > The following code computes the power of two up to a specific point > {code} > import org.apache.spark.sql.types.Decimal > val one = Decimal(1) > val two = Decimal(2) > def pow(n : Int) : Decimal = if (n <= 0) { one } else { > val a = pow(n - 1) > a.changePrecision(n,0) > two.changePrecision(n,0) > a * two > } > (109 to 120).foreach(n => > println(pow(n).toJavaBigDecimal.unscaledValue.toString)) > 649037107316853453566312041152512 > 1298074214633706907132624082305024 > 2596148429267413814265248164610048 > 5192296858534827628530496329220096 > 1038459371706965525706099265844019 > 2076918743413931051412198531688038 > 4153837486827862102824397063376076 > 8307674973655724205648794126752152 > 1661534994731144841129758825350430 > 3323069989462289682259517650700860 > 6646139978924579364519035301401720 > 1329227995784915872903807060280344 > {code} > Beyond ~2^112 the precision is truncated even if the precision was set to n > and should thus handle 10^n without problems.. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6548) Adding stddev to DataFrame functions
[ https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562298#comment-14562298 ] Jihong MA commented on SPARK-6548: -- Hi sdfox, I thought you are no longer working on this, so submitted a pull request 1 week ago, which uses a 1-pass online algorithm to calculate standard deviation. you are welcome to take a look. sorry, hoping we are not duplicating the effort. > Adding stddev to DataFrame functions > > > Key: SPARK-6548 > URL: https://issues.apache.org/jira/browse/SPARK-6548 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin > Labels: DataFrame, starter > > Add it to the list of aggregate functions: > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala > Also add it to > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala > We can either add a Stddev Catalyst expression, or just compute it using > existing functions like here: > https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7357) Improving HBaseTest example
Jihong MA created SPARK-7357: Summary: Improving HBaseTest example Key: SPARK-7357 URL: https://issues.apache.org/jira/browse/SPARK-7357 Project: Spark Issue Type: Improvement Components: Examples Affects Versions: 1.3.1 Reporter: Jihong MA Priority: Minor Fix For: 1.4.0 Minor improvement to HBaseTest example, when Hbase related configurations e.g: zookeeper quorum, zookeeper client port or zookeeper.znode.parent are not set to default (localhost:2181), connection to zookeeper might hang as shown in following stack 15/03/26 18:31:20 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=xxx.xxx.xxx:2181 sessionTimeout=9 watcher=hconnection-0x322a4437, quorum=xxx.xxx.xxx:2181, baseZNode=/hbase 15/03/26 18:31:21 INFO zookeeper.ClientCnxn: Opening socket connection to server 9.30.94.121:2181. Will not attempt to authenticate using SASL (unknown error) 15/03/26 18:31:21 INFO zookeeper.ClientCnxn: Socket connection established to xxx.xxx.xxx/9.30.94.121:2181, initiating session 15/03/26 18:31:21 INFO zookeeper.ClientCnxn: Session establishment complete on server xxx.xxx.xxx/9.30.94.121:2181, sessionid = 0x14c53cd311e004b, negotiated timeout = 4 15/03/26 18:31:21 INFO client.ZooKeeperRegistry: ClusterId read in ZooKeeper is null this is due to hbase-site.xml is not placed on spark class path. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7265) Improving documentation for Spark SQL Hive support
[ https://issues.apache.org/jira/browse/SPARK-7265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14522720#comment-14522720 ] Jihong MA commented on SPARK-7265: -- this is a place holder for changes I am planning to contribute, I will make a PR very soon. > Improving documentation for Spark SQL Hive support > --- > > Key: SPARK-7265 > URL: https://issues.apache.org/jira/browse/SPARK-7265 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.3.1 >Reporter: Jihong MA >Priority: Minor > > miscellaneous documentation improvement for Spark SQL Hive support, Yarn > cluster deployment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7265) Improving documentation for Spark SQL Hive support
[ https://issues.apache.org/jira/browse/SPARK-7265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jihong MA updated SPARK-7265: - Priority: Minor (was: Trivial) > Improving documentation for Spark SQL Hive support > --- > > Key: SPARK-7265 > URL: https://issues.apache.org/jira/browse/SPARK-7265 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.3.1 >Reporter: Jihong MA >Priority: Minor > Fix For: 1.4.0 > > > miscellaneous documentation improvement for Spark SQL Hive support, Yarn > cluster deployment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7265) Improving documentation for Spark SQL Hive support
Jihong MA created SPARK-7265: Summary: Improving documentation for Spark SQL Hive support Key: SPARK-7265 URL: https://issues.apache.org/jira/browse/SPARK-7265 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.3.1 Reporter: Jihong MA Priority: Trivial Fix For: 1.4.0 miscellaneous documentation improvement for Spark SQL Hive support, Yarn cluster deployment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org