[jira] [Created] (SPARK-18799) Spark SQL expose interface for plug-gable parser extension

2016-12-08 Thread Jihong MA (JIRA)
Jihong MA created SPARK-18799:
-

 Summary: Spark SQL expose interface for plug-gable parser 
extension 
 Key: SPARK-18799
 URL: https://issues.apache.org/jira/browse/SPARK-18799
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Jihong MA


There used to be an interface to plug a parser extension through ParserDialect 
in HiveContext in all Spark 1.x version. Starting Spark 2.x release, Apache 
Spark moved to the new parser (Antlr4), there is no longer a way to extend the 
default SQL parser through SparkSession interface, however this is really a 
pain and hard to work around it when integrating other data source with Spark 
with extended support such as Insert, Update, Delete statement or any other 
data management statement. 

It would be very nice to continue to expose an interface for parser extension 
to make data source integration easier and smoother. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18799) Spark SQL expose interface for plug-gable parser extension

2016-12-09 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15736003#comment-15736003
 ] 

Jihong MA commented on SPARK-18799:
---

[~hyukjin.kwon] the intention to remove it at that time is different as Spark 
doesn't have its own parser yet and would prefer community to contribute parser 
changes directly. for Spark 2.x, it is essential to provide an interface for 
extension for data source integration, especially for those syntax/statements 
Spark has no intention to support even in the future. 

> Spark SQL expose interface for plug-gable parser extension 
> ---
>
> Key: SPARK-18799
> URL: https://issues.apache.org/jira/browse/SPARK-18799
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jihong MA
>
> There used to be an interface to plug a parser extension through 
> ParserDialect in HiveContext in all Spark 1.x version. Starting Spark 2.x 
> release, Apache Spark moved to the new parser (Antlr4), there is no longer a 
> way to extend the default SQL parser through SparkSession interface, however 
> this is really a pain and hard to work around it when integrating other data 
> source with Spark with extended support such as Insert, Update, Delete 
> statement or any other data management statement. 
> It would be very nice to continue to expose an interface for parser extension 
> to make data source integration easier and smoother. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18799) Spark SQL expose interface for plug-gable parser extension

2016-12-09 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15736100#comment-15736100
 ] 

Jihong MA commented on SPARK-18799:
---

DML statement support for instance

> Spark SQL expose interface for plug-gable parser extension 
> ---
>
> Key: SPARK-18799
> URL: https://issues.apache.org/jira/browse/SPARK-18799
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jihong MA
>
> There used to be an interface to plug a parser extension through 
> ParserDialect in HiveContext in all Spark 1.x version. Starting Spark 2.x 
> release, Apache Spark moved to the new parser (Antlr4), there is no longer a 
> way to extend the default SQL parser through SparkSession interface, however 
> this is really a pain and hard to work around it when integrating other data 
> source with Spark with extended support such as Insert, Update, Delete 
> statement or any other data management statement. 
> It would be very nice to continue to expose an interface for parser extension 
> to make data source integration easier and smoother. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18799) Spark SQL expose interface for plug-gable parser extension

2016-12-09 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15736118#comment-15736118
 ] 

Jihong MA commented on SPARK-18799:
---

are we looking at first quarter of 2017 for Spark 2.2? is now too late to 
squeeze this in Spark 2.1 release? thanks!

> Spark SQL expose interface for plug-gable parser extension 
> ---
>
> Key: SPARK-18799
> URL: https://issues.apache.org/jira/browse/SPARK-18799
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jihong MA
>
> There used to be an interface to plug a parser extension through 
> ParserDialect in HiveContext in all Spark 1.x version. Starting Spark 2.x 
> release, Apache Spark moved to the new parser (Antlr4), there is no longer a 
> way to extend the default SQL parser through SparkSession interface, however 
> this is really a pain and hard to work around it when integrating other data 
> source with Spark with extended support such as Insert, Update, Delete 
> statement or any other data management statement. 
> It would be very nice to continue to expose an interface for parser extension 
> to make data source integration easier and smoother. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10645) Bivariate Statistics: Spearman's Correlation support as UDAF

2015-09-28 Thread Jihong MA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-10645:
--
Description: Spearman's rank correlation coefficient : 
https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient  (was: 
this is an umbrella jira, which covers Bivariate Statistics for continuous vs. 
continuous columns, including covariance, Pearson's correlation, Spearman's 
correlation (for both continuous & categorical).)
Summary: Bivariate Statistics: Spearman's Correlation support as UDAF  
(was: Bivariate Statistics for continuous vs. continuous)

> Bivariate Statistics: Spearman's Correlation support as UDAF
> 
>
> Key: SPARK-10645
> URL: https://issues.apache.org/jira/browse/SPARK-10645
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>
> Spearman's rank correlation coefficient : 
> https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10646) Bivariate Statistics: Pearson's Chi-Squared goodness of fit test

2015-09-28 Thread Jihong MA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-10646:
--
Description: Pearson's chi-squared goodness of fit test for observed 
against the expected distribution.  (was: Pearson's chi-squared goodness of fit 
test for observed against the expected distribution & independence test. )
Summary: Bivariate Statistics: Pearson's Chi-Squared goodness of fit 
test  (was: Bivariate Statistics: Pearson's Chi-Squared Test for categorical 
vs. categorical)

> Bivariate Statistics: Pearson's Chi-Squared goodness of fit test
> 
>
> Key: SPARK-10646
> URL: https://issues.apache.org/jira/browse/SPARK-10646
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>Assignee: Jihong MA
>
> Pearson's chi-squared goodness of fit test for observed against the expected 
> distribution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10860) Bivariate Statistics: Chi-Squared independence test

2015-09-28 Thread Jihong MA (JIRA)
Jihong MA created SPARK-10860:
-

 Summary: Bivariate Statistics: Chi-Squared independence test
 Key: SPARK-10860
 URL: https://issues.apache.org/jira/browse/SPARK-10860
 Project: Spark
  Issue Type: New Feature
  Components: ML, SQL
Reporter: Jihong MA






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10860) Bivariate Statistics: Chi-Squared independence test

2015-09-28 Thread Jihong MA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-10860:
--
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-10385

> Bivariate Statistics: Chi-Squared independence test
> ---
>
> Key: SPARK-10860
> URL: https://issues.apache.org/jira/browse/SPARK-10860
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10860) Bivariate Statistics: Chi-Squared independence test

2015-09-28 Thread Jihong MA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-10860:
--
Description: Pearson's chi-squared independence test

> Bivariate Statistics: Chi-Squared independence test
> ---
>
> Key: SPARK-10860
> URL: https://issues.apache.org/jira/browse/SPARK-10860
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>
> Pearson's chi-squared independence test



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10860) Bivariate Statistics: Chi-Squared independence test

2015-09-28 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934098#comment-14934098
 ] 

Jihong MA commented on SPARK-10860:
---

[~josephkb] please assign this JIRA to me.  Thanks!

> Bivariate Statistics: Chi-Squared independence test
> ---
>
> Key: SPARK-10860
> URL: https://issues.apache.org/jira/browse/SPARK-10860
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>
> Pearson's chi-squared independence test



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10861) Univariate Statistics: Adding range support for continuous

2015-09-28 Thread Jihong MA (JIRA)
Jihong MA created SPARK-10861:
-

 Summary: Univariate Statistics: Adding range support for 
continuous 
 Key: SPARK-10861
 URL: https://issues.apache.org/jira/browse/SPARK-10861
 Project: Spark
  Issue Type: New Feature
  Components: ML, SQL
Reporter: Jihong MA






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10861) Univariate Statistics: Adding range support for continuous

2015-09-28 Thread Jihong MA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-10861:
--
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-10384

> Univariate Statistics: Adding range support for continuous 
> ---
>
> Key: SPARK-10861
> URL: https://issues.apache.org/jira/browse/SPARK-10861
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10861) Univariate Statistics: Adding range support for continuous

2015-09-28 Thread Jihong MA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-10861:
--
Description: Range support as UDAF 

> Univariate Statistics: Adding range support for continuous 
> ---
>
> Key: SPARK-10861
> URL: https://issues.apache.org/jira/browse/SPARK-10861
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>
> Range support as UDAF 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10861) Univariate Statistics: Adding range support as UDAF

2015-09-28 Thread Jihong MA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-10861:
--
Summary: Univariate Statistics: Adding range support as UDAF  (was: 
Univariate Statistics: Adding range support for continuous )

> Univariate Statistics: Adding range support as UDAF
> ---
>
> Key: SPARK-10861
> URL: https://issues.apache.org/jira/browse/SPARK-10861
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>
> Range support as UDAF 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10861) Univariate Statistics: Adding range support as UDAF

2015-09-28 Thread Jihong MA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-10861:
--
Description: Range support for continuous  (was: Range support as UDAF )

> Univariate Statistics: Adding range support as UDAF
> ---
>
> Key: SPARK-10861
> URL: https://issues.apache.org/jira/browse/SPARK-10861
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>
> Range support for continuous



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10861) Univariate Statistics: Adding range support as UDAF

2015-09-28 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934109#comment-14934109
 ] 

Jihong MA commented on SPARK-10861:
---

I will send a PR soon. 

> Univariate Statistics: Adding range support as UDAF
> ---
>
> Key: SPARK-10861
> URL: https://issues.apache.org/jira/browse/SPARK-10861
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>
> Range support for continuous



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10862) Univariate Statistics: Adding median support as UDAF

2015-09-28 Thread Jihong MA (JIRA)
Jihong MA created SPARK-10862:
-

 Summary: Univariate Statistics: Adding median support as UDAF
 Key: SPARK-10862
 URL: https://issues.apache.org/jira/browse/SPARK-10862
 Project: Spark
  Issue Type: New Feature
  Components: ML, SQL
Reporter: Jihong MA






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10862) Univariate Statistics: Adding median & quantile support as UDAF

2015-09-28 Thread Jihong MA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-10862:
--
Summary: Univariate Statistics: Adding median & quantile support as UDAF  
(was: Univariate Statistics: Adding median support as UDAF)

> Univariate Statistics: Adding median & quantile support as UDAF
> ---
>
> Key: SPARK-10862
> URL: https://issues.apache.org/jira/browse/SPARK-10862
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Reporter: Jihong MA
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10862) Univariate Statistics: Adding median & quantile support as UDAF

2015-09-28 Thread Jihong MA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-10862:
--
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-10384

> Univariate Statistics: Adding median & quantile support as UDAF
> ---
>
> Key: SPARK-10862
> URL: https://issues.apache.org/jira/browse/SPARK-10862
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10953) Benchmark codegen vs. hand-written code for univariate statistics

2015-10-06 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945532#comment-14945532
 ] 

Jihong MA commented on SPARK-10953:
---

[~mengxr] do you mean comparing an implementation which operate directly at RDD 
level vs. leveraging UDAF framework?  like what has been done under 
sql/core/src/main/scala/org/apache/spark/sql/execution/stat/. 

> Benchmark codegen vs. hand-written code for univariate statistics
> -
>
> Key: SPARK-10953
> URL: https://issues.apache.org/jira/browse/SPARK-10953
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Xiangrui Meng
>
> I checked the generated code for a simple stddev_pop call:
> {code}
> val df = sqlContext.range(100)
> df.select(stddev_pop(col("id"))).show()
> {code}
> This is the generated code for the merge part, which is very long and 
> complex. I'm not sure whether we can get benefit from the code generation for 
> univariate statistics. We should benchmark it against Scala implementation.
> {code}
> 15/10/06 10:10:57 DEBUG GenerateMutableProjection: code for if 
> (isnull(input[1, DoubleType])) cast(0 as double) else input[1, DoubleType],if 
> (isnull(input[1, DoubleType])) input[6, DoubleType] else if (isnull(input[6, 
> DoubleType])) input[1, DoubleType] else (input[1, DoubleType] + input[6, 
> DoubleType]),if (isnull(input[3, DoubleType])) cast(0 as double) else 
> input[3, DoubleType],if (isnull(input[3, DoubleType])) input[8, DoubleType] 
> else if (isnull(input[8, DoubleType])) input[3, DoubleType] else (((input[3, 
> DoubleType] * input[0, DoubleType]) + (input[8, DoubleType] * input[6, 
> DoubleType])) / (input[0, DoubleType] + input[6, DoubleType])),if 
> (isnull(input[4, DoubleType])) input[9, DoubleType] else if (isnull(input[9, 
> DoubleType])) input[4, DoubleType] else ((input[4, DoubleType] + input[9, 
> DoubleType]) + input[8, DoubleType] - input[2, DoubleType]) * (input[8, 
> DoubleType] - input[2, DoubleType])) * (input[0, DoubleType] * input[6, 
> DoubleType])) / (input[0, DoubleType] + input[6, DoubleType]))):
> public Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] 
> expr) {
>   return new SpecificMutableProjection(expr);
> }
> class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
>   private org.apache.spark.sql.catalyst.expressions.Expression[] expressions;
>   private org.apache.spark.sql.catalyst.expressions.MutableRow mutableRow;
>   public 
> SpecificMutableProjection(org.apache.spark.sql.catalyst.expressions.Expression[]
>  expr) {
> expressions = expr;
> mutableRow = new 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow(5);
>   }
>   public 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection 
> target(org.apache.spark.sql.catalyst.expressions.MutableRow row) {
> mutableRow = row;
> return this;
>   }
>   /* Provide immutable access to the last projected row. */
>   public InternalRow currentValue() {
> return (InternalRow) mutableRow;
>   }
>   public Object apply(Object _i) {
> InternalRow i = (InternalRow) _i;
> /* if (isnull(input[1, DoubleType])) cast(0 as double) else input[1, 
> DoubleType] */
> /* isnull(input[1, DoubleType]) */
> /* input[1, DoubleType] */
> boolean isNull4 = i.isNullAt(1);
> double primitive5 = isNull4 ? -1.0 : (i.getDouble(1));
> boolean isNull0 = false;
> double primitive1 = -1.0;
> if (!false && isNull4) {
>   /* cast(0 as double) */
>   /* 0 */
>   boolean isNull6 = false;
>   double primitive7 = -1.0;
>   if (!false) {
> primitive7 = (double) 0;
>   }
>   isNull0 = isNull6;
>   primitive1 = primitive7;
> } else {
>   /* input[1, DoubleType] */
>   boolean isNull10 = i.isNullAt(1);
>   double primitive11 = isNull10 ? -1.0 : (i.getDouble(1));
>   isNull0 = isNull10;
>   primitive1 = primitive11;
> }
> if (isNull0) {
>   mutableRow.setNullAt(0);
> } else {
>   mutableRow.setDouble(0, primitive1);
> }
> /* if (isnull(input[1, DoubleType])) input[6, DoubleType] else if 
> (isnull(input[6, DoubleType])) input[1, DoubleType] else (input[1, 
> DoubleType] + input[6, DoubleType]) */
> /* isnull(input[1, DoubleType]) */
> /* input[1, DoubleType] */
> boolean isNull16 = i.isNullAt(1);
> double primitive17 = isNull16 ? -1.0 : (i.getDouble(1));
> boolean isNull12 = false;
> double primitive13 = -1.0;
> if (!false && isNull16) {
>   /* input[6, DoubleType] */
>   boolean isNull18 = i.isNullAt(6);
>   double primitive19 = isNull18 ? -1.0 : (i.getDouble(6));
>   isNull12 = isNull18;
>   primitive13 = primitive19;
> } else {
>

[jira] [Commented] (SPARK-10953) Benchmark codegen vs. hand-written code for univariate statistics

2015-10-06 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14946061#comment-14946061
 ] 

Jihong MA commented on SPARK-10953:
---


it should not be too hard to put together an implementation based on 
AggregateFunction2 interface (e.g HyperLogLogPlusPlus) , would it make more 
sense to compare implementation with AggregateFunction2 vs. AlgebraicAggregate, 
if we will go for AggregateFunction2 as a better alternative based on the 
result.

or we could do as you suggested ,  comparing between rdd.stats() and 
df.describe() where describe uses UDAF AlgebraicAggregate internally. we need a 
bigger cluster (bare-medal or on cloud ), not sure when we can have it, and are 
there performance tools to generate data for testing purpose? 



> Benchmark codegen vs. hand-written code for univariate statistics
> -
>
> Key: SPARK-10953
> URL: https://issues.apache.org/jira/browse/SPARK-10953
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Xiangrui Meng
>
> I checked the generated code for a simple stddev_pop call:
> {code}
> val df = sqlContext.range(100)
> df.select(stddev_pop(col("id"))).show()
> {code}
> This is the generated code for the merge part, which is very long and 
> complex. I'm not sure whether we can get benefit from the code generation for 
> univariate statistics. We should benchmark it against Scala implementation.
> {code}
> 15/10/06 10:10:57 DEBUG GenerateMutableProjection: code for if 
> (isnull(input[1, DoubleType])) cast(0 as double) else input[1, DoubleType],if 
> (isnull(input[1, DoubleType])) input[6, DoubleType] else if (isnull(input[6, 
> DoubleType])) input[1, DoubleType] else (input[1, DoubleType] + input[6, 
> DoubleType]),if (isnull(input[3, DoubleType])) cast(0 as double) else 
> input[3, DoubleType],if (isnull(input[3, DoubleType])) input[8, DoubleType] 
> else if (isnull(input[8, DoubleType])) input[3, DoubleType] else (((input[3, 
> DoubleType] * input[0, DoubleType]) + (input[8, DoubleType] * input[6, 
> DoubleType])) / (input[0, DoubleType] + input[6, DoubleType])),if 
> (isnull(input[4, DoubleType])) input[9, DoubleType] else if (isnull(input[9, 
> DoubleType])) input[4, DoubleType] else ((input[4, DoubleType] + input[9, 
> DoubleType]) + input[8, DoubleType] - input[2, DoubleType]) * (input[8, 
> DoubleType] - input[2, DoubleType])) * (input[0, DoubleType] * input[6, 
> DoubleType])) / (input[0, DoubleType] + input[6, DoubleType]))):
> public Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] 
> expr) {
>   return new SpecificMutableProjection(expr);
> }
> class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
>   private org.apache.spark.sql.catalyst.expressions.Expression[] expressions;
>   private org.apache.spark.sql.catalyst.expressions.MutableRow mutableRow;
>   public 
> SpecificMutableProjection(org.apache.spark.sql.catalyst.expressions.Expression[]
>  expr) {
> expressions = expr;
> mutableRow = new 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow(5);
>   }
>   public 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection 
> target(org.apache.spark.sql.catalyst.expressions.MutableRow row) {
> mutableRow = row;
> return this;
>   }
>   /* Provide immutable access to the last projected row. */
>   public InternalRow currentValue() {
> return (InternalRow) mutableRow;
>   }
>   public Object apply(Object _i) {
> InternalRow i = (InternalRow) _i;
> /* if (isnull(input[1, DoubleType])) cast(0 as double) else input[1, 
> DoubleType] */
> /* isnull(input[1, DoubleType]) */
> /* input[1, DoubleType] */
> boolean isNull4 = i.isNullAt(1);
> double primitive5 = isNull4 ? -1.0 : (i.getDouble(1));
> boolean isNull0 = false;
> double primitive1 = -1.0;
> if (!false && isNull4) {
>   /* cast(0 as double) */
>   /* 0 */
>   boolean isNull6 = false;
>   double primitive7 = -1.0;
>   if (!false) {
> primitive7 = (double) 0;
>   }
>   isNull0 = isNull6;
>   primitive1 = primitive7;
> } else {
>   /* input[1, DoubleType] */
>   boolean isNull10 = i.isNullAt(1);
>   double primitive11 = isNull10 ? -1.0 : (i.getDouble(1));
>   isNull0 = isNull10;
>   primitive1 = primitive11;
> }
> if (isNull0) {
>   mutableRow.setNullAt(0);
> } else {
>   mutableRow.setDouble(0, primitive1);
> }
> /* if (isnull(input[1, DoubleType])) input[6, DoubleType] else if 
> (isnull(input[6, DoubleType])) input[1, DoubleType] else (input[1, 
> DoubleType] + input[6, DoubleType]) */
> /* isnull(input[1, DoubleType]) */
> /* input[1, DoubleType] */
> boolean isNull16 = i.is

[jira] [Commented] (SPARK-10953) Benchmark codegen vs. hand-written code for univariate statistics

2015-10-08 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949797#comment-14949797
 ] 

Jihong MA commented on SPARK-10953:
---

we should have a cluster for testing next Monday. we will run the performance 
comparison between ImperativeAggregate vs. DeclarativeAggregate (initially 
named AggregateFunction2 vs. AlgebraicAggregate). 
[~yhuai] my understanding of  the difference of these two: 
 DeclarativeAggregate directly manipulate the aggregatebuffer where 
ImperativeAggregate express aggregates as expression, at runtime, Declarative 
uses TungstenAggregate where ImperativeAggregate uses SortBasedAggregate? 
please clarify if it is correct. Thanks!

> Benchmark codegen vs. hand-written code for univariate statistics
> -
>
> Key: SPARK-10953
> URL: https://issues.apache.org/jira/browse/SPARK-10953
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Xiangrui Meng
>
> I checked the generated code for a simple stddev_pop call:
> {code}
> val df = sqlContext.range(100)
> df.select(stddev_pop(col("id"))).show()
> {code}
> This is the generated code for the merge part, which is very long and 
> complex. I'm not sure whether we can get benefit from the code generation for 
> univariate statistics. We should benchmark it against Scala implementation.
> {code}
> 15/10/06 10:10:57 DEBUG GenerateMutableProjection: code for if 
> (isnull(input[1, DoubleType])) cast(0 as double) else input[1, DoubleType],if 
> (isnull(input[1, DoubleType])) input[6, DoubleType] else if (isnull(input[6, 
> DoubleType])) input[1, DoubleType] else (input[1, DoubleType] + input[6, 
> DoubleType]),if (isnull(input[3, DoubleType])) cast(0 as double) else 
> input[3, DoubleType],if (isnull(input[3, DoubleType])) input[8, DoubleType] 
> else if (isnull(input[8, DoubleType])) input[3, DoubleType] else (((input[3, 
> DoubleType] * input[0, DoubleType]) + (input[8, DoubleType] * input[6, 
> DoubleType])) / (input[0, DoubleType] + input[6, DoubleType])),if 
> (isnull(input[4, DoubleType])) input[9, DoubleType] else if (isnull(input[9, 
> DoubleType])) input[4, DoubleType] else ((input[4, DoubleType] + input[9, 
> DoubleType]) + input[8, DoubleType] - input[2, DoubleType]) * (input[8, 
> DoubleType] - input[2, DoubleType])) * (input[0, DoubleType] * input[6, 
> DoubleType])) / (input[0, DoubleType] + input[6, DoubleType]))):
> public Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] 
> expr) {
>   return new SpecificMutableProjection(expr);
> }
> class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
>   private org.apache.spark.sql.catalyst.expressions.Expression[] expressions;
>   private org.apache.spark.sql.catalyst.expressions.MutableRow mutableRow;
>   public 
> SpecificMutableProjection(org.apache.spark.sql.catalyst.expressions.Expression[]
>  expr) {
> expressions = expr;
> mutableRow = new 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow(5);
>   }
>   public 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection 
> target(org.apache.spark.sql.catalyst.expressions.MutableRow row) {
> mutableRow = row;
> return this;
>   }
>   /* Provide immutable access to the last projected row. */
>   public InternalRow currentValue() {
> return (InternalRow) mutableRow;
>   }
>   public Object apply(Object _i) {
> InternalRow i = (InternalRow) _i;
> /* if (isnull(input[1, DoubleType])) cast(0 as double) else input[1, 
> DoubleType] */
> /* isnull(input[1, DoubleType]) */
> /* input[1, DoubleType] */
> boolean isNull4 = i.isNullAt(1);
> double primitive5 = isNull4 ? -1.0 : (i.getDouble(1));
> boolean isNull0 = false;
> double primitive1 = -1.0;
> if (!false && isNull4) {
>   /* cast(0 as double) */
>   /* 0 */
>   boolean isNull6 = false;
>   double primitive7 = -1.0;
>   if (!false) {
> primitive7 = (double) 0;
>   }
>   isNull0 = isNull6;
>   primitive1 = primitive7;
> } else {
>   /* input[1, DoubleType] */
>   boolean isNull10 = i.isNullAt(1);
>   double primitive11 = isNull10 ? -1.0 : (i.getDouble(1));
>   isNull0 = isNull10;
>   primitive1 = primitive11;
> }
> if (isNull0) {
>   mutableRow.setNullAt(0);
> } else {
>   mutableRow.setDouble(0, primitive1);
> }
> /* if (isnull(input[1, DoubleType])) input[6, DoubleType] else if 
> (isnull(input[6, DoubleType])) input[1, DoubleType] else (input[1, 
> DoubleType] + input[6, DoubleType]) */
> /* isnull(input[1, DoubleType]) */
> /* input[1, DoubleType] */
> boolean isNull16 = i.isNullAt(1);
> double primitive17 = isNull16 ? -1.0 : (i.getDouble(1));
> 

[jira] [Updated] (SPARK-10641) skewness and kurtosis support

2015-10-09 Thread Jihong MA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-10641:
--
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-10384

> skewness and kurtosis support
> -
>
> Key: SPARK-10641
> URL: https://issues.apache.org/jira/browse/SPARK-10641
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>Assignee: Seth Hendrickson
>
> Implementing skewness and kurtosis support based on following algorithm:
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10953) Benchmark codegen vs. hand-written code for univariate statistics

2015-10-12 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954281#comment-14954281
 ] 

Jihong MA commented on SPARK-10953:
---

[~mengxr]as Yin indicated in the comment, we would like to merge the pull 
request for enabling TungstenAggregate support for ImperativeAggregate (pull 
9038), to make it a fair comparison between ImperativeAggregate vs. 
DeclariativeAggregate and eliminate the perf impact due to runtime difference 
(used to be SortBasedAggregate), I noticed there are issues with the pull 
request and Josh merged couple more commits later this afternoon, [~yhuai] 
would you say it is ok now to merge those code for perf testing? Thanks! 

> Benchmark codegen vs. hand-written code for univariate statistics
> -
>
> Key: SPARK-10953
> URL: https://issues.apache.org/jira/browse/SPARK-10953
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Xiangrui Meng
>Assignee: Jihong MA
>
> I checked the generated code for a simple stddev_pop call:
> {code}
> val df = sqlContext.range(100)
> df.select(stddev_pop(col("id"))).show()
> {code}
> This is the generated code for the merge part, which is very long and 
> complex. I'm not sure whether we can get benefit from the code generation for 
> univariate statistics. We should benchmark it against Scala implementation.
> {code}
> 15/10/06 10:10:57 DEBUG GenerateMutableProjection: code for if 
> (isnull(input[1, DoubleType])) cast(0 as double) else input[1, DoubleType],if 
> (isnull(input[1, DoubleType])) input[6, DoubleType] else if (isnull(input[6, 
> DoubleType])) input[1, DoubleType] else (input[1, DoubleType] + input[6, 
> DoubleType]),if (isnull(input[3, DoubleType])) cast(0 as double) else 
> input[3, DoubleType],if (isnull(input[3, DoubleType])) input[8, DoubleType] 
> else if (isnull(input[8, DoubleType])) input[3, DoubleType] else (((input[3, 
> DoubleType] * input[0, DoubleType]) + (input[8, DoubleType] * input[6, 
> DoubleType])) / (input[0, DoubleType] + input[6, DoubleType])),if 
> (isnull(input[4, DoubleType])) input[9, DoubleType] else if (isnull(input[9, 
> DoubleType])) input[4, DoubleType] else ((input[4, DoubleType] + input[9, 
> DoubleType]) + input[8, DoubleType] - input[2, DoubleType]) * (input[8, 
> DoubleType] - input[2, DoubleType])) * (input[0, DoubleType] * input[6, 
> DoubleType])) / (input[0, DoubleType] + input[6, DoubleType]))):
> public Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] 
> expr) {
>   return new SpecificMutableProjection(expr);
> }
> class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
>   private org.apache.spark.sql.catalyst.expressions.Expression[] expressions;
>   private org.apache.spark.sql.catalyst.expressions.MutableRow mutableRow;
>   public 
> SpecificMutableProjection(org.apache.spark.sql.catalyst.expressions.Expression[]
>  expr) {
> expressions = expr;
> mutableRow = new 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow(5);
>   }
>   public 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection 
> target(org.apache.spark.sql.catalyst.expressions.MutableRow row) {
> mutableRow = row;
> return this;
>   }
>   /* Provide immutable access to the last projected row. */
>   public InternalRow currentValue() {
> return (InternalRow) mutableRow;
>   }
>   public Object apply(Object _i) {
> InternalRow i = (InternalRow) _i;
> /* if (isnull(input[1, DoubleType])) cast(0 as double) else input[1, 
> DoubleType] */
> /* isnull(input[1, DoubleType]) */
> /* input[1, DoubleType] */
> boolean isNull4 = i.isNullAt(1);
> double primitive5 = isNull4 ? -1.0 : (i.getDouble(1));
> boolean isNull0 = false;
> double primitive1 = -1.0;
> if (!false && isNull4) {
>   /* cast(0 as double) */
>   /* 0 */
>   boolean isNull6 = false;
>   double primitive7 = -1.0;
>   if (!false) {
> primitive7 = (double) 0;
>   }
>   isNull0 = isNull6;
>   primitive1 = primitive7;
> } else {
>   /* input[1, DoubleType] */
>   boolean isNull10 = i.isNullAt(1);
>   double primitive11 = isNull10 ? -1.0 : (i.getDouble(1));
>   isNull0 = isNull10;
>   primitive1 = primitive11;
> }
> if (isNull0) {
>   mutableRow.setNullAt(0);
> } else {
>   mutableRow.setDouble(0, primitive1);
> }
> /* if (isnull(input[1, DoubleType])) input[6, DoubleType] else if 
> (isnull(input[6, DoubleType])) input[1, DoubleType] else (input[1, 
> DoubleType] + input[6, DoubleType]) */
> /* isnull(input[1, DoubleType]) */
> /* input[1, DoubleType] */
> boolean isNull16 = i.isNullAt(1);
> double primitive17 = isNull16 ? -1

[jira] [Comment Edited] (SPARK-10953) Benchmark codegen vs. hand-written code for univariate statistics

2015-10-12 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954447#comment-14954447
 ] 

Jihong MA edited comment on SPARK-10953 at 10/13/15 5:59 AM:
-

had a quick run on my laptop, with a stddev implementation based on 
ImperativeAggregate vs. DeclarativeAggregate, where ImperativeAggregate still 
use SortBasedAggregate at runtime, and  DeclarativeAggregate uses 
TungstenAggregate. 

with a single double column of DF (cached): 

#rowsImperativeAggregate DeclarativeAggregate
100 58ms   0.1s
1000   0.4s 0.6s
1 4s 7s

overall it seems ImperativeAggregate perform better.  if enabling 
TungstenAggregate support for ImperativeAggregate is in good shape (PR 9038), I 
will merge them in and have another try.


was (Author: jihongma):
had a quick run on my laptop, with a stddev implementation based on 
ImperativeAggregate vs. DeclarativeAggregate, where ImperativeAggregate still 
use SortBasedAggregate at runtime, where DeclarativeAggregate uses 
TungstenAggregate. 

with a single double column of DF (cached): 

#rowsImperativeAggregate DeclarativeAggregate
100 58ms   0.1s
1000   0.4s 0.6s
1 4s 7s

overall it seems ImperativeAggregate perform better.  if enabling 
TungstenAggregate support for ImperativeAggregate is in good shape (PR 9038), I 
will merge them in and have another try.

> Benchmark codegen vs. hand-written code for univariate statistics
> -
>
> Key: SPARK-10953
> URL: https://issues.apache.org/jira/browse/SPARK-10953
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Xiangrui Meng
>Assignee: Jihong MA
>
> I checked the generated code for a simple stddev_pop call:
> {code}
> val df = sqlContext.range(100)
> df.select(stddev_pop(col("id"))).show()
> {code}
> This is the generated code for the merge part, which is very long and 
> complex. I'm not sure whether we can get benefit from the code generation for 
> univariate statistics. We should benchmark it against Scala implementation.
> {code}
> 15/10/06 10:10:57 DEBUG GenerateMutableProjection: code for if 
> (isnull(input[1, DoubleType])) cast(0 as double) else input[1, DoubleType],if 
> (isnull(input[1, DoubleType])) input[6, DoubleType] else if (isnull(input[6, 
> DoubleType])) input[1, DoubleType] else (input[1, DoubleType] + input[6, 
> DoubleType]),if (isnull(input[3, DoubleType])) cast(0 as double) else 
> input[3, DoubleType],if (isnull(input[3, DoubleType])) input[8, DoubleType] 
> else if (isnull(input[8, DoubleType])) input[3, DoubleType] else (((input[3, 
> DoubleType] * input[0, DoubleType]) + (input[8, DoubleType] * input[6, 
> DoubleType])) / (input[0, DoubleType] + input[6, DoubleType])),if 
> (isnull(input[4, DoubleType])) input[9, DoubleType] else if (isnull(input[9, 
> DoubleType])) input[4, DoubleType] else ((input[4, DoubleType] + input[9, 
> DoubleType]) + input[8, DoubleType] - input[2, DoubleType]) * (input[8, 
> DoubleType] - input[2, DoubleType])) * (input[0, DoubleType] * input[6, 
> DoubleType])) / (input[0, DoubleType] + input[6, DoubleType]))):
> public Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] 
> expr) {
>   return new SpecificMutableProjection(expr);
> }
> class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
>   private org.apache.spark.sql.catalyst.expressions.Expression[] expressions;
>   private org.apache.spark.sql.catalyst.expressions.MutableRow mutableRow;
>   public 
> SpecificMutableProjection(org.apache.spark.sql.catalyst.expressions.Expression[]
>  expr) {
> expressions = expr;
> mutableRow = new 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow(5);
>   }
>   public 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection 
> target(org.apache.spark.sql.catalyst.expressions.MutableRow row) {
> mutableRow = row;
> return this;
>   }
>   /* Provide immutable access to the last projected row. */
>   public InternalRow currentValue() {
> return (InternalRow) mutableRow;
>   }
>   public Object apply(Object _i) {
> InternalRow i = (InternalRow) _i;
> /* if (isnull(input[1, DoubleType])) cast(0 as double) else input[1, 
> DoubleType] */
> /* isnull(input[1, DoubleType]) */
> /* input[1, DoubleType] */
> boolean isNull4 = i.isNullAt(1);
> double primitive5 = isNull4 ? -1.0 : (i.getDouble(1));
>

[jira] [Commented] (SPARK-10953) Benchmark codegen vs. hand-written code for univariate statistics

2015-10-12 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954447#comment-14954447
 ] 

Jihong MA commented on SPARK-10953:
---

had a quick run on my laptop, with a stddev implementation based on 
ImperativeAggregate vs. DeclarativeAggregate, where ImperativeAggregate still 
use SortBasedAggregate at runtime, where DeclarativeAggregate uses 
TungstenAggregate. 

with a single double column of DF (cached): 

#rowsImperativeAggregate DeclarativeAggregate
100 58ms   0.1s
1000   0.4s 0.6s
1 4s 7s

overall it seems ImperativeAggregate perform better.  if enabling 
TungstenAggregate support for ImperativeAggregate is in good shape (PR 9038), I 
will merge them in and have another try.

> Benchmark codegen vs. hand-written code for univariate statistics
> -
>
> Key: SPARK-10953
> URL: https://issues.apache.org/jira/browse/SPARK-10953
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Xiangrui Meng
>Assignee: Jihong MA
>
> I checked the generated code for a simple stddev_pop call:
> {code}
> val df = sqlContext.range(100)
> df.select(stddev_pop(col("id"))).show()
> {code}
> This is the generated code for the merge part, which is very long and 
> complex. I'm not sure whether we can get benefit from the code generation for 
> univariate statistics. We should benchmark it against Scala implementation.
> {code}
> 15/10/06 10:10:57 DEBUG GenerateMutableProjection: code for if 
> (isnull(input[1, DoubleType])) cast(0 as double) else input[1, DoubleType],if 
> (isnull(input[1, DoubleType])) input[6, DoubleType] else if (isnull(input[6, 
> DoubleType])) input[1, DoubleType] else (input[1, DoubleType] + input[6, 
> DoubleType]),if (isnull(input[3, DoubleType])) cast(0 as double) else 
> input[3, DoubleType],if (isnull(input[3, DoubleType])) input[8, DoubleType] 
> else if (isnull(input[8, DoubleType])) input[3, DoubleType] else (((input[3, 
> DoubleType] * input[0, DoubleType]) + (input[8, DoubleType] * input[6, 
> DoubleType])) / (input[0, DoubleType] + input[6, DoubleType])),if 
> (isnull(input[4, DoubleType])) input[9, DoubleType] else if (isnull(input[9, 
> DoubleType])) input[4, DoubleType] else ((input[4, DoubleType] + input[9, 
> DoubleType]) + input[8, DoubleType] - input[2, DoubleType]) * (input[8, 
> DoubleType] - input[2, DoubleType])) * (input[0, DoubleType] * input[6, 
> DoubleType])) / (input[0, DoubleType] + input[6, DoubleType]))):
> public Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] 
> expr) {
>   return new SpecificMutableProjection(expr);
> }
> class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
>   private org.apache.spark.sql.catalyst.expressions.Expression[] expressions;
>   private org.apache.spark.sql.catalyst.expressions.MutableRow mutableRow;
>   public 
> SpecificMutableProjection(org.apache.spark.sql.catalyst.expressions.Expression[]
>  expr) {
> expressions = expr;
> mutableRow = new 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow(5);
>   }
>   public 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection 
> target(org.apache.spark.sql.catalyst.expressions.MutableRow row) {
> mutableRow = row;
> return this;
>   }
>   /* Provide immutable access to the last projected row. */
>   public InternalRow currentValue() {
> return (InternalRow) mutableRow;
>   }
>   public Object apply(Object _i) {
> InternalRow i = (InternalRow) _i;
> /* if (isnull(input[1, DoubleType])) cast(0 as double) else input[1, 
> DoubleType] */
> /* isnull(input[1, DoubleType]) */
> /* input[1, DoubleType] */
> boolean isNull4 = i.isNullAt(1);
> double primitive5 = isNull4 ? -1.0 : (i.getDouble(1));
> boolean isNull0 = false;
> double primitive1 = -1.0;
> if (!false && isNull4) {
>   /* cast(0 as double) */
>   /* 0 */
>   boolean isNull6 = false;
>   double primitive7 = -1.0;
>   if (!false) {
> primitive7 = (double) 0;
>   }
>   isNull0 = isNull6;
>   primitive1 = primitive7;
> } else {
>   /* input[1, DoubleType] */
>   boolean isNull10 = i.isNullAt(1);
>   double primitive11 = isNull10 ? -1.0 : (i.getDouble(1));
>   isNull0 = isNull10;
>   primitive1 = primitive11;
> }
> if (isNull0) {
>   mutableRow.setNullAt(0);
> } else {
>   mutableRow.setDouble(0, primitive1);
> }
> /* if (isnull(input[1, DoubleType])) input[6, DoubleType] else if 
> (isnull(input[6, DoubleType])) input[1, DoubleType] e

[jira] [Comment Edited] (SPARK-10953) Benchmark codegen vs. hand-written code for univariate statistics

2015-10-12 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954447#comment-14954447
 ] 

Jihong MA edited comment on SPARK-10953 at 10/13/15 6:05 AM:
-

[~mengxr]had a quick run on my laptop, with a stddev implementation based on 
ImperativeAggregate vs. DeclarativeAggregate, where ImperativeAggregate still 
use SortBasedAggregate at runtime, and  DeclarativeAggregate uses 
TungstenAggregate. 

with a single double column of DF (cached): 

#rowsImperativeAggregate DeclarativeAggregate
100 58ms   0.1s
1000   0.4s 0.6s
1 4s 7s

overall it seems ImperativeAggregate perform better.  if enabling 
TungstenAggregate support for ImperativeAggregate is in good shape (PR 9038), I 
will merge them in and have another try.


was (Author: jihongma):
had a quick run on my laptop, with a stddev implementation based on 
ImperativeAggregate vs. DeclarativeAggregate, where ImperativeAggregate still 
use SortBasedAggregate at runtime, and  DeclarativeAggregate uses 
TungstenAggregate. 

with a single double column of DF (cached): 

#rowsImperativeAggregate DeclarativeAggregate
100 58ms   0.1s
1000   0.4s 0.6s
1 4s 7s

overall it seems ImperativeAggregate perform better.  if enabling 
TungstenAggregate support for ImperativeAggregate is in good shape (PR 9038), I 
will merge them in and have another try.

> Benchmark codegen vs. hand-written code for univariate statistics
> -
>
> Key: SPARK-10953
> URL: https://issues.apache.org/jira/browse/SPARK-10953
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Xiangrui Meng
>Assignee: Jihong MA
>
> I checked the generated code for a simple stddev_pop call:
> {code}
> val df = sqlContext.range(100)
> df.select(stddev_pop(col("id"))).show()
> {code}
> This is the generated code for the merge part, which is very long and 
> complex. I'm not sure whether we can get benefit from the code generation for 
> univariate statistics. We should benchmark it against Scala implementation.
> {code}
> 15/10/06 10:10:57 DEBUG GenerateMutableProjection: code for if 
> (isnull(input[1, DoubleType])) cast(0 as double) else input[1, DoubleType],if 
> (isnull(input[1, DoubleType])) input[6, DoubleType] else if (isnull(input[6, 
> DoubleType])) input[1, DoubleType] else (input[1, DoubleType] + input[6, 
> DoubleType]),if (isnull(input[3, DoubleType])) cast(0 as double) else 
> input[3, DoubleType],if (isnull(input[3, DoubleType])) input[8, DoubleType] 
> else if (isnull(input[8, DoubleType])) input[3, DoubleType] else (((input[3, 
> DoubleType] * input[0, DoubleType]) + (input[8, DoubleType] * input[6, 
> DoubleType])) / (input[0, DoubleType] + input[6, DoubleType])),if 
> (isnull(input[4, DoubleType])) input[9, DoubleType] else if (isnull(input[9, 
> DoubleType])) input[4, DoubleType] else ((input[4, DoubleType] + input[9, 
> DoubleType]) + input[8, DoubleType] - input[2, DoubleType]) * (input[8, 
> DoubleType] - input[2, DoubleType])) * (input[0, DoubleType] * input[6, 
> DoubleType])) / (input[0, DoubleType] + input[6, DoubleType]))):
> public Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] 
> expr) {
>   return new SpecificMutableProjection(expr);
> }
> class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
>   private org.apache.spark.sql.catalyst.expressions.Expression[] expressions;
>   private org.apache.spark.sql.catalyst.expressions.MutableRow mutableRow;
>   public 
> SpecificMutableProjection(org.apache.spark.sql.catalyst.expressions.Expression[]
>  expr) {
> expressions = expr;
> mutableRow = new 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow(5);
>   }
>   public 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection 
> target(org.apache.spark.sql.catalyst.expressions.MutableRow row) {
> mutableRow = row;
> return this;
>   }
>   /* Provide immutable access to the last projected row. */
>   public InternalRow currentValue() {
> return (InternalRow) mutableRow;
>   }
>   public Object apply(Object _i) {
> InternalRow i = (InternalRow) _i;
> /* if (isnull(input[1, DoubleType])) cast(0 as double) else input[1, 
> DoubleType] */
> /* isnull(input[1, DoubleType]) */
> /* input[1, DoubleType] */
> boolean isNull4 = i.isNullAt(1);
> double primitive5 = isNull4 ? -1.0 : (i.getDoubl

[jira] [Comment Edited] (SPARK-10953) Benchmark codegen vs. hand-written code for univariate statistics

2015-10-12 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954447#comment-14954447
 ] 

Jihong MA edited comment on SPARK-10953 at 10/13/15 6:05 AM:
-

[~mengxr] I had a quick run on my laptop, with a stddev implementation based on 
ImperativeAggregate vs. DeclarativeAggregate, where ImperativeAggregate still 
use SortBasedAggregate at runtime, and  DeclarativeAggregate uses 
TungstenAggregate. 

with a single double column of DF (cached): 

#rowsImperativeAggregate DeclarativeAggregate
100 58ms   0.1s
1000   0.4s 0.6s
1 4s 7s

overall it seems ImperativeAggregate perform better.  if enabling 
TungstenAggregate support for ImperativeAggregate is in good shape (PR 9038), I 
will merge them in and have another try.


was (Author: jihongma):
[~mengxr]had a quick run on my laptop, with a stddev implementation based on 
ImperativeAggregate vs. DeclarativeAggregate, where ImperativeAggregate still 
use SortBasedAggregate at runtime, and  DeclarativeAggregate uses 
TungstenAggregate. 

with a single double column of DF (cached): 

#rowsImperativeAggregate DeclarativeAggregate
100 58ms   0.1s
1000   0.4s 0.6s
1 4s 7s

overall it seems ImperativeAggregate perform better.  if enabling 
TungstenAggregate support for ImperativeAggregate is in good shape (PR 9038), I 
will merge them in and have another try.

> Benchmark codegen vs. hand-written code for univariate statistics
> -
>
> Key: SPARK-10953
> URL: https://issues.apache.org/jira/browse/SPARK-10953
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Xiangrui Meng
>Assignee: Jihong MA
>
> I checked the generated code for a simple stddev_pop call:
> {code}
> val df = sqlContext.range(100)
> df.select(stddev_pop(col("id"))).show()
> {code}
> This is the generated code for the merge part, which is very long and 
> complex. I'm not sure whether we can get benefit from the code generation for 
> univariate statistics. We should benchmark it against Scala implementation.
> {code}
> 15/10/06 10:10:57 DEBUG GenerateMutableProjection: code for if 
> (isnull(input[1, DoubleType])) cast(0 as double) else input[1, DoubleType],if 
> (isnull(input[1, DoubleType])) input[6, DoubleType] else if (isnull(input[6, 
> DoubleType])) input[1, DoubleType] else (input[1, DoubleType] + input[6, 
> DoubleType]),if (isnull(input[3, DoubleType])) cast(0 as double) else 
> input[3, DoubleType],if (isnull(input[3, DoubleType])) input[8, DoubleType] 
> else if (isnull(input[8, DoubleType])) input[3, DoubleType] else (((input[3, 
> DoubleType] * input[0, DoubleType]) + (input[8, DoubleType] * input[6, 
> DoubleType])) / (input[0, DoubleType] + input[6, DoubleType])),if 
> (isnull(input[4, DoubleType])) input[9, DoubleType] else if (isnull(input[9, 
> DoubleType])) input[4, DoubleType] else ((input[4, DoubleType] + input[9, 
> DoubleType]) + input[8, DoubleType] - input[2, DoubleType]) * (input[8, 
> DoubleType] - input[2, DoubleType])) * (input[0, DoubleType] * input[6, 
> DoubleType])) / (input[0, DoubleType] + input[6, DoubleType]))):
> public Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] 
> expr) {
>   return new SpecificMutableProjection(expr);
> }
> class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
>   private org.apache.spark.sql.catalyst.expressions.Expression[] expressions;
>   private org.apache.spark.sql.catalyst.expressions.MutableRow mutableRow;
>   public 
> SpecificMutableProjection(org.apache.spark.sql.catalyst.expressions.Expression[]
>  expr) {
> expressions = expr;
> mutableRow = new 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow(5);
>   }
>   public 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection 
> target(org.apache.spark.sql.catalyst.expressions.MutableRow row) {
> mutableRow = row;
> return this;
>   }
>   /* Provide immutable access to the last projected row. */
>   public InternalRow currentValue() {
> return (InternalRow) mutableRow;
>   }
>   public Object apply(Object _i) {
> InternalRow i = (InternalRow) _i;
> /* if (isnull(input[1, DoubleType])) cast(0 as double) else input[1, 
> DoubleType] */
> /* isnull(input[1, DoubleType]) */
> /* input[1, DoubleType] */
> boolean isNull4 = i.isNullAt(1);
> double primitive5 = isNull4 ? -1.0 :

[jira] [Commented] (SPARK-10953) Benchmark codegen vs. hand-written code for univariate statistics

2015-10-14 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958386#comment-14958386
 ] 

Jihong MA commented on SPARK-10953:
---

[~mengxr] merged PR9038.  below are the avg elapsed time collected, which is in 
line with what we observed earlier.  Seth has started preparing an 
ImperativeAggregate implementation for SPARK-10641 (skewness, kurtosis)

#rowsImperativeAggregate DeclarativeAggregate
100  90ms0.2s
10000.4s  0.8s
1  4s  7s

> Benchmark codegen vs. hand-written code for univariate statistics
> -
>
> Key: SPARK-10953
> URL: https://issues.apache.org/jira/browse/SPARK-10953
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Xiangrui Meng
>Assignee: Jihong MA
>
> I checked the generated code for a simple stddev_pop call:
> {code}
> val df = sqlContext.range(100)
> df.select(stddev_pop(col("id"))).show()
> {code}
> This is the generated code for the merge part, which is very long and 
> complex. I'm not sure whether we can get benefit from the code generation for 
> univariate statistics. We should benchmark it against Scala implementation.
> {code}
> 15/10/06 10:10:57 DEBUG GenerateMutableProjection: code for if 
> (isnull(input[1, DoubleType])) cast(0 as double) else input[1, DoubleType],if 
> (isnull(input[1, DoubleType])) input[6, DoubleType] else if (isnull(input[6, 
> DoubleType])) input[1, DoubleType] else (input[1, DoubleType] + input[6, 
> DoubleType]),if (isnull(input[3, DoubleType])) cast(0 as double) else 
> input[3, DoubleType],if (isnull(input[3, DoubleType])) input[8, DoubleType] 
> else if (isnull(input[8, DoubleType])) input[3, DoubleType] else (((input[3, 
> DoubleType] * input[0, DoubleType]) + (input[8, DoubleType] * input[6, 
> DoubleType])) / (input[0, DoubleType] + input[6, DoubleType])),if 
> (isnull(input[4, DoubleType])) input[9, DoubleType] else if (isnull(input[9, 
> DoubleType])) input[4, DoubleType] else ((input[4, DoubleType] + input[9, 
> DoubleType]) + input[8, DoubleType] - input[2, DoubleType]) * (input[8, 
> DoubleType] - input[2, DoubleType])) * (input[0, DoubleType] * input[6, 
> DoubleType])) / (input[0, DoubleType] + input[6, DoubleType]))):
> public Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] 
> expr) {
>   return new SpecificMutableProjection(expr);
> }
> class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
>   private org.apache.spark.sql.catalyst.expressions.Expression[] expressions;
>   private org.apache.spark.sql.catalyst.expressions.MutableRow mutableRow;
>   public 
> SpecificMutableProjection(org.apache.spark.sql.catalyst.expressions.Expression[]
>  expr) {
> expressions = expr;
> mutableRow = new 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow(5);
>   }
>   public 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection 
> target(org.apache.spark.sql.catalyst.expressions.MutableRow row) {
> mutableRow = row;
> return this;
>   }
>   /* Provide immutable access to the last projected row. */
>   public InternalRow currentValue() {
> return (InternalRow) mutableRow;
>   }
>   public Object apply(Object _i) {
> InternalRow i = (InternalRow) _i;
> /* if (isnull(input[1, DoubleType])) cast(0 as double) else input[1, 
> DoubleType] */
> /* isnull(input[1, DoubleType]) */
> /* input[1, DoubleType] */
> boolean isNull4 = i.isNullAt(1);
> double primitive5 = isNull4 ? -1.0 : (i.getDouble(1));
> boolean isNull0 = false;
> double primitive1 = -1.0;
> if (!false && isNull4) {
>   /* cast(0 as double) */
>   /* 0 */
>   boolean isNull6 = false;
>   double primitive7 = -1.0;
>   if (!false) {
> primitive7 = (double) 0;
>   }
>   isNull0 = isNull6;
>   primitive1 = primitive7;
> } else {
>   /* input[1, DoubleType] */
>   boolean isNull10 = i.isNullAt(1);
>   double primitive11 = isNull10 ? -1.0 : (i.getDouble(1));
>   isNull0 = isNull10;
>   primitive1 = primitive11;
> }
> if (isNull0) {
>   mutableRow.setNullAt(0);
> } else {
>   mutableRow.setDouble(0, primitive1);
> }
> /* if (isnull(input[1, DoubleType])) input[6, DoubleType] else if 
> (isnull(input[6, DoubleType])) input[1, DoubleType] else (input[1, 
> DoubleType] + input[6, DoubleType]) */
> /* isnull(input[1, DoubleType]) */
> /* input[1, DoubleType] */
> boolean isNull16 = i.isNullAt(1);
> double primitive17 = isNull16 ? -1.0 : (i.getDouble(1));
> boolean isNull12 = fal

[jira] [Commented] (SPARK-9297) covar_pop and covar_samp aggregate functions

2015-10-19 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14964370#comment-14964370
 ] 

Jihong MA commented on SPARK-9297:
--

I will work on this and prepare a PR using ImperativeAggregate interface. 

> covar_pop and covar_samp aggregate functions
> 
>
> Key: SPARK-9297
> URL: https://issues.apache.org/jira/browse/SPARK-9297
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> A short introduction on how to build aggregate functions based on our new 
> interface can be found at 
> https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10646) Bivariate Statistics: Pearson's Chi-Squared goodness of fit test

2015-10-20 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14965298#comment-14965298
 ] 

Jihong MA commented on SPARK-10646:
---

[~mengxr] to add chi-squared test support through UDAF framework, I will need 
to keep around a HashMap for tracking counts for each category encountered, I 
prototyped it using ImperativeAggregate interface, and realized current UDAF 
infrastructure doesn't support varied length of aggregation attribute buffer 
and cause GC pressure as well even with the support in place. I discussed with 
[~yhuai] offline on this sometime back. it looks adding it through UDAF is not 
feasible at this point, please kindly let me know how we should proceed? thanks!

> Bivariate Statistics: Pearson's Chi-Squared goodness of fit test
> 
>
> Key: SPARK-10646
> URL: https://issues.apache.org/jira/browse/SPARK-10646
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>Assignee: Jihong MA
>
> Pearson's chi-squared goodness of fit test for observed against the expected 
> distribution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9297) covar_pop and covar_samp aggregate functions

2015-10-20 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14965662#comment-14965662
 ] 

Jihong MA commented on SPARK-9297:
--

[~viirya] I just noticed your initial PR for Pearson's correlation is based on 
ImperativeAggregate, which calculates covariance, not sure if you would like to 
take this JIRA? if not, I will work on it as we would like to have this one 
merge in for 1.6 if possible. 

> covar_pop and covar_samp aggregate functions
> 
>
> Key: SPARK-9297
> URL: https://issues.apache.org/jira/browse/SPARK-9297
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> A short introduction on how to build aggregate functions based on our new 
> interface can be found at 
> https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11420) Changing Stddev support with Imperative Aggregate

2015-10-29 Thread Jihong MA (JIRA)
Jihong MA created SPARK-11420:
-

 Summary: Changing Stddev support with Imperative Aggregate
 Key: SPARK-11420
 URL: https://issues.apache.org/jira/browse/SPARK-11420
 Project: Spark
  Issue Type: Improvement
  Components: ML, SQL
Reporter: Jihong MA


based on the performance comparison of Declaritive vs. Imperative Aggregate 
(SPARK-10953), switching to Imerative aggregate for stddev. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11420) Changing Stddev support with Imperative Aggregate

2015-10-29 Thread Jihong MA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-11420:
--
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-10384

> Changing Stddev support with Imperative Aggregate
> -
>
> Key: SPARK-11420
> URL: https://issues.apache.org/jira/browse/SPARK-11420
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>
> based on the performance comparison of Declaritive vs. Imperative Aggregate 
> (SPARK-10953), switching to Imerative aggregate for stddev. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11420) Updating Stddev support with Imperative Aggregate

2015-10-30 Thread Jihong MA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-11420:
--
Summary: Updating Stddev support with Imperative Aggregate  (was: Changing 
Stddev support with Imperative Aggregate)

> Updating Stddev support with Imperative Aggregate
> -
>
> Key: SPARK-11420
> URL: https://issues.apache.org/jira/browse/SPARK-11420
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>
> based on the performance comparison of Declaritive vs. Imperative Aggregate 
> (SPARK-10953), switching to Imerative aggregate for stddev. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8359) Spark SQL Decimal type precision loss on multiplication

2015-06-27 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604041#comment-14604041
 ] 

Jihong MA commented on SPARK-8359:
--

This fix is causing issue with divide over Decimal.Unlimited type when 
precision and scale are not defined as show below

java.lang.ArithmeticException: Non-terminating decimal expansion; no exact 
representable decimal result.
at java.math.BigDecimal.divide(BigDecimal.java:1616)
at java.math.BigDecimal.divide(BigDecimal.java:1650)
at scala.math.BigDecimal.$div(BigDecimal.scala:256)
at org.apache.spark.sql.types.Decimal.$div(Decimal.scala:269)
at 
org.apache.spark.sql.types.Decimal$DecimalIsFractional$.div(Decimal.scala:333)
at 
org.apache.spark.sql.types.Decimal$DecimalIsFractional$.div(Decimal.scala:332)
at 
org.apache.spark.sql.catalyst.expressions.Divide$$anonfun$div$1.apply(arithmetic.scala:193)
at 
org.apache.spark.sql.catalyst.expressions.Divide.eval(arithmetic.scala:206)


> Spark SQL Decimal type precision loss on multiplication
> ---
>
> Key: SPARK-8359
> URL: https://issues.apache.org/jira/browse/SPARK-8359
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Rene Treffer
>Assignee: Liang-Chi Hsieh
> Fix For: 1.5.0
>
>
> It looks like the precision of decimal can not be raised beyond ~2^112 
> without causing full value truncation.
> The following code computes the power of two up to a specific point
> {code}
> import org.apache.spark.sql.types.Decimal
> val one = Decimal(1)
> val two = Decimal(2)
> def pow(n : Int) :  Decimal = if (n <= 0) { one } else { 
>   val a = pow(n - 1)
>   a.changePrecision(n,0)
>   two.changePrecision(n,0)
>   a * two
> }
> (109 to 120).foreach(n => 
> println(pow(n).toJavaBigDecimal.unscaledValue.toString))
> 649037107316853453566312041152512
> 1298074214633706907132624082305024
> 2596148429267413814265248164610048
> 5192296858534827628530496329220096
> 1038459371706965525706099265844019
> 2076918743413931051412198531688038
> 4153837486827862102824397063376076
> 8307674973655724205648794126752152
> 1661534994731144841129758825350430
> 3323069989462289682259517650700860
> 6646139978924579364519035301401720
> 1329227995784915872903807060280344
> {code}
> Beyond ~2^112 the precision is truncated even if the precision was set to n 
> and should thus handle 10^n without problems..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8677) Decimal divide operation throws ArithmeticException

2015-07-01 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14610831#comment-14610831
 ] 

Jihong MA commented on SPARK-8677:
--

Thanks for fixing the division problem. but this fix introduces one more issue 
w.r.t the accuracy of Decimal computation. 

scala> val aa = Decimal(2) / Decimal(3);
aa: org.apache.spark.sql.types.Decimal = 1

when a Decimal is defined as Decimal.Unlimited, we are not expecting the 
division result's scale value to inherit from its parent, this is causing big 
accuracy issue once we go coupe round of division over decimal data vs. double 
data.  below is a sample output from my run.  

10:27:46.042 WARN 
org.apache.spark.sql.catalyst.expressions.CombinePartialStdFunction: COMBINE 
STDDEV DOUBLE---4.0 , 0.8VALUE

10:27:46.137 WARN 
org.apache.spark.sql.catalyst.expressions.CombinePartialStdFunction: COMBINE 
STDDEV DECIMAL---4.29000 , 0.858VALUE


> Decimal divide operation throws ArithmeticException
> ---
>
> Key: SPARK-8677
> URL: https://issues.apache.org/jira/browse/SPARK-8677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 1.5.0
>
>
> Please refer to [BigDecimal 
> doc|http://docs.oracle.com/javase/1.5.0/docs/api/java/math/BigDecimal.html]:
> {quote}
> ... the rounding mode setting of a MathContext object with a precision 
> setting of 0 is not used and thus irrelevant. In the case of divide, the 
> exact quotient could have an infinitely long decimal expansion; for example, 
> 1 divided by 3.
> {quote}
> Because we provide a MathContext.UNLIMITED in toBigDecimal, Decimal divide 
> operation will throw the following exception:
> {code}
> val decimal = Decimal(1.0, 10, 3) / Decimal(3.0, 10, 3)
> [info]   java.lang.ArithmeticException: Non-terminating decimal expansion; no 
> exact representable decimal result.
> [info]   at java.math.BigDecimal.divide(BigDecimal.java:1690)
> [info]   at java.math.BigDecimal.divide(BigDecimal.java:1723)
> [info]   at scala.math.BigDecimal.$div(BigDecimal.scala:256)
> [info]   at org.apache.spark.sql.types.Decimal.$div(Decimal.scala:272)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8677) Decimal divide operation throws ArithmeticException

2015-07-01 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14610865#comment-14610865
 ] 

Jihong MA commented on SPARK-8677:
--

I am not sure if there is guideline for DecimalType.Unlimited, can we go for an 
accuracy at least equivalent to Double? 

> Decimal divide operation throws ArithmeticException
> ---
>
> Key: SPARK-8677
> URL: https://issues.apache.org/jira/browse/SPARK-8677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 1.5.0
>
>
> Please refer to [BigDecimal 
> doc|http://docs.oracle.com/javase/1.5.0/docs/api/java/math/BigDecimal.html]:
> {quote}
> ... the rounding mode setting of a MathContext object with a precision 
> setting of 0 is not used and thus irrelevant. In the case of divide, the 
> exact quotient could have an infinitely long decimal expansion; for example, 
> 1 divided by 3.
> {quote}
> Because we provide a MathContext.UNLIMITED in toBigDecimal, Decimal divide 
> operation will throw the following exception:
> {code}
> val decimal = Decimal(1.0, 10, 3) / Decimal(3.0, 10, 3)
> [info]   java.lang.ArithmeticException: Non-terminating decimal expansion; no 
> exact representable decimal result.
> [info]   at java.math.BigDecimal.divide(BigDecimal.java:1690)
> [info]   at java.math.BigDecimal.divide(BigDecimal.java:1723)
> [info]   at scala.math.BigDecimal.$div(BigDecimal.scala:256)
> [info]   at org.apache.spark.sql.types.Decimal.$div(Decimal.scala:272)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8800) Spark SQL Decimal Division operation loss of precision/scale when type is defined as DecimalType.Unlimited

2015-07-02 Thread Jihong MA (JIRA)
Jihong MA created SPARK-8800:


 Summary: Spark SQL Decimal Division operation loss of 
precision/scale when type is defined as DecimalType.Unlimited
 Key: SPARK-8800
 URL: https://issues.apache.org/jira/browse/SPARK-8800
 Project: Spark
  Issue Type: Bug
Reporter: Jihong MA


According to specification defined in Java doc over BigDecimal :

http://docs.oracle.com/javase/1.5.0/docs/api/java/math/BigDecimal.html

When a MathContext object is supplied with a precision setting of 0 (for 
example, MathContext.UNLIMITED), arithmetic operations are exact, as are the 
arithmetic methods which take no MathContext object. (This is the only behavior 
that was supported in releases prior to 5.) As a corollary of computing the 
exact result, the rounding mode setting of a MathContext object with a 
precision setting of 0 is not used and thus irrelevant. In the case of divide, 
the exact quotient could have an infinitely long decimal expansion; for 
example, 1 divided by 3. If the quotient has a nonterminating decimal expansion 
and the operation is specified to return an exact result, an 
ArithmeticException is thrown. Otherwise, the exact result of the division is 
returned, as done for other operations.

when Decimal data is defined as DecimalType.Unlimited in Spark SQL, the exact 
result of the division should be returned or truncated to precision = 38 which 
is in align with what Hive supports. the current behavior is as shown 
following, which cause we lose the accuracy of Decimal division operation. 

scala> val aa = Decimal(2) / Decimal(3);
aa: org.apache.spark.sql.types.Decimal = 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8800) Spark SQL Decimal Division operation loss of precision/scale when type is defined as DecimalType.Unlimited

2015-07-02 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612524#comment-14612524
 ] 

Jihong MA commented on SPARK-8800:
--

this is an issue noticed after we open up the precision limit for decimal 
multiplication SPARK-8359. SPARK-8677 only solves partial of the issue. 

> Spark SQL Decimal Division operation loss of precision/scale when type is 
> defined as DecimalType.Unlimited
> --
>
> Key: SPARK-8800
> URL: https://issues.apache.org/jira/browse/SPARK-8800
> Project: Spark
>  Issue Type: Bug
>Reporter: Jihong MA
>
> According to specification defined in Java doc over BigDecimal :
> http://docs.oracle.com/javase/1.5.0/docs/api/java/math/BigDecimal.html
> When a MathContext object is supplied with a precision setting of 0 (for 
> example, MathContext.UNLIMITED), arithmetic operations are exact, as are the 
> arithmetic methods which take no MathContext object. (This is the only 
> behavior that was supported in releases prior to 5.) As a corollary of 
> computing the exact result, the rounding mode setting of a MathContext object 
> with a precision setting of 0 is not used and thus irrelevant. In the case of 
> divide, the exact quotient could have an infinitely long decimal expansion; 
> for example, 1 divided by 3. If the quotient has a nonterminating decimal 
> expansion and the operation is specified to return an exact result, an 
> ArithmeticException is thrown. Otherwise, the exact result of the division is 
> returned, as done for other operations.
> when Decimal data is defined as DecimalType.Unlimited in Spark SQL, the exact 
> result of the division should be returned or truncated to precision = 38 
> which is in align with what Hive supports. the current behavior is as shown 
> following, which cause we lose the accuracy of Decimal division operation. 
> scala> val aa = Decimal(2) / Decimal(3);
> aa: org.apache.spark.sql.types.Decimal = 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8800) Spark SQL Decimal Division operation loss of precision/scale when type is defined as DecimalType.Unlimited

2015-07-02 Thread Jihong MA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-8800:
-
Description: 
According to specification defined in Java doc over BigDecimal :

http://docs.oracle.com/javase/1.5.0/docs/api/java/math/BigDecimal.html

When a MathContext object is supplied with a precision setting of 0 (for 
example, MathContext.UNLIMITED), arithmetic operations are exact, as are the 
arithmetic methods which take no MathContext object. (This is the only behavior 
that was supported in releases prior to 5.) As a corollary of computing the 
exact result, the rounding mode setting of a MathContext object with a 
precision setting of 0 is not used and thus irrelevant. In the case of divide, 
the exact quotient could have an infinitely long decimal expansion; for 
example, 1 divided by 3. If the quotient has a nonterminating decimal expansion 
and the operation is specified to return an exact result, an 
ArithmeticException is thrown. Otherwise, the exact result of the division is 
returned, as done for other operations.

when Decimal data is defined as DecimalType.Unlimited in Spark SQL, the exact 
result of the division should be returned or truncated to precision = 38 which 
is in align with what Hive supports. the current behavior is as shown 
following, which cause we lose the accuracy of Decimal division operation. 

scala> val aa = Decimal(2) / Decimal(3);
aa: org.apache.spark.sql.types.Decimal = 1

here is another example where we should return 0.125 instead of 0

scala> val aa = Decimal(1) /Decimal(8)
aa: org.apache.spark.sql.types.Decimal = 0

  was:
According to specification defined in Java doc over BigDecimal :

http://docs.oracle.com/javase/1.5.0/docs/api/java/math/BigDecimal.html

When a MathContext object is supplied with a precision setting of 0 (for 
example, MathContext.UNLIMITED), arithmetic operations are exact, as are the 
arithmetic methods which take no MathContext object. (This is the only behavior 
that was supported in releases prior to 5.) As a corollary of computing the 
exact result, the rounding mode setting of a MathContext object with a 
precision setting of 0 is not used and thus irrelevant. In the case of divide, 
the exact quotient could have an infinitely long decimal expansion; for 
example, 1 divided by 3. If the quotient has a nonterminating decimal expansion 
and the operation is specified to return an exact result, an 
ArithmeticException is thrown. Otherwise, the exact result of the division is 
returned, as done for other operations.

when Decimal data is defined as DecimalType.Unlimited in Spark SQL, the exact 
result of the division should be returned or truncated to precision = 38 which 
is in align with what Hive supports. the current behavior is as shown 
following, which cause we lose the accuracy of Decimal division operation. 

scala> val aa = Decimal(2) / Decimal(3);
aa: org.apache.spark.sql.types.Decimal = 1


> Spark SQL Decimal Division operation loss of precision/scale when type is 
> defined as DecimalType.Unlimited
> --
>
> Key: SPARK-8800
> URL: https://issues.apache.org/jira/browse/SPARK-8800
> Project: Spark
>  Issue Type: Bug
>Reporter: Jihong MA
>
> According to specification defined in Java doc over BigDecimal :
> http://docs.oracle.com/javase/1.5.0/docs/api/java/math/BigDecimal.html
> When a MathContext object is supplied with a precision setting of 0 (for 
> example, MathContext.UNLIMITED), arithmetic operations are exact, as are the 
> arithmetic methods which take no MathContext object. (This is the only 
> behavior that was supported in releases prior to 5.) As a corollary of 
> computing the exact result, the rounding mode setting of a MathContext object 
> with a precision setting of 0 is not used and thus irrelevant. In the case of 
> divide, the exact quotient could have an infinitely long decimal expansion; 
> for example, 1 divided by 3. If the quotient has a nonterminating decimal 
> expansion and the operation is specified to return an exact result, an 
> ArithmeticException is thrown. Otherwise, the exact result of the division is 
> returned, as done for other operations.
> when Decimal data is defined as DecimalType.Unlimited in Spark SQL, the exact 
> result of the division should be returned or truncated to precision = 38 
> which is in align with what Hive supports. the current behavior is as shown 
> following, which cause we lose the accuracy of Decimal division operation. 
> scala> val aa = Decimal(2) / Decimal(3);
> aa: org.apache.spark.sql.types.Decimal = 1
> here is another example where we should return 0.125 instead of 0
> scala> val aa = Decimal(1) /Decimal(8)
> aa: org.apache.spark.sql.types.Decimal = 0



--
This message was sent by Atlassian JIR

[jira] [Updated] (SPARK-11720) Return Double.NaN instead of null for Mean and Average when count = 0

2015-11-12 Thread Jihong MA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-11720:
--
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-10384

> Return Double.NaN instead of null for Mean and Average when count = 0
> -
>
> Key: SPARK-11720
> URL: https://issues.apache.org/jira/browse/SPARK-11720
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jihong MA
>
> change the default behavior of mean in case of count = 0 from null to 
> Double.NaN, to make it inline with all other univariate stats function. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11720) Return Double.NaN instead of null for Mean and Average when count = 0

2015-11-12 Thread Jihong MA (JIRA)
Jihong MA created SPARK-11720:
-

 Summary: Return Double.NaN instead of null for Mean and Average 
when count = 0
 Key: SPARK-11720
 URL: https://issues.apache.org/jira/browse/SPARK-11720
 Project: Spark
  Issue Type: Improvement
Reporter: Jihong MA


change the default behavior of mean in case of count = 0 from null to 
Double.NaN, to make it inline with all other univariate stats function. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11720) Return Double.NaN instead of null for Mean and Average when count = 0

2015-11-12 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15003558#comment-15003558
 ] 

Jihong MA commented on SPARK-11720:
---

[~mengxr] the implementation of average is not a numerical stable version, 
should we update it to leverage CentralMomentAgg as part of this JIRA?  

> Return Double.NaN instead of null for Mean and Average when count = 0
> -
>
> Key: SPARK-11720
> URL: https://issues.apache.org/jira/browse/SPARK-11720
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jihong MA
>
> change the default behavior of mean in case of count = 0 from null to 
> Double.NaN, to make it inline with all other univariate stats function. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11720) Return Double.NaN instead of null for Mean and Average when count = 0

2015-11-12 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15003589#comment-15003589
 ] 

Jihong MA commented on SPARK-11720:
---

Also, for mean, we treat Decimal differently vs. other numeric type, all other 
numeric type all converted to Double, for Decimal type, we continue to keep its 
type but increase its p/s to avoid precison loss.  so for types that converted 
to Double, we can return Double.NaN, for decimal type, we can't return a double 
value but still null? 

> Return Double.NaN instead of null for Mean and Average when count = 0
> -
>
> Key: SPARK-11720
> URL: https://issues.apache.org/jira/browse/SPARK-11720
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jihong MA
>
> change the default behavior of mean in case of count = 0 from null to 
> Double.NaN, to make it inline with all other univariate stats function. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11720) Handle edge cases when count = 0 or 1 for Stats function

2015-11-18 Thread Jihong MA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-11720:
--
Summary: Handle edge cases when count = 0 or 1 for Stats function  (was: 
Return Double.NaN instead of null for Mean and Average when count = 0)

> Handle edge cases when count = 0 or 1 for Stats function
> 
>
> Key: SPARK-11720
> URL: https://issues.apache.org/jira/browse/SPARK-11720
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>Assignee: Jihong MA
>Priority: Minor
>
> change the default behavior of mean in case of count = 0 from null to 
> Double.NaN, to make it inline with all other univariate stats function. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11720) Handle edge cases when count = 0 or 1 for Stats function

2015-11-18 Thread Jihong MA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-11720:
--
Description: update the behavior of stats function when count =0 or 1 to 
make it in consistent across SPARK-SQL  (was: change the default behavior of 
mean in case of count = 0 from null to Double.NaN, to make it inline with all 
other univariate stats function. )

> Handle edge cases when count = 0 or 1 for Stats function
> 
>
> Key: SPARK-11720
> URL: https://issues.apache.org/jira/browse/SPARK-11720
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>Assignee: Jihong MA
>Priority: Minor
>
> update the behavior of stats function when count =0 or 1 to make it in 
> consistent across SPARK-SQL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8951) support CJK characters in collect()

2015-09-04 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731140#comment-14731140
 ] 

Jihong MA commented on SPARK-8951:
--

This commit cause R style check failure. 


Running R style checks

Loading required package: methods

Attaching package: 'SparkR'

The following objects are masked from 'package:stats':

filter, na.omit

The following objects are masked from 'package:base':

intersect, rbind, sample, subset, summary, table, transform


Attaching package: 'testthat'

The following object is masked from 'package:SparkR':

describe

R/deserialize.R:63:9: style: Trailing whitespace is superfluous.
  string 
^
lintr checks failed.
[error] running /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-r ; 
received return code 1
Archiving unit tests logs...
> No log files found.
Attempting to post to Github...
 > Post successful.
Build step 'Execute shell' marked build as failure
Archiving artifacts
Recording test results
ERROR: Publisher 'Publish JUnit test result report' failed: No test report 
files were found. Configuration error?
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 

> support CJK characters in collect()
> ---
>
> Key: SPARK-8951
> URL: https://issues.apache.org/jira/browse/SPARK-8951
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Jaehong Choi
>Assignee: Jaehong Choi
>Priority: Minor
> Fix For: 1.6.0
>
> Attachments: SerDe.scala.diff
>
>
> Spark gives an error message and does not show the output when a field of the 
> result DataFrame contains characters in CJK.
> I found out that SerDe in R API only supports ASCII format for strings right 
> now as commented in source code.  
> So, I fixed SerDe.scala a little to support CJK as the file attached. 
> I did not care efficiency, but just wanted to see if it works.
> {noformat}
> people.json
> {"name":"가나"}
> {"name":"테스트123", "age":30}
> {"name":"Justin", "age":19}
> df <- read.df(sqlContext, "./people.json", "json")
> head(df)
> Error in rawtochar(string) : embedded nul in string : '\0 \x98'
> {noformat}
> {code:title=core/src/main/scala/org/apache/spark/api/r/SerDe.scala}
>   // NOTE: Only works for ASCII right now
>   def writeString(out: DataOutputStream, value: String): Unit = {
> val len = value.length
> out.writeInt(len + 1) // For the \0
> out.writeBytes(value)
> out.writeByte(0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6548) stddev_pop and stddev_samp aggregate functions

2015-09-14 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14743959#comment-14743959
 ] 

Jihong MA commented on SPARK-6548:
--

[~davies]please fix the assignee to Jihong, Thanks!

> stddev_pop and stddev_samp aggregate functions
> --
>
> Key: SPARK-6548
> URL: https://issues.apache.org/jira/browse/SPARK-6548
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: DataFrame, starter
> Fix For: 1.6.0
>
>
> Add it to the list of aggregate functions:
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
> Also add it to 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala
> We can either add a Stddev Catalyst expression, or just compute it using 
> existing functions like here: 
> https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10641) skewness and kurtosis support

2015-09-16 Thread Jihong MA (JIRA)
Jihong MA created SPARK-10641:
-

 Summary: skewness and kurtosis support
 Key: SPARK-10641
 URL: https://issues.apache.org/jira/browse/SPARK-10641
 Project: Spark
  Issue Type: New Feature
  Components: ML, SQL
Reporter: Jihong MA


Implementing skewness and kurtosis support based on following algorithm:
https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10641) skewness and kurtosis support

2015-09-16 Thread Jihong MA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-10641:
--
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-10384

> skewness and kurtosis support
> -
>
> Key: SPARK-10641
> URL: https://issues.apache.org/jira/browse/SPARK-10641
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>
> Implementing skewness and kurtosis support based on following algorithm:
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats

2015-09-16 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790816#comment-14790816
 ] 

Jihong MA commented on SPARK-10602:
---

I go ahead/ created SPARK-10641, since this JIRA is not listed as umbrella, 
couldn't link to this JIRA directly instead linked to SPARK-10384. @Joseph, can 
you assign SPARK-10641 to Seth?  and help fix the link, Thanks! 

> Univariate statistics as UDAFs: single-pass continuous stats
> 
>
> Key: SPARK-10602
> URL: https://issues.apache.org/jira/browse/SPARK-10602
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Joseph K. Bradley
>
> See parent JIRA for more details.  This subtask covers statistics for 
> continuous values requiring a single pass over the data, such as min and max.
> This JIRA is an umbrella.  For individual stats, please create and link a new 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10645) Bivariate Statistics for continuous vs. continuous

2015-09-16 Thread Jihong MA (JIRA)
Jihong MA created SPARK-10645:
-

 Summary: Bivariate Statistics for continuous vs. continuous
 Key: SPARK-10645
 URL: https://issues.apache.org/jira/browse/SPARK-10645
 Project: Spark
  Issue Type: New Feature
Reporter: Jihong MA


this is an umbrella jira, which covers Bivariate Statistics for continuous vs. 
continuous columns, including covariance, Pearson's correlation, Spearman's 
correlation (for both continuous & categorical).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10645) Bivariate Statistics for continuous vs. continuous

2015-09-16 Thread Jihong MA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-10645:
--
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-10385

> Bivariate Statistics for continuous vs. continuous
> --
>
> Key: SPARK-10645
> URL: https://issues.apache.org/jira/browse/SPARK-10645
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jihong MA
>
> this is an umbrella jira, which covers Bivariate Statistics for continuous 
> vs. continuous columns, including covariance, Pearson's correlation, 
> Spearman's correlation (for both continuous & categorical).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10646) Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. categorical

2015-09-16 Thread Jihong MA (JIRA)
Jihong MA created SPARK-10646:
-

 Summary: Bivariate Statistics: Pearson's Chi-Squared Test for 
categorical vs. categorical
 Key: SPARK-10646
 URL: https://issues.apache.org/jira/browse/SPARK-10646
 Project: Spark
  Issue Type: New Feature
Reporter: Jihong MA






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10646) Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. categorical

2015-09-16 Thread Jihong MA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-10646:
--
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-10385

> Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. 
> categorical
> 
>
> Key: SPARK-10646
> URL: https://issues.apache.org/jira/browse/SPARK-10646
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jihong MA
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10646) Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. categorical

2015-09-16 Thread Jihong MA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-10646:
--
Description: Pearson's chi-squared goodness of fit test for observed 
against the expected distribution.

> Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. 
> categorical
> 
>
> Key: SPARK-10646
> URL: https://issues.apache.org/jira/browse/SPARK-10646
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jihong MA
>
> Pearson's chi-squared goodness of fit test for observed against the expected 
> distribution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10646) Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. categorical

2015-09-16 Thread Jihong MA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-10646:
--
Component/s: SQL
 ML

> Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. 
> categorical
> 
>
> Key: SPARK-10646
> URL: https://issues.apache.org/jira/browse/SPARK-10646
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>
> Pearson's chi-squared goodness of fit test for observed against the expected 
> distribution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10645) Bivariate Statistics for continuous vs. continuous

2015-09-16 Thread Jihong MA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-10645:
--
Component/s: SQL
 ML

> Bivariate Statistics for continuous vs. continuous
> --
>
> Key: SPARK-10645
> URL: https://issues.apache.org/jira/browse/SPARK-10645
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>
> this is an umbrella jira, which covers Bivariate Statistics for continuous 
> vs. continuous columns, including covariance, Pearson's correlation, 
> Spearman's correlation (for both continuous & categorical).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10646) Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. categorical

2015-09-17 Thread Jihong MA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-10646:
--
Description: Pearson's chi-squared goodness of fit test for observed 
against the expected distribution & independence test.   (was: Pearson's 
chi-squared goodness of fit test for observed against the expected 
distribution.)

> Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. 
> categorical
> 
>
> Key: SPARK-10646
> URL: https://issues.apache.org/jira/browse/SPARK-10646
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>
> Pearson's chi-squared goodness of fit test for observed against the expected 
> distribution & independence test. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10646) Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. categorical

2015-09-17 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803378#comment-14803378
 ] 

Jihong MA commented on SPARK-10646:
---

[~josephkb]  please assign this JIRA to me, I will start working on it.  Thanks!

> Bivariate Statistics: Pearson's Chi-Squared Test for categorical vs. 
> categorical
> 
>
> Key: SPARK-10646
> URL: https://issues.apache.org/jira/browse/SPARK-10646
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>
> Pearson's chi-squared goodness of fit test for observed against the expected 
> distribution & independence test. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7265) Improving documentation for Spark SQL Hive support

2015-04-29 Thread Jihong MA (JIRA)
Jihong MA created SPARK-7265:


 Summary: Improving documentation for Spark SQL Hive support 
 Key: SPARK-7265
 URL: https://issues.apache.org/jira/browse/SPARK-7265
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.3.1
Reporter: Jihong MA
Priority: Trivial
 Fix For: 1.4.0


miscellaneous documentation improvement for Spark SQL Hive support, Yarn 
cluster deployment. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7265) Improving documentation for Spark SQL Hive support

2015-04-29 Thread Jihong MA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-7265:
-
Priority: Minor  (was: Trivial)

> Improving documentation for Spark SQL Hive support 
> ---
>
> Key: SPARK-7265
> URL: https://issues.apache.org/jira/browse/SPARK-7265
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.3.1
>Reporter: Jihong MA
>Priority: Minor
> Fix For: 1.4.0
>
>
> miscellaneous documentation improvement for Spark SQL Hive support, Yarn 
> cluster deployment. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7265) Improving documentation for Spark SQL Hive support

2015-04-30 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14522720#comment-14522720
 ] 

Jihong MA commented on SPARK-7265:
--

this is a place holder for changes I am planning to contribute, I will make a 
PR very soon. 

> Improving documentation for Spark SQL Hive support 
> ---
>
> Key: SPARK-7265
> URL: https://issues.apache.org/jira/browse/SPARK-7265
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.3.1
>Reporter: Jihong MA
>Priority: Minor
>
> miscellaneous documentation improvement for Spark SQL Hive support, Yarn 
> cluster deployment. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7357) Improving HBaseTest example

2015-05-04 Thread Jihong MA (JIRA)
Jihong MA created SPARK-7357:


 Summary: Improving HBaseTest example
 Key: SPARK-7357
 URL: https://issues.apache.org/jira/browse/SPARK-7357
 Project: Spark
  Issue Type: Improvement
  Components: Examples
Affects Versions: 1.3.1
Reporter: Jihong MA
Priority: Minor
 Fix For: 1.4.0


Minor improvement to HBaseTest example, when Hbase related configurations e.g: 
zookeeper quorum, zookeeper client port or zookeeper.znode.parent are not set 
to default (localhost:2181), connection to zookeeper might hang as shown in 
following stack

15/03/26 18:31:20 INFO zookeeper.ZooKeeper: Initiating client connection, 
connectString=xxx.xxx.xxx:2181 sessionTimeout=9 
watcher=hconnection-0x322a4437, quorum=xxx.xxx.xxx:2181, baseZNode=/hbase
15/03/26 18:31:21 INFO zookeeper.ClientCnxn: Opening socket connection to 
server 9.30.94.121:2181. Will not attempt to authenticate using SASL (unknown 
error)
15/03/26 18:31:21 INFO zookeeper.ClientCnxn: Socket connection established to 
xxx.xxx.xxx/9.30.94.121:2181, initiating session
15/03/26 18:31:21 INFO zookeeper.ClientCnxn: Session establishment complete on 
server xxx.xxx.xxx/9.30.94.121:2181, sessionid = 0x14c53cd311e004b, negotiated 
timeout = 4
15/03/26 18:31:21 INFO client.ZooKeeperRegistry: ClusterId read in ZooKeeper is 
null

this is due to hbase-site.xml is not placed on spark class path. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6548) Adding stddev to DataFrame functions

2015-05-27 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562298#comment-14562298
 ] 

Jihong MA commented on SPARK-6548:
--

Hi sdfox,

I thought you are no longer working on this, so submitted a pull request 1 week 
ago, which uses a 1-pass online algorithm to calculate standard deviation. you 
are welcome to take a look. sorry, hoping we are not duplicating the effort. 

> Adding stddev to DataFrame functions
> 
>
> Key: SPARK-6548
> URL: https://issues.apache.org/jira/browse/SPARK-6548
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: DataFrame, starter
>
> Add it to the list of aggregate functions:
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
> Also add it to 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala
> We can either add a Stddev Catalyst expression, or just compute it using 
> existing functions like here: 
> https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8800) Spark SQL Decimal Division operation loss of precision/scale when type is defined as DecimalType.Unlimited

2015-07-14 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627306#comment-14627306
 ] 

Jihong MA commented on SPARK-8800:
--

I applied the fix and noticed the same. 

  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 1198.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1198.0 
(TID 3847, localhost): java.lang.ArithmeticException: Non-terminating decimal 
expansion; no exact representable decimal result.^[[0m
^[[31m  at java.math.BigDecimal.divide(BigDecimal.java:1616)^[[0m
^[[31m  at java.math.BigDecimal.divide(BigDecimal.java:1650)^[[0m
^[[31m  at scala.math.BigDecimal.$div(BigDecimal.scala:256)^[[0m
^[[31m  at 
org.apache.spark.sql.types.Decimal.$div(Decimal.scala:282)^[[0m
^[[31m  at 
org.apache.spark.sql.types.Decimal$DecimalIsFractional$.div(Decimal.scala:348)^[[0m
^[[31m  at 
org.apache.spark.sql.types.Decimal$DecimalIsFractional$.div(Decimal.scala:347)^[[0m
^[[31m  at 
org.apache.spark.sql.catalyst.expressions.Divide$$anonfun$div$1.apply(arithmetic.scala:193)^[[0m
^[[31m  at 
org.apache.spark.sql.catalyst.expressions.Divide.eval(arithmetic.scala:206)^[[0m


> Spark SQL Decimal Division operation loss of precision/scale when type is 
> defined as DecimalType.Unlimited
> --
>
> Key: SPARK-8800
> URL: https://issues.apache.org/jira/browse/SPARK-8800
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Jihong MA
>Assignee: Liang-Chi Hsieh
>Priority: Blocker
> Fix For: 1.5.0
>
>
> According to specification defined in Java doc over BigDecimal :
> http://docs.oracle.com/javase/1.5.0/docs/api/java/math/BigDecimal.html
> When a MathContext object is supplied with a precision setting of 0 (for 
> example, MathContext.UNLIMITED), arithmetic operations are exact, as are the 
> arithmetic methods which take no MathContext object. (This is the only 
> behavior that was supported in releases prior to 5.) As a corollary of 
> computing the exact result, the rounding mode setting of a MathContext object 
> with a precision setting of 0 is not used and thus irrelevant. In the case of 
> divide, the exact quotient could have an infinitely long decimal expansion; 
> for example, 1 divided by 3. If the quotient has a nonterminating decimal 
> expansion and the operation is specified to return an exact result, an 
> ArithmeticException is thrown. Otherwise, the exact result of the division is 
> returned, as done for other operations.
> when Decimal data is defined as DecimalType.Unlimited in Spark SQL, the exact 
> result of the division should be returned or truncated to precision = 38 
> which is in align with what Hive supports. the current behavior is as shown 
> following, which cause we lose the accuracy of Decimal division operation. 
> scala> val aa = Decimal(2) / Decimal(3);
> aa: org.apache.spark.sql.types.Decimal = 1
> here is another example where we should return 0.125 instead of 0
> scala> val aa = Decimal(1) /Decimal(8)
> aa: org.apache.spark.sql.types.Decimal = 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8800) Spark SQL Decimal Division operation loss of precision/scale when type is defined as DecimalType.Unlimited

2015-07-14 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627317#comment-14627317
 ] 

Jihong MA commented on SPARK-8800:
--

I would like to suggest to revert back the initial code change for SPARK-8359,  
SPARK-8677 and SPARK-8800. and think though whether supporting unlimited 
precision decimal multiplication is the right way of fixing it as Hive only 
support limited precision decimal multiplication. 

> Spark SQL Decimal Division operation loss of precision/scale when type is 
> defined as DecimalType.Unlimited
> --
>
> Key: SPARK-8800
> URL: https://issues.apache.org/jira/browse/SPARK-8800
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Jihong MA
>Assignee: Liang-Chi Hsieh
>Priority: Blocker
> Fix For: 1.5.0
>
>
> According to specification defined in Java doc over BigDecimal :
> http://docs.oracle.com/javase/1.5.0/docs/api/java/math/BigDecimal.html
> When a MathContext object is supplied with a precision setting of 0 (for 
> example, MathContext.UNLIMITED), arithmetic operations are exact, as are the 
> arithmetic methods which take no MathContext object. (This is the only 
> behavior that was supported in releases prior to 5.) As a corollary of 
> computing the exact result, the rounding mode setting of a MathContext object 
> with a precision setting of 0 is not used and thus irrelevant. In the case of 
> divide, the exact quotient could have an infinitely long decimal expansion; 
> for example, 1 divided by 3. If the quotient has a nonterminating decimal 
> expansion and the operation is specified to return an exact result, an 
> ArithmeticException is thrown. Otherwise, the exact result of the division is 
> returned, as done for other operations.
> when Decimal data is defined as DecimalType.Unlimited in Spark SQL, the exact 
> result of the division should be returned or truncated to precision = 38 
> which is in align with what Hive supports. the current behavior is as shown 
> following, which cause we lose the accuracy of Decimal division operation. 
> scala> val aa = Decimal(2) / Decimal(3);
> aa: org.apache.spark.sql.types.Decimal = 1
> here is another example where we should return 0.125 instead of 0
> scala> val aa = Decimal(1) /Decimal(8)
> aa: org.apache.spark.sql.types.Decimal = 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org