[jira] [Commented] (SPARK-2515) Hypothesis testing

2014-08-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14083330#comment-14083330
 ] 

Apache Spark commented on SPARK-2515:
-

User 'dorx' has created a pull request for this issue:
https://github.com/apache/spark/pull/1733

> Hypothesis testing
> --
>
> Key: SPARK-2515
> URL: https://issues.apache.org/jira/browse/SPARK-2515
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Doris Xin
>
> Support common statistical tests in Spark MLlib.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2515) Hypothesis testing

2014-07-25 Thread Doris Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074802#comment-14074802
 ] 

Doris Xin commented on SPARK-2515:
--

A toString method sounds like a really good idea here actually. I think 
originally we planned the Summary object to hold anything that isn't standard 
across tests, and in the case of chi squared, I can't think of anything else to 
put in there. Having the toString method instead would allow us to have a 
single TestResult class across tests, too.

Sure, we can go with degreesOfFreedom. 

> Hypothesis testing
> --
>
> Key: SPARK-2515
> URL: https://issues.apache.org/jira/browse/SPARK-2515
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Doris Xin
>
> Support common statistical tests in Spark MLlib.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2515) Hypothesis testing

2014-07-24 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074003#comment-14074003
 ] 

Hossein Falaki commented on SPARK-2515:
---

If we really have to implement another chi-square test method, I think 
Likelihood-ratio test would be a good candidate.

On the return type: 
* What is left for the Summary field? Why can't this be the toString method?
* I am not sure, but maybe this df is too cryptic for non-experts. How about 
degreesOfFreedom? 


> Hypothesis testing
> --
>
> Key: SPARK-2515
> URL: https://issues.apache.org/jira/browse/SPARK-2515
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Doris Xin
>
> Support common statistical tests in Spark MLlib.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2515) Hypothesis testing

2014-07-24 Thread Doris Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073879#comment-14073879
 ] 

Doris Xin commented on SPARK-2515:
--

Here's the proposed API for chi-squared tests (lives in 
org.apache.spark.mllib.stat.Statistics):

{code}
def chiSquare(X: RDD[Vector], method: String = “pearson”): ChiSquareTestResult
def chiSquare(x: RDD[Double], y: RDD[Double], method: String = “pearson”): 
ChiSquareTestResult
{code}

where ChiSquareTestResult <: TestResult looks like:

{code}
pValue: Double
df: Array[Int] //normally a single but need to be more for anova
statistic: Double
ChiSquareSummary <: Summary
{code}

So a couple points of discussion:
1. Of the many variants of the chi-squared test, what methods in addition to 
"pearson" do we want to support (hopefully based on popular demand)? 
http://en.wikipedia.org/wiki/Chi-squared_test
2. What special fields should ChiSquareSummary have?

> Hypothesis testing
> --
>
> Key: SPARK-2515
> URL: https://issues.apache.org/jira/browse/SPARK-2515
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Doris Xin
>
> Support common statistical tests in Spark MLlib.



--
This message was sent by Atlassian JIRA
(v6.2#6252)