[ 
https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073879#comment-14073879
 ] 

Doris Xin commented on SPARK-2515:
----------------------------------

Here's the proposed API for chi-squared tests (lives in 
org.apache.spark.mllib.stat.Statistics):

{code}
def chiSquare(X: RDD[Vector], method: String = “pearson”): ChiSquareTestResult
def chiSquare(x: RDD[Double], y: RDD[Double], method: String = “pearson”): 
ChiSquareTestResult
{code}

where ChiSquareTestResult <: TestResult looks like:

{code}
pValue: Double
df: Array[Int] //normally a single but need to be more for anova
statistic: Double
ChiSquareSummary <: Summary
{code}

So a couple points of discussion:
1. Of the many variants of the chi-squared test, what methods in addition to 
"pearson" do we want to support (hopefully based on popular demand)? 
http://en.wikipedia.org/wiki/Chi-squared_test
2. What special fields should ChiSquareSummary have?

> Hypothesis testing
> ------------------
>
>                 Key: SPARK-2515
>                 URL: https://issues.apache.org/jira/browse/SPARK-2515
>             Project: Spark
>          Issue Type: Sub-task
>          Components: MLlib
>            Reporter: Xiangrui Meng
>            Assignee: Doris Xin
>
> Support common statistical tests in Spark MLlib.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to