[jira] [Commented] (SPARK-2515) Hypothesis testing
[ https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14083330#comment-14083330 ] Apache Spark commented on SPARK-2515: - User 'dorx' has created a pull request for this issue: https://github.com/apache/spark/pull/1733 > Hypothesis testing > -- > > Key: SPARK-2515 > URL: https://issues.apache.org/jira/browse/SPARK-2515 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Doris Xin > > Support common statistical tests in Spark MLlib. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2515) Hypothesis testing
[ https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074802#comment-14074802 ] Doris Xin commented on SPARK-2515: -- A toString method sounds like a really good idea here actually. I think originally we planned the Summary object to hold anything that isn't standard across tests, and in the case of chi squared, I can't think of anything else to put in there. Having the toString method instead would allow us to have a single TestResult class across tests, too. Sure, we can go with degreesOfFreedom. > Hypothesis testing > -- > > Key: SPARK-2515 > URL: https://issues.apache.org/jira/browse/SPARK-2515 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Doris Xin > > Support common statistical tests in Spark MLlib. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2515) Hypothesis testing
[ https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074003#comment-14074003 ] Hossein Falaki commented on SPARK-2515: --- If we really have to implement another chi-square test method, I think Likelihood-ratio test would be a good candidate. On the return type: * What is left for the Summary field? Why can't this be the toString method? * I am not sure, but maybe this df is too cryptic for non-experts. How about degreesOfFreedom? > Hypothesis testing > -- > > Key: SPARK-2515 > URL: https://issues.apache.org/jira/browse/SPARK-2515 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Doris Xin > > Support common statistical tests in Spark MLlib. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2515) Hypothesis testing
[ https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073879#comment-14073879 ] Doris Xin commented on SPARK-2515: -- Here's the proposed API for chi-squared tests (lives in org.apache.spark.mllib.stat.Statistics): {code} def chiSquare(X: RDD[Vector], method: String = “pearson”): ChiSquareTestResult def chiSquare(x: RDD[Double], y: RDD[Double], method: String = “pearson”): ChiSquareTestResult {code} where ChiSquareTestResult <: TestResult looks like: {code} pValue: Double df: Array[Int] //normally a single but need to be more for anova statistic: Double ChiSquareSummary <: Summary {code} So a couple points of discussion: 1. Of the many variants of the chi-squared test, what methods in addition to "pearson" do we want to support (hopefully based on popular demand)? http://en.wikipedia.org/wiki/Chi-squared_test 2. What special fields should ChiSquareSummary have? > Hypothesis testing > -- > > Key: SPARK-2515 > URL: https://issues.apache.org/jira/browse/SPARK-2515 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Doris Xin > > Support common statistical tests in Spark MLlib. -- This message was sent by Atlassian JIRA (v6.2#6252)