[ https://issues.apache.org/jira/browse/SPARK-25911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-25911. ------------------------------- Resolution: Won't Fix I don't think we'd add all of those. Some of these are already in JIRA as ideas. While I don't think we'll add much more like this to ML, if you have one you can argue is widely used, and you can implement it, then I'd create (or find) a JIRA for that one to discuss first. > [spark-ml] Hypothesis testing module > ------------------------------------ > > Key: SPARK-25911 > URL: https://issues.apache.org/jira/browse/SPARK-25911 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib > Affects Versions: 3.0.0 > Reporter: Uday Babbar > Priority: Minor > > h2. Why this ticket was created > Feasibility determination of some subset of hypothesis testing module mainly > along value proposition front and to get a preliminary opinion of how does it > generally sound. Can work on a more comprehensive proposal if say, it's > generally agreed upon that including dataframe API for t-test makes sense in > the o.a.s.ml package. > h2. Current state > There are some streaming implementation in the o.a.s.mllib module, but there > are no dataframe APIs for some standard tests (t-test). > ||Test ||Current state||Proposed state|| > |t-test (welch's, student)|only streaming |Dataframe API| > |chi-squared|streaming, Dataframe/RDD API present| - | > |ANOVA|-|Dataframe API| > |mann-whitney-u-test|-|RDD API (in maintenance mode so probably doesn't make > sense to include this)| > h2. Rationale > The utility of experimentation platforms is pervasive and most of them that > operate at scale (a large portion of them use spark for offline computation) > require distributed implementation of hypothesis tests to calculate p-values > of different metrics/features. These APIs would enable distributed > computation of the relevant stats and prevent overhead in moving data (or > some downstream view of it) to a framework where such stats computation is > available (R, scipy). > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org