[ 
https://issues.apache.org/jira/browse/SPARK-25911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25911.
-------------------------------
    Resolution: Won't Fix

I don't think we'd add all of those. Some of these are already in JIRA as 
ideas. While I don't think we'll add much more like this to ML, if you have one 
you can argue is widely used, and you can implement it, then I'd create (or 
find) a JIRA for that one to discuss first.

> [spark-ml] Hypothesis testing module
> ------------------------------------
>
>                 Key: SPARK-25911
>                 URL: https://issues.apache.org/jira/browse/SPARK-25911
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, MLlib
>    Affects Versions: 3.0.0
>            Reporter: Uday Babbar
>            Priority: Minor
>
> h2. Why this ticket was created
> Feasibility determination of some subset of hypothesis testing module mainly 
> along value proposition front and to get a preliminary opinion of how does it 
> generally sound. Can work on a more comprehensive proposal if say, it's 
> generally agreed upon that including dataframe API for t-test makes sense in 
> the o.a.s.ml package. 
> h2. Current state
> There are some streaming implementation in the o.a.s.mllib module, but there 
> are no dataframe APIs for some standard tests (t-test). 
> ||Test ||Current state||Proposed state||
> |t-test (welch's, student)|only streaming |Dataframe API|
> |chi-squared|streaming, Dataframe/RDD API present| - |
> |ANOVA|-|Dataframe API|
> |mann-whitney-u-test|-|RDD API (in maintenance mode so probably doesn't make 
> sense to include this)|
> h2. Rationale 
> The utility of experimentation platforms is pervasive and most of them that 
> operate at scale (a large portion of them use spark for offline computation) 
> require distributed implementation of hypothesis tests to calculate p-values 
> of different metrics/features. These APIs would enable distributed 
> computation of the relevant stats and prevent overhead in moving data (or 
> some downstream view of it) to a framework where such stats computation is 
> available (R, scipy). 
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to