[ 
https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-8971:
---------------------------------
    Description: 
{{CrossValidator}} and the proposed {{TrainValidatorSplit}} are Spark classes 
which partition data into training and evaluation sets for performing 
hyperparameter selection via cross validation.

Both methods currently perform the split by randomly sampling the datasets. 
However, when class probabilities are highly imbalanced (e.g. detection of 
extremely low-frequency events), random sampling may result in cross validation 
sets not representative of actual out-of-training performance (e.g. no positive 
training examples could be included).

Mainstream R packages like already 
[caret|http://topepo.github.io/caret/splitting.html] support splitting the data 
based upon the class labels.

  was:
{{CrossValidator}} and the proposed {{TrainValidatorSplit}} are Spark classes 
which partition data into training and evaluation sets for performing 
hyperparameter selection via cross validation.

Both methods currently perform the split by randomly sampling the datasets. 
However, when class probabilities are highly imbalanced (e.g. detection of 
extremely low-frequency events), random sampling may result in cross validation 
sets not representative of actual out-of-training performance (e.g. no positive 
training examples could be included).

Mainstream R packages like already 
[caret](http://topepo.github.io/caret/splitting.html) support splitting the 
data based upon the class labels.


> Support balanced class labels when splitting train/cross validation sets
> ------------------------------------------------------------------------
>
>                 Key: SPARK-8971
>                 URL: https://issues.apache.org/jira/browse/SPARK-8971
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: Feynman Liang
>
> {{CrossValidator}} and the proposed {{TrainValidatorSplit}} are Spark classes 
> which partition data into training and evaluation sets for performing 
> hyperparameter selection via cross validation.
> Both methods currently perform the split by randomly sampling the datasets. 
> However, when class probabilities are highly imbalanced (e.g. detection of 
> extremely low-frequency events), random sampling may result in cross 
> validation sets not representative of actual out-of-training performance 
> (e.g. no positive training examples could be included).
> Mainstream R packages like already 
> [caret|http://topepo.github.io/caret/splitting.html] support splitting the 
> data based upon the class labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to