[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets
[ https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16278060#comment-16278060 ] Ashish Chopra commented on SPARK-8971: -- When can we expect this in Dataframe API? > Support balanced class labels when splitting train/cross validation sets > > > Key: SPARK-8971 > URL: https://issues.apache.org/jira/browse/SPARK-8971 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Feynman Liang >Assignee: Seth Hendrickson > > {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are > Spark classes which partition data into training and evaluation sets for > performing hyperparameter selection via cross validation. > Both methods currently perform the split by randomly sampling the datasets. > However, when class probabilities are highly imbalanced (e.g. detection of > extremely low-frequency events), random sampling may result in cross > validation sets not representative of actual out-of-training performance > (e.g. no positive training examples could be included). > Mainstream R packages like already > [caret|http://topepo.github.io/caret/splitting.html] support splitting the > data based upon the class labels. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets
[ https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15976027#comment-15976027 ] Tiago Albineli Motta commented on SPARK-8971: - Why not a variation of TrainValidatorSplit to stratify the split of training and test? We just need to extract this code in TrainValidationSplit.scala as a new method: {code} val Array(trainingDataset, validationDataset) = dataset.randomSplit(Array($(trainRatio), 1 - $(trainRatio)), $(seed)) trainingDataset.cache() validationDataset.cache() {code} And them create a subclass like TrainValidatorBalancedSplit overriding this method > Support balanced class labels when splitting train/cross validation sets > > > Key: SPARK-8971 > URL: https://issues.apache.org/jira/browse/SPARK-8971 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Feynman Liang >Assignee: Seth Hendrickson > > {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are > Spark classes which partition data into training and evaluation sets for > performing hyperparameter selection via cross validation. > Both methods currently perform the split by randomly sampling the datasets. > However, when class probabilities are highly imbalanced (e.g. detection of > extremely low-frequency events), random sampling may result in cross > validation sets not representative of actual out-of-training performance > (e.g. no positive training examples could be included). > Mainstream R packages like already > [caret|http://topepo.github.io/caret/splitting.html] support splitting the > data based upon the class labels. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets
[ https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390242#comment-15390242 ] Apache Spark commented on SPARK-8971: - User 'sethah' has created a pull request for this issue: https://github.com/apache/spark/pull/14321 > Support balanced class labels when splitting train/cross validation sets > > > Key: SPARK-8971 > URL: https://issues.apache.org/jira/browse/SPARK-8971 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Feynman Liang >Assignee: Seth Hendrickson > > {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are > Spark classes which partition data into training and evaluation sets for > performing hyperparameter selection via cross validation. > Both methods currently perform the split by randomly sampling the datasets. > However, when class probabilities are highly imbalanced (e.g. detection of > extremely low-frequency events), random sampling may result in cross > validation sets not representative of actual out-of-training performance > (e.g. no positive training examples could be included). > Mainstream R packages like already > [caret|http://topepo.github.io/caret/splitting.html] support splitting the > data based upon the class labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets
[ https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264224#comment-15264224 ] Seth Hendrickson commented on SPARK-8971: - I meant label column. Sorry for the confusion! > Support balanced class labels when splitting train/cross validation sets > > > Key: SPARK-8971 > URL: https://issues.apache.org/jira/browse/SPARK-8971 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Feynman Liang >Assignee: Seth Hendrickson > > {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are > Spark classes which partition data into training and evaluation sets for > performing hyperparameter selection via cross validation. > Both methods currently perform the split by randomly sampling the datasets. > However, when class probabilities are highly imbalanced (e.g. detection of > extremely low-frequency events), random sampling may result in cross > validation sets not representative of actual out-of-training performance > (e.g. no positive training examples could be included). > Mainstream R packages like already > [caret|http://topepo.github.io/caret/splitting.html] support splitting the > data based upon the class labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets
[ https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261655#comment-15261655 ] Nick Pentreath commented on SPARK-8971: --- I think it would be good to have something implemented, so if that means doing it with RDD initially that's fine by me. For you questions 1 - I'd still like to see if this same approach could be used for recommendation/ranking style settings, so allowing the user to specify the column would be good. 2 / 3 - I agree it makes most sense to respect trainRatio. The idea is to maintain the class distribution rather than allow different trainRatios effectively between strata. So I vote for exact sampling as you suggest 4 - for now no, but I would imagine the main use case for this is for class labels, in which case we can use column metadata (now or in the future) to get the labels? As for API design, I'm not sure what you mean by "output column" in your first example? I would go for the `stratifiedCol` approach personally. > Support balanced class labels when splitting train/cross validation sets > > > Key: SPARK-8971 > URL: https://issues.apache.org/jira/browse/SPARK-8971 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Feynman Liang >Assignee: Seth Hendrickson > > {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are > Spark classes which partition data into training and evaluation sets for > performing hyperparameter selection via cross validation. > Both methods currently perform the split by randomly sampling the datasets. > However, when class probabilities are highly imbalanced (e.g. detection of > extremely low-frequency events), random sampling may result in cross > validation sets not representative of actual out-of-training performance > (e.g. no positive training examples could be included). > Mainstream R packages like already > [caret|http://topepo.github.io/caret/splitting.html] support splitting the > data based upon the class labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets
[ https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15260399#comment-15260399 ] Seth Hendrickson commented on SPARK-8971: - I've got an improved version of the original PR which requires only a single pass through the data for computing multiple splits at once, but it utilizes {{PairRDDFunctions}} and is not implemented for the dataframe API. If this is ok (i.e. we implement for RDDs initially), then what remains is the semantics of the API for {{TrainValidationSplit}} and {{CrossValidator}}. Specific questions I have: * Should users be able to specify which column to stratify on or should it default to the output column always? * Should the stratified splitting always be exact? If all the key weights are the same and we don't do exact sampling, then the problem of having some stratums without a given class still exists. However, if we expose a way for users to specify different key weights per stratum, then there is added functionality. * Should we expose a way for users to specify key weights per stratum? For example, with labels 0 and 1 should a user be able to say I want these splits: split0 = (0 -> 0.2, 1 -> 0.4), split1 = (0 -> 0.8, 1 -> 0.6) ? I don't think it makes sense, since this would override the {{trainRatio}} parameter. For this reason, I think we should always use exact stratified sampling. * Should users have a way to specify the keys in the stratified column? If not, we require a pass through the data to collect distinct values. Some example API designs: * have a {{useStratifiedSampling}} boolean parameter that calls stratified sampling using the output column when true. * have a {{stratifiedCol}} string parameter that calls stratified sampling using the specified column when set. * similar to above, but add a way to specify the stratified key values I'd really appreciate any feedback about the design and if we want to continue this PR in the RDD API. cc [~mlnick] [~josephkb] > Support balanced class labels when splitting train/cross validation sets > > > Key: SPARK-8971 > URL: https://issues.apache.org/jira/browse/SPARK-8971 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Feynman Liang >Assignee: Seth Hendrickson > > {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are > Spark classes which partition data into training and evaluation sets for > performing hyperparameter selection via cross validation. > Both methods currently perform the split by randomly sampling the datasets. > However, when class probabilities are highly imbalanced (e.g. detection of > extremely low-frequency events), random sampling may result in cross > validation sets not representative of actual out-of-training performance > (e.g. no positive training examples could be included). > Mainstream R packages like already > [caret|http://topepo.github.io/caret/splitting.html] support splitting the > data based upon the class labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets
[ https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692312#comment-14692312 ] Seth Hendrickson commented on SPARK-8971: - I went ahead and created the PR for this issue, even though some of the design choices still merit discussion. This way, others can at least see the code and make comments. I did not mark as WIP but I can do that if needed. Support balanced class labels when splitting train/cross validation sets Key: SPARK-8971 URL: https://issues.apache.org/jira/browse/SPARK-8971 Project: Spark Issue Type: New Feature Components: ML Reporter: Feynman Liang Assignee: Seth Hendrickson {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are Spark classes which partition data into training and evaluation sets for performing hyperparameter selection via cross validation. Both methods currently perform the split by randomly sampling the datasets. However, when class probabilities are highly imbalanced (e.g. detection of extremely low-frequency events), random sampling may result in cross validation sets not representative of actual out-of-training performance (e.g. no positive training examples could be included). Mainstream R packages like already [caret|http://topepo.github.io/caret/splitting.html] support splitting the data based upon the class labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets
[ https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692309#comment-14692309 ] Apache Spark commented on SPARK-8971: - User 'sethah' has created a pull request for this issue: https://github.com/apache/spark/pull/8112 Support balanced class labels when splitting train/cross validation sets Key: SPARK-8971 URL: https://issues.apache.org/jira/browse/SPARK-8971 Project: Spark Issue Type: New Feature Components: ML Reporter: Feynman Liang Assignee: Seth Hendrickson {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are Spark classes which partition data into training and evaluation sets for performing hyperparameter selection via cross validation. Both methods currently perform the split by randomly sampling the datasets. However, when class probabilities are highly imbalanced (e.g. detection of extremely low-frequency events), random sampling may result in cross validation sets not representative of actual out-of-training performance (e.g. no positive training examples could be included). Mainstream R packages like already [caret|http://topepo.github.io/caret/splitting.html] support splitting the data based upon the class labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets
[ https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14658376#comment-14658376 ] Seth Hendrickson commented on SPARK-8971: - [~mengxr] You mentioned that the solution should call {{sampleByKeyExact}}, which is a function that takes a stratified subsample of m N elements from a dataset. One problem is that when doing things like train/test split and k fold creation (which are fundamentally the same as far as sampling goes) is that we actually need to take random splits of the dataset. That is, we need not only the subsample, but its complement. For k-fold sampling, we need to split the dataset into k unique, non-overlapping subsamples, which isn't possible with {{samplyByKeyExact}} in its current state. I have a pretty coarse prototype which essentially uses the [efficient, parallel sampling routine|http://jmlr.org/proceedings/papers/v28/meng13a.html] to find the exact k thresholds needed to split the dataset into k subsamples. I had to modify the sampling function in {{org.apache.spark.util.random.StratifiedSamplingUtils}} to compare the random keys to a range (e.g. x lb x = ub), rather than simply comparing to one number (x threshold) which only allows for a bisection of the data. Once you know the exact k-1 thresholds that provide even splits for each stratum, and you have a sampling function that can compare the random key to a range, you have what you need to for stratified k-fold and train/test split. Is there a way to implement this without touching the {{org.apache.spark.util.random}} package that I'm missing? Support balanced class labels when splitting train/cross validation sets Key: SPARK-8971 URL: https://issues.apache.org/jira/browse/SPARK-8971 Project: Spark Issue Type: New Feature Components: ML Reporter: Feynman Liang Assignee: Seth Hendrickson {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are Spark classes which partition data into training and evaluation sets for performing hyperparameter selection via cross validation. Both methods currently perform the split by randomly sampling the datasets. However, when class probabilities are highly imbalanced (e.g. detection of extremely low-frequency events), random sampling may result in cross validation sets not representative of actual out-of-training performance (e.g. no positive training examples could be included). Mainstream R packages like already [caret|http://topepo.github.io/caret/splitting.html] support splitting the data based upon the class labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets
[ https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14644691#comment-14644691 ] Xiangrui Meng commented on SPARK-8971: -- Assigned. Note that this should call `sampleByKeyExact` to maintain the ratio. Support balanced class labels when splitting train/cross validation sets Key: SPARK-8971 URL: https://issues.apache.org/jira/browse/SPARK-8971 Project: Spark Issue Type: New Feature Components: ML Reporter: Feynman Liang Assignee: Seth Hendrickson {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are Spark classes which partition data into training and evaluation sets for performing hyperparameter selection via cross validation. Both methods currently perform the split by randomly sampling the datasets. However, when class probabilities are highly imbalanced (e.g. detection of extremely low-frequency events), random sampling may result in cross validation sets not representative of actual out-of-training performance (e.g. no positive training examples could be included). Mainstream R packages like already [caret|http://topepo.github.io/caret/splitting.html] support splitting the data based upon the class labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets
[ https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14644608#comment-14644608 ] Seth Hendrickson commented on SPARK-8971: - I'd like to work on this JIRA if it's still unassigned. Support balanced class labels when splitting train/cross validation sets Key: SPARK-8971 URL: https://issues.apache.org/jira/browse/SPARK-8971 Project: Spark Issue Type: New Feature Components: ML Reporter: Feynman Liang {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are Spark classes which partition data into training and evaluation sets for performing hyperparameter selection via cross validation. Both methods currently perform the split by randomly sampling the datasets. However, when class probabilities are highly imbalanced (e.g. detection of extremely low-frequency events), random sampling may result in cross validation sets not representative of actual out-of-training performance (e.g. no positive training examples could be included). Mainstream R packages like already [caret|http://topepo.github.io/caret/splitting.html] support splitting the data based upon the class labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets
[ https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14644670#comment-14644670 ] Feynman Liang commented on SPARK-8971: -- I don't think anyone's working on it. [~mengxr] can assign it to you. Support balanced class labels when splitting train/cross validation sets Key: SPARK-8971 URL: https://issues.apache.org/jira/browse/SPARK-8971 Project: Spark Issue Type: New Feature Components: ML Reporter: Feynman Liang {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are Spark classes which partition data into training and evaluation sets for performing hyperparameter selection via cross validation. Both methods currently perform the split by randomly sampling the datasets. However, when class probabilities are highly imbalanced (e.g. detection of extremely low-frequency events), random sampling may result in cross validation sets not representative of actual out-of-training performance (e.g. no positive training examples could be included). Mainstream R packages like already [caret|http://topepo.github.io/caret/splitting.html] support splitting the data based upon the class labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org