[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets

2017-12-04 Thread Ashish Chopra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16278060#comment-16278060
 ] 

Ashish Chopra commented on SPARK-8971:
--

When can we expect this in Dataframe API?

> Support balanced class labels when splitting train/cross validation sets
> 
>
> Key: SPARK-8971
> URL: https://issues.apache.org/jira/browse/SPARK-8971
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Feynman Liang
>Assignee: Seth Hendrickson
>
> {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are 
> Spark classes which partition data into training and evaluation sets for 
> performing hyperparameter selection via cross validation.
> Both methods currently perform the split by randomly sampling the datasets. 
> However, when class probabilities are highly imbalanced (e.g. detection of 
> extremely low-frequency events), random sampling may result in cross 
> validation sets not representative of actual out-of-training performance 
> (e.g. no positive training examples could be included).
> Mainstream R packages like already 
> [caret|http://topepo.github.io/caret/splitting.html] support splitting the 
> data based upon the class labels.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets

2017-04-19 Thread Tiago Albineli Motta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15976027#comment-15976027
 ] 

Tiago Albineli Motta commented on SPARK-8971:
-

Why not a variation of TrainValidatorSplit to stratify the split of training 
and test?

We just need to extract this code in TrainValidationSplit.scala as a new method:

{code}
val Array(trainingDataset, validationDataset) =
  dataset.randomSplit(Array($(trainRatio), 1 - $(trainRatio)), $(seed))
trainingDataset.cache()
validationDataset.cache()
{code}

And them create a subclass like TrainValidatorBalancedSplit overriding this 
method

> Support balanced class labels when splitting train/cross validation sets
> 
>
> Key: SPARK-8971
> URL: https://issues.apache.org/jira/browse/SPARK-8971
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Feynman Liang
>Assignee: Seth Hendrickson
>
> {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are 
> Spark classes which partition data into training and evaluation sets for 
> performing hyperparameter selection via cross validation.
> Both methods currently perform the split by randomly sampling the datasets. 
> However, when class probabilities are highly imbalanced (e.g. detection of 
> extremely low-frequency events), random sampling may result in cross 
> validation sets not representative of actual out-of-training performance 
> (e.g. no positive training examples could be included).
> Mainstream R packages like already 
> [caret|http://topepo.github.io/caret/splitting.html] support splitting the 
> data based upon the class labels.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets

2016-07-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390242#comment-15390242
 ] 

Apache Spark commented on SPARK-8971:
-

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/14321

> Support balanced class labels when splitting train/cross validation sets
> 
>
> Key: SPARK-8971
> URL: https://issues.apache.org/jira/browse/SPARK-8971
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Feynman Liang
>Assignee: Seth Hendrickson
>
> {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are 
> Spark classes which partition data into training and evaluation sets for 
> performing hyperparameter selection via cross validation.
> Both methods currently perform the split by randomly sampling the datasets. 
> However, when class probabilities are highly imbalanced (e.g. detection of 
> extremely low-frequency events), random sampling may result in cross 
> validation sets not representative of actual out-of-training performance 
> (e.g. no positive training examples could be included).
> Mainstream R packages like already 
> [caret|http://topepo.github.io/caret/splitting.html] support splitting the 
> data based upon the class labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets

2016-04-29 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264224#comment-15264224
 ] 

Seth Hendrickson commented on SPARK-8971:
-

I meant label column. Sorry for the confusion!

> Support balanced class labels when splitting train/cross validation sets
> 
>
> Key: SPARK-8971
> URL: https://issues.apache.org/jira/browse/SPARK-8971
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Feynman Liang
>Assignee: Seth Hendrickson
>
> {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are 
> Spark classes which partition data into training and evaluation sets for 
> performing hyperparameter selection via cross validation.
> Both methods currently perform the split by randomly sampling the datasets. 
> However, when class probabilities are highly imbalanced (e.g. detection of 
> extremely low-frequency events), random sampling may result in cross 
> validation sets not representative of actual out-of-training performance 
> (e.g. no positive training examples could be included).
> Mainstream R packages like already 
> [caret|http://topepo.github.io/caret/splitting.html] support splitting the 
> data based upon the class labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets

2016-04-28 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261655#comment-15261655
 ] 

Nick Pentreath commented on SPARK-8971:
---

I think it would be good to have something implemented, so if that means doing 
it with RDD initially that's fine by me.

For you questions
1 - I'd still like to see if this same approach could be used for 
recommendation/ranking style settings, so allowing the user to specify the 
column would be good.
2 / 3 - I agree it makes most sense to respect trainRatio. The idea is to 
maintain the class distribution rather than allow different trainRatios 
effectively between strata. So I vote for exact sampling as you suggest
4 - for now no, but I would imagine the main use case for this is for class 
labels, in which case we can use column metadata (now or in the future) to get 
the labels?

As for API design, I'm not sure what you mean by "output column" in your first 
example?

I would go for the `stratifiedCol` approach personally.

> Support balanced class labels when splitting train/cross validation sets
> 
>
> Key: SPARK-8971
> URL: https://issues.apache.org/jira/browse/SPARK-8971
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Feynman Liang
>Assignee: Seth Hendrickson
>
> {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are 
> Spark classes which partition data into training and evaluation sets for 
> performing hyperparameter selection via cross validation.
> Both methods currently perform the split by randomly sampling the datasets. 
> However, when class probabilities are highly imbalanced (e.g. detection of 
> extremely low-frequency events), random sampling may result in cross 
> validation sets not representative of actual out-of-training performance 
> (e.g. no positive training examples could be included).
> Mainstream R packages like already 
> [caret|http://topepo.github.io/caret/splitting.html] support splitting the 
> data based upon the class labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets

2016-04-27 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15260399#comment-15260399
 ] 

Seth Hendrickson commented on SPARK-8971:
-

I've got an improved version of the original PR which requires only a single 
pass through the data for computing multiple splits at once, but it utilizes 
{{PairRDDFunctions}} and is not implemented for the dataframe API. If this is 
ok (i.e. we implement for RDDs initially), then what remains is the semantics 
of the API for {{TrainValidationSplit}} and {{CrossValidator}}. Specific 
questions I have:

* Should users be able to specify which column to stratify on or should it 
default to the output column always?
* Should the stratified splitting always be exact? If all the key weights are 
the same and we don't do exact sampling, then the problem of having some 
stratums without a given class still exists. However, if we expose a way for 
users to specify different key weights per stratum, then there is added 
functionality.
* Should we expose a way for users to specify key weights per stratum? For 
example, with labels 0 and 1 should a user be able to say I want these splits: 
split0 = (0 -> 0.2, 1 -> 0.4), split1 = (0 -> 0.8, 1 -> 0.6) ? I don't think it 
makes sense, since this would override the {{trainRatio}} parameter. For this 
reason, I think we should always use exact stratified sampling. 
* Should users have a way to specify the keys in the stratified column? If not, 
we require a pass through the data to collect distinct values.

Some example API designs:

* have a {{useStratifiedSampling}} boolean parameter that calls stratified 
sampling using the output column when true.
* have a {{stratifiedCol}} string parameter that calls stratified sampling 
using the specified column when set.
* similar to above, but add a way to specify the stratified key values

I'd really appreciate any feedback about the design and if we want to continue 
this PR in the RDD API. cc [~mlnick] [~josephkb]

> Support balanced class labels when splitting train/cross validation sets
> 
>
> Key: SPARK-8971
> URL: https://issues.apache.org/jira/browse/SPARK-8971
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Feynman Liang
>Assignee: Seth Hendrickson
>
> {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are 
> Spark classes which partition data into training and evaluation sets for 
> performing hyperparameter selection via cross validation.
> Both methods currently perform the split by randomly sampling the datasets. 
> However, when class probabilities are highly imbalanced (e.g. detection of 
> extremely low-frequency events), random sampling may result in cross 
> validation sets not representative of actual out-of-training performance 
> (e.g. no positive training examples could be included).
> Mainstream R packages like already 
> [caret|http://topepo.github.io/caret/splitting.html] support splitting the 
> data based upon the class labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets

2015-08-11 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692312#comment-14692312
 ] 

Seth Hendrickson commented on SPARK-8971:
-

I went ahead and created the PR for this issue, even though some of the design 
choices still merit discussion. This way, others can at least see the code and 
make comments. I did not mark as WIP but I can do that if needed. 

 Support balanced class labels when splitting train/cross validation sets
 

 Key: SPARK-8971
 URL: https://issues.apache.org/jira/browse/SPARK-8971
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Feynman Liang
Assignee: Seth Hendrickson

 {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are 
 Spark classes which partition data into training and evaluation sets for 
 performing hyperparameter selection via cross validation.
 Both methods currently perform the split by randomly sampling the datasets. 
 However, when class probabilities are highly imbalanced (e.g. detection of 
 extremely low-frequency events), random sampling may result in cross 
 validation sets not representative of actual out-of-training performance 
 (e.g. no positive training examples could be included).
 Mainstream R packages like already 
 [caret|http://topepo.github.io/caret/splitting.html] support splitting the 
 data based upon the class labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets

2015-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692309#comment-14692309
 ] 

Apache Spark commented on SPARK-8971:
-

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/8112

 Support balanced class labels when splitting train/cross validation sets
 

 Key: SPARK-8971
 URL: https://issues.apache.org/jira/browse/SPARK-8971
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Feynman Liang
Assignee: Seth Hendrickson

 {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are 
 Spark classes which partition data into training and evaluation sets for 
 performing hyperparameter selection via cross validation.
 Both methods currently perform the split by randomly sampling the datasets. 
 However, when class probabilities are highly imbalanced (e.g. detection of 
 extremely low-frequency events), random sampling may result in cross 
 validation sets not representative of actual out-of-training performance 
 (e.g. no positive training examples could be included).
 Mainstream R packages like already 
 [caret|http://topepo.github.io/caret/splitting.html] support splitting the 
 data based upon the class labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets

2015-08-05 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14658376#comment-14658376
 ] 

Seth Hendrickson commented on SPARK-8971:
-

[~mengxr] You mentioned that the solution should call {{sampleByKeyExact}}, 
which is a function that takes a stratified subsample of m  N elements from a 
dataset. One problem is that when doing things like train/test split and k fold 
creation (which are fundamentally the same as far as sampling goes) is that we 
actually need to take random splits of the dataset. That is, we need not only 
the subsample, but its complement. For k-fold sampling, we need to split the 
dataset into k unique, non-overlapping subsamples, which isn't possible with 
{{samplyByKeyExact}} in its current state.

I have a pretty coarse prototype which essentially uses the [efficient, 
parallel sampling routine|http://jmlr.org/proceedings/papers/v28/meng13a.html] 
to find the exact k thresholds needed to split the dataset into k subsamples. I 
had to modify the sampling function in 
{{org.apache.spark.util.random.StratifiedSamplingUtils}} to compare the random 
keys to a range (e.g. x  lb  x = ub), rather than simply comparing to one 
number (x  threshold) which only allows for a bisection of the data. Once you 
know the exact k-1 thresholds that provide even splits for each stratum, and 
you have a sampling function that can compare the random key to a range, you 
have what you need to for stratified k-fold and train/test split. Is there a 
way to implement this without touching the {{org.apache.spark.util.random}} 
package that I'm missing?

 Support balanced class labels when splitting train/cross validation sets
 

 Key: SPARK-8971
 URL: https://issues.apache.org/jira/browse/SPARK-8971
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Feynman Liang
Assignee: Seth Hendrickson

 {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are 
 Spark classes which partition data into training and evaluation sets for 
 performing hyperparameter selection via cross validation.
 Both methods currently perform the split by randomly sampling the datasets. 
 However, when class probabilities are highly imbalanced (e.g. detection of 
 extremely low-frequency events), random sampling may result in cross 
 validation sets not representative of actual out-of-training performance 
 (e.g. no positive training examples could be included).
 Mainstream R packages like already 
 [caret|http://topepo.github.io/caret/splitting.html] support splitting the 
 data based upon the class labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets

2015-07-28 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14644691#comment-14644691
 ] 

Xiangrui Meng commented on SPARK-8971:
--

Assigned. Note that this should call `sampleByKeyExact` to maintain the ratio.

 Support balanced class labels when splitting train/cross validation sets
 

 Key: SPARK-8971
 URL: https://issues.apache.org/jira/browse/SPARK-8971
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Feynman Liang
Assignee: Seth Hendrickson

 {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are 
 Spark classes which partition data into training and evaluation sets for 
 performing hyperparameter selection via cross validation.
 Both methods currently perform the split by randomly sampling the datasets. 
 However, when class probabilities are highly imbalanced (e.g. detection of 
 extremely low-frequency events), random sampling may result in cross 
 validation sets not representative of actual out-of-training performance 
 (e.g. no positive training examples could be included).
 Mainstream R packages like already 
 [caret|http://topepo.github.io/caret/splitting.html] support splitting the 
 data based upon the class labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets

2015-07-28 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14644608#comment-14644608
 ] 

Seth Hendrickson commented on SPARK-8971:
-

I'd like to work on this JIRA if it's still unassigned.

 Support balanced class labels when splitting train/cross validation sets
 

 Key: SPARK-8971
 URL: https://issues.apache.org/jira/browse/SPARK-8971
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Feynman Liang

 {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are 
 Spark classes which partition data into training and evaluation sets for 
 performing hyperparameter selection via cross validation.
 Both methods currently perform the split by randomly sampling the datasets. 
 However, when class probabilities are highly imbalanced (e.g. detection of 
 extremely low-frequency events), random sampling may result in cross 
 validation sets not representative of actual out-of-training performance 
 (e.g. no positive training examples could be included).
 Mainstream R packages like already 
 [caret|http://topepo.github.io/caret/splitting.html] support splitting the 
 data based upon the class labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets

2015-07-28 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14644670#comment-14644670
 ] 

Feynman Liang commented on SPARK-8971:
--

I don't think anyone's working on it. [~mengxr] can assign it to you.

 Support balanced class labels when splitting train/cross validation sets
 

 Key: SPARK-8971
 URL: https://issues.apache.org/jira/browse/SPARK-8971
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Feynman Liang

 {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are 
 Spark classes which partition data into training and evaluation sets for 
 performing hyperparameter selection via cross validation.
 Both methods currently perform the split by randomly sampling the datasets. 
 However, when class probabilities are highly imbalanced (e.g. detection of 
 extremely low-frequency events), random sampling may result in cross 
 validation sets not representative of actual out-of-training performance 
 (e.g. no positive training examples could be included).
 Mainstream R packages like already 
 [caret|http://topepo.github.io/caret/splitting.html] support splitting the 
 data based upon the class labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org