[jira] [Comment Edited] (SPARK-17055) add labelKFold to CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432463#comment-15432463 ] Vincent edited comment on SPARK-17055 at 8/23/16 10:34 AM: --- sorry for late reply. Yes, I just knew they intend to rename it to GroupKFold, or something like that. Personally I think it's fine to keep the way it is, though, it could be still kinda confusing when someone first uses it before understanding the idea behinds it. As for application, take face recognition as an example. features are, say, eyes, nose, lips etc. training data are obtained from a number of different person, this method can create subject independent folds, so we can train the model with features from certain group of people and take the data from the rest of group of people for validation. it will enhance the generic ability of the model and avoid over-fitting. it's a useful method, seen in sklearn, and currently caret is on the way trying to add this feature. was (Author: vincexie): sorry for late reply. Yes, I just knew they intend to rename it to GroupKFold, or something like that. Though personally I think it's fine to keep the way it is, though, it could be still kinda confusing when someone first uses it before understanding the idea behinds it. As for application, take face recognition as an example. features are, say, eyes, nose, lips etc. training data are obtained from a number of different person, this method can create subject independent folds, so we can train the model with features from certain group of people and take the data from the rest of group of people for validation. it will enhance the generic ability of the model and avoid over-fitting. it's a useful method, seen in sklearn, and currently caret is on the way trying to add this feature. > add labelKFold to CrossValidator > > > Key: SPARK-17055 > URL: https://issues.apache.org/jira/browse/SPARK-17055 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Vincent >Priority: Minor > > Current CrossValidator only supports k-fold, which randomly divides all the > samples in k groups of samples. But in cases when data is gathered from > different subjects and we want to avoid over-fitting, we want to hold out > samples with certain labels from training data and put them into validation > fold, i.e. we want to ensure that the same label is not in both testing and > training sets. > Mainstream packages like Sklearn already supports such cross validation > method. > (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17055) add labelKFold to CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432463#comment-15432463 ] Vincent edited comment on SPARK-17055 at 8/23/16 9:18 AM: -- sorry for late reply. Yes, I just knew they intend to rename it to GroupKFold, or something like that. Though personally I think it's fine to keep the way it is, though, it could be still kinda confusing when someone first uses it before understanding the idea behinds it. As for application, take face recognition as an example. features are, say, eyes, nose, lips etc. training data are obtained from a number of different person, this method can create subject independent folds, so we can train the model with features from certain group of people and take the data from the rest of group of people for validation. it will enhance the generic ability of the model and avoid over-fitting. it's a useful method, seen in sklearn, and currently caret is on the way trying to add this feature. was (Author: vincexie): sorry for late reply. Yes, I just knew they intend to rename it to GroupKFold, or something like that. Though personally I think it's fine to keep the way it is, though, it could be still kinda confusing when someone first uses it before understanding the idea behinds it. As for application, take face recognition as an example. features are, say, eyes, nose, lips etc. training data are obtained from a number of different person, this method can create subject independent folds, so we can train the model with features from certain group of people and take the data from the rest of group of people for validation. it will enhance the generic ability of the model and avoid over-fitting. it's a useful method, seen in sklearn, and currently caret is on the way add this feature. > add labelKFold to CrossValidator > > > Key: SPARK-17055 > URL: https://issues.apache.org/jira/browse/SPARK-17055 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Vincent >Priority: Minor > > Current CrossValidator only supports k-fold, which randomly divides all the > samples in k groups of samples. But in cases when data is gathered from > different subjects and we want to avoid over-fitting, we want to hold out > samples with certain labels from training data and put them into validation > fold, i.e. we want to ensure that the same label is not in both testing and > training sets. > Mainstream packages like Sklearn already supports such cross validation > method. > (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17055) add labelKFold to CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430381#comment-15430381 ] Vincent edited comment on SPARK-17055 at 8/22/16 9:14 AM: -- well, a better model will have a better cv performance on validation data with unseen labels, so the final selected model will have a relatively better capability on predicting samples with unseen categories/labels in real case. was (Author: vincexie): well, a better model will have a better cv performance on data with unseen labels, so the final selected model will have a relatively better capability on predicting samples with unseen categories/labels in real case. > add labelKFold to CrossValidator > > > Key: SPARK-17055 > URL: https://issues.apache.org/jira/browse/SPARK-17055 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Vincent >Priority: Minor > > Current CrossValidator only supports k-fold, which randomly divides all the > samples in k groups of samples. But in cases when data is gathered from > different subjects and we want to avoid over-fitting, we want to hold out > samples with certain labels from training data and put them into validation > fold, i.e. we want to ensure that the same label is not in both testing and > training sets. > Mainstream packages like Sklearn already supports such cross validation > method. > (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org