[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/13894 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/13894#discussion_r71392299 --- Diff: python/pyspark/ml/tuning.py --- @@ -264,10 +264,10 @@ def copy(self, extra=None): class CrossValidatorModel(Model, ValidatorParams): """ -.. note:: Experimental - -Model from k-fold cross validation. - +CrossValidatorModel contains the model with the highest average cross-validation +metric across folds and uses this model to transform input data. CrossValidatorModel +also tracks the metrics for each param map evaluated. + --- End diff -- You've got some extra whitespace here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/13894#discussion_r70367155 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -56,7 +56,10 @@ private[ml] trait CrossValidatorParams extends ValidatorParams { /** * :: Experimental :: - * K-fold cross validation. + * CrossValidator begins by splitting the dataset into a set of non-overlapping randomly --- End diff -- I like the improved description, but can we please keep the phrase "k-fold cross validation?" It's a very common phrase and will be useful for people using keyword search. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/13894#discussion_r69892610 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -56,7 +56,10 @@ private[ml] trait CrossValidatorParams extends ValidatorParams { /** * :: Experimental :: - * K-fold cross validation. + * CrossValidator begins by splitting the dataset into a set of non-overlapping randomly + * partitioned folds as separate training and test datasets e.g., with k=3 folds, --- End diff -- I think we can bring back the "folds, which are used as ..." part --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/13894#discussion_r69809083 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -191,7 +194,8 @@ object CrossValidator extends MLReadable[CrossValidator] { /** * :: Experimental :: - * Model from k-fold cross validation. + * CrossValidatorModel contains the model that achieved the highest average cross-validation --- End diff -- I think we could maybe make this a bit clearer. Maybe something like: "CrossValidatorModel contains the model with the highest average cross validation metric across folds and uses this model to transform input data. CrossValidatorModel also tracks the metrics for each param map evaluated." This way its clear that it contains both the best model and the metrics, as well as uses the best model if asked to predict/transform any data. What do you think? (Of course if we change it here we would want to have the corresponding change occur in Python). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/13894#discussion_r69808182 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -56,7 +56,10 @@ private[ml] trait CrossValidatorParams extends ValidatorParams { /** * :: Experimental :: - * K-fold cross validation. + * CrossValidator begins by splitting the dataset into a set of non-overlapping randomly --- End diff -- +1 - is good to mention why its splitting the dataset/what its for. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...
Github user krishnakalyan3 commented on a diff in the pull request: https://github.com/apache/spark/pull/13894#discussion_r69567260 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -57,7 +57,7 @@ private[ml] trait CrossValidatorParams extends ValidatorParams { /** * :: Experimental :: * CrossValidator begins by splitting the dataset into a set of non-overlapping randomly - * partitioned folds which are used as separate training and test datasets e.g., with k=3 folds, + * partitioned folds as separate training and test datasets e.g., with k=3 folds, --- End diff -- @MLnick is the current description okay?. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/13894#discussion_r69524371 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -56,7 +56,10 @@ private[ml] trait CrossValidatorParams extends ValidatorParams { /** * :: Experimental :: - * K-fold cross validation. + * CrossValidator begins by splitting the dataset into a set of non-overlapping randomly --- End diff -- sorry about this, but looking at it again I think we can clean up the description a bit more: `CrossValidator performs model selection by splitting the dataset into a set of non-overlapping randomly partitioned folds, which are used as separate training and test datasets. For example, with k=3 folds, ...` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/13894#discussion_r69281183 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -56,7 +56,10 @@ private[ml] trait CrossValidatorParams extends ValidatorParams { /** * :: Experimental :: - * K-fold cross validation. + * CrossValidator begins by splitting the dataset into a set of non-overlapping randomly + * partitioned folds which are used as separate training and test datasets e.g., with k=3 folds, + * CrossValidator will generate 3 (training, test) dataset pairs, each of which uses 2/3 of + * the data for training and 1/3 for testing. Each fold is used in the testing set exactly once. --- End diff -- "used in the testing set" -> "used as the test set" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/13894#discussion_r69133294 --- Diff: python/pyspark/ml/tuning.py --- @@ -266,7 +269,7 @@ class CrossValidatorModel(Model, ValidatorParams): """ .. note:: Experimental -Model from k-fold cross validation. +CrossValidatorModel model returns the best set of estimators for your model. --- End diff -- Java docs will be generated from the scala doc, so no need for anything special here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/13894#discussion_r69133212 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -56,7 +56,10 @@ private[ml] trait CrossValidatorParams extends ValidatorParams { /** * :: Experimental :: - * K-fold cross validation. + * CrossValidator begins by splitting the dataset into a set of folds which are + * used as separate training and test datasets e.g., with k=3 folds, CrossValidator + * will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data + * for training and 1/3 for testing. --- End diff -- Also, we can mention that the final cross-validation metric is the average of the test metric achieved for each (training, test) pair. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/13894#discussion_r69132912 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -191,7 +194,7 @@ object CrossValidator extends MLReadable[CrossValidator] { /** * :: Experimental :: - * Model from k-fold cross validation. + * CrossValidatorModel model returns the best set of estimators for your model. --- End diff -- Perhaps something like `CrossValidatorModel contains the model that achieved the highest cross-validation metric ( averaged across the number of folds).`, or something like that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/13894#discussion_r69132570 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -56,7 +56,10 @@ private[ml] trait CrossValidatorParams extends ValidatorParams { /** * :: Experimental :: - * K-fold cross validation. + * CrossValidator begins by splitting the dataset into a set of folds which are + * used as separate training and test datasets e.g., with k=3 folds, CrossValidator + * will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data + * for training and 1/3 for testing. --- End diff -- It may be worth mentioning that each fold is non-overlapping, and that each fold will be used in test exactly once. See [wikipedia](https://en.wikipedia.org/wiki/Cross-validation_(statistics)#k-fold_cross-validation). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...
Github user krishnakalyan3 commented on a diff in the pull request: https://github.com/apache/spark/pull/13894#discussion_r69103201 --- Diff: python/pyspark/ml/tuning.py --- @@ -266,7 +269,7 @@ class CrossValidatorModel(Model, ValidatorParams): """ .. note:: Experimental -Model from k-fold cross validation. +CrossValidatorModel model returns the best set of estimators for your model. --- End diff -- @holdenk thanks for the review, ran lint-python. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/13894#discussion_r69089084 --- Diff: python/pyspark/ml/tuning.py --- @@ -266,8 +269,8 @@ class CrossValidatorModel(Model, ValidatorParams): """ .. note:: Experimental -Model from k-fold cross validation. - +CrossValidatorModel model returns the best set of estimators for your model. + --- End diff -- Seems like a possible formatting issue here - have you run ./dev/lint-python? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...
Github user krishnakalyan3 commented on a diff in the pull request: https://github.com/apache/spark/pull/13894#discussion_r69086192 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -194,8 +194,7 @@ object CrossValidator extends MLReadable[CrossValidator] { /** * :: Experimental :: - * Pipelines facilitate model selection by making it easy to tune an entire - * Pipeline at once, rather than tuning each element in the Pipeline separately. + * CrossValidatorModel model returns the best set of estimators for your model. --- End diff -- @holdenk I have updated doc. Could you please review it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/13894#discussion_r68857452 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -191,7 +194,8 @@ object CrossValidator extends MLReadable[CrossValidator] { /** * :: Experimental :: - * Model from k-fold cross validation. + * Pipelines facilitate model selection by making it easy to tune an entire --- End diff -- I'm not sure if this is the best place to mention that - it seems like it doesn't belong in the CrossValidatorModel scaladoc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...
GitHub user krishnakalyan3 opened a pull request: https://github.com/apache/spark/pull/13894 [SPARK-15254][DOC] Improve ML pipeline Cross Validation Scaladoc & PyDoc ## What changes were proposed in this pull request? Updated ML pipeline Cross Validation Scaladoc & PyDoc. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) You can merge this pull request into a Git repository by running: $ git pull https://github.com/krishnakalyan3/spark kfold-cv Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13894.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13894 commit d7ffe7a442a6e29b5f3afdf32894d108eb04bb35 Author: krishnakalyan3Date: 2016-06-24T14:11:37Z Improve ML pipeline Cross Validation Scaladoc & PyDoc --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org