[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...

2016-07-27 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/13894


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...

2016-07-19 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13894#discussion_r71392299
  
--- Diff: python/pyspark/ml/tuning.py ---
@@ -264,10 +264,10 @@ def copy(self, extra=None):
 
 class CrossValidatorModel(Model, ValidatorParams):
 """
-.. note:: Experimental
-
-Model from k-fold cross validation.
-
+CrossValidatorModel contains the model with the highest average 
cross-validation
+metric across folds and uses this model to transform input data. 
CrossValidatorModel
+also tracks the metrics for each param map evaluated.
+
--- End diff --

You've got some extra whitespace here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...

2016-07-11 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/13894#discussion_r70367155
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -56,7 +56,10 @@ private[ml] trait CrossValidatorParams extends 
ValidatorParams {
 
 /**
  * :: Experimental ::
- * K-fold cross validation.
+ * CrossValidator begins by splitting the dataset into a set of 
non-overlapping randomly
--- End diff --

I like the improved description, but can we please keep the phrase "k-fold 
cross validation?"  It's a very common phrase and will be useful for people 
using keyword search.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...

2016-07-07 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/13894#discussion_r69892610
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -56,7 +56,10 @@ private[ml] trait CrossValidatorParams extends 
ValidatorParams {
 
 /**
  * :: Experimental ::
- * K-fold cross validation.
+ * CrossValidator begins by splitting the dataset into a set of 
non-overlapping randomly
+ * partitioned folds as separate training and test datasets e.g., with k=3 
folds,
--- End diff --

I think we can bring back the "folds, which are used as ..." part


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...

2016-07-06 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13894#discussion_r69809083
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -191,7 +194,8 @@ object CrossValidator extends 
MLReadable[CrossValidator] {
 
 /**
  * :: Experimental ::
- * Model from k-fold cross validation.
+ * CrossValidatorModel contains the model that achieved the highest 
average cross-validation
--- End diff --

I think we could maybe make this a bit clearer. Maybe something like:

"CrossValidatorModel contains the model with the highest average cross 
validation metric across folds and uses this model to transform input data. 
CrossValidatorModel also tracks the metrics for each param map evaluated."

This way its clear that it contains both the best model and the metrics, as 
well as uses the best model if asked to predict/transform any data. What do you 
think?
(Of course if we change it here we would want to have the corresponding 
change occur in Python).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...

2016-07-06 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13894#discussion_r69808182
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -56,7 +56,10 @@ private[ml] trait CrossValidatorParams extends 
ValidatorParams {
 
 /**
  * :: Experimental ::
- * K-fold cross validation.
+ * CrossValidator begins by splitting the dataset into a set of 
non-overlapping randomly
--- End diff --

+1 - is good to mention why its splitting the dataset/what its for.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...

2016-07-05 Thread krishnakalyan3
Github user krishnakalyan3 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13894#discussion_r69567260
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -57,7 +57,7 @@ private[ml] trait CrossValidatorParams extends 
ValidatorParams {
 /**
  * :: Experimental ::
  * CrossValidator begins by splitting the dataset into a set of 
non-overlapping randomly
- * partitioned folds which are used as separate training and test datasets 
e.g., with k=3 folds,
+ * partitioned folds as separate training and test datasets e.g., with k=3 
folds,
--- End diff --

@MLnick is the current description okay?.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...

2016-07-05 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/13894#discussion_r69524371
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -56,7 +56,10 @@ private[ml] trait CrossValidatorParams extends 
ValidatorParams {
 
 /**
  * :: Experimental ::
- * K-fold cross validation.
+ * CrossValidator begins by splitting the dataset into a set of 
non-overlapping randomly
--- End diff --

sorry about this, but looking at it again I think we can clean up the 
description a bit more:

`CrossValidator performs model selection by splitting the dataset into a 
set of non-overlapping randomly partitioned folds, which are used as separate 
training and test datasets. For example, with k=3 folds, ...`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...

2016-07-01 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/13894#discussion_r69281183
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -56,7 +56,10 @@ private[ml] trait CrossValidatorParams extends 
ValidatorParams {
 
 /**
  * :: Experimental ::
- * K-fold cross validation.
+ * CrossValidator begins by splitting the dataset into a set of 
non-overlapping randomly
+ * partitioned folds which are used as separate training and test datasets 
e.g., with k=3 folds,
+ * CrossValidator will generate 3 (training, test) dataset pairs, each of 
which uses 2/3 of
+ * the data for training and 1/3 for testing. Each fold is used in the 
testing set exactly once.
--- End diff --

"used in the testing set" -> "used as the test set"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...

2016-06-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/13894#discussion_r69133294
  
--- Diff: python/pyspark/ml/tuning.py ---
@@ -266,7 +269,7 @@ class CrossValidatorModel(Model, ValidatorParams):
 """
 .. note:: Experimental
 
-Model from k-fold cross validation.
+CrossValidatorModel model returns the best set of estimators for your 
model.
--- End diff --

Java docs will be generated from the scala doc, so no need for anything 
special here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...

2016-06-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/13894#discussion_r69133212
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -56,7 +56,10 @@ private[ml] trait CrossValidatorParams extends 
ValidatorParams {
 
 /**
  * :: Experimental ::
- * K-fold cross validation.
+ * CrossValidator begins by splitting the dataset into a set of folds 
which are 
+ * used as separate training and test datasets e.g., with k=3 folds, 
CrossValidator 
+ * will generate 3 (training, test) dataset pairs, each of which uses 2/3 
of the data 
+ * for training and 1/3 for testing.
--- End diff --

Also, we can mention that the final cross-validation metric is the average 
of the test metric achieved for each (training, test) pair.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...

2016-06-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/13894#discussion_r69132912
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -191,7 +194,7 @@ object CrossValidator extends 
MLReadable[CrossValidator] {
 
 /**
  * :: Experimental ::
- * Model from k-fold cross validation.
+ * CrossValidatorModel model returns the best set of estimators for your 
model.
--- End diff --

Perhaps something like `CrossValidatorModel contains the model that 
achieved the highest cross-validation metric ( averaged across the number of 
folds).`, or something like that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...

2016-06-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/13894#discussion_r69132570
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -56,7 +56,10 @@ private[ml] trait CrossValidatorParams extends 
ValidatorParams {
 
 /**
  * :: Experimental ::
- * K-fold cross validation.
+ * CrossValidator begins by splitting the dataset into a set of folds 
which are 
+ * used as separate training and test datasets e.g., with k=3 folds, 
CrossValidator 
+ * will generate 3 (training, test) dataset pairs, each of which uses 2/3 
of the data 
+ * for training and 1/3 for testing.
--- End diff --

It may be worth mentioning that each fold is non-overlapping, and that each 
fold will be used in test exactly once. See 
[wikipedia](https://en.wikipedia.org/wiki/Cross-validation_(statistics)#k-fold_cross-validation).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...

2016-06-30 Thread krishnakalyan3
Github user krishnakalyan3 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13894#discussion_r69103201
  
--- Diff: python/pyspark/ml/tuning.py ---
@@ -266,7 +269,7 @@ class CrossValidatorModel(Model, ValidatorParams):
 """
 .. note:: Experimental
 
-Model from k-fold cross validation.
+CrossValidatorModel model returns the best set of estimators for your 
model.
--- End diff --

@holdenk thanks for the review, ran lint-python.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...

2016-06-30 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13894#discussion_r69089084
  
--- Diff: python/pyspark/ml/tuning.py ---
@@ -266,8 +269,8 @@ class CrossValidatorModel(Model, ValidatorParams):
 """
 .. note:: Experimental
 
-Model from k-fold cross validation.
-
+CrossValidatorModel model returns the best set of estimators for your 
model.
+   
--- End diff --

Seems like a possible formatting issue here - have you run 
./dev/lint-python?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...

2016-06-30 Thread krishnakalyan3
Github user krishnakalyan3 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13894#discussion_r69086192
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -194,8 +194,7 @@ object CrossValidator extends 
MLReadable[CrossValidator] {
 
 /**
  * :: Experimental ::
- * Pipelines facilitate model selection by making it easy to tune an 
entire 
- * Pipeline at once, rather than tuning each element in the Pipeline 
separately.
+ * CrossValidatorModel model returns the best set of estimators for your 
model.
--- End diff --

@holdenk I have updated doc. Could you please review it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...

2016-06-28 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13894#discussion_r68857452
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -191,7 +194,8 @@ object CrossValidator extends 
MLReadable[CrossValidator] {
 
 /**
  * :: Experimental ::
- * Model from k-fold cross validation.
+ * Pipelines facilitate model selection by making it easy to tune an 
entire 
--- End diff --

I'm not sure if this is the best place to mention that - it seems like it 
doesn't belong in the CrossValidatorModel scaladoc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...

2016-06-24 Thread krishnakalyan3
GitHub user krishnakalyan3 opened a pull request:

https://github.com/apache/spark/pull/13894

[SPARK-15254][DOC] Improve ML pipeline Cross Validation Scaladoc & PyDoc

## What changes were proposed in this pull request?
Updated ML pipeline Cross Validation Scaladoc & PyDoc.


## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)


(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/krishnakalyan3/spark kfold-cv

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13894.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13894


commit d7ffe7a442a6e29b5f3afdf32894d108eb04bb35
Author: krishnakalyan3 
Date:   2016-06-24T14:11:37Z

Improve ML pipeline Cross Validation Scaladoc & PyDoc




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org