[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-09-14 Thread hhbyyh
Github user hhbyyh commented on the issue:

https://github.com/apache/spark/pull/18313
  
That's all right. Please just proceed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-09-12 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/18313
  
@hhbyyh I apologize to you that your PR is valuable (in the case model list 
is very big).
But now your PR is stale and I integrate it into my new PR #19208 
Would you mind to take a look?
Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-08-23 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/18313
  
The idea with best/all/k is to to allow use cases with fairly large models 
(say large enough that all 100 or 1000 or whatever param combinations is not 
feasible to collect to driver) to still store more than just the best model.

So it's a way to satisfy both the small-to-medium use case of storing 
"all", the default use case of "best" and a part solution to the large-model 
use case using "k". So the idea with the "k" version is to not do a full 
"collect then top k" but instead keep a running top-k (PriorityQueue perhaps) 
in order to limit memory consumption.

But I agree if we have both solutions (memory and file-based) then it's not 
necessary (though if one did want to do a top-k on the file-based scenario it 
would be quite clunky to do). So if that is the end goal then let's do the 
2-step process suggested above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-08-22 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/18313
  
+1 for merging https://github.com/apache/spark/pull/16774 before proceeding 
with the other work since it will affect everything else.

@MLnick I'd be Ok with adding options for best/all/k models in the future, 
but I'm not convinced it's needed since the topK could be computed post-hoc and 
since the 2 approaches for keeping all models should handle big & small use 
cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-08-03 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/18313
  
I commented on the 
[JIRA](https://issues.apache.org/jira/browse/SPARK-21086?focusedCommentId=16112623&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16112623).
 

I would like to better understand the use case more of keeping all the 
models (as per my JIRA comment). I suspect that #16158 may be the more useful 
approach in practice.

But overall, if there is a use case for keeping the models, I would agree 
with @jkbradley's suggestion that we offer a simple "keep all sub-models as a 
field in the model" approach, as well as consider the large-scale case with 
possibly the "dump to file" option.

In addition, we could have an option to keep "best", "all" or "_k_" models 
(user-specified as a number or %)?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-08-01 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/18313
  
@jkbradley 
I think the thing is simple.
When persist model list param is `false`, just keep the code logic the same 
and **it won't increase the memory cost** (This is the default case)
When persist model list param is `true`, in the `fit` method, we can 
collect all the models and pass them to the 
`CrossValidatorModel`/`TrainValidationSplit`. This will increase memory cost 
but it doesn't matter because it is not the default case.

@hhbyyh Your new PR #18733 is meaningful I think, which make the memory 
cost to be O(1), but we need to consider the case of parallelism, in #16774 , 
maybe we should wait the #16774 merged, then consider how to optimize the 
memory usage.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-08-01 Thread hhbyyh
Github user hhbyyh commented on the issue:

https://github.com/apache/spark/pull/18313
  
@jkbradley Thanks for the suggestion. After the discussion, I found that 
actually we can reduce the memory requirement for the tuning process. Check 
https://issues.apache.org/jira/browse/SPARK-21535 if you have some time. 

And I agree to your suggestion overall, 
About 2, if we want to use indices rather that the param combination, we 
probably need to save a summary table with the models. perhaps like 
https://github.com/apache/spark/pull/16158 ?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-08-01 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/18313
  
Oh, you're right; I overlooked that it only holds all of the models for a 
single split.  In that case, I agree it could be problematic to keep all in 
memory by default.  How does this sound then: We can do 2 separate PRs, each of 
which adds a separate option:
1. Add an option for keeping the models in memory so that they are 
available as a data field in the Model object after training.  This caters to 
smaller use cases, focusing on ease-of-use.
2. (your PR) Add an option to dump models to a directory, not keeping them 
in memory.  This caters to big use cases, focusing on scalability.

What do you think?  If this sounds good, feel free to reopen your current 
PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-07-24 Thread hhbyyh
Github user hhbyyh commented on the issue:

https://github.com/apache/spark/pull/18313
  
Current implementation of `CrossValidator` (with or without this PR) 
**NEVER** holds all the trained models in the driver memory at the same time. 
It collects models sequentially and allows GC to clear previous trained models. 
 

With all due respect, for the record, I strongly vote against keeping all 
the models in the CrossValidatorModel in any way, which makes it impractical 
for any production usage.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-06-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18313
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-06-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18313
  
**[Test build #78647 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78647/testReport)**
 for PR 18313 at commit 
[`39c025f`](https://github.com/apache/spark/commit/39c025f53b0830f1980a5c425dbb3724d6639ed4).
 * This patch **fails PySpark pip packaging tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-06-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18313
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78647/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-06-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18313
  
**[Test build #78647 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78647/testReport)**
 for PR 18313 at commit 
[`39c025f`](https://github.com/apache/spark/commit/39c025f53b0830f1980a5c425dbb3724d6639ed4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-06-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18313
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78554/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-06-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18313
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-06-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18313
  
**[Test build #78554 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78554/testReport)**
 for PR 18313 at commit 
[`c7e0bcd`](https://github.com/apache/spark/commit/c7e0bcd1152c3245c1c190b71f11a22e61fd3ac5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-06-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18313
  
**[Test build #78554 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78554/testReport)**
 for PR 18313 at commit 
[`c7e0bcd`](https://github.com/apache/spark/commit/c7e0bcd1152c3245c1c190b71f11a22e61fd3ac5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-06-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18313
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78085/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-06-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18313
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-06-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18313
  
**[Test build #78085 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78085/testReport)**
 for PR 18313 at commit 
[`0fc43e1`](https://github.com/apache/spark/commit/0fc43e19c29f67c847749fb7ea0cf21ac47eb69f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org