[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/18313 That's all right. Please just proceed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18313 @hhbyyh I apologize to you that your PR is valuable (in the case model list is very big). But now your PR is stale and I integrate it into my new PR #19208 Would you mind to take a look? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/18313 The idea with best/all/k is to to allow use cases with fairly large models (say large enough that all 100 or 1000 or whatever param combinations is not feasible to collect to driver) to still store more than just the best model. So it's a way to satisfy both the small-to-medium use case of storing "all", the default use case of "best" and a part solution to the large-model use case using "k". So the idea with the "k" version is to not do a full "collect then top k" but instead keep a running top-k (PriorityQueue perhaps) in order to limit memory consumption. But I agree if we have both solutions (memory and file-based) then it's not necessary (though if one did want to do a top-k on the file-based scenario it would be quite clunky to do). So if that is the end goal then let's do the 2-step process suggested above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/18313 +1 for merging https://github.com/apache/spark/pull/16774 before proceeding with the other work since it will affect everything else. @MLnick I'd be Ok with adding options for best/all/k models in the future, but I'm not convinced it's needed since the topK could be computed post-hoc and since the 2 approaches for keeping all models should handle big & small use cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/18313 I commented on the [JIRA](https://issues.apache.org/jira/browse/SPARK-21086?focusedCommentId=16112623&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16112623). I would like to better understand the use case more of keeping all the models (as per my JIRA comment). I suspect that #16158 may be the more useful approach in practice. But overall, if there is a use case for keeping the models, I would agree with @jkbradley's suggestion that we offer a simple "keep all sub-models as a field in the model" approach, as well as consider the large-scale case with possibly the "dump to file" option. In addition, we could have an option to keep "best", "all" or "_k_" models (user-specified as a number or %)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18313 @jkbradley I think the thing is simple. When persist model list param is `false`, just keep the code logic the same and **it won't increase the memory cost** (This is the default case) When persist model list param is `true`, in the `fit` method, we can collect all the models and pass them to the `CrossValidatorModel`/`TrainValidationSplit`. This will increase memory cost but it doesn't matter because it is not the default case. @hhbyyh Your new PR #18733 is meaningful I think, which make the memory cost to be O(1), but we need to consider the case of parallelism, in #16774 , maybe we should wait the #16774 merged, then consider how to optimize the memory usage. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/18313 @jkbradley Thanks for the suggestion. After the discussion, I found that actually we can reduce the memory requirement for the tuning process. Check https://issues.apache.org/jira/browse/SPARK-21535 if you have some time. And I agree to your suggestion overall, About 2, if we want to use indices rather that the param combination, we probably need to save a summary table with the models. perhaps like https://github.com/apache/spark/pull/16158 ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/18313 Oh, you're right; I overlooked that it only holds all of the models for a single split. In that case, I agree it could be problematic to keep all in memory by default. How does this sound then: We can do 2 separate PRs, each of which adds a separate option: 1. Add an option for keeping the models in memory so that they are available as a data field in the Model object after training. This caters to smaller use cases, focusing on ease-of-use. 2. (your PR) Add an option to dump models to a directory, not keeping them in memory. This caters to big use cases, focusing on scalability. What do you think? If this sounds good, feel free to reopen your current PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/18313 Current implementation of `CrossValidator` (with or without this PR) **NEVER** holds all the trained models in the driver memory at the same time. It collects models sequentially and allows GC to clear previous trained models. With all due respect, for the record, I strongly vote against keeping all the models in the CrossValidatorModel in any way, which makes it impractical for any production usage. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18313 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18313 **[Test build #78647 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78647/testReport)** for PR 18313 at commit [`39c025f`](https://github.com/apache/spark/commit/39c025f53b0830f1980a5c425dbb3724d6639ed4). * This patch **fails PySpark pip packaging tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18313 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78647/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18313 **[Test build #78647 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78647/testReport)** for PR 18313 at commit [`39c025f`](https://github.com/apache/spark/commit/39c025f53b0830f1980a5c425dbb3724d6639ed4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18313 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78554/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18313 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18313 **[Test build #78554 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78554/testReport)** for PR 18313 at commit [`c7e0bcd`](https://github.com/apache/spark/commit/c7e0bcd1152c3245c1c190b71f11a22e61fd3ac5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18313 **[Test build #78554 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78554/testReport)** for PR 18313 at commit [`c7e0bcd`](https://github.com/apache/spark/commit/c7e0bcd1152c3245c1c190b71f11a22e61fd3ac5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18313 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78085/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18313 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18313 **[Test build #78085 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78085/testReport)** for PR 18313 at commit [`0fc43e1`](https://github.com/apache/spark/commit/0fc43e19c29f67c847749fb7ea0cf21ac47eb69f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org