[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/19904 LGTM Sorry for the delay & thanks for the PR! Merging with master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19904 **[Test build #4024 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4024/testReport)** for PR 19904 at commit [`cad2104`](https://github.com/apache/spark/commit/cad210439b7a0bc3eb870f1d68fd96fbd0763aa8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19904 **[Test build #4024 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4024/testReport)** for PR 19904 at commit [`cad2104`](https://github.com/apache/spark/commit/cad210439b7a0bc3eb870f1d68fd96fbd0763aa8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19904 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85183/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19904 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19904 **[Test build #85183 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85183/testReport)** for PR 19904 at commit [`cad2104`](https://github.com/apache/spark/commit/cad210439b7a0bc3eb870f1d68fd96fbd0763aa8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19904 **[Test build #85183 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85183/testReport)** for PR 19904 at commit [`cad2104`](https://github.com/apache/spark/commit/cad210439b7a0bc3eb870f1d68fd96fbd0763aa8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19904 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19904 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85086/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19904 **[Test build #85086 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85086/testReport)** for PR 19904 at commit [`ccd2689`](https://github.com/apache/spark/commit/ccd26895e749cce1acbe65a5507cba6b0e6c42d3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19904 **[Test build #85086 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85086/testReport)** for PR 19904 at commit [`ccd2689`](https://github.com/apache/spark/commit/ccd26895e749cce1acbe65a5507cba6b0e6c42d3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/19904 Strong +1 for unpersisting the data at the end. In the long-term, I don't think we'll even cache the training and validation datasets. Our caching of the training & validation datasets is a temporary hack to get around the issue that we don't have a DataFrame k-fold splitting method. Our current workaround is to go down to RDDs: DataFrame -> RDD -> k-fold split -> DataFrame, and as I recall, we cache to lower the SerDe costs in these conversions. Once we have a k-fold split method for DataFrames, we can just cache the original (full) dataset and not the k splits. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19904 I discussed with @MrBago offline, I make a summary for what I thought now: I give 3 approaches which we can compare, after discussion I realized none of them is ideal, we have to make a trade-off. ## Approach 1 The approach proposed by @MrBago at https://github.com/apache/spark/pull/19904#discussion_r156751569 This approach resolve the model objects referenced issue, allow the model objects to be GCed in time. **BUT, in some cases, it still do not resolve the O(N) model memory occupation issue**. Let me use an extreme case to describe it: suppose we set `parallelism = 1`, and there're 100 paramMaps. So we have 100 fitting & evaluation tasks. In this approach, because of `parallelism = 1`, the code have to wait 100 fitting tasks complete, **(at this time the memory occupation by models already reach 100 * sizeof(model) )** and then it will unpersist training dataset and then do 100 evaluation tasks. ## Approach 2 This approach is my current PR code. This approach can make sure at any case, the peak memory occupation by models to be `O(numParallelism * sizeof(model))`, but, it exists an issue that, in some extreme case, the "unpersist training dataset" will be delayed until most of the evaluation tasks complete. Suppose the case `parallelism = 1`, and there're 100 fitting & evaluation tasks, each fitting task have to be executed one by one, so only after the first 99 fitting tasks and the 100th fitting task complete, the "unpersist training dataset" will be triggered. ## Approach 3 After I compared approach 1 and approach 2, I realized that, in the case which parallelism is low but there're many fitting & evaluation tasks, we cannot achieve both of the following two goals: - Make the peak memory occupation by models to be O(parallelism * sizeof(model)) - unpersist training dataset before most of the evaluation tasks started. So I vote for a simpler approach, move the unpersist training dataset to the end (Does this really matters ?) Because the goal 1 is more important, we must make sure the peak memory occupation by models to be O(parallelism * sizeof(model)), otherwise it will bring high risk of OOM. Like following code: ``` val foldMetricFutures = epm.zipWithIndex.map { case (paramMap, paramIndex) => Future[Double] { val model = est.fit(trainingDataset, paramMap).asInstanceOf[Model[_]] //...other minor codes val metric = eval.evaluate(model.transform(validationDataset, paramMap)) logDebug(s"Got metric metricformodeltrainedwithparamMap.") metric } (executionContext) } val foldMetrics = foldMetricFutures.map(ThreadUtils.awaitResult(_, Duration.Inf)) trainingDataset.unpersist() // <--- unpersist at the end validationDataset.unpersist() ``` Gentle ping @jkbradley @MrBago @sethah @BryanCutler @holdenk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19904 @MrBago Your code https://github.com/apache/spark/pull/19904#discussion_r156751569 also works fine, I think. Although it is more complicated. @BryanCutler >the unpersist of training data is not async anymore, but this also changes the order in which fit and evaluate are called so that training data is not unpersisted until all but the last models are also evaluated. Before, all modelFutures would be executed first before metricFutures and so training data could be unpersisted as soon as possible. - My current PR code also unpersist the trainingdataset once all fitting finished (before evaluation). and, the calling `df.unpersist()` won't block (see df.unpersist(block = false) param). So move it into training thread won't cause some issue I think. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user MrBago commented on the issue: https://github.com/apache/spark/pull/19904 @BryanCutler Thanks for the response, I think I understand the situation a little better here. I think there is a fundamental tradeoff we cannot avoid. Specifically there is a tradeoff between distributed memory usage and driver memory usage. A model cannot be garbage collected until its evaluation future is done running and the training data cannot be unpersisted until all of the training tasks are finished. Let's say we have 3 models to train and evaluate with `parallelism=1`, and let's call the training & evaluation tasks T1-3 and E1-3. If we schedule the tasks T1, T2, T3, unpersist, E1, E2, E3 then we can use less distributed memory but we cannot avoid holding all 3 models in memory. If we schedule the tasks T1, E1, T2, E2, T3, unpersist, E3 then we must use more distributed memory but we only ever need to hold 1 model in memory. I'm not sure how scala futures work, when T1 is done I don't know whether T2 or E1 get priority in the executor pool. Can we guarantee that the jvm will not schedule T1, T2, T3, E1, E2, E3, unpersist? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/19904 Thanks for looking into this @WeichenXu123, this does change the behavior in a couple ways though. Like @sethah said, the unpersist of training data is not async anymore, but this also changes the order in which `fit` and `evaluate` are called so that training data is not unpersisted until all but the last models are also evaluated. Before, all `modelFutures` would be executed first before `metricFutures` and so training data could be unpersisted as soon as possible. I believe this is how it worked before adding the parallelism too. I did some local testing where I put `modelFutures` in an inner function so that they are out of scope before `awaitResult` is called, and also mapped the `Future.sequence` similar to https://github.com/apache/spark/pull/19904#discussion_r156751569, and this seemed to be enough to allow the models to be GC'd. I think this approach would be a little better. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19904 @sethah To verify the memory issue, you can add one line test code against current master at here: ``` val modelFutures = ... // Unpersist training data only when all models have trained Future.sequence[Model[_], Iterable](modelFutures)(implicitly, executionContext) .onComplete { _ => trainingDataset.unpersist() } (executionContext) // Evaluate models in a Future that will calulate a metric and allow model to be cleaned up val foldMetricFutures = // Wait for metrics to be calculated before unpersisting validation dataset val foldMetrics = foldMetricFutures.map(ThreadUtils.awaitResult(_, Duration.Inf)) validationDataset.unpersist() //add test code here, fetch all models val models = modelFutures.map(_.value.get.get) foldMetrics ``` The test code I add here is **val models = modelFutures.map(_.value.get.get)** So it can prove that these models are still in memory, we can get them. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user sethah commented on the issue: https://github.com/apache/spark/pull/19904 Can you share your test/results with us? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19904 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19904 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84639/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19904 **[Test build #84639 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84639/testReport)** for PR 19904 at commit [`2cc7c28`](https://github.com/apache/spark/commit/2cc7c28f385009570536690d686f2843485942b2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19904 **[Test build #84639 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84639/testReport)** for PR 19904 at commit [`2cc7c28`](https://github.com/apache/spark/commit/2cc7c28f385009570536690d686f2843485942b2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19904 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84523/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19904 **[Test build #84523 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84523/testReport)** for PR 19904 at commit [`8bfcc94`](https://github.com/apache/spark/commit/8bfcc948696dd88a55d808322079abae994dbaa6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19904 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19904 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84522/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19904 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19904 **[Test build #84522 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84522/testReport)** for PR 19904 at commit [`7725fd8`](https://github.com/apache/spark/commit/7725fd8a86dddba6c61c7d053dfa510a114bebb8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org