[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-24 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19904
  
LGTM
Sorry for the delay & thanks for the PR!
Merging with master



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19904
  
**[Test build #4024 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4024/testReport)**
 for PR 19904 at commit 
[`cad2104`](https://github.com/apache/spark/commit/cad210439b7a0bc3eb870f1d68fd96fbd0763aa8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19904
  
**[Test build #4024 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4024/testReport)**
 for PR 19904 at commit 
[`cad2104`](https://github.com/apache/spark/commit/cad210439b7a0bc3eb870f1d68fd96fbd0763aa8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19904
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85183/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19904
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19904
  
**[Test build #85183 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85183/testReport)**
 for PR 19904 at commit 
[`cad2104`](https://github.com/apache/spark/commit/cad210439b7a0bc3eb870f1d68fd96fbd0763aa8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19904
  
**[Test build #85183 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85183/testReport)**
 for PR 19904 at commit 
[`cad2104`](https://github.com/apache/spark/commit/cad210439b7a0bc3eb870f1d68fd96fbd0763aa8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19904
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19904
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85086/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19904
  
**[Test build #85086 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85086/testReport)**
 for PR 19904 at commit 
[`ccd2689`](https://github.com/apache/spark/commit/ccd26895e749cce1acbe65a5507cba6b0e6c42d3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19904
  
**[Test build #85086 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85086/testReport)**
 for PR 19904 at commit 
[`ccd2689`](https://github.com/apache/spark/commit/ccd26895e749cce1acbe65a5507cba6b0e6c42d3).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-18 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19904
  
Strong +1 for unpersisting the data at the end.  In the long-term, I don't 
think we'll even cache the training and validation datasets.  Our caching of 
the training & validation datasets is a temporary hack to get around the issue 
that we don't have a DataFrame k-fold splitting method.  Our current workaround 
is to go down to RDDs: DataFrame -> RDD -> k-fold split -> DataFrame, and as I 
recall, we cache to lower the SerDe costs in these conversions.  Once we have a 
k-fold split method for DataFrames, we can just cache the original (full) 
dataset and not the k splits.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-14 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/19904
  
I discussed with @MrBago offline, I make a summary for what I thought now:
I give 3 approaches which we can compare, after discussion I realized none 
of them is ideal, we have to make a trade-off.

## Approach 1
The approach proposed by @MrBago at 
https://github.com/apache/spark/pull/19904#discussion_r156751569
This approach resolve the model objects referenced issue, allow the model 
objects to be GCed in time. **BUT, in some cases, it still do not resolve the 
O(N) model memory occupation issue**. Let me use an extreme case to describe it:
suppose we set `parallelism = 1`, and there're 100 paramMaps. So we have 
100 fitting & evaluation tasks. In this approach, because of `parallelism = 1`, 
the code have to wait 100 fitting tasks complete, **(at this time the memory 
occupation by models already reach 100 * sizeof(model) )** and then it will 
unpersist training dataset and then do 100 evaluation tasks.

## Approach 2
This approach is my current PR code.
This approach can make sure at any case, the peak memory occupation by 
models to be `O(numParallelism * sizeof(model))`, but, it exists an issue that, 
in some extreme case, the "unpersist training dataset" will be delayed until 
most of the evaluation tasks complete. Suppose the case 
 `parallelism = 1`, and there're 100 fitting & evaluation tasks, each 
fitting task have to be executed one by one, so only after the first 
99 fitting tasks and the 100th fitting task complete, the "unpersist 
training dataset" will be triggered.

## Approach 3
After I compared approach 1 and approach 2, I realized that, in the case 
which parallelism is low but there're many fitting & evaluation tasks, we 
cannot achieve both of the following two goals:
- Make the peak memory occupation by models to be O(parallelism * 
sizeof(model))
- unpersist training dataset before most of the evaluation tasks started.

So I vote for a simpler approach, move the unpersist training dataset to 
the end (Does this really matters ?)
Because the goal 1 is more important, we must make sure the peak memory 
occupation by models to be O(parallelism * sizeof(model)), otherwise it will 
bring high risk of OOM.
Like following code:
```
  val foldMetricFutures = epm.zipWithIndex.map { case (paramMap, 
paramIndex) =>
Future[Double] {
  val model = est.fit(trainingDataset, 
paramMap).asInstanceOf[Model[_]]
  //...other minor codes
  val metric = eval.evaluate(model.transform(validationDataset, 
paramMap))
  logDebug(s"Got metric metricformodeltrainedwithparamMap.")
  metric
} (executionContext)
  }
  val foldMetrics = foldMetricFutures.map(ThreadUtils.awaitResult(_, 
Duration.Inf))
  trainingDataset.unpersist() // <--- unpersist at the end
  validationDataset.unpersist()
```

Gentle ping @jkbradley @MrBago @sethah @BryanCutler @holdenk 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-13 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/19904
  
@MrBago 
Your code https://github.com/apache/spark/pull/19904#discussion_r156751569 
also works fine, I think. Although it is more complicated.

@BryanCutler 
>the unpersist of training data is not async anymore, but this also changes 
the order in which fit and evaluate are called so that training data is not 
unpersisted until all but the last models are also evaluated. Before, all 
modelFutures would be executed first before metricFutures and so training data 
could be unpersisted as soon as possible.

- My current PR code also unpersist the trainingdataset once all fitting 
finished (before evaluation). and, the calling `df.unpersist()` won't block 
(see df.unpersist(block = false) param). So move it into training thread won't 
cause some issue I think.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-13 Thread MrBago
Github user MrBago commented on the issue:

https://github.com/apache/spark/pull/19904
  
@BryanCutler Thanks for the response, I think I understand the situation a 
little better here.

I think there is a fundamental tradeoff we cannot avoid. Specifically there 
is a tradeoff between distributed memory usage and driver memory usage. A model 
cannot be garbage collected until its evaluation future is done running and the 
training data cannot be unpersisted until all of the training tasks are 
finished.

Let's say we have 3 models to train and evaluate with `parallelism=1`, and 
let's call the training & evaluation tasks T1-3 and E1-3. If we schedule the 
tasks T1, T2, T3, unpersist, E1, E2, E3 then we can use less distributed memory 
but we cannot avoid holding all 3 models in memory. If we schedule the tasks 
T1, E1, T2, E2, T3, unpersist, E3 then we must use more distributed memory but 
we only ever need to hold 1 model in memory.

I'm not sure how scala futures work, when T1 is done I don't know whether 
T2 or E1 get priority in the executor pool. Can we guarantee that the jvm will 
not schedule T1, T2, T3, E1, E2, E3, unpersist?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-13 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/19904
  
Thanks for looking into this @WeichenXu123, this does change the behavior 
in a couple ways though.  Like @sethah said, the unpersist of training data is 
not async anymore, but this also changes the order in which `fit` and 
`evaluate` are called so that training data is not unpersisted until all but 
the last models are also evaluated.  Before, all `modelFutures` would be 
executed first before `metricFutures` and so training data could be unpersisted 
as soon as possible.  I believe this is how it worked before adding the 
parallelism too.

I did some local testing where I put `modelFutures` in an inner function so 
that they are out of scope before `awaitResult` is called, and also mapped the 
`Future.sequence` similar to 
https://github.com/apache/spark/pull/19904#discussion_r156751569, and this 
seemed to be enough to allow the models to be GC'd.  I think this approach 
would be a little better.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-07 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/19904
  
@sethah To verify the memory issue, you can add one line test code against 
current master at here:

```
  val modelFutures = ...
  // Unpersist training data only when all models have trained
  Future.sequence[Model[_], Iterable](modelFutures)(implicitly, 
executionContext)
.onComplete { _ => trainingDataset.unpersist() } (executionContext)
  // Evaluate models in a Future that will calulate a metric and allow 
model to be cleaned up
  val foldMetricFutures = 
  // Wait for metrics to be calculated before unpersisting validation 
dataset
  val foldMetrics = foldMetricFutures.map(ThreadUtils.awaitResult(_, 
Duration.Inf))
  validationDataset.unpersist()

  //add test code here, fetch all models
  val models = modelFutures.map(_.value.get.get)

  foldMetrics
```
The test code I add here is **val models = 
modelFutures.map(_.value.get.get)** So it can prove that these models are still 
in memory, we can get them.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-07 Thread sethah
Github user sethah commented on the issue:

https://github.com/apache/spark/pull/19904
  
Can you share your test/results with us?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19904
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19904
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84639/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19904
  
**[Test build #84639 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84639/testReport)**
 for PR 19904 at commit 
[`2cc7c28`](https://github.com/apache/spark/commit/2cc7c28f385009570536690d686f2843485942b2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19904
  
**[Test build #84639 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84639/testReport)**
 for PR 19904 at commit 
[`2cc7c28`](https://github.com/apache/spark/commit/2cc7c28f385009570536690d686f2843485942b2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19904
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84523/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19904
  
**[Test build #84523 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84523/testReport)**
 for PR 19904 at commit 
[`8bfcc94`](https://github.com/apache/spark/commit/8bfcc948696dd88a55d808322079abae994dbaa6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19904
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19904
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84522/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19904
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

2017-12-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19904
  
**[Test build #84522 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84522/testReport)**
 for PR 19904 at commit 
[`7725fd8`](https://github.com/apache/spark/commit/7725fd8a86dddba6c61c7d053dfa510a114bebb8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org