Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/19904 Thanks for looking into this @WeichenXu123, this does change the behavior in a couple ways though. Like @sethah said, the unpersist of training data is not async anymore, but this also changes the order in which `fit` and `evaluate` are called so that training data is not unpersisted until all but the last models are also evaluated. Before, all `modelFutures` would be executed first before `metricFutures` and so training data could be unpersisted as soon as possible. I believe this is how it worked before adding the parallelism too. I did some local testing where I put `modelFutures` in an inner function so that they are out of scope before `awaitResult` is called, and also mapped the `Future.sequence` similar to https://github.com/apache/spark/pull/19904#discussion_r156751569, and this seemed to be enough to allow the models to be GC'd. I think this approach would be a little better.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org