[ 
https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-21535:
-------------------------------
    Description: 
CrossValidator and TrainValidationSplit both use 
{code}models = est.fit(trainingDataset, epm) {code} to fit the models, where 
epm is Array[ParamMap].

Even though the training process is sequential, current implementation consumes 
extra driver memory for holding the trained models, which is not necessary and 
often leads to memory exception for both CrossValidator and 
TrainValidationSplit. My proposal is to optimize the training implementation, 
thus that used model can be collected by GC, and avoid the unnecessary OOM 
exceptions.

E.g. when grid search space is 12, old implementation needs to hold all 12 
trained models in the driver memory at the same time, while the new 
implementation only needs to hold 1 trained model at a time, and previous model 
can be cleared by GC.

  was:
CrossValidator and TrainValidationSplit both use 
{code}models = est.fit(trainingDataset, epm) {code} to fit the models, where 
epm is Array[ParamMap].

Even though the training process is sequential, current implementation consumes 
extra driver memory for holding the trained models, which is not necessary and 
often leads to memory exception for both CrossValidator and 
TrainValidationSplit. My proposal is to changing the training implementation to 
train one model at a time, thus that used local model can be collected by GC, 
and avoid the unnecessary OOM exceptions.


> Reduce memory requirement for CrossValidator and TrainValidationSplit 
> ----------------------------------------------------------------------
>
>                 Key: SPARK-21535
>                 URL: https://issues.apache.org/jira/browse/SPARK-21535
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.2.0
>            Reporter: yuhao yang
>
> CrossValidator and TrainValidationSplit both use 
> {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where 
> epm is Array[ParamMap].
> Even though the training process is sequential, current implementation 
> consumes extra driver memory for holding the trained models, which is not 
> necessary and often leads to memory exception for both CrossValidator and 
> TrainValidationSplit. My proposal is to optimize the training implementation, 
> thus that used model can be collected by GC, and avoid the unnecessary OOM 
> exceptions.
> E.g. when grid search space is 12, old implementation needs to hold all 12 
> trained models in the driver memory at the same time, while the new 
> implementation only needs to hold 1 trained model at a time, and previous 
> model can be cleared by GC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to