[jira] [Commented] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit
[ https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139023#comment-16139023 ] yuhao yang commented on SPARK-21535: Thank for for the comments. > Reduce memory requirement for CrossValidator and TrainValidationSplit > -- > > Key: SPARK-21535 > URL: https://issues.apache.org/jira/browse/SPARK-21535 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > CrossValidator and TrainValidationSplit both use > {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where > epm is Array[ParamMap]. > Even though the training process is sequential, current implementation > consumes extra driver memory for holding the trained models, which is not > necessary and often leads to memory exception for both CrossValidator and > TrainValidationSplit. My proposal is to optimize the training implementation, > thus that used model can be collected by GC, and avoid the unnecessary OOM > exceptions. > E.g. when grid search space is 12, old implementation needs to hold all 12 > trained models in the driver memory at the same time, while the new > implementation only needs to hold 1 trained model at a time, and previous > model can be cleared by GC. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit
[ https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16137697#comment-16137697 ] Joseph K. Bradley commented on SPARK-21535: --- [~yuhaoyan] Parallel training of models can be beneficial; we've done tests showing decent speedups (2-3x). But the benefits are generally limited to small models or small data, where there isn't enough work during training a single model for the whole cluster to stay busy. For larger problems, parallel training does not help as much. I agree with you that parallel training & this fix should not conflict too much: The memory efficiency issue is a problem for big models; parallel training is more useful with smaller models. > Reduce memory requirement for CrossValidator and TrainValidationSplit > -- > > Key: SPARK-21535 > URL: https://issues.apache.org/jira/browse/SPARK-21535 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > CrossValidator and TrainValidationSplit both use > {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where > epm is Array[ParamMap]. > Even though the training process is sequential, current implementation > consumes extra driver memory for holding the trained models, which is not > necessary and often leads to memory exception for both CrossValidator and > TrainValidationSplit. My proposal is to optimize the training implementation, > thus that used model can be collected by GC, and avoid the unnecessary OOM > exceptions. > E.g. when grid search space is 12, old implementation needs to hold all 12 > trained models in the driver memory at the same time, while the new > implementation only needs to hold 1 trained model at a time, and previous > model can be cleared by GC. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit
[ https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111556#comment-16111556 ] Hyukjin Kwon commented on SPARK-21535: -- User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/18733 > Reduce memory requirement for CrossValidator and TrainValidationSplit > -- > > Key: SPARK-21535 > URL: https://issues.apache.org/jira/browse/SPARK-21535 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > CrossValidator and TrainValidationSplit both use > {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where > epm is Array[ParamMap]. > Even though the training process is sequential, current implementation > consumes extra driver memory for holding the trained models, which is not > necessary and often leads to memory exception for both CrossValidator and > TrainValidationSplit. My proposal is to optimize the training implementation, > thus that used model can be collected by GC, and avoid the unnecessary OOM > exceptions. > E.g. when grid search space is 12, old implementation needs to hold all 12 > trained models in the driver memory at the same time, while the new > implementation only needs to hold 1 trained model at a time, and previous > model can be cleared by GC. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit
[ https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16103547#comment-16103547 ] yuhao yang commented on SPARK-21535: It's not in my opinion. https://issues.apache.org/jira/browse/SPARK-21086 is trying to store all the trained models in the TrainValidationSplitModel or CrossValidatorModel according to the discussion, and with a control parameter which is turned off by default. Anyway changing the training process hardly has an impact on that. > Reduce memory requirement for CrossValidator and TrainValidationSplit > -- > > Key: SPARK-21535 > URL: https://issues.apache.org/jira/browse/SPARK-21535 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > CrossValidator and TrainValidationSplit both use > {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where > epm is Array[ParamMap]. > Even though the training process is sequential, current implementation > consumes extra driver memory for holding the trained models, which is not > necessary and often leads to memory exception for both CrossValidator and > TrainValidationSplit. My proposal is to optimize the training implementation, > thus that used model can be collected by GC, and avoid the unnecessary OOM > exceptions. > E.g. when grid search space is 12, old implementation needs to hold all 12 > trained models in the driver memory at the same time, while the new > implementation only needs to hold 1 trained model at a time, and previous > model can be cleared by GC. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit
[ https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16103152#comment-16103152 ] Nick Pentreath commented on SPARK-21535: Isn't this in direct opposition to https://issues.apache.org/jira/browse/SPARK-21086? > Reduce memory requirement for CrossValidator and TrainValidationSplit > -- > > Key: SPARK-21535 > URL: https://issues.apache.org/jira/browse/SPARK-21535 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > CrossValidator and TrainValidationSplit both use > {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where > epm is Array[ParamMap]. > Even though the training process is sequential, current implementation > consumes extra driver memory for holding the trained models, which is not > necessary and often leads to memory exception for both CrossValidator and > TrainValidationSplit. My proposal is to optimize the training implementation, > thus that used model can be collected by GC, and avoid the unnecessary OOM > exceptions. > E.g. when grid search space is 12, old implementation needs to hold all 12 > trained models in the driver memory at the same time, while the new > implementation only needs to hold 1 trained model at a time, and previous > model can be cleared by GC. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit
[ https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16101870#comment-16101870 ] yuhao yang commented on SPARK-21535: The basic idea is that we should release the driver memory as soon as a trained model is evaluated. I don't think there's any conflict. But let me know if there's any, I'll revert the jira. I'm not a big fan for the Parallel CV idea. Personally I cannot see how it improves the overall performance or ease of use. But maybe it's just I never met the appropriate scenarios. > Reduce memory requirement for CrossValidator and TrainValidationSplit > -- > > Key: SPARK-21535 > URL: https://issues.apache.org/jira/browse/SPARK-21535 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > CrossValidator and TrainValidationSplit both use > {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where > epm is Array[ParamMap]. > Even though the training process is sequential, current implementation > consumes extra driver memory for holding the trained models, which is not > necessary and often leads to memory exception for both CrossValidator and > TrainValidationSplit. My proposal is to optimize the training implementation, > thus that used model can be collected by GC, and avoid the unnecessary OOM > exceptions. > E.g. when grid search space is 12, old implementation needs to hold all 12 > trained models in the driver memory at the same time, while the new > implementation only needs to hold 1 trained model at a time, and previous > model can be cleared by GC. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit
[ https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16101452#comment-16101452 ] Nick Pentreath commented on SPARK-21535: Parallel CV is in progress: https://github.com/apache/spark/pull/16774. How will this work with that? > Reduce memory requirement for CrossValidator and TrainValidationSplit > -- > > Key: SPARK-21535 > URL: https://issues.apache.org/jira/browse/SPARK-21535 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > CrossValidator and TrainValidationSplit both use > {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where > epm is Array[ParamMap]. > Even though the training process is sequential, current implementation > consumes extra driver memory for holding the trained models, which is not > necessary and often leads to memory exception for both CrossValidator and > TrainValidationSplit. My proposal is to optimize the training implementation, > thus that used model can be collected by GC, and avoid the unnecessary OOM > exceptions. > E.g. when grid search space is 12, old implementation needs to hold all 12 > trained models in the driver memory at the same time, while the new > implementation only needs to hold 1 trained model at a time, and previous > model can be cleared by GC. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit
[ https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100860#comment-16100860 ] yuhao yang commented on SPARK-21535: https://github.com/apache/spark/pulls > Reduce memory requirement for CrossValidator and TrainValidationSplit > -- > > Key: SPARK-21535 > URL: https://issues.apache.org/jira/browse/SPARK-21535 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > CrossValidator and TrainValidationSplit both use > {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where > epm is Array[ParamMap]. > Even though the training process is sequential, current implementation > consumes extra driver memory for holding the trained models, which is not > necessary and often leads to memory exception for both CrossValidator and > TrainValidationSplit. My proposal is to optimize the training implementation, > thus that used model can be collected by GC, and avoid the unnecessary OOM > exceptions. > E.g. when grid search space is 12, old implementation needs to hold all 12 > trained models in the driver memory at the same time, while the new > implementation only needs to hold 1 trained model at a time, and previous > model can be cleared by GC. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org