[ https://issues.apache.org/jira/browse/SPARK-32271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156417#comment-17156417 ]
Apache Spark commented on SPARK-32271: -------------------------------------- User 'adjordan' has created a pull request for this issue: https://github.com/apache/spark/pull/29080 > Update CrossValidator to parallelize fit method across folds > ------------------------------------------------------------ > > Key: SPARK-32271 > URL: https://issues.apache.org/jira/browse/SPARK-32271 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 3.1.0 > Reporter: Austin Jordan > Priority: Minor > > Currently, fitting a CrossValidator is only parallelized across models. This > means that a CrossValidator will only fit as quickly as the slowest-to-train > model would fit by itself. > If a 2x2x3 parameter grid is provided for 10-fold cross validation, all 12 > models will begin training on the first fold. However, if 6 of these models > will train for 1 hour/fold and the other 6 will train for 3 hours/fold (e.g. > when tuning number of early stopping rounds in XGBoost), the first 6 models > will not move on to the second fold until the last 6 are finished. > If fitting was parallelized across folds, the first 6 models would finish > after 10 hours, freeing up cluster resources to run multiple folds for the > last 6 models in parallel. > Changes to be made: > * Instead of splitting data into multiple training and validation sets, > split into the folds. > * Cache each of the folds (so each fold only ends up getting cached once, > instead of 10 times how it is now). > * For each fold index, form the training and validation sets by selecting > the current fold as the validation set and unioning the rest into the > training set. > * Make associated changes to calculate fold metrics, now that folds are > being parallelized as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org