[ https://issues.apache.org/jira/browse/SPARK-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joseph K. Bradley resolved SPARK-7132. -------------------------------------- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21129 [https://github.com/apache/spark/pull/21129] > Add fit with validation set to spark.ml GBT > ------------------------------------------- > > Key: SPARK-7132 > URL: https://issues.apache.org/jira/browse/SPARK-7132 > Project: Spark > Issue Type: Improvement > Components: ML > Reporter: Joseph K. Bradley > Assignee: Weichen Xu > Priority: Minor > Fix For: 2.4.0 > > > In spark.mllib GradientBoostedTrees, we have a method runWithValidation which > takes a validation set. We should add that to the spark.ml API. > This will require a bit of thinking about how the Pipelines API should handle > a validation set (since Transformers and Estimators only take 1 input > DataFrame). The current plan is to include an extra column in the input > DataFrame which indicates whether the row is for training, validation, etc. > Goals > A [P0] Support efficient validation during training > B [P1] Support early stopping based on validation metrics > C [P0] Ensure validation data are preprocessed identically to training data > D [P1] Support complex Pipelines with multiple models using validation data > Proposal: column with indicator for train vs validation > Include an extra column in the input DataFrame which indicates whether the > row is for training or validation. Add a Param “validationFlagCol” used to > specify the extra column name. > A, B, C are easy. > D is doable. > Each estimator would need to have its validationFlagCol Param set to the same > column. > Complication: It would be ideal if we could prevent different estimators from > using different validation sets. (Joseph: There is not an obvious way IMO. > Maybe we can address this later by, e.g., having Pipelines take a > validationFlagCol Param and pass that to the sub-models in the Pipeline. > Let’s not worry about this for now.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org