Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21129#discussion_r187112582 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala --- @@ -460,18 +461,29 @@ private[ml] trait RandomForestRegressorParams * * Note: Marked as private and DeveloperApi since this may be made public in the future. */ -private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter with HasStepSize { +private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter with HasStepSize + with HasValidationIndicatorCol { - /* TODO: Add this doc when we add this param. SPARK-7132 - * Threshold for stopping early when runWithValidation is used. + /** + * Threshold for stopping early when fit with validation is used. * If the error rate on the validation input changes by less than the validationTol, - * then learning will stop early (before [[numIterations]]). - * This parameter is ignored when run is used. + * then learning will stop early (before [[maxIter]]). + * This parameter is ignored when fit without validation is used. * (default = 1e-5) --- End diff -- I forget why we chose 1e-5 (which is different from spark.mllib). What do you think about using 0.01 to match the sklearn docs here? http://scikit-learn.org/dev/auto_examples/ensemble/plot_gradient_boosting_early_stopping.html (I also checked xgboost, but they use a different approach based on x number of steps without improvement. We may want to add that at some point since it sounds more robust.)
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org