Github user sethah commented on the issue: https://github.com/apache/spark/pull/14547 So, taking a look at the current patch, the API for this "loss-based" impurity feels clunky and a bit confusing. To enumerate, we have the following scenarios: **1. Completely decoupled loss and impurity** - Spark 2.0 and before **2. Terminal node updates** - trees trained with _some_ impurity, but predictions are optimal (this patch), aka "TreeBoost" **3. Train trees with loss directly**- this incorporates both terminal node refinements and minimizing the loss directly during tree training. This could be done in the future and is how XGBoost works. I don't exactly see why we need to maintain 1 at all. Terminal node refinements should be a strict improvement over 1. By attempting to maintain both options, we expose an "impurity" parameter for GBTs, which IMO does not need to be public. To me, it is confusing also. e.g. ````scala scala> gbtc.getImpurity res1: String = loss-based ```` What exactly does that mean to the user? I think that, ideally in the future, we would support 2 and 3 and users could select between the two. I can envision there being trade-offs between 2 and 3, but in theory 2 should be a strict superset of 1. AFAIK, scikit does not expose the option to perform terminal-node updates to the user and I am not even sure there is a paper documenting gradient tree boosting without explicitly performing the terminal node updates. For instance, we have previously referred to this paper https://statweb.stanford.edu/~jhf/ftp/stobst.pdf (please correct me if I'm wrong here) in the code as justification for describing the current approach as "Stochastic Gradient Boosting." But in the paper both sections 1 and 2 use terminal node refinements. Silently changing the algorithm could be tricky, but if we can be certain that this change is a strict improvement then I'm not sure it's a problem. I'm curious to hear others' thoughts on this.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org