Github user sethah commented on the issue:

    https://github.com/apache/spark/pull/14547
  
    So, taking a look at the current patch, the API for this "loss-based" 
impurity feels clunky and a bit confusing. To enumerate, we have the following 
scenarios:
    
    **1. Completely decoupled loss and impurity** - Spark 2.0 and before
    **2. Terminal node updates** - trees trained with _some_ impurity, but 
predictions are optimal (this patch), aka "TreeBoost"
    **3. Train trees with loss directly**-  this incorporates both terminal 
node refinements and minimizing the loss directly during tree training. This 
could be done in the future and is how XGBoost works.
    
    I don't exactly see why we need to maintain 1 at all. Terminal node 
refinements should be a strict improvement over 1. By attempting to maintain 
both options, we expose an "impurity" parameter for GBTs, which IMO does not 
need to be public. To me, it is confusing also. e.g.
    
    ````scala
    scala> gbtc.getImpurity
    res1: String = loss-based
    ````
    What exactly does that mean to the user? I think that, ideally in the 
future, we would support 2 and 3 and users could select between the two. I can 
envision there being trade-offs between 2 and 3, but in theory 2 should be a 
strict superset of 1. 
    
    AFAIK, scikit does not expose the option to perform terminal-node updates 
to the user and I am not even sure there is a paper documenting gradient tree 
boosting without explicitly performing the terminal node updates. For instance, 
we have previously referred to this paper 
https://statweb.stanford.edu/~jhf/ftp/stobst.pdf (please correct me if I'm 
wrong here) in the code as justification for describing the current approach as 
"Stochastic Gradient Boosting." But in the paper both sections 1 and 2 use 
terminal node refinements.
    
    Silently changing the algorithm could be tricky, but if we can be certain 
that this change is a strict improvement then I'm not sure it's a problem. I'm 
curious to hear others' thoughts on this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to