Github user facaiy commented on the issue:

    https://github.com/apache/spark/pull/17556
  
    I scanned split critical of sklearn and xgboost.
    
    1. sklearn
        count all continuous values and split at mean value.
    
        commit 5147fd09c6a063188efde444f47bd006fa5f95f0
        sklearn/tree/_splitter.pyx: 484:
        ```python
        current.threshold = (Xf[p - 1] + Xf[p]) / 2.0
        ```
    
    2. xgboost: 
        commit 49bdb5c97fccd81b1fdf032eab4599a065c6c4f6
    
        + If all continuous values are used as candidate, it uses mean value.
    
           src/tree/updater_colmaker.cc: 555:
           ```c++
           e.best.Update(loss_chg, fid, (fvalue + e.last_fvalue) * 0.5f, d_step 
== -1);
           ```
        + If continuous feature are quantized, it uses `cut`. I'm not familiar 
with C++ and update_histmaker.cc is a little complicate, hence I don't know 
what is `cut` indeed. However, it should be the same with current spark's split 
critical, I guess.
    
           src/tree/updater_histmaker.cc: 194:
           ```c++
           if (best->Update(static_cast<bst_float>(loss_chg), fid, hist.cut[i], 
false)) {
           ```
    
    Anyway,  weighted mean is more reasonable than mean or cut value in my 
option.  And the PR is trivial enhancement for tree module, and it's not worth 
to spend much time because of obvious conclusion. 
    
    However, we will be more confident if more feedback of experts are 
collected.
            


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to