Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17556 I scanned split critical of sklearn and xgboost. 1. sklearn count all continuous values and split at mean value. commit 5147fd09c6a063188efde444f47bd006fa5f95f0 sklearn/tree/_splitter.pyx: 484: ```python current.threshold = (Xf[p - 1] + Xf[p]) / 2.0 ``` 2. xgboost: commit 49bdb5c97fccd81b1fdf032eab4599a065c6c4f6 + If all continuous values are used as candidate, it uses mean value. src/tree/updater_colmaker.cc: 555: ```c++ e.best.Update(loss_chg, fid, (fvalue + e.last_fvalue) * 0.5f, d_step == -1); ``` + If continuous feature are quantized, it uses `cut`. I'm not familiar with C++ and update_histmaker.cc is a little complicate, hence I don't know what is `cut` indeed. However, it should be the same with current spark's split critical, I guess. src/tree/updater_histmaker.cc: 194: ```c++ if (best->Update(static_cast<bst_float>(loss_chg), fid, hist.cut[i], false)) { ``` Anyway, weighted mean is more reasonable than mean or cut value in my option. And the PR is trivial enhancement for tree module, and it's not worth to spend much time because of obvious conclusion. However, we will be more confident if more feedback of experts are collected.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org