Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19433 I made a rough pass. I have only a few issues for now, I haven't go into code details: - The `colStoreInit` currently ignore the `subsampleWeights`, it should be used, isn't it ? I read your doc, in the higher level, the local training will be used to train sub-trees as parts of the global distributed training, `subsampleWeights` should be important info. and here it will train only single tree so `subsampleWeights` only contains one element, does we still need use `BaggedPoint` structure ? - The logic of training for regression and for classification will be the same I think, only impurity difference but do not influence the code logic. - The key idea is to use the columnar storage format for features, is the purpose to improve memory cost & cache locality when finding best splits ? I see the code will do some reordering operation on feature values and use indices, but I haven't go into details. It's a complex part I need more time to review. - Maybe we can support multithreads in local training, what do you think about it ?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org