Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    I made a rough pass. I have only a few issues for now, I haven't go into 
code details:
    
    - The `colStoreInit` currently ignore the `subsampleWeights`, it should be 
used, isn't it ? I read your doc, in the higher level, the local training will 
be used to train sub-trees as parts of the global distributed training, 
`subsampleWeights` should be important info. and here it will train only single 
tree so `subsampleWeights` only contains one element, does we still need use 
`BaggedPoint` structure ? 
    
    - The logic of training for regression and for classification will be the 
same I think, only impurity difference but do not influence the code logic.
    
    - The key idea is to use the columnar storage format for features, is the 
purpose to improve memory cost & cache locality when finding best splits ? I 
see the code will do some reordering operation on feature values and use 
indices, but I haven't go into details. It's a complex part I need more time to 
review.
    
    - Maybe we can support multithreads in local training, what do you think 
about it ? 
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to