Github user smurching commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Thanks for the comments!
    
    - Yep, feature subsampling is necessary for using local tree training in 
distributed training. I was thinking of adding subsampling in a follow-up PR. 
You're right that we don't need to pass an array of `BaggedPoints` to local 
tree training; we should just pass an array of `subsampleWeights` (weights for 
the current tree) and an array of `TreePoints`. I'll push an update for this.
    
    - Agreed that the logic for classification will be the same but with a 
different impurity metric. I can add support for classification & associated 
tests in a follow-up PR.
    
    - IMO the primary advantage of the columnar storage format is that it'll 
eventually enable improvements to best split calculations; specifically, for 
continuous features we could sort the unbinned feature values and consider 
every possible threshold.  There are also the locality & memory advantages 
described in the design doc. In brief, `DTStatsAggregator` stores a flat array 
partitioned by (feature x bin). If we can iterate through all values for a 
single feature at once, most updates to `DTStatsAggregator`will occur within 
the same subarray.
    
    - Multithreading could be a nice way to increase parallelism since we don't 
use Spark during local tree training. I think we could add it in a follow-up PR.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to