Github user smurching commented on the issue: https://github.com/apache/spark/pull/19433 Thanks for the comments! - Yep, feature subsampling is necessary for using local tree training in distributed training. I was thinking of adding subsampling in a follow-up PR. You're right that we don't need to pass an array of `BaggedPoints` to local tree training; we should just pass an array of `subsampleWeights` (weights for the current tree) and an array of `TreePoints`. I'll push an update for this. - Agreed that the logic for classification will be the same but with a different impurity metric. I can add support for classification & associated tests in a follow-up PR. - IMO the primary advantage of the columnar storage format is that it'll eventually enable improvements to best split calculations; specifically, for continuous features we could sort the unbinned feature values and consider every possible threshold. There are also the locality & memory advantages described in the design doc. In brief, `DTStatsAggregator` stores a flat array partitioned by (feature x bin). If we can iterate through all values for a single feature at once, most updates to `DTStatsAggregator`will occur within the same subarray. - Multithreading could be a nice way to increase parallelism since we don't use Spark during local tree training. I think we could add it in a follow-up PR.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org