Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19433 @smurching I found some issues and have some thoughts on the columnar features format: - In your doc, you said "Specifically, we only need to store sufficient stats for each bin of a single feature, as opposed to each bin of every feature", BUT, current implementation, you still allocate space for all features when computing: -- see `DTStatsAggregator` implementation, you pass `featureSubset = None` so `DTStatsAggregator` will allocate space for every features. According to your purpose, you should pass `featureSubset = Some(Array(currentFeatureIndex))`. - Current implementation still use binnedFeatures. You said in future it will be improved to sort feature values for continuous feature (for more precise tree training), if you want to consider every possible thresholds, you need hold rawFeatures instead of binnedFeatures in the columnar feature array, and in each split range offset, you need sort every continuous features. Is this the thing you want to do in the future ? This will increase calculation amount. - For current implementation(using binnedFeature) , there is no need to sort continuous features inside each split offset. So the `indices` for each feature is exactly the same. In order to save memory, I think these `indices` should be shared, no need to create separate indices array for each features. Even if you add the improvements for continuous features mentioned above, you can create separate `indices` array for **only** continuous features, the categorical features can still share the same `indices` array. - About locality advantage of columnar format, I have some doubts. Current implementation, you do not reorder the `label` and `weight` array, access `label` and `weight` value need use `indices`, when calculating `DTStat`, this break locality. (But I'm not sure how much impact to perf this will bring). - About the overhead of columnar format: when making reordering (when get new split, we need reorder left sub-tree samples into front), so you need reordering on each column, and at the same time, update the `indices` array. But, if we use row format, like: `Array[(features, label, weight)]`, reordering will be much easier, and do not need indices. So, I am considering, whether we can use row format, but at the time when we need `DTStatsAggregator` computation, copy the data we need from the row format into columnar format array (only need to copy rows between sub-node offset and only copy the sampled features if using feature subsampling).
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org