Github user smurching commented on the issue: https://github.com/apache/spark/pull/19433 @WeichenXu123 Thanks for the comments! I'll respond inline: > In your doc, you said "Specifically, we only need to store sufficient stats for each bin of a single feature, as opposed to each bin of every feature", BUT, current implementation, you still allocate space for all features when computing: -- see DTStatsAggregator implementation, you pass featureSubset = None so DTStatsAggregator will allocate space for every features. According to your purpose, you should pass featureSubset = Some(Array(currentFeatureIndex)). I like your proposed solution (pass `featureSubset = Some(Array(currentFeatureIndex))`). I'll go ahead & implement it. > Current implementation still use binnedFeatures. You said in future it will be improved to sort feature values for continuous feature (for more precise tree training), if you want to consider every possible thresholds, you need hold rawFeatures instead of binnedFeatures in the columnar feature array, and in each split range offset, you need sort every continuous features. Is this the thing you want to do in the future ? This will increase calculation amount. Yep, we'll have to pass raw (`Double`) continuous features to the local tree training methods, which will require them to accept an `Array[LabeledPoint]` instead of an `Array[TreePoint]` as input & increase memory usage (along with requiring us to store additional indices). We'll actually only have to run an `O(n log n)` sort on continuous feature values once (i.e. in the `FeatureVector` constructor), since once the continuous features are sorted we can update them as we would for categorical features when splitting nodes (in `O(n)` time) and they'll remain sorted. > For current implementation(using binnedFeature) , there is no need to sort continuous features inside each split offset. So the indices for each feature is exactly the same. In order to save memory, I think these indices should be shared, no need to create separate indices array for each features. Even if you add the improvements for continuous features mentioned above, you can create separate indices array for only continuous features, the categorical features can still share the same indices array. Agreed, I'll make this change. > About locality advantage of columnar format, I have some doubts. Current implementation, you do not reorder the label and weight array, access label and weight value need use indices, when calculating DTStat, this break locality. (But I'm not sure how much impact to perf this will bring). Yeah, I'm not sure if it'd be better to reorder the labels/weights arrays to achieve improved locality. I think we could experiment with both, but I'd prefer to save that for a follow-up PR unless you or another reviewer think it'll make a big perf difference. > About the overhead of columnar format: when making reordering (when get new split, we need reorder left sub-tree samples into front), so you need reordering on each column, and at the same time, update the indices array. But, if we use row format, like: Array[(features, label, weight)], reordering will be much easier, and do not need indices. So, I am considering, whether we can use row format, but at the time when we need DTStatsAggregator computation, copy the data we need from the row format into columnar format array (only need to copy rows between sub-node offset and only copy the sampled features if using feature subsampling). This is an interesting idea, my main concern is that on the first iteration of local tree training we'd need to copy the entire training data matrix from row -> columnar format, which negates any memory savings we get from not using indices. I'm also concerned about the overhead of repeatedly copying data from row -> columnar format.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org