Github user smurching commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    @WeichenXu123 Thanks for the comments! I'll respond inline:
    
    > In your doc, you said "Specifically, we only need to store sufficient 
stats for each bin of a single feature, as opposed to each bin of every 
feature", BUT, current implementation, you still allocate space for all 
features when computing: -- see DTStatsAggregator implementation, you pass 
featureSubset = None so DTStatsAggregator will allocate space for every 
features. According to your purpose, you should pass featureSubset = 
Some(Array(currentFeatureIndex)).
    
    I like your proposed solution (pass `featureSubset = 
Some(Array(currentFeatureIndex))`). I'll go ahead & implement it.
    
    > Current implementation still use binnedFeatures. You said in future it 
will be improved to sort feature values for continuous feature (for more 
precise tree training), if you want to consider every possible thresholds, you 
need hold rawFeatures instead of binnedFeatures in the columnar feature array, 
and in each split range offset, you need sort every continuous features. Is 
this the thing you want to do in the future ? This will increase calculation 
amount.
    
    Yep, we'll have to pass raw (`Double`) continuous features to the local 
tree training methods, which will require them to accept an 
`Array[LabeledPoint]` instead of an `Array[TreePoint]` as input & increase 
memory usage (along with requiring us to store additional indices). 
    
    We'll actually only have to run an `O(n log n)` sort on continuous feature 
values once (i.e. in the `FeatureVector` constructor), since once the 
continuous features are sorted we can update them as we would for categorical 
features when splitting nodes (in `O(n)` time) and they'll remain sorted.
    
    > For current implementation(using binnedFeature) , there is no need to 
sort continuous features inside each split offset. So the indices for each 
feature is exactly the same. In order to save memory, I think these indices 
should be shared, no need to create separate indices array for each features. 
Even if you add the improvements for continuous features mentioned above, you 
can create separate indices array for only continuous features, the categorical 
features can still share the same indices array.
    
    Agreed, I'll make this change.
    
    > About locality advantage of columnar format, I have some doubts. Current 
implementation, you do not reorder the label and weight array, access label and 
weight value need use indices, when calculating DTStat, this break locality. 
(But I'm not sure how much impact to perf this will bring).
    
    Yeah, I'm not sure if it'd be better to reorder the labels/weights arrays 
to achieve improved locality.
    I think we could experiment with both, but I'd prefer to save that for a 
follow-up PR unless you or another reviewer think it'll make a big perf 
difference.
    
    > About the overhead of columnar format: when making reordering (when get 
new split, we need reorder left sub-tree samples into front), so you need 
reordering on each column, and at the same time, update the indices array. But, 
if we use row format, like:
    Array[(features, label, weight)], reordering will be much easier, and do 
not need indices.
    So, I am considering, whether we can use row format, but at the time when 
we need DTStatsAggregator computation, copy the data we need from the row 
format into columnar format array (only need to copy rows between sub-node 
offset and only copy the sampled features if using feature subsampling).
    
    This is an interesting idea, my main concern is that on the first iteration 
of local tree training we'd need to copy the entire training data matrix from 
row -> columnar format, which negates any memory savings we get from not using 
indices. I'm also concerned about the overhead of repeatedly copying data from 
row -> columnar format.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to