[ 
https://issues.apache.org/jira/browse/SPARK-3163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-3163:
--------------------------------
    Labels: bulk-closed  (was: )

> Separate continuous and categorical features in DecisionTree
> ------------------------------------------------------------
>
>                 Key: SPARK-3163
>                 URL: https://issues.apache.org/jira/browse/SPARK-3163
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>            Priority: Minor
>              Labels: bulk-closed
>
> Improvement: code clarity, memory usage
> Currently, during DecisionTree training, some internal data structures have 
> overloaded meanings and unused values.  These data structures are shared for 
> all types of features, but they are used differently for different types of 
> features.
> Data structures: Split, Bins, aggregates
> Feature types: continuous, ordered categorical, and unordered categorical
> This causes a couple of issues:
> (1) Overloading the meaning of these data (for different types of features) 
> makes the code difficult to understand.
> (2) This leads to extra storage (e.g., unused lowSplit for some categorical 
> features), and extra computation (e.g., 
> findAggForUnorderedFeatureClassification simply reshapes data).
> Proposed fix: Use different storage formats to save space and separate out 
> these semantically different types.
> A related issue which could be fixed simultaneously is that multiple copies 
> of splits (about 3) are kept.
> Currently: Splits and bins are stored separately and together.  I.e., there 
> are separate splits and bins arrays, but bins also store copies of splits. 
> (Total: 3 copies of each split.)
> Possible fix: Keep separate arrays of splits, bins.  Do not store splits in 
> bins.  There is a simple correspondence, so it would be easy to match splits 
> to bins.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to