[
https://issues.apache.org/jira/browse/SPARK-3163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-3163:
--------------------------------
Labels: bulk-closed (was: )
> Separate continuous and categorical features in DecisionTree
> ------------------------------------------------------------
>
> Key: SPARK-3163
> URL: https://issues.apache.org/jira/browse/SPARK-3163
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Reporter: Joseph K. Bradley
> Priority: Minor
> Labels: bulk-closed
>
> Improvement: code clarity, memory usage
> Currently, during DecisionTree training, some internal data structures have
> overloaded meanings and unused values. These data structures are shared for
> all types of features, but they are used differently for different types of
> features.
> Data structures: Split, Bins, aggregates
> Feature types: continuous, ordered categorical, and unordered categorical
> This causes a couple of issues:
> (1) Overloading the meaning of these data (for different types of features)
> makes the code difficult to understand.
> (2) This leads to extra storage (e.g., unused lowSplit for some categorical
> features), and extra computation (e.g.,
> findAggForUnorderedFeatureClassification simply reshapes data).
> Proposed fix: Use different storage formats to save space and separate out
> these semantically different types.
> A related issue which could be fixed simultaneously is that multiple copies
> of splits (about 3) are kept.
> Currently: Splits and bins are stored separately and together. I.e., there
> are separate splits and bins arrays, but bins also store copies of splits.
> (Total: 3 copies of each split.)
> Possible fix: Keep separate arrays of splits, bins. Do not store splits in
> bins. There is a simple correspondence, so it would be easy to match splits
> to bins.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]