[ https://issues.apache.org/jira/browse/SPARK-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiangrui Meng resolved SPARK-3156. ---------------------------------- Resolution: Fixed Fix Version/s: 1.2.0 > DecisionTree: Order categorical features adaptively > --------------------------------------------------- > > Key: SPARK-3156 > URL: https://issues.apache.org/jira/browse/SPARK-3156 > Project: Spark > Issue Type: Improvement > Components: MLlib > Reporter: Joseph K. Bradley > Assignee: Joseph K. Bradley > Fix For: 1.2.0 > > > Improvement: accuracy > Currently, ordered categorical features use a fixed bin ordering chosen > before training based on a subsample of the data. (See the code using > centroids in findSplitsBins().) > Proposal: Choose the ordering adaptively for every split. This would require > a bit more computation on the master, but could improve results by splitting > more intelligently. > Required changes: The result of aggregation is used in > findAggForOrderedFeatureClassification() to compute running totals over the > pre-set ordering of categorical feature values. The stats should instead be > used to choose a new ordering of categories, before computing running totals. > Clarification: > It is actually more accurate to choose a new ordering at every node (and is > required to make this have guarantees and not be a heuristic for regression > and binary classification). It does mean a different set of splits may be > considered at each node, but that split should be tailored specifically for > that node and should give better results. > As far as computation, it does require a sort, but that should be cheap as > long as the number of categories for any feature is not too large. In my > tests, much more (10x - 100x) time is spent on the aggregation than on the > master, so it is not an issue for categorical features with a smallish number > of categories. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org