[ https://issues.apache.org/jira/browse/SPARK-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14312403#comment-14312403 ]
Apache Spark commented on SPARK-5688: ------------------------------------- User 'edenovit' has created a pull request for this issue: https://github.com/apache/spark/pull/4475 > Splits for Categorical Variables in DecisionTrees > ------------------------------------------------- > > Key: SPARK-5688 > URL: https://issues.apache.org/jira/browse/SPARK-5688 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.2.0 > Environment: Any > Reporter: Eric Denovitzer > Labels: categorical, decisiontree > Fix For: 1.2.0 > > > The categories on each subset chosen to build a split on a categorical > variable was not random. The categories for the subset are chosen based on > the binary representation of a number from 1 to (2^(number of categories)) - > 2 (excludes empty and full subset). On the current implementation, the > integers used for the subsets are 1..numSplits. This should be random instead > of biasing towards the categories with the lower indexes. > Another problem is that if numBins/2 is bigger than the possible subsets > given a category set, it still considered the numSplits to be numBins/2. This > should be the min of numBins/2 and (2^(number of categories)) - 2 (otherwise > the same subsets might be considered more than once when choosing the splits). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org