[ https://issues.apache.org/jira/browse/SPARK-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323311#comment-14323311 ]
Joseph K. Bradley commented on SPARK-5688: ------------------------------------------ This actually does not happen. You're describing unordered categorical features. Categorical features are only treated as unordered if the number of subsets we need to test is <= numSplits. You can see this choice being made here: [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala#L144] If DecisionTree cannot make the feature unordered, then it treats it as an ordered categorical feature, in which case we do the bounds check here: [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala#L128] Does this make sense? If you ran into a bug or error message, please let me know. Otherwise, I'll close this JIRA. > Splits for Categorical Variables in DecisionTrees > ------------------------------------------------- > > Key: SPARK-5688 > URL: https://issues.apache.org/jira/browse/SPARK-5688 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.2.0 > Environment: Any > Reporter: Eric Denovitzer > Priority: Minor > Labels: categorical, decisiontree > > The categories on each subset chosen to build a split on a categorical > variable was not random. The categories for the subset are chosen based on > the binary representation of a number from 1 to (2^(number of categories)) - > 2 (excludes empty and full subset). On the current implementation, the > integers used for the subsets are 1..numSplits. This should be random instead > of biasing towards the categories with the lower indexes. > Another problem is that if numBins/2 is bigger than the possible subsets > given a category set, it still considered the numSplits to be numBins/2. This > should be the min of numBins/2 and (2^(number of categories)) - 2 (otherwise > the same subsets might be considered more than once when choosing the splits). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org