[ 
https://issues.apache.org/jira/browse/SPARK-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323311#comment-14323311
 ] 

Joseph K. Bradley commented on SPARK-5688:
------------------------------------------

This actually does not happen.  You're describing unordered categorical 
features.  Categorical features are only treated as unordered if the number of 
subsets we need to test is <= numSplits.  You can see this choice being made 
here: 
[https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala#L144]

If DecisionTree cannot make the feature unordered, then it treats it as an 
ordered categorical feature, in which case we do the bounds check here: 
[https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala#L128]

Does this make sense?  If you ran into a bug or error message, please let me 
know.  Otherwise, I'll close this JIRA.

> Splits for Categorical Variables in DecisionTrees
> -------------------------------------------------
>
>                 Key: SPARK-5688
>                 URL: https://issues.apache.org/jira/browse/SPARK-5688
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.2.0
>         Environment: Any
>            Reporter: Eric Denovitzer
>            Priority: Minor
>              Labels: categorical, decisiontree
>
> The categories on each subset chosen to build a split on a categorical 
> variable  was not random. The categories for the subset are chosen based on 
> the binary representation of a number from 1 to (2^(number of categories)) - 
> 2 (excludes empty and full subset). On the current implementation, the 
> integers used for the subsets are 1..numSplits. This should be random instead 
> of biasing towards the categories with the lower indexes. 
> Another problem is that if numBins/2 is bigger than the possible subsets 
> given a category set, it still considered the numSplits to be numBins/2. This 
> should be the min of numBins/2 and  (2^(number of categories)) - 2 (otherwise 
> the same subsets might be considered more than once when choosing the splits).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to