[jira] [Commented] (SPARK-5688) Splits for Categorical Variables in DecisionTrees

Apache Spark (JIRA) Mon, 09 Feb 2015 08:23:26 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14312403#comment-14312403
 ]


Apache Spark commented on SPARK-5688:
-------------------------------------

User 'edenovit' has created a pull request for this issue:
https://github.com/apache/spark/pull/4475

> Splits for Categorical Variables in DecisionTrees
> -------------------------------------------------
>
>                 Key: SPARK-5688
>                 URL: https://issues.apache.org/jira/browse/SPARK-5688
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.2.0
>         Environment: Any
>            Reporter: Eric Denovitzer
>              Labels: categorical, decisiontree
>             Fix For: 1.2.0
>
>
> The categories on each subset chosen to build a split on a categorical 
> variable  was not random. The categories for the subset are chosen based on 
> the binary representation of a number from 1 to (2^(number of categories)) - 
> 2 (excludes empty and full subset). On the current implementation, the 
> integers used for the subsets are 1..numSplits. This should be random instead 
> of biasing towards the categories with the lower indexes. 
> Another problem is that if numBins/2 is bigger than the possible subsets 
> given a category set, it still considered the numSplits to be numBins/2. This 
> should be the min of numBins/2 and  (2^(number of categories)) - 2 (otherwise 
> the same subsets might be considered more than once when choosing the splits).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5688) Splits for Categorical Variables in DecisionTrees

Reply via email to