[jira] [Commented] (SPARK-14610) Remove superfluous split from random forest findSplitsForContinousFeature

2016-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240274#comment-15240274
 ] 

Apache Spark commented on SPARK-14610:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/12374

> Remove superfluous split from random forest findSplitsForContinousFeature
> -
>
> Key: SPARK-14610
> URL: https://issues.apache.org/jira/browse/SPARK-14610
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Seth Hendrickson
>
> Currently, the method findSplitsForContinuousFeature in random forest 
> produces an unnecessary split. For example, if a continuous feature has 
> unique values: (1, 2, 3), then the possible splits generated by this method 
> are:
> * {1|2,3}
> * {1,2|3} 
> * {1,2,3|}
> The following unit test is quite clearly incorrect:
> {code:title=rf.scala|borderStyle=solid}
> val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble)
>   val splits = 
> RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
>   assert(splits.length === 3)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14610) Remove superfluous split from random forest findSplitsForContinousFeature

2016-04-13 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240105#comment-15240105
 ] 

Seth Hendrickson commented on SPARK-14610:
--

One thing to note, is that fixing this actually uncovers a bug of sorts. There 
is an assertion in this method to verify that there are more than zero splits. 
However, due to the extra split being returned previously, this assertion did 
nothing. Now, the training will fail if there is a constant continuous feature. 
So, this PR will also remove this assertion and handle constant continuous 
features appropriately.

I can submit a PR for this soon.

> Remove superfluous split from random forest findSplitsForContinousFeature
> -
>
> Key: SPARK-14610
> URL: https://issues.apache.org/jira/browse/SPARK-14610
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Seth Hendrickson
>
> Currently, the method findSplitsForContinuousFeature in random forest 
> produces an unnecessary split. For example, if a continuous feature has 
> unique values: {1, 2, 3}, then the possible splits generated by this method 
> are:
> {1|2,3}, {1,2|3} and {1,2,3|}. The following unit test is quite clearly 
> incorrect:
> {code:title=rf.scala|borderStyle=solid}
> val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble)
>   val splits = 
> RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
>   assert(splits.length === 3)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org