[
https://issues.apache.org/jira/browse/SPARK-14610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240105#comment-15240105
]
Seth Hendrickson commented on SPARK-14610:
--
One thing to note, is that fixing this actually uncovers a bug of sorts. There
is an assertion in this method to verify that there are more than zero splits.
However, due to the extra split being returned previously, this assertion did
nothing. Now, the training will fail if there is a constant continuous feature.
So, this PR will also remove this assertion and handle constant continuous
features appropriately.
I can submit a PR for this soon.
> Remove superfluous split from random forest findSplitsForContinousFeature
> -
>
> Key: SPARK-14610
> URL: https://issues.apache.org/jira/browse/SPARK-14610
> Project: Spark
> Issue Type: Improvement
> Components: ML
>Reporter: Seth Hendrickson
>
> Currently, the method findSplitsForContinuousFeature in random forest
> produces an unnecessary split. For example, if a continuous feature has
> unique values: {1, 2, 3}, then the possible splits generated by this method
> are:
> {1|2,3}, {1,2|3} and {1,2,3|}. The following unit test is quite clearly
> incorrect:
> {code:title=rf.scala|borderStyle=solid}
> val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble)
> val splits =
> RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
> assert(splits.length === 3)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org