[ https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960177#comment-15960177 ]
Yan Facai (颜发才) commented on SPARK-16957: ----------------------------------------- I think that it is helpful for small dataset, while trivial for large dataset. The task is easy. However, is it needed? If the issue would be shepherd, I'd like to work on it. > Use weighted midpoints for split values. > ---------------------------------------- > > Key: SPARK-16957 > URL: https://issues.apache.org/jira/browse/SPARK-16957 > Project: Spark > Issue Type: Improvement > Components: MLlib > Reporter: Vladimir Feinberg > Priority: Trivial > > Just like R's gbm, we should be using weighted split points rather than the > actual continuous binned feature values. For instance, in a dataset > containing binary features (that are fed in as continuous ones), our splits > are selected as {{x <= 0.0}} and {{x > 0.0}}. For any real data with some > smoothness qualities, this is asymptotically bad compared to GBM's approach. > The split point should be a weighted split point of the two values of the > "innermost" feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, > the above split should be at {{0.75}}. > Example: > {code} > +--------+--------+-----+-----+ > |feature0|feature1|label|count| > +--------+--------+-----+-----+ > | 0.0| 0.0| 0.0| 23| > | 1.0| 0.0| 0.0| 2| > | 0.0| 0.0| 1.0| 2| > | 0.0| 1.0| 0.0| 7| > | 1.0| 0.0| 1.0| 23| > | 0.0| 1.0| 1.0| 18| > | 1.0| 1.0| 1.0| 7| > | 1.0| 1.0| 0.0| 18| > +--------+--------+-----+-----+ > DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes > If (feature 0 <= 0.0) > If (feature 1 <= 0.0) > Predict: -0.56 > Else (feature 1 > 0.0) > Predict: 0.29333333333333333 > Else (feature 0 > 0.0) > If (feature 1 <= 0.0) > Predict: 0.56 > Else (feature 1 > 0.0) > Predict: -0.29333333333333333 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org