[ https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15990195#comment-15990195 ]
Yan Facai (颜发才) commented on SPARK-16957: ----------------------------------------- To match the other libraries, we use mean value for now and decide later to make it weighted. [~srowen] [~sethah] > Use weighted midpoints for split values. > ---------------------------------------- > > Key: SPARK-16957 > URL: https://issues.apache.org/jira/browse/SPARK-16957 > Project: Spark > Issue Type: Improvement > Components: MLlib > Reporter: Vladimir Feinberg > Priority: Trivial > > We should be using weighted split points rather than the actual continuous > binned feature values. For instance, in a dataset containing binary features > (that are fed in as continuous ones), our splits are selected as {{x <= 0.0}} > and {{x > 0.0}}. For any real data with some smoothness qualities, this is > asymptotically bad compared to GBM's approach. The split point should be a > weighted split point of the two values of the "innermost" feature bins; e.g., > if there are 30 {{x = 0}} and 10 {{x = 1}}, the above split should be at > {{0.75}}. > Example: > {code} > +--------+--------+-----+-----+ > |feature0|feature1|label|count| > +--------+--------+-----+-----+ > | 0.0| 0.0| 0.0| 23| > | 1.0| 0.0| 0.0| 2| > | 0.0| 0.0| 1.0| 2| > | 0.0| 1.0| 0.0| 7| > | 1.0| 0.0| 1.0| 23| > | 0.0| 1.0| 1.0| 18| > | 1.0| 1.0| 1.0| 7| > | 1.0| 1.0| 0.0| 18| > +--------+--------+-----+-----+ > DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes > If (feature 0 <= 0.0) > If (feature 1 <= 0.0) > Predict: -0.56 > Else (feature 1 > 0.0) > Predict: 0.29333333333333333 > Else (feature 0 > 0.0) > If (feature 1 <= 0.0) > Predict: 0.56 > Else (feature 1 > 0.0) > Predict: -0.29333333333333333 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org