Hello,

There is a big compelling reason for binary splits in general for
trees: a split is made if the difference between the two resulting
branches is "significant".You also want to compare the significance of
this candidate split vs all the other candidate splits. There are many
statistical tests to compare two groups. You can even generate
something like p-values that, according to some, allow you to compare
different candidate splits.

If you introduce multibranch splits... things become much more messy.

Also, mind that breaking categorical variables into as many groups as
there are levels would in some cases separate subgroups of variables
which are not "that different".  Successive binary splits could
potentially provide you with the required "homogeneous subsets".

Best,

Carlos J. Gil Bellosta
http://www.datanalytics.com



2014-11-06 10:46 GMT+01:00 Sean Owen <so...@cloudera.com>:
> I haven't seen that done before, which may be most of the reason - I am not
> sure that is common practice.
>
> I can see upsides - you need not pick candidate splits to test since there
> is only one N-way rule possible. The binary split equivalent is N levels
> instead of 1.
>
> The big problem is that you are always segregating the data set entirely,
> and making the equivalent of those N binary rules, even when you would not
> otherwise bother because they don't add information about the target. The
> subsets matching each child are therefore unnecessarily small and this makes
> learning on each independent subset weaker.
>
> On Nov 6, 2014 9:36 AM, "jamborta" <jambo...@gmail.com> wrote:
>>
>> I meant above, that in the case of categorical variables it might be more
>> efficient to create a node on each categorical value. Is there a reason
>> why
>> spark went down the binary route?
>>
>> thanks,
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/why-decision-trees-do-binary-split-tp18188p18265.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to