Re: why decision trees do binary split?

2014-11-06 Thread jamborta
I meant above, that in the case of categorical variables it might be more
efficient to create a node on each categorical value. Is there a reason why
spark went down the binary route?

thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/why-decision-trees-do-binary-split-tp18188p18265.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: why decision trees do binary split?

2014-11-06 Thread Sean Owen
I haven't seen that done before, which may be most of the reason - I am not
sure that is common practice.

I can see upsides - you need not pick candidate splits to test since there
is only one N-way rule possible. The binary split equivalent is N levels
instead of 1.

The big problem is that you are always segregating the data set entirely,
and making the equivalent of those N binary rules, even when you would not
otherwise bother because they don't add information about the target. The
subsets matching each child are therefore unnecessarily small and this
makes learning on each independent subset weaker.
 On Nov 6, 2014 9:36 AM, jamborta jambo...@gmail.com wrote:

 I meant above, that in the case of categorical variables it might be more
 efficient to create a node on each categorical value. Is there a reason why
 spark went down the binary route?

 thanks,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/why-decision-trees-do-binary-split-tp18188p18265.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: why decision trees do binary split?

2014-11-06 Thread Carlos J. Gil Bellosta
Hello,

There is a big compelling reason for binary splits in general for
trees: a split is made if the difference between the two resulting
branches is significant.You also want to compare the significance of
this candidate split vs all the other candidate splits. There are many
statistical tests to compare two groups. You can even generate
something like p-values that, according to some, allow you to compare
different candidate splits.

If you introduce multibranch splits... things become much more messy.

Also, mind that breaking categorical variables into as many groups as
there are levels would in some cases separate subgroups of variables
which are not that different.  Successive binary splits could
potentially provide you with the required homogeneous subsets.

Best,

Carlos J. Gil Bellosta
http://www.datanalytics.com



2014-11-06 10:46 GMT+01:00 Sean Owen so...@cloudera.com:
 I haven't seen that done before, which may be most of the reason - I am not
 sure that is common practice.

 I can see upsides - you need not pick candidate splits to test since there
 is only one N-way rule possible. The binary split equivalent is N levels
 instead of 1.

 The big problem is that you are always segregating the data set entirely,
 and making the equivalent of those N binary rules, even when you would not
 otherwise bother because they don't add information about the target. The
 subsets matching each child are therefore unnecessarily small and this makes
 learning on each independent subset weaker.

 On Nov 6, 2014 9:36 AM, jamborta jambo...@gmail.com wrote:

 I meant above, that in the case of categorical variables it might be more
 efficient to create a node on each categorical value. Is there a reason
 why
 spark went down the binary route?

 thanks,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/why-decision-trees-do-binary-split-tp18188p18265.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: why decision trees do binary split?

2014-11-06 Thread Tamas Jambor
Thanks for the reply, Sean.

I can see that splitting on all the categories would probably overfit
the tree, on the other hand, it might give more insight on the
subcategories (probably only would work if the data is uniformly
distributed between the categories).

I haven't really found any comparison between the two methods in terms
of performance and interpretability.

thanks,

On Thu, Nov 6, 2014 at 9:46 AM, Sean Owen so...@cloudera.com wrote:
 I haven't seen that done before, which may be most of the reason - I am not
 sure that is common practice.

 I can see upsides - you need not pick candidate splits to test since there
 is only one N-way rule possible. The binary split equivalent is N levels
 instead of 1.

 The big problem is that you are always segregating the data set entirely,
 and making the equivalent of those N binary rules, even when you would not
 otherwise bother because they don't add information about the target. The
 subsets matching each child are therefore unnecessarily small and this makes
 learning on each independent subset weaker.

 On Nov 6, 2014 9:36 AM, jamborta jambo...@gmail.com wrote:

 I meant above, that in the case of categorical variables it might be more
 efficient to create a node on each categorical value. Is there a reason
 why
 spark went down the binary route?

 thanks,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/why-decision-trees-do-binary-split-tp18188p18265.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: why decision trees do binary split?

2014-11-06 Thread Evan R. Sparks
You can imagine this same logic applying to the continuous case. E.g. what
if all the quartiles or deciles of a particular value have different
behavior - this could capture that too. Of what if some combination of
features was highly discriminitive but only into n buckets, rather than
two.. you can see there are lots of different options here.

In general in MLlib, we're trying to support widely accepted and frequently
used ML models, and simply offer a platform to efficiently train these with
spark. While decision trees with n-ary splits might be a sensible thing to
explore, they are not widely used in practice, and I'd want to see some
compelling results from proper ML/stats researchers before shipping them as
a default feature.

If you're looking for a way to control variance and pick up nuance in your
dataset that's not covered by plain decision trees, I recommend looking at
Random Forests - a well studied extension to decision trees that's also
widely used in practice - and coming to MLlib soon!

On Thu, Nov 6, 2014 at 3:29 AM, Tamas Jambor jambo...@gmail.com wrote:

 Thanks for the reply, Sean.

 I can see that splitting on all the categories would probably overfit
 the tree, on the other hand, it might give more insight on the
 subcategories (probably only would work if the data is uniformly
 distributed between the categories).

 I haven't really found any comparison between the two methods in terms
 of performance and interpretability.

 thanks,

 On Thu, Nov 6, 2014 at 9:46 AM, Sean Owen so...@cloudera.com wrote:
  I haven't seen that done before, which may be most of the reason - I am
 not
  sure that is common practice.
 
  I can see upsides - you need not pick candidate splits to test since
 there
  is only one N-way rule possible. The binary split equivalent is N levels
  instead of 1.
 
  The big problem is that you are always segregating the data set entirely,
  and making the equivalent of those N binary rules, even when you would
 not
  otherwise bother because they don't add information about the target. The
  subsets matching each child are therefore unnecessarily small and this
 makes
  learning on each independent subset weaker.
 
  On Nov 6, 2014 9:36 AM, jamborta jambo...@gmail.com wrote:
 
  I meant above, that in the case of categorical variables it might be
 more
  efficient to create a node on each categorical value. Is there a reason
  why
  spark went down the binary route?
 
  thanks,
 
 
 
  --
  View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/why-decision-trees-do-binary-split-tp18188p18265.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org