Re: why decision trees do binary split?
You can imagine this same logic applying to the continuous case. E.g. what if all the quartiles or deciles of a particular value have different behavior - this could capture that too. Of what if some combination of features was highly discriminitive but only into n buckets, rather than two.. you can see there are lots of different options here. In general in MLlib, we're trying to support widely accepted and frequently used ML models, and simply offer a platform to efficiently train these with spark. While decision trees with n-ary splits might be a sensible thing to explore, they are not widely used in practice, and I'd want to see some compelling results from proper ML/stats researchers before shipping them as a default feature. If you're looking for a way to control variance and pick up nuance in your dataset that's not covered by plain decision trees, I recommend looking at Random Forests - a well studied extension to decision trees that's also widely used in practice - and coming to MLlib soon! On Thu, Nov 6, 2014 at 3:29 AM, Tamas Jambor wrote: > Thanks for the reply, Sean. > > I can see that splitting on all the categories would probably overfit > the tree, on the other hand, it might give more insight on the > subcategories (probably only would work if the data is uniformly > distributed between the categories). > > I haven't really found any comparison between the two methods in terms > of performance and interpretability. > > thanks, > > On Thu, Nov 6, 2014 at 9:46 AM, Sean Owen wrote: > > I haven't seen that done before, which may be most of the reason - I am > not > > sure that is common practice. > > > > I can see upsides - you need not pick candidate splits to test since > there > > is only one N-way rule possible. The binary split equivalent is N levels > > instead of 1. > > > > The big problem is that you are always segregating the data set entirely, > > and making the equivalent of those N binary rules, even when you would > not > > otherwise bother because they don't add information about the target. The > > subsets matching each child are therefore unnecessarily small and this > makes > > learning on each independent subset weaker. > > > > On Nov 6, 2014 9:36 AM, "jamborta" wrote: > >> > >> I meant above, that in the case of categorical variables it might be > more > >> efficient to create a node on each categorical value. Is there a reason > >> why > >> spark went down the binary route? > >> > >> thanks, > >> > >> > >> > >> -- > >> View this message in context: > >> > http://apache-spark-user-list.1001560.n3.nabble.com/why-decision-trees-do-binary-split-tp18188p18265.html > >> Sent from the Apache Spark User List mailing list archive at Nabble.com. > >> > >> - > >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >> For additional commands, e-mail: user-h...@spark.apache.org > >> > > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: why decision trees do binary split?
Thanks for the reply, Sean. I can see that splitting on all the categories would probably overfit the tree, on the other hand, it might give more insight on the subcategories (probably only would work if the data is uniformly distributed between the categories). I haven't really found any comparison between the two methods in terms of performance and interpretability. thanks, On Thu, Nov 6, 2014 at 9:46 AM, Sean Owen wrote: > I haven't seen that done before, which may be most of the reason - I am not > sure that is common practice. > > I can see upsides - you need not pick candidate splits to test since there > is only one N-way rule possible. The binary split equivalent is N levels > instead of 1. > > The big problem is that you are always segregating the data set entirely, > and making the equivalent of those N binary rules, even when you would not > otherwise bother because they don't add information about the target. The > subsets matching each child are therefore unnecessarily small and this makes > learning on each independent subset weaker. > > On Nov 6, 2014 9:36 AM, "jamborta" wrote: >> >> I meant above, that in the case of categorical variables it might be more >> efficient to create a node on each categorical value. Is there a reason >> why >> spark went down the binary route? >> >> thanks, >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/why-decision-trees-do-binary-split-tp18188p18265.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: why decision trees do binary split?
Hello, There is a big compelling reason for binary splits in general for trees: a split is made if the difference between the two resulting branches is "significant".You also want to compare the significance of this candidate split vs all the other candidate splits. There are many statistical tests to compare two groups. You can even generate something like p-values that, according to some, allow you to compare different candidate splits. If you introduce multibranch splits... things become much more messy. Also, mind that breaking categorical variables into as many groups as there are levels would in some cases separate subgroups of variables which are not "that different". Successive binary splits could potentially provide you with the required "homogeneous subsets". Best, Carlos J. Gil Bellosta http://www.datanalytics.com 2014-11-06 10:46 GMT+01:00 Sean Owen : > I haven't seen that done before, which may be most of the reason - I am not > sure that is common practice. > > I can see upsides - you need not pick candidate splits to test since there > is only one N-way rule possible. The binary split equivalent is N levels > instead of 1. > > The big problem is that you are always segregating the data set entirely, > and making the equivalent of those N binary rules, even when you would not > otherwise bother because they don't add information about the target. The > subsets matching each child are therefore unnecessarily small and this makes > learning on each independent subset weaker. > > On Nov 6, 2014 9:36 AM, "jamborta" wrote: >> >> I meant above, that in the case of categorical variables it might be more >> efficient to create a node on each categorical value. Is there a reason >> why >> spark went down the binary route? >> >> thanks, >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/why-decision-trees-do-binary-split-tp18188p18265.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: why decision trees do binary split?
I haven't seen that done before, which may be most of the reason - I am not sure that is common practice. I can see upsides - you need not pick candidate splits to test since there is only one N-way rule possible. The binary split equivalent is N levels instead of 1. The big problem is that you are always segregating the data set entirely, and making the equivalent of those N binary rules, even when you would not otherwise bother because they don't add information about the target. The subsets matching each child are therefore unnecessarily small and this makes learning on each independent subset weaker. On Nov 6, 2014 9:36 AM, "jamborta" wrote: > I meant above, that in the case of categorical variables it might be more > efficient to create a node on each categorical value. Is there a reason why > spark went down the binary route? > > thanks, > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/why-decision-trees-do-binary-split-tp18188p18265.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: why decision trees do binary split?
I meant above, that in the case of categorical variables it might be more efficient to create a node on each categorical value. Is there a reason why spark went down the binary route? thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/why-decision-trees-do-binary-split-tp18188p18265.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
why decision trees do binary split?
Hi, Just wondering what is the reason that the decision tree implementation in spark always does binary splits? thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/why-decision-trees-do-binary-split-tp18188.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org