I believe that they show one example comparing depth 1 ensemble vs depth 3 ensemble but it is based on boosting, not bagging.
On Thu, Apr 17, 2014 at 2:21 PM, Debasish Das <debasish.da...@gmail.com>wrote: > Evan, > > Was not mllib decision tree implemented using ideas from Google's PLANET > paper...do the paper also propose to grow a shallow tree ? > > Thanks. > Deb > > > On Thu, Apr 17, 2014 at 1:52 PM, Sung Hwan Chung <coded...@cs.stanford.edu > > wrote: > >> Additionally, the 'random features per node' (or mtry in R) is a very >> important feature for Random Forest. The variance reduction comes if the >> trees are decorrelated from each other and often the random features per >> node does more than bootstrap samples. And this is something that would >> have to be supported at the tree level. >> >> >> On Thu, Apr 17, 2014 at 1:43 PM, Sung Hwan Chung < >> coded...@cs.stanford.edu> wrote: >> >>> Well, if you read the original paper, >>> http://oz.berkeley.edu/~breiman/randomforest2001.pdf >>> "Grow the tree using CART methodology to maximum size and do not prune." >>> >>> Now, the elements of statistical learning book on page 598 says that you >>> could potentially overfit fully-grown regression random forest. However, >>> this effect is very slight, and likely negligible for classifications. >>> http://www.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf >>> >>> In our experiments however, if the pruning is "drastic", then the >>> performance actually becomes much worse. This makes intuitive sense IMO >>> because a decision tree is a non-parametric model, and the expressibility >>> of a tree depends on the number of nodes. >>> >>> With a huge amount of data (millions or even billions of rows), we found >>> that the depth of 10 is simply not adequate to build high-accuracy models. >>> >>> >>> On Thu, Apr 17, 2014 at 12:30 PM, Evan R. Sparks >>> <evan.spa...@gmail.com>wrote: >>> >>>> Hmm... can you provide some pointers to examples where deep trees are >>>> helpful? >>>> >>>> Typically with Decision Trees you limit depth (either directly or >>>> indirectly with minimum node size and minimum improvement criteria) to >>>> avoid overfitting. I agree with the assessment that forests are a variance >>>> reduction technique, but I'd be a little surprised if a bunch of hugely >>>> deep trees don't overfit to training data. I guess I view limiting tree >>>> depth as an analogue to regularization in linear models. >>>> >>>> >>>> On Thu, Apr 17, 2014 at 12:19 PM, Sung Hwan Chung < >>>> coded...@cs.stanford.edu> wrote: >>>> >>>>> Evan, >>>>> >>>>> I actually haven't heard of 'shallow' random forest. I think that the >>>>> only scenarios where shallow trees are useful are boosting scenarios. >>>>> >>>>> AFAIK, Random Forest is a variance reducing technique and doesn't do >>>>> much about bias (although some people claim that it does have some bias >>>>> reducing effect). Because shallow trees typically have higher bias than >>>>> fully-grown trees, people don't often use shallow trees with RF. >>>>> >>>>> You can confirm this through some experiments with R's random forest >>>>> implementation as well. They allow you to set some limits of depth and/or >>>>> pruning. >>>>> >>>>> In contrast, boosting is a bias reduction technique (and increases >>>>> variance), so people typically use shallow trees. >>>>> >>>>> Our empirical experiments also confirmed that shallow trees resulted >>>>> in drastically lower accuracy for random forests. >>>>> >>>>> There are some papers that mix boosting-like technique with bootstrap >>>>> averaging (e.g. http://arxiv.org/pdf/1103.2068.pdf) where you could >>>>> potentially use shallow trees to build boosted learners, but then average >>>>> the results of many boosted learners. >>>>> >>>>> >>>>> On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks < >>>>> evan.spa...@gmail.com> wrote: >>>>> >>>>>> Multiclass classification, Gradient Boosting, and Random Forest >>>>>> support for based on the recent Decision Tree implementation in MLlib. >>>>>> >>>>>> Sung - I'd be curious to hear about your use of decision trees (and >>>>>> forests) where you want to go to 100+ depth. My experience with random >>>>>> forests has been that people typically build hundreds of shallow trees >>>>>> (maybe depth 7 or 8), rather than a few (or many) really deep trees. >>>>>> >>>>>> Generally speaking, we save passes over the data by computing >>>>>> histograms per variable per split at each *level* of a decision tree. >>>>>> This >>>>>> can blow up as the level of the decision tree gets deep, but I'd >>>>>> recommend >>>>>> a lot more memory than 2-4GB per worker for most big data workloads. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung < >>>>>> coded...@cs.stanford.edu> wrote: >>>>>> >>>>>>> Debasish, we've tested the MLLib decision tree a bit and it eats up >>>>>>> too much memory for RF purposes. >>>>>>> Once the tree got to depth 8~9, it was easy to get heap exception, >>>>>>> even with 2~4 GB of memory per worker. >>>>>>> >>>>>>> With RF, it's very easy to get 100+ depth in RF with even only >>>>>>> 100,000+ rows (because trees usually are not balanced). Additionally, >>>>>>> the >>>>>>> lack of multi-class classification limits its applicability. >>>>>>> >>>>>>> Also, RF requires random features per tree node to be effective (not >>>>>>> just bootstrap samples), and MLLib decision tree doesn't support that. >>>>>>> >>>>>>> >>>>>>> On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das < >>>>>>> debasish.da...@gmail.com> wrote: >>>>>>> >>>>>>>> Mllib has decision tree....there is a rf pr which is not active >>>>>>>> now....take that and swap the tree builder with the fast tree builder >>>>>>>> that's in mllib...search for the spark jira...the code is based on >>>>>>>> google >>>>>>>> planet paper. .. >>>>>>>> >>>>>>>> I am sure people in devlist are already working on it...send an >>>>>>>> email to know the status over there... >>>>>>>> >>>>>>>> There is also a rf in cloudera oryx but we could not run it on our >>>>>>>> data yet.... >>>>>>>> >>>>>>>> Weka 3.7.10 has a multi thread rf that is good to do some adhoc >>>>>>>> runs but it does not scale... >>>>>>>> On Apr 17, 2014 2:45 AM, "Laeeq Ahmed" <laeeqsp...@yahoo.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> For one of my application, I want to use Random forests(RF) on top >>>>>>>>> of spark. I see that currenlty MLLib does not have implementation for >>>>>>>>> RF. >>>>>>>>> What other opensource RF implementations will be great to use with >>>>>>>>> spark in >>>>>>>>> terms of speed? >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Laeeq Ahmed, >>>>>>>>> KTH, Sweden. >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >