Re: Random Forest on Spark

Sung Hwan Chung Thu, 17 Apr 2014 14:31:24 -0700

I believe that they show one example comparing depth 1 ensemble vs depth 3
ensemble but it is based on boosting, not bagging.



On Thu, Apr 17, 2014 at 2:21 PM, Debasish Das <debasish.da...@gmail.com>wrote:

> Evan,
>
> Was not mllib decision tree implemented using ideas from Google's PLANET
> paper...do the paper also propose to grow a shallow tree ?
>
> Thanks.
> Deb
>
>
> On Thu, Apr 17, 2014 at 1:52 PM, Sung Hwan Chung <coded...@cs.stanford.edu
> > wrote:
>
>> Additionally, the 'random features per node' (or mtry in R) is a very
>> important feature for Random Forest. The variance reduction comes if the
>> trees are decorrelated from each other and often the random features per
>> node does more than bootstrap samples. And this is something that would
>> have to be supported at the tree level.
>>
>>
>> On Thu, Apr 17, 2014 at 1:43 PM, Sung Hwan Chung <
>> coded...@cs.stanford.edu> wrote:
>>
>>> Well, if you read the original paper,
>>> http://oz.berkeley.edu/~breiman/randomforest2001.pdf
>>> "Grow the tree using CART methodology to maximum size and do not prune."
>>>
>>> Now, the elements of statistical learning book on page 598 says that you
>>> could potentially overfit fully-grown regression random forest. However,
>>> this effect is very slight, and likely negligible for classifications.
>>> http://www.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
>>>
>>> In our experiments however, if the pruning is "drastic", then the
>>> performance actually becomes much worse. This makes intuitive sense IMO
>>> because a decision tree is a non-parametric model, and the expressibility
>>> of a tree depends on the number of nodes.
>>>
>>> With a huge amount of data (millions or even billions of rows), we found
>>> that the depth of 10 is simply not adequate to build high-accuracy models.
>>>
>>>
>>> On Thu, Apr 17, 2014 at 12:30 PM, Evan R. Sparks 
>>> <evan.spa...@gmail.com>wrote:
>>>
>>>> Hmm... can you provide some pointers to examples where deep trees are
>>>> helpful?
>>>>
>>>> Typically with Decision Trees you limit depth (either directly or
>>>> indirectly with minimum node size and minimum improvement criteria) to
>>>> avoid overfitting. I agree with the assessment that forests are a variance
>>>> reduction technique, but I'd be a little surprised if a bunch of hugely
>>>> deep trees don't overfit to training data. I guess I view limiting tree
>>>> depth as an analogue to regularization in linear models.
>>>>
>>>>
>>>> On Thu, Apr 17, 2014 at 12:19 PM, Sung Hwan Chung <
>>>> coded...@cs.stanford.edu> wrote:
>>>>
>>>>> Evan,
>>>>>
>>>>> I actually haven't heard of 'shallow' random forest. I think that the
>>>>> only scenarios where shallow trees are useful are boosting scenarios.
>>>>>
>>>>> AFAIK, Random Forest is a variance reducing technique and doesn't do
>>>>> much about bias (although some people claim that it does have some bias
>>>>> reducing effect). Because shallow trees typically have higher bias than
>>>>> fully-grown trees, people don't often use shallow trees with RF.
>>>>>
>>>>> You can confirm this through some experiments with R's random forest
>>>>> implementation as well. They allow you to set some limits of depth and/or
>>>>> pruning.
>>>>>
>>>>> In contrast, boosting is a bias reduction technique (and increases
>>>>> variance), so people typically use shallow trees.
>>>>>
>>>>> Our empirical experiments also confirmed that shallow trees resulted
>>>>> in drastically lower accuracy for random forests.
>>>>>
>>>>> There are some papers that mix boosting-like technique with bootstrap
>>>>> averaging (e.g. http://arxiv.org/pdf/1103.2068.pdf) where you could
>>>>> potentially use shallow trees to build boosted learners, but then average
>>>>> the results of many boosted learners.
>>>>>
>>>>>
>>>>> On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks <
>>>>> evan.spa...@gmail.com> wrote:
>>>>>
>>>>>> Multiclass classification, Gradient Boosting, and Random Forest
>>>>>> support for based on the recent Decision Tree implementation in MLlib.
>>>>>>
>>>>>> Sung - I'd be curious to hear about your use of decision trees (and
>>>>>> forests) where you want to go to 100+ depth. My experience with random
>>>>>> forests has been that people typically build hundreds of shallow trees
>>>>>> (maybe depth 7 or 8), rather than a few (or many) really deep trees.
>>>>>>
>>>>>> Generally speaking, we save passes over the data by computing
>>>>>> histograms per variable per split at each *level* of a decision tree. 
>>>>>> This
>>>>>> can blow up as the level of the decision tree gets deep, but I'd 
>>>>>> recommend
>>>>>> a lot more memory than 2-4GB per worker for most big data workloads.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung <
>>>>>> coded...@cs.stanford.edu> wrote:
>>>>>>
>>>>>>> Debasish, we've tested the MLLib decision tree a bit and it eats up
>>>>>>> too much memory for RF purposes.
>>>>>>> Once the tree got to depth 8~9, it was easy to get heap exception,
>>>>>>> even with 2~4 GB of memory per worker.
>>>>>>>
>>>>>>> With RF, it's very easy to get 100+ depth in RF with even only
>>>>>>> 100,000+ rows (because trees usually are not balanced). Additionally, 
>>>>>>> the
>>>>>>> lack of multi-class classification limits its applicability.
>>>>>>>
>>>>>>> Also, RF requires random features per tree node to be effective (not
>>>>>>> just bootstrap samples), and MLLib decision tree doesn't support that.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das <
>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Mllib has decision tree....there is a rf pr which is not active
>>>>>>>> now....take that and swap the tree builder with the fast tree builder
>>>>>>>> that's in mllib...search for the spark jira...the code is based on 
>>>>>>>> google
>>>>>>>> planet paper. ..
>>>>>>>>
>>>>>>>> I am sure people in devlist are already working on it...send an
>>>>>>>> email to know the status over there...
>>>>>>>>
>>>>>>>> There is also a rf in cloudera oryx but we could not run it on our
>>>>>>>> data yet....
>>>>>>>>
>>>>>>>> Weka 3.7.10 has a multi thread rf that is good to do some adhoc
>>>>>>>> runs but it does not scale...
>>>>>>>>  On Apr 17, 2014 2:45 AM, "Laeeq Ahmed" <laeeqsp...@yahoo.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> For one of my application, I want to use Random forests(RF) on top
>>>>>>>>> of spark. I see that currenlty MLLib does not have implementation for 
>>>>>>>>> RF.
>>>>>>>>> What other opensource RF implementations will be great to use with 
>>>>>>>>> spark in
>>>>>>>>> terms of speed?
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Laeeq Ahmed,
>>>>>>>>> KTH, Sweden.
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Random Forest on Spark

Reply via email to