Re: Random Forest on Spark

Evan R. Sparks Thu, 17 Apr 2014 12:31:24 -0700

Hmm... can you provide some pointers to examples where deep trees are
helpful?


Typically with Decision Trees you limit depth (either directly or
indirectly with minimum node size and minimum improvement criteria) to
avoid overfitting. I agree with the assessment that forests are a variance
reduction technique, but I'd be a little surprised if a bunch of hugely
deep trees don't overfit to training data. I guess I view limiting tree
depth as an analogue to regularization in linear models.


On Thu, Apr 17, 2014 at 12:19 PM, Sung Hwan Chung
<coded...@cs.stanford.edu>wrote:

> Evan,
>
> I actually haven't heard of 'shallow' random forest. I think that the only
> scenarios where shallow trees are useful are boosting scenarios.
>
> AFAIK, Random Forest is a variance reducing technique and doesn't do much
> about bias (although some people claim that it does have some bias reducing
> effect). Because shallow trees typically have higher bias than fully-grown
> trees, people don't often use shallow trees with RF.
>
> You can confirm this through some experiments with R's random forest
> implementation as well. They allow you to set some limits of depth and/or
> pruning.
>
> In contrast, boosting is a bias reduction technique (and increases
> variance), so people typically use shallow trees.
>
> Our empirical experiments also confirmed that shallow trees resulted in
> drastically lower accuracy for random forests.
>
> There are some papers that mix boosting-like technique with bootstrap
> averaging (e.g. http://arxiv.org/pdf/1103.2068.pdf) where you could
> potentially use shallow trees to build boosted learners, but then average
> the results of many boosted learners.
>
>
> On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks <evan.spa...@gmail.com>wrote:
>
>> Multiclass classification, Gradient Boosting, and Random Forest support
>> for based on the recent Decision Tree implementation in MLlib.
>>
>> Sung - I'd be curious to hear about your use of decision trees (and
>> forests) where you want to go to 100+ depth. My experience with random
>> forests has been that people typically build hundreds of shallow trees
>> (maybe depth 7 or 8), rather than a few (or many) really deep trees.
>>
>> Generally speaking, we save passes over the data by computing histograms
>> per variable per split at each *level* of a decision tree. This can blow up
>> as the level of the decision tree gets deep, but I'd recommend a lot more
>> memory than 2-4GB per worker for most big data workloads.
>>
>>
>>
>>
>>
>> On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung <
>> coded...@cs.stanford.edu> wrote:
>>
>>> Debasish, we've tested the MLLib decision tree a bit and it eats up too
>>> much memory for RF purposes.
>>> Once the tree got to depth 8~9, it was easy to get heap exception, even
>>> with 2~4 GB of memory per worker.
>>>
>>> With RF, it's very easy to get 100+ depth in RF with even only 100,000+
>>> rows (because trees usually are not balanced). Additionally, the lack of
>>> multi-class classification limits its applicability.
>>>
>>> Also, RF requires random features per tree node to be effective (not
>>> just bootstrap samples), and MLLib decision tree doesn't support that.
>>>
>>>
>>> On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das <debasish.da...@gmail.com
>>> > wrote:
>>>
>>>> Mllib has decision tree....there is a rf pr which is not active
>>>> now....take that and swap the tree builder with the fast tree builder
>>>> that's in mllib...search for the spark jira...the code is based on google
>>>> planet paper. ..
>>>>
>>>> I am sure people in devlist are already working on it...send an email
>>>> to know the status over there...
>>>>
>>>> There is also a rf in cloudera oryx but we could not run it on our data
>>>> yet....
>>>>
>>>> Weka 3.7.10 has a multi thread rf that is good to do some adhoc runs
>>>> but it does not scale...
>>>>  On Apr 17, 2014 2:45 AM, "Laeeq Ahmed" <laeeqsp...@yahoo.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> For one of my application, I want to use Random forests(RF) on top of
>>>>> spark. I see that currenlty MLLib does not have implementation for RF. 
>>>>> What
>>>>> other opensource RF implementations will be great to use with spark in
>>>>> terms of speed?
>>>>>
>>>>> Regards,
>>>>> Laeeq Ahmed,
>>>>> KTH, Sweden.
>>>>>
>>>>>
>>>
>>
>

Re: Random Forest on Spark

Reply via email to