Multiclass classification, Gradient Boosting, and Random Forest support for
based on the recent Decision Tree implementation in MLlib.

Sung - I'd be curious to hear about your use of decision trees (and
forests) where you want to go to 100+ depth. My experience with random
forests has been that people typically build hundreds of shallow trees
(maybe depth 7 or 8), rather than a few (or many) really deep trees.

Generally speaking, we save passes over the data by computing histograms
per variable per split at each *level* of a decision tree. This can blow up
as the level of the decision tree gets deep, but I'd recommend a lot more
memory than 2-4GB per worker for most big data workloads.





On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung
<coded...@cs.stanford.edu>wrote:

> Debasish, we've tested the MLLib decision tree a bit and it eats up too
> much memory for RF purposes.
> Once the tree got to depth 8~9, it was easy to get heap exception, even
> with 2~4 GB of memory per worker.
>
> With RF, it's very easy to get 100+ depth in RF with even only 100,000+
> rows (because trees usually are not balanced). Additionally, the lack of
> multi-class classification limits its applicability.
>
> Also, RF requires random features per tree node to be effective (not just
> bootstrap samples), and MLLib decision tree doesn't support that.
>
>
> On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das 
> <debasish.da...@gmail.com>wrote:
>
>> Mllib has decision tree....there is a rf pr which is not active
>> now....take that and swap the tree builder with the fast tree builder
>> that's in mllib...search for the spark jira...the code is based on google
>> planet paper. ..
>>
>> I am sure people in devlist are already working on it...send an email to
>> know the status over there...
>>
>> There is also a rf in cloudera oryx but we could not run it on our data
>> yet....
>>
>> Weka 3.7.10 has a multi thread rf that is good to do some adhoc runs but
>> it does not scale...
>>  On Apr 17, 2014 2:45 AM, "Laeeq Ahmed" <laeeqsp...@yahoo.com> wrote:
>>
>>> Hi,
>>>
>>> For one of my application, I want to use Random forests(RF) on top of
>>> spark. I see that currenlty MLLib does not have implementation for RF. What
>>> other opensource RF implementations will be great to use with spark in
>>> terms of speed?
>>>
>>> Regards,
>>> Laeeq Ahmed,
>>> KTH, Sweden.
>>>
>>>
>

Reply via email to