Is there a PR or issue where GBT / RF progress in MLLib is tracked ?

2014-04-17 21:11 GMT+02:00 Evan R. Sparks <evan.spa...@gmail.com>:

> Sorry - I meant to say that "Multiclass classification, Gradient
> Boosting, and Random Forest support based on the recent Decision Tree
> implementation in MLlib is planned and coming soon."
>
>
> On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks <evan.spa...@gmail.com>wrote:
>
>> Multiclass classification, Gradient Boosting, and Random Forest support
>> for based on the recent Decision Tree implementation in MLlib.
>>
>> Sung - I'd be curious to hear about your use of decision trees (and
>> forests) where you want to go to 100+ depth. My experience with random
>> forests has been that people typically build hundreds of shallow trees
>> (maybe depth 7 or 8), rather than a few (or many) really deep trees.
>>
>> Generally speaking, we save passes over the data by computing histograms
>> per variable per split at each *level* of a decision tree. This can blow up
>> as the level of the decision tree gets deep, but I'd recommend a lot more
>> memory than 2-4GB per worker for most big data workloads.
>>
>>
>>
>>
>>
>> On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung <
>> coded...@cs.stanford.edu> wrote:
>>
>>> Debasish, we've tested the MLLib decision tree a bit and it eats up too
>>> much memory for RF purposes.
>>> Once the tree got to depth 8~9, it was easy to get heap exception, even
>>> with 2~4 GB of memory per worker.
>>>
>>> With RF, it's very easy to get 100+ depth in RF with even only 100,000+
>>> rows (because trees usually are not balanced). Additionally, the lack of
>>> multi-class classification limits its applicability.
>>>
>>> Also, RF requires random features per tree node to be effective (not
>>> just bootstrap samples), and MLLib decision tree doesn't support that.
>>>
>>>
>>> On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das <debasish.da...@gmail.com
>>> > wrote:
>>>
>>>> Mllib has decision tree....there is a rf pr which is not active
>>>> now....take that and swap the tree builder with the fast tree builder
>>>> that's in mllib...search for the spark jira...the code is based on google
>>>> planet paper. ..
>>>>
>>>> I am sure people in devlist are already working on it...send an email
>>>> to know the status over there...
>>>>
>>>> There is also a rf in cloudera oryx but we could not run it on our data
>>>> yet....
>>>>
>>>> Weka 3.7.10 has a multi thread rf that is good to do some adhoc runs
>>>> but it does not scale...
>>>>  On Apr 17, 2014 2:45 AM, "Laeeq Ahmed" <laeeqsp...@yahoo.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> For one of my application, I want to use Random forests(RF) on top of
>>>>> spark. I see that currenlty MLLib does not have implementation for RF. 
>>>>> What
>>>>> other opensource RF implementations will be great to use with spark in
>>>>> terms of speed?
>>>>>
>>>>> Regards,
>>>>> Laeeq Ahmed,
>>>>> KTH, Sweden.
>>>>>
>>>>>
>>>
>>
>

Reply via email to