Is there a PR or issue where GBT / RF progress in MLLib is tracked ?
2014-04-17 21:11 GMT+02:00 Evan R. Sparks <[email protected]>: > Sorry - I meant to say that "Multiclass classification, Gradient > Boosting, and Random Forest support based on the recent Decision Tree > implementation in MLlib is planned and coming soon." > > > On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks <[email protected]>wrote: > >> Multiclass classification, Gradient Boosting, and Random Forest support >> for based on the recent Decision Tree implementation in MLlib. >> >> Sung - I'd be curious to hear about your use of decision trees (and >> forests) where you want to go to 100+ depth. My experience with random >> forests has been that people typically build hundreds of shallow trees >> (maybe depth 7 or 8), rather than a few (or many) really deep trees. >> >> Generally speaking, we save passes over the data by computing histograms >> per variable per split at each *level* of a decision tree. This can blow up >> as the level of the decision tree gets deep, but I'd recommend a lot more >> memory than 2-4GB per worker for most big data workloads. >> >> >> >> >> >> On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung < >> [email protected]> wrote: >> >>> Debasish, we've tested the MLLib decision tree a bit and it eats up too >>> much memory for RF purposes. >>> Once the tree got to depth 8~9, it was easy to get heap exception, even >>> with 2~4 GB of memory per worker. >>> >>> With RF, it's very easy to get 100+ depth in RF with even only 100,000+ >>> rows (because trees usually are not balanced). Additionally, the lack of >>> multi-class classification limits its applicability. >>> >>> Also, RF requires random features per tree node to be effective (not >>> just bootstrap samples), and MLLib decision tree doesn't support that. >>> >>> >>> On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das <[email protected] >>> > wrote: >>> >>>> Mllib has decision tree....there is a rf pr which is not active >>>> now....take that and swap the tree builder with the fast tree builder >>>> that's in mllib...search for the spark jira...the code is based on google >>>> planet paper. .. >>>> >>>> I am sure people in devlist are already working on it...send an email >>>> to know the status over there... >>>> >>>> There is also a rf in cloudera oryx but we could not run it on our data >>>> yet.... >>>> >>>> Weka 3.7.10 has a multi thread rf that is good to do some adhoc runs >>>> but it does not scale... >>>> On Apr 17, 2014 2:45 AM, "Laeeq Ahmed" <[email protected]> wrote: >>>> >>>>> Hi, >>>>> >>>>> For one of my application, I want to use Random forests(RF) on top of >>>>> spark. I see that currenlty MLLib does not have implementation for RF. >>>>> What >>>>> other opensource RF implementations will be great to use with spark in >>>>> terms of speed? >>>>> >>>>> Regards, >>>>> Laeeq Ahmed, >>>>> KTH, Sweden. >>>>> >>>>> >>> >> >
