Multiclass classification, Gradient Boosting, and Random Forest support for based on the recent Decision Tree implementation in MLlib.
Sung - I'd be curious to hear about your use of decision trees (and forests) where you want to go to 100+ depth. My experience with random forests has been that people typically build hundreds of shallow trees (maybe depth 7 or 8), rather than a few (or many) really deep trees. Generally speaking, we save passes over the data by computing histograms per variable per split at each *level* of a decision tree. This can blow up as the level of the decision tree gets deep, but I'd recommend a lot more memory than 2-4GB per worker for most big data workloads. On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung <coded...@cs.stanford.edu>wrote: > Debasish, we've tested the MLLib decision tree a bit and it eats up too > much memory for RF purposes. > Once the tree got to depth 8~9, it was easy to get heap exception, even > with 2~4 GB of memory per worker. > > With RF, it's very easy to get 100+ depth in RF with even only 100,000+ > rows (because trees usually are not balanced). Additionally, the lack of > multi-class classification limits its applicability. > > Also, RF requires random features per tree node to be effective (not just > bootstrap samples), and MLLib decision tree doesn't support that. > > > On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das > <debasish.da...@gmail.com>wrote: > >> Mllib has decision tree....there is a rf pr which is not active >> now....take that and swap the tree builder with the fast tree builder >> that's in mllib...search for the spark jira...the code is based on google >> planet paper. .. >> >> I am sure people in devlist are already working on it...send an email to >> know the status over there... >> >> There is also a rf in cloudera oryx but we could not run it on our data >> yet.... >> >> Weka 3.7.10 has a multi thread rf that is good to do some adhoc runs but >> it does not scale... >> On Apr 17, 2014 2:45 AM, "Laeeq Ahmed" <laeeqsp...@yahoo.com> wrote: >> >>> Hi, >>> >>> For one of my application, I want to use Random forests(RF) on top of >>> spark. I see that currenlty MLLib does not have implementation for RF. What >>> other opensource RF implementations will be great to use with spark in >>> terms of speed? >>> >>> Regards, >>> Laeeq Ahmed, >>> KTH, Sweden. >>> >>> >