Re: Random Forest on Spark

Sean Owen Fri, 18 Apr 2014 01:33:08 -0700

Mahout RDF is fairly old code. If you try it, try to use 1.0-SNAPSHOT,
as you will almost certainly need this patch to make it run reasonably
fast: https://issues.apache.org/jira/browse/MAHOUT-1419


I have not tried Stratosphere here.

Since we are on the subject of RDF on Hadoop, possibly on M/R, I don't
feel too bad advertising this: oryx also does
classification/regression via RDF:
https://github.com/cloudera/oryx#classification--regression-example

This is a fairly different design choice than, say, what's in the
PLANET paper. The one big negative is that trees are built only over a
sub-sample of the data. But given that big simplifying assumption, a
lot of other things work well. It's not iterative so is not
handicapped by being M/R-based. May be of interest if building /
benchmarking stuff on Hadoop.

Personally, going forward, I'm interested in something smarter (like
what I see is going into the new Spark impl) although there really are
some big design tradeoffs here, yes.


--
Sean Owen | Director, Data Science | London


On Fri, Apr 18, 2014 at 9:21 AM, Laeeq Ahmed <laeeqsp...@yahoo.com> wrote:
> Have anyone tried mahout RF or Stratosphere RF with spark. Any comments.
>
> Regards,
> Laeeq
> On Friday, April 18, 2014 3:11 AM, Sung Hwan Chung
> <coded...@cs.stanford.edu> wrote:
> Yes, it should be data specific and perhaps we're biased toward the data
> sets that we are playing with. To put things in perspective, we're highly
> interested in (and I believe, our customers are):
>
> 1. large (hundreds of millions of rows)
> 2. multi-class classification - nowadays, dozens of target categories are
> common and even thousands in some cases - you could imagine that this is a
> big reason for us requiring more 'complex' models
> 3. high dimensional with thousands of descriptive and sort-of-independent
> features
>
> From the theoretical perspective, I would argue that it's usually in the
> best interest to prune as little as possible. I believe that pruning
> inherently increases bias of an individual tree, which RF can't do anything
> about while decreasing variance - which is what RF is for.
>
> The default pruning criteria for R's reference implementation is min-node of
> 1 (meaning fully-grown tree) for classification, and 5 for regression. I'd
> imagine they did at least some empirical testing to justify these values at
> the time - although at a time of small datasets :).
>
> FYI, we are also considering the MLLib decision tree for our Gradient
> Boosting implementation, however, the memory requirement is still a bit too
> steep (we were getting heap exceptions at depth limit of 5 with 2GB per
> worker with approximately 1000 features). Now 2GB per worker is about what
> we expect our typical customers would tolerate and I don't think that it's
> unreasonable for shallow trees.
>
>
>
> On Thu, Apr 17, 2014 at 3:54 PM, Evan R. Sparks <evan.spa...@gmail.com>
> wrote:
>
> What kind of data are you training on? These effects are *highly* data
> dependent, and while saying "the depth of 10 is simply not adequate to build
> high-accuracy models" may be accurate for the particular problem you're
> modeling, it is not true in general. From a statistical perspective, I
> consider each node in each tree an additional degree of freedom for the
> model, and all else equal I'd expect a model with fewer degrees of freedom
> to generalize better. Regardless, if there are lots of use cases for really
> deep trees, we'd like to hear about them so that we can decide how important
> they are to support!
>
> In the context of CART - pruning very specifically refers to a step *after*
> a tree has been constructed to some depth using cross-validation. This was a
> variance reduction technique in the original tree work that is unnecessary
> and computationally expensive in the context of forests. In the original
> Random Forests paper, there are still stopping criteria - usually either
> minimum leaf size or minimum split improvement (or both), so "training to
> maximum depth" doesn't mean "train until you've completely divided your
> dataset and there's one point per leaf." My point is that if you set minimum
> leaf size to something like 0.2% of the dataset, then you're not going to
> get deeper than 10 or 12 levels with a reasonably balanced tree.
>
> With respect to PLANET - our implementation is very much in the spirit of
> planet, but has some key differences - there's good documentation on exactly
> what the differences are forthcoming, so I won't belabor these here. The
> differences are designed to 1) avoid data shuffling, and 2) minimize number
> of passes over the training data. Of course, there are tradeoffs involved,
> and there is at least one really good trick in the PLANET work that we
> should leverage that we aren't yet - namely once the nodes get small enough
> for data to fit easily on a single machine, data can be shuffled and then
> the remainder of the tree can be trained in parallel from each lower node on
> a single machine This would actually help with the memory overheads in model
> training when trees get deep  - if someone wants to modify the current
> implementation of trees in MLlib and contribute this optimization as a pull
> request, it would be welcome!
>
> At any rate, we'll take this feedback into account with respect to improving
> the tree implementation, but if anyone can send over use cases or (even
> better) datasets where really deep trees are necessary, that would be great!
>
>
>
>
> On Thu, Apr 17, 2014 at 1:43 PM, Sung Hwan Chung <coded...@cs.stanford.edu>
> wrote:
>
> Well, if you read the original paper,
> http://oz.berkeley.edu/~breiman/randomforest2001.pdf
> "Grow the tree using CART methodology to maximum size and do not prune."
>
> Now, the elements of statistical learning book on page 598 says that you
> could potentially overfit fully-grown regression random forest. However,
> this effect is very slight, and likely negligible for classifications.
> http://www.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
>
> In our experiments however, if the pruning is "drastic", then the
> performance actually becomes much worse. This makes intuitive sense IMO
> because a decision tree is a non-parametric model, and the expressibility of
> a tree depends on the number of nodes.
>
> With a huge amount of data (millions or even billions of rows), we found
> that the depth of 10 is simply not adequate to build high-accuracy models.
>
>
> On Thu, Apr 17, 2014 at 12:30 PM, Evan R. Sparks <evan.spa...@gmail.com>
> wrote:
>
> Hmm... can you provide some pointers to examples where deep trees are
> helpful?
>
> Typically with Decision Trees you limit depth (either directly or indirectly
> with minimum node size and minimum improvement criteria) to avoid
> overfitting. I agree with the assessment that forests are a variance
> reduction technique, but I'd be a little surprised if a bunch of hugely deep
> trees don't overfit to training data. I guess I view limiting tree depth as
> an analogue to regularization in linear models.
>
>
> On Thu, Apr 17, 2014 at 12:19 PM, Sung Hwan Chung <coded...@cs.stanford.edu>
> wrote:
>
> Evan,
>
> I actually haven't heard of 'shallow' random forest. I think that the only
> scenarios where shallow trees are useful are boosting scenarios.
>
> AFAIK, Random Forest is a variance reducing technique and doesn't do much
> about bias (although some people claim that it does have some bias reducing
> effect). Because shallow trees typically have higher bias than fully-grown
> trees, people don't often use shallow trees with RF.
>
> You can confirm this through some experiments with R's random forest
> implementation as well. They allow you to set some limits of depth and/or
> pruning.
>
> In contrast, boosting is a bias reduction technique (and increases
> variance), so people typically use shallow trees.
>
> Our empirical experiments also confirmed that shallow trees resulted in
> drastically lower accuracy for random forests.
>
> There are some papers that mix boosting-like technique with bootstrap
> averaging (e.g. http://arxiv.org/pdf/1103.2068.pdf) where you could
> potentially use shallow trees to build boosted learners, but then average
> the results of many boosted learners.
>
>
> On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks <evan.spa...@gmail.com>
> wrote:
>
> Multiclass classification, Gradient Boosting, and Random Forest support for
> based on the recent Decision Tree implementation in MLlib.
>
> Sung - I'd be curious to hear about your use of decision trees (and forests)
> where you want to go to 100+ depth. My experience with random forests has
> been that people typically build hundreds of shallow trees (maybe depth 7 or
> 8), rather than a few (or many) really deep trees.
>
> Generally speaking, we save passes over the data by computing histograms per
> variable per split at each *level* of a decision tree. This can blow up as
> the level of the decision tree gets deep, but I'd recommend a lot more
> memory than 2-4GB per worker for most big data workloads.
>
>
>
>
>
> On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung <coded...@cs.stanford.edu>
> wrote:
>
> Debasish, we've tested the MLLib decision tree a bit and it eats up too much
> memory for RF purposes.
> Once the tree got to depth 8~9, it was easy to get heap exception, even with
> 2~4 GB of memory per worker.
>
> With RF, it's very easy to get 100+ depth in RF with even only 100,000+ rows
> (because trees usually are not balanced). Additionally, the lack of
> multi-class classification limits its applicability.
>
> Also, RF requires random features per tree node to be effective (not just
> bootstrap samples), and MLLib decision tree doesn't support that.
>
>
> On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das <debasish.da...@gmail.com>
> wrote:
>
> Mllib has decision tree....there is a rf pr which is not active now....take
> that and swap the tree builder with the fast tree builder that's in
> mllib...search for the spark jira...the code is based on google planet
> paper. ..
> I am sure people in devlist are already working on it...send an email to
> know the status over there...
> There is also a rf in cloudera oryx but we could not run it on our data
> yet....
> Weka 3.7.10 has a multi thread rf that is good to do some adhoc runs but it
> does not scale...
> On Apr 17, 2014 2:45 AM, "Laeeq Ahmed" <laeeqsp...@yahoo.com> wrote:
>
> Hi,
>
> For one of my application, I want to use Random forests(RF) on top of spark.
> I see that currenlty MLLib does not have implementation for RF. What other
> opensource RF implementations will be great to use with spark in terms of
> speed?
>
> Regards,
> Laeeq Ahmed,
> KTH, Sweden.
>
>
>
>
>
>
>
>
>
>

Re: Random Forest on Spark

Reply via email to