sorry I mismatched the link, it should be

https://gist.github.com/wpm/6454814

and the algorithm is not ExtraTrees but a basic ensemble of boosted trees.


2014-04-18 10:31 GMT+02:00 Eustache DIEMERT <eusta...@diemert.fr>:

> Another option is to use ExtraTrees as provided by scikit-learn with
> pyspark:
>
> https://github.com/pydata/pyrallel/blob/master/pyrallel/ensemble.py#L27-L59
>
> this is a proof of concept right now and should be hacked to what you
> need, but the core decision tree implementation is highly optimized and
> could solve the memory issue mentioned by the OP.
>
> Also, for scaling ensembles of decision trees there is also the LambdaMART
> paper [1] which is more modern/optimized in its approach albeit using MPI
> implementation.
>
> Finally, here [2] is a blog post of mine explaining the PLANET paper and
> its limitations
>
> [1] http://jmlr.org/proceedings/papers/v14/burges11a/burges11a.pdf
>
> [2]
> http://stochastics.komodo.re/implementing-distributed-gradient-boosted-trees-part-2.html
>
> HTH,
>
> Eustache
>
>
>
>
>
>
> 2014-04-18 10:21 GMT+02:00 Laeeq Ahmed <laeeqsp...@yahoo.com>:
>
>  Have anyone tried mahout RF or Stratosphere RF with spark. Any comments.
>>
>> Regards,
>> Laeeq
>>   On Friday, April 18, 2014 3:11 AM, Sung Hwan Chung <
>> coded...@cs.stanford.edu> wrote:
>>   Yes, it should be data specific and perhaps we're biased toward the
>> data sets that we are playing with. To put things in perspective, we're
>> highly interested in (and I believe, our customers are):
>>
>> 1. large (hundreds of millions of rows)
>> 2. multi-class classification - nowadays, dozens of target categories are
>> common and even thousands in some cases - you could imagine that this is a
>> big reason for us requiring more 'complex' models
>> 3. high dimensional with thousands of descriptive and sort-of-independent
>> features
>>
>> From the theoretical perspective, I would argue that it's usually in the
>> best interest to prune as little as possible. I believe that pruning
>> inherently increases bias of an individual tree, which RF can't do anything
>> about while decreasing variance - which is what RF is for.
>>
>> The default pruning criteria for R's reference implementation is min-node
>> of 1 (meaning fully-grown tree) for classification, and 5 for regression.
>> I'd imagine they did at least some empirical testing to justify these
>> values at the time - although at a time of small datasets :).
>>
>> FYI, we are also considering the MLLib decision tree for our Gradient
>> Boosting implementation, however, the memory requirement is still a bit too
>> steep (we were getting heap exceptions at depth limit of 5 with 2GB per
>> worker with approximately 1000 features). Now 2GB per worker is about what
>> we expect our typical customers would tolerate and I don't think that it's
>> unreasonable for shallow trees.
>>
>>
>>
>> On Thu, Apr 17, 2014 at 3:54 PM, Evan R. Sparks <evan.spa...@gmail.com>wrote:
>>
>> What kind of data are you training on? These effects are *highly* data
>> dependent, and while saying "the depth of 10 is simply not adequate to
>> build high-accuracy models" may be accurate for the particular problem
>> you're modeling, it is not true in general. From a statistical perspective,
>> I consider each node in each tree an additional degree of freedom for the
>> model, and all else equal I'd expect a model with fewer degrees of freedom
>> to generalize better. Regardless, if there are lots of use cases for really
>> deep trees, we'd like to hear about them so that we can decide how
>> important they are to support!
>>
>> In the context of CART - pruning very specifically refers to a step
>> *after* a tree has been constructed to some depth using cross-validation.
>> This was a variance reduction technique in the original tree work that is
>> unnecessary and computationally expensive in the context of forests. In the
>> original Random Forests paper, there are still stopping criteria - usually
>> either minimum leaf size or minimum split improvement (or both), so
>> "training to maximum depth" doesn't mean "train until you've completely
>> divided your dataset and there's one point per leaf." My point is that if
>> you set minimum leaf size to something like 0.2% of the dataset, then
>> you're not going to get deeper than 10 or 12 levels with a reasonably
>> balanced tree.
>>
>> With respect to PLANET - our implementation is very much in the spirit of
>> planet, but has some key differences - there's good documentation on
>> exactly what the differences are forthcoming, so I won't belabor these
>> here. The differences are designed to 1) avoid data shuffling, and 2)
>> minimize number of passes over the training data. Of course, there are
>> tradeoffs involved, and there is at least one really good trick in the
>> PLANET work that we should leverage that we aren't yet - namely once the
>> nodes get small enough for data to fit easily on a single machine, data can
>> be shuffled and then the remainder of the tree can be trained in parallel
>> from each lower node on a single machine This would actually help with the
>> memory overheads in model training when trees get deep  - if someone wants
>> to modify the current implementation of trees in MLlib and contribute this
>> optimization as a pull request, it would be welcome!
>>
>> At any rate, we'll take this feedback into account with respect to
>> improving the tree implementation, but if anyone can send over use cases or
>> (even better) datasets where really deep trees are necessary, that would be
>> great!
>>
>>
>>
>>
>> On Thu, Apr 17, 2014 at 1:43 PM, Sung Hwan Chung <
>> coded...@cs.stanford.edu> wrote:
>>
>> Well, if you read the original paper,
>> http://oz.berkeley.edu/~breiman/randomforest2001.pdf
>> "Grow the tree using CART methodology to maximum size and do not prune."
>>
>> Now, the elements of statistical learning book on page 598 says that you
>> could potentially overfit fully-grown regression random forest. However,
>> this effect is very slight, and likely negligible for classifications.
>> http://www.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
>>
>> In our experiments however, if the pruning is "drastic", then the
>> performance actually becomes much worse. This makes intuitive sense IMO
>> because a decision tree is a non-parametric model, and the expressibility
>> of a tree depends on the number of nodes.
>>
>> With a huge amount of data (millions or even billions of rows), we found
>> that the depth of 10 is simply not adequate to build high-accuracy models.
>>
>>
>> On Thu, Apr 17, 2014 at 12:30 PM, Evan R. Sparks 
>> <evan.spa...@gmail.com>wrote:
>>
>> Hmm... can you provide some pointers to examples where deep trees are
>> helpful?
>>
>> Typically with Decision Trees you limit depth (either directly or
>> indirectly with minimum node size and minimum improvement criteria) to
>> avoid overfitting. I agree with the assessment that forests are a variance
>> reduction technique, but I'd be a little surprised if a bunch of hugely
>> deep trees don't overfit to training data. I guess I view limiting tree
>> depth as an analogue to regularization in linear models.
>>
>>
>> On Thu, Apr 17, 2014 at 12:19 PM, Sung Hwan Chung <
>> coded...@cs.stanford.edu> wrote:
>>
>> Evan,
>>
>> I actually haven't heard of 'shallow' random forest. I think that the
>> only scenarios where shallow trees are useful are boosting scenarios.
>>
>> AFAIK, Random Forest is a variance reducing technique and doesn't do much
>> about bias (although some people claim that it does have some bias reducing
>> effect). Because shallow trees typically have higher bias than fully-grown
>> trees, people don't often use shallow trees with RF.
>>
>> You can confirm this through some experiments with R's random forest
>> implementation as well. They allow you to set some limits of depth and/or
>> pruning.
>>
>> In contrast, boosting is a bias reduction technique (and increases
>> variance), so people typically use shallow trees.
>>
>> Our empirical experiments also confirmed that shallow trees resulted in
>> drastically lower accuracy for random forests.
>>
>> There are some papers that mix boosting-like technique with bootstrap
>> averaging (e.g. http://arxiv.org/pdf/1103.2068.pdf) where you could
>> potentially use shallow trees to build boosted learners, but then average
>> the results of many boosted learners.
>>
>>
>> On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks 
>> <evan.spa...@gmail.com>wrote:
>>
>> Multiclass classification, Gradient Boosting, and Random Forest support
>> for based on the recent Decision Tree implementation in MLlib.
>>
>> Sung - I'd be curious to hear about your use of decision trees (and
>> forests) where you want to go to 100+ depth. My experience with random
>> forests has been that people typically build hundreds of shallow trees
>> (maybe depth 7 or 8), rather than a few (or many) really deep trees.
>>
>> Generally speaking, we save passes over the data by computing histograms
>> per variable per split at each *level* of a decision tree. This can blow up
>> as the level of the decision tree gets deep, but I'd recommend a lot more
>> memory than 2-4GB per worker for most big data workloads.
>>
>>
>>
>>
>>
>> On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung <
>> coded...@cs.stanford.edu> wrote:
>>
>> Debasish, we've tested the MLLib decision tree a bit and it eats up too
>> much memory for RF purposes.
>> Once the tree got to depth 8~9, it was easy to get heap exception, even
>> with 2~4 GB of memory per worker.
>>
>> With RF, it's very easy to get 100+ depth in RF with even only 100,000+
>> rows (because trees usually are not balanced). Additionally, the lack of
>> multi-class classification limits its applicability.
>>
>> Also, RF requires random features per tree node to be effective (not just
>> bootstrap samples), and MLLib decision tree doesn't support that.
>>
>>
>> On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das 
>> <debasish.da...@gmail.com>wrote:
>>
>> Mllib has decision tree....there is a rf pr which is not active
>> now....take that and swap the tree builder with the fast tree builder
>> that's in mllib...search for the spark jira...the code is based on google
>> planet paper. ..
>> I am sure people in devlist are already working on it...send an email to
>> know the status over there...
>> There is also a rf in cloudera oryx but we could not run it on our data
>> yet....
>> Weka 3.7.10 has a multi thread rf that is good to do some adhoc runs but
>> it does not scale...
>>  On Apr 17, 2014 2:45 AM, "Laeeq Ahmed" <laeeqsp...@yahoo.com> wrote:
>>
>> Hi,
>>
>> For one of my application, I want to use Random forests(RF) on top of
>> spark. I see that currenlty MLLib does not have implementation for RF. What
>> other opensource RF implementations will be great to use with spark in
>> terms of speed?
>>
>> Regards,
>> Laeeq Ahmed,
>> KTH, Sweden.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

Reply via email to