Extra Trees are even more random than random forests. Have a look at
the referenced papers.
To choose one vs the other you can evaluate the generalization power
via cross-validation on your data (you might also want to grid search
the optimal parameter values for max_features and min_samples_split
This looks perfect. I’m pretty knew to ensemble methods, so please forgive this
ignorant question: what’s the difference between ExtraTrees and RandomForests?
From http://scikit-learn.org/stable/modules/ensemble.html it looks like
ExtraTrees is an extension of RandomForests. Examples of when one
2014-02-07 15:09 GMT-08:00 Peter Prettenhofer :
> Hi Allessandro,
>
> you might want to look into this presentation by Olivier
> https://speakerdeck.com/ogrisel/growing-randomized-trees-in-the-cloud-1 --
> it should be pretty much what you need. Code is here
> https://github.com/pydata/pyrallel.
I
There is no support for multi-machine parallel computing in scikit-learn.
You'll have to write your own code mimicking the code of the random
forest.
Gaël
On Fri, Feb 07, 2014 at 10:28:01PM +, Alessandro Gagliardi wrote:
> Hi All,
> I want to run a large sklearn.ensemble.RandomForestClassifi
Hi Allessandro,
you might want to look into this presentation by Olivier
https://speakerdeck.com/ogrisel/growing-randomized-trees-in-the-cloud-1 --
it should be pretty much what you need. Code is here
https://github.com/pydata/pyrallel.
best,
Peter
2014-02-07 23:28 GMT+01:00 Alessandro Gagliar
Hi All,
I want to run a large sklearn.ensemble.RandomForestClassifier (with maybe a
dozens or maybe hundreds of trees and 100,000 samples). My desktop won’t handle
this so I want to try using StarCluster. RandomForestClassifier seems to
parallelize easily, but I don’t know how I would split it