Re: [R] Need Help! Poor performance about randomForest for large data

Joris Meys Tue, 25 May 2010 04:02:27 -0700

Hi Jia,

without seeing the actual data, it's difficult to give solid options. But
it's quite normal this runs for hours : it has to make a whole lot of
decisions, and it can grow tremendous large trees with that amount of data.
Also the error is quite logic : you just can't store all those huge trees.

Try to set the following options in RandomForest :
mtry : number of variables selected at each split. Smaller number speeds up
things, but this effect will be not too big.
nodesize : this is the minimum node size. In default, it is 1 for
classification, meaning that you build a tree until every observation is in
a seperate leaf. In your case, this should be set waaaaaay higher.
maxnodes : this is the maximum number of nodes. Again, with the amount of
data you have, this number goes skyrocket and thus produces huge trees (you
can have more than 200.000 nodes... ). No need to do that, so you should set
it to a reasonable low amount.

Try this for example :
res <- randomForest(x=sdata1,y=sdata2,ntrees=500,
            mtry=5, nodesize=100,maxnodes=60)

These trees assume that the minimum size of a group with similar
observations is 100. Sounds reasonable, it still gives you over 2800 groups
for a full tree. The maximum number of nodes I chose to allow that every
variable occurs once in the tree, although it doesn't have to be this way.
If you still get errors, play a bit more with those numbers.

Actually, you should do that anyway, regardless of memory and computation
time. RandomForest is known to have the danger of overfitting. Restricting
the tree size avoids this and gives you a more general fit.

Cheers
Joris

On Tue, May 25, 2010 at 11:51 AM, Jia ZJ Zou <jia...@cn.ibm.com> wrote:

> Hi, dears,
>
> I am processing some data with 60 columns, and 286,730 rows.
> Most columns are numerical value, and some columns are categorical value.
>
> It turns out that: when ntree sets to the default value (500), it says "can
> not allocate a vector of 1.1 GB size"; And when I set ntree to be a very
> small number like 10, it will run for hours.
> I use the (x,y) rather than the (formula,data).
>
> My code:
>
> > sdata<-read.csv("D://zSignal Dump//XXXX//XXXX.csv")
> > sdata1<-subset(sdata,select=-38)
> > sdata2<-subset(sdata,select=38)
> > res<-randomForest(x=sdata1,y=sdata2,ntrees=10)
>
>
> Am I doing anything wrong? Or do you have other suggestions? Are there any
> other packages to do the same thing?
> I will appreciate if anyone can help me out, thanks!
>
>
> Thanks and Best regards,
> ------------------------------------------------
> Jia, Zou (×Þ¼Î), Ph.D.
> IBM Research -- China
> Diamond Building, #19 Zhongguancun Software Park, 8 Dongbeiwang West Road,
> Haidian District, Beijing 100193, P.R. China
> Tel: +86 (10) 58748518
> E-mail: jia...@cn.ibm.com
>        [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

-- 
Joris Meys
Statistical Consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

Coupure Links 653
B-9000 Gent

tel : +32 9 264 59 87
joris.m...@ugent.be
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Need Help! Poor performance about randomForest for large data

Reply via email to