> -----Original Message----- > From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] On Behalf Of Max Kuhn > Sent: Monday, June 14, 2010 10:19 AM > To: Matthew OKane > Cc: r-help@r-project.org > Subject: Re: [R] Cforest and Random Forest memory use > > The first thing that I would recommend is to avoid the "formula > interface" to models. The internals that R uses to create matrices > form a formula+data set are not efficient. If you had a large number > of variables, I would have automatically pointed to that as a source > of issues. cforest and ctree only have formula interfaces though, so > you are stuck on that one. The randomForest package has both > interfaces, so that might be better. > > Probably the issue is the depth of the trees. With that many > observations, you are likely to get extremely deep trees. You might > try limiting the depth of the tree and see if that has an effect on > performance. > > We run into these issues with large compound libraries; in those cases > we do whatever we can to avoid ensembles of trees or kernel methods. > If you want those, you might need to write your own code that is > hyper-efficient and tuned to your particular data structure (as we > did). > > On another note... are this many observations really needed? You have > 40ish variables; I suspect that >1M points are pretty densely packed > into 40-dimensional space.
This did not seem right to me: 40-dimensional space is very, very big and even a million observations will be thinly spread. There is probably some analytic result from the theory of coverage processes about this, but I just did a quick simulation. If a million samples are independently and randomly distributed in a 40-d unit hypercube, then >90% of the points in the hypercube will be more than one-quarter of the maximum possible distance (sqrt(40)) from the nearest sample. And about 40% of the hypercube will be more than one-third of the maximum possible distance to the nearest sample. So the samples do not densely cover the space at all. One implication is that modeling the relation of a response to 40 predictors will inevitably require a lot of smoothing, even with a million data points. Richard Raubertas Merck & Co. > Do you loose much by sampling the data set > or allocating a large portion to a test set? If you have thousands of > predictors, I could see the need for so many observations, but I'm > wondering if many of the samples are redundant. > > Max > > On Mon, Jun 14, 2010 at 3:45 AM, Matthew OKane > <mlok...@gmail.com> wrote: > > Answers added below. > > Thanks again, > > Matt > > > > On 11 June 2010 14:28, Max Kuhn <mxk...@gmail.com> wrote: > >> > >> Also, you have not said: > >> > >> - your OS: Windows Server 2003 64-bit > >> - your version of R: 2.11.1 64-bit > >> - your version of party: 0.9-9995 > > > > > >> > >> - your code: test.cf <-(formula=badflag~.,data = > >> example,control=cforest_control > > > > (teststat = > 'max', testtype = > > 'Teststatistic', replace = FALSE, ntree = 500, > savesplitstats = FALSE,mtry = > > 10)) > > > >> - what "Large data set" means: > 1 million observations, > 40+ variables, > >> around 200MB > >> - what "very large model objects" means - anything which breaks > >> > >> So... how is anyone suppose to help you? > >> > >> Max > > > > > > > > -- > > Max > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > Notice: This e-mail message, together with any attachme...{{dropped:11}} ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.