Re: [Scikit-learn-general] memory use of sklearn GBM implementation

2015-10-08 Thread Jacob Schreiber
Okay this makes sense. Upon reflection, the splitters all only take pointers to the datasets, so that shouldn't have been a problem. On Thu, Oct 8, 2015 at 5:40 PM, Peter Rickwood wrote: > > Found the issue > > It is because I am using warm start. I was using warm start and gradually > adding mo

Re: [Scikit-learn-general] memory use of sklearn GBM implementation

2015-10-08 Thread Peter Rickwood
Found the issue It is because I am using warm start. I was using warm start and gradually adding models to the GBM, and this causes a memory blowout. If I change this and just run the same number of iterations in one go rather than incrementally, I get no memory issue. Peter -

Re: [Scikit-learn-general] memory use of sklearn GBM implementation

2015-10-08 Thread Peter Rickwood
Jacob: Great, thanks for confirming. Glad I'm not going crazy or doing something silly. What sklearn version would I need to downgrade to to get back to the old setup (one splitter for all trees)? Andreas: yes, it completes just fine if I set the number of iterations low enough (i.e. ~80) Thanks

Re: [Scikit-learn-general] memory use of sklearn GBM implementation

2015-10-08 Thread Andreas Mueller
On 10/08/2015 06:25 PM, Jacob Schreiber wrote: > Hi > > I think your hypothesis is correct. We recently switched from having > one splitter for all trees, to having one splitter per tree. I can > submit a hotfix tonight to prevent the data from being held multiple times > Hm haven't paid attent

Re: [Scikit-learn-general] memory use of sklearn GBM implementation

2015-10-08 Thread Andreas Mueller
What I meant was if you set the number of max iterations to like 80, does it run through? On 10/08/2015 06:29 PM, Peter Rickwood wrote: Yes, I can get up to 80-100 trees/iterations and everything works normally (but slow due to thrashing) before the OS kills it. I'll try and look into it w

Re: [Scikit-learn-general] memory use of sklearn GBM implementation

2015-10-08 Thread Peter Rickwood
Yes, I can get up to 80-100 trees/iterations and everything works normally (but slow due to thrashing) before the OS kills it. I'll try and look into it with the profiler you suggest and if I find anything will get back to the list. It is of course possible I'm doing something else on the side wh

Re: [Scikit-learn-general] memory use of sklearn GBM implementation

2015-10-08 Thread Jacob Schreiber
Hi I think your hypothesis is correct. We recently switched from having one splitter for all trees, to having one splitter per tree. I can submit a hotfix tonight to prevent the data from being held multiple times Jacob On Thu, Oct 8, 2015 at 3:16 PM, Andreas Mueller wrote: > Hm, that does sou

Re: [Scikit-learn-general] Scikit-learn-general Digest, Vol 69, Issue 10

2015-10-08 Thread Peter Rickwood
Yes, I can get up to 80-100 trees/iterations and everything works normally (but slow due to thrashing) before the OS kills it. I'll try and look into it with the profiler you suggest and if I find anything will get back to the list. It is of course possible I'm doing something else on the side wh

Re: [Scikit-learn-general] memory use of sklearn GBM implementation

2015-10-08 Thread Andreas Mueller
Hm, that does sound a bit odd. Maybe the memory_profiler will shed light on it? https://pypi.python.org/pypi/memory_profiler So if you use less than 100 trees it runs through? Andy On 10/08/2015 06:12 PM, Peter Rickwood wrote: Hello all, I'm puzzled by the memory use of sklearns GBM implem

[Scikit-learn-general] memory use of sklearn GBM implementation

2015-10-08 Thread Peter Rickwood
Hello all, I'm puzzled by the memory use of sklearns GBM implementation. It takes up all available memory and is forced to terminate by the OS, and I cant think of why it is using as much memory as it does. Here is the siituation: I have modest data set of size ~ 4GB (1800 columns, 55 rows,

Re: [Scikit-learn-general] How to optimize a random forest for out of sample prediction

2015-10-08 Thread Andreas Mueller
On 10/07/2015 03:29 AM, Joel Nothman wrote: > RFECV will select features based on scores on a number of validation > sets, as selected by its cv parameter. As opposed to that > StackOverflow query, RFECV should now support RandomForest and its > feature_importances_ attribute. > RFECV is not t