Re: [Scikit-learn-general] Pipeline - Convert to dense

2014-08-21 Thread Anders Aagaard
If X2 doesn't have the same ordering you wouldn't be able to pass that directly either. The data is split before the being run into the pipeline, so just using hstack is fine. I've got the code I use to make this easier here by the way : https://github.com/andaag/scikit_helpers . On Fri, Aug 22,

Re: [Scikit-learn-general] Pipeline - Convert to dense

2014-08-21 Thread Sebastian Okser
Hey, Your tip put me very far in the right direction. I have one further question. It seems that just appending the features in the featureunion as a pipeline step may create havoc as when I implement gridsearchcv the data is going to have some sort of randomness in the data order compared to t

Re: [Scikit-learn-general] Final summary for GSoC 2014 posted

2014-08-21 Thread Robert Layton
Great work, and I hope to see it merged soon. Thanks! On 17 August 2014 23:54, Issam wrote: > Hi all, > > I finished writing the final summary of my work for GSoC 2014. It is > posted here: http://issamlaradji.blogspot.com/ > > Thank you! > > Best regards, > --Issam Laradji > > > -

Re: [Scikit-learn-general] [GSoC] Wrap up post

2014-08-21 Thread Robert Layton
Really interesting work, well done in GSoC! On 22 August 2014 09:35, Manoj Kumar wrote: > Hi, > > A quick wrap up post about my Summer of Code > > http://manojbits.wordpress.com/2014/08/21/gsoc-the-end-of-another-journey/ > > > -- > Godspeed, > Manoj Kumar, > Mech Undergrad > http://manojbits.w

[Scikit-learn-general] [GSoC] Wrap up post

2014-08-21 Thread Manoj Kumar
Hi, A quick wrap up post about my Summer of Code http://manojbits.wordpress.com/2014/08/21/gsoc-the-end-of-another-journey/ -- Godspeed, Manoj Kumar, Mech Undergrad http://manojbits.wordpress.com -- Slashdot TV. Vide

[Scikit-learn-general] ElasticnetCV crash on 64-bit Linux

2014-08-21 Thread László Sándor
Hi, OS denied me memory upon running CV in the script below. I am still investigating whether it was a mistake of the scheduler on the server, but I think the process had access to 240 GB memory but reproducibly crashes upon using 120035176K with the error message below. I paste my conda info outp

Re: [Scikit-learn-general] [GSOC] Wrap up blog post

2014-08-21 Thread Robert Layton
Great effort Hamzeh -- your GSoC has dealt with problems everyone has put off until later and helped the community a lot, thanks! On 19 August 2014 11:32, Hamzeh Alsalhi wrote: > Hello, I am wrapping up my final blogpost and I want to say that this was > an awesome summer of code! It has been a

Re: [Scikit-learn-general] Pipeline - Convert to dense

2014-08-21 Thread Sebastian Raschka
Hi, Zoraida, thanks for the follow up! I went with a short, custom ColumnSelector class, but the itemgetter is even nicer. Best, Sebastian On Aug 21, 2014, at 2:57 PM, ZORAIDA HIDALGO SANCHEZ wrote: > Sebastian, > > a few days ago, I asked a very similar question and I got this link as a >

Re: [Scikit-learn-general] Pipeline - Convert to dense

2014-08-21 Thread ZORAIDA HIDALGO SANCHEZ
Sebastian, a few days ago, I asked a very similar question and I got this link as a response: https://github.com/scikit-learn/scikit-learn/issues/2034 I think that you could try something similar. Best, Zoraida.- El 21/08/14 18:48, "Sebastian Okser" escribió: >I am trying to use the pipe

[Scikit-learn-general] Pipeline - Convert to dense

2014-08-21 Thread Sebastian Okser
I am trying to use the pipeline combined with a countvectorizer, tfidftransformer and randomforest. However the output of the second step is a sparse array and randomforest requires a dense one. How can I add a step to allow for a conversion of the matrix from sparse to dense, using something a

Re: [Scikit-learn-general] optimal n_jobs in GridSearchCV

2014-08-21 Thread Arnaud Joly
If you set n_jobs to XXX, it will spawn XXX threads or processes. Thus, you will need to ask for XXX cores. Note that it’s often possible to retrieve XXX in your script using os.environ. If you use less than the XXX cores, then you won’t use all the available cpu. If you ask for more than XXX cor

Re: [Scikit-learn-general] optimal n_jobs in GridSearchCV

2014-08-21 Thread Sheila the angel
I still have following doubt: I understand that n_jobs "should be depending on the number of cpu cores available on your machine". But I am running code on Grid computing environment where I have to specify the number of CPU cores in advance. Does this mean if I (reserve 64 cores and) specify n_j

Re: [Scikit-learn-general] optimal n_jobs in GridSearchCV

2014-08-21 Thread Lars Buitinck
2014-08-21 13:44 GMT+02:00 Joel Nothman : > I think RandomForestClassifier, using multithreading in version 0.15, should > work nested in multiprocessing. It would work, but the p * n threads from p processes using n threads each would still compete for the cores, right? -

Re: [Scikit-learn-general] optimal n_jobs in GridSearchCV

2014-08-21 Thread Joel Nothman
On 21 August 2014 21:46, Gael Varoquaux wrote: > On Thu, Aug 21, 2014 at 09:44:37PM +1000, Joel Nothman wrote: > > I think RandomForestClassifier, using multithreading in version 0.15, > should > > work nested in multiprocessing. > > Good point, as it uses threading. Thus, for version 0.15, what

Re: [Scikit-learn-general] optimal n_jobs in GridSearchCV

2014-08-21 Thread Gael Varoquaux
On Thu, Aug 21, 2014 at 09:44:37PM +1000, Joel Nothman wrote: > I think RandomForestClassifier, using multithreading in version 0.15, should > work nested in multiprocessing. Good point, as it uses threading. Thus, for version 0.15, what I just said was irrelevant. G

Re: [Scikit-learn-general] optimal n_jobs in GridSearchCV

2014-08-21 Thread Joel Nothman
On 21 August 2014 21:39, Gael Varoquaux wrote: > On Thu, Aug 21, 2014 at 12:32:08PM +0200, Sheila the angel wrote: > > 2. If I use the classifier such as RandomForestClassifier where > > 'n_jobs' can be specified, will it make any difference if I specify > > "n_jobs" at the classifier level also-

Re: [Scikit-learn-general] optimal n_jobs in GridSearchCV

2014-08-21 Thread Gael Varoquaux
On Thu, Aug 21, 2014 at 12:32:08PM +0200, Sheila the angel wrote: > 2. If I use the classifier such as RandomForestClassifier where > 'n_jobs' can be specified, will it make any difference if I specify > "n_jobs" at the classifier level also- We don't support nested parallelism, unfortunately. G

Re: [Scikit-learn-general] optimal n_jobs in GridSearchCV

2014-08-21 Thread Sheila the angel
First Thanks for reply. @Hames : I understand that n_jobs "should be depending on the number of cpu cores available on your machine". But I am running code on Grid computing environment where I have to specify the number of CPUs in advance. Does this mean if I (reserve 64 cores and) specify n_job

Re: [Scikit-learn-general] optimal n_jobs in GridSearchCV

2014-08-21 Thread Lars Buitinck
2014-08-21 12:32 GMT+02:00 Sheila the angel : > 1. What should be the n_jobs value, 8 or (8*4=) 32 ? n_jobs is the number of CPUs you want to use, not the amount of work. (It's a misnomer because the number of jobs/work items is variable; the parameter determines the number of workers performing t

Re: [Scikit-learn-general] optimal n_jobs in GridSearchCV

2014-08-21 Thread Mr Samuel Hames
Hi, 1. The n_jobs parameter controls the number of physical processes started in parallel. It should be set depending on the number of cpu cores available on your machine, independent of the type of or size of the CV search you are trying to run. On a typical desktop machine with four cores t

[Scikit-learn-general] optimal n_jobs in GridSearchCV

2014-08-21 Thread Sheila the angel
Hi, Using GridSearchCV, I am trying to optimize two parameters values. In total, I have 8 parameter combinations and doing 4 fold cross validation. I want to run it in parallel environment. My questions are: 1. What should be the n_jobs value, 8 or (8*4=) 32 ? (I know I can specify n_jobs=-1 but du

Re: [Scikit-learn-general] Optimal Subset Selection Code Contribution

2014-08-21 Thread Mathieu Blondel
There was a thread on the mailing-list a while ago on instance reduction methods. It was decided to not include such methods for the time being as changing n_samples is not supported by transformers or pipelines. It is also not clear yet how such methods would play with grid search, for instance.