Re: [Scikit-learn-general] Distributed RandomForests

2013-04-24 Thread Gilles Louppe
Hi Youssef, Regarding memory usage, you should know that it'll basically blow up if you increase the number of jobs. With the current implementation, you'll need O(n_jobs * |X| * 2) in memory space (where |X| is the size of X, in bytes). That issue stems from the use of joblib which basically forc

Re: [Scikit-learn-general] Distributed RandomForests

2013-04-24 Thread Brian Holt
Hi Youssef, You're trying to do exactly what I did. First thing to note is that the Microsoft guys don't precompute the features, rather they compute them on the fly. That means that they only need enough memory to store the depth images, and since they have a 1000 core cluster, computing the feat

Re: [Scikit-learn-general] GSoC applications are open (based on Fwd: [Soc2013-general] Student Application Template (Applications start April 22!))

2013-04-24 Thread Vlad Niculae
Exactly, I was talking about predict and about the state of the estimator. It seemed much more difficult before I thought about it better :) On Thu, Apr 25, 2013 at 10:54 AM, Mathieu Blondel wrote: > > On Thu, Apr 25, 2013 at 10:26 AM, Vlad Niculae wrote: >> >> If we are talking about the same t

Re: [Scikit-learn-general] GSoC applications are open (based on Fwd: [Soc2013-general] Student Application Template (Applications start April 22!))

2013-04-24 Thread Mathieu Blondel
On Thu, Apr 25, 2013 at 10:26 AM, Vlad Niculae wrote: > If we are talking about the same thing, you are returning clusters of > samples and features together (ie rows and columns). So if in K-means > we return a 1D array with cluster labels, here the output would be two > arrays, one of (n_sample

Re: [Scikit-learn-general] GSoC applications are open (based on Fwd: [Soc2013-general] Student Application Template (Applications start April 22!))

2013-04-24 Thread Vlad Niculae
If we are talking about the same thing, you are returning clusters of samples and features together (ie rows and columns). So if in K-means we return a 1D array with cluster labels, here the output would be two arrays, one of (n_samples,) and one of (n_features,). Another alternative would be a li

[Scikit-learn-general] Distributed RandomForests

2013-04-24 Thread Youssef Barhomi
Hello, I am trying to reproduce the results of this paper: http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf with different kinds of data (monkey depth maps instead of humans). So I am generating my depth features and training and classifying data with a random forest with quite s

Re: [Scikit-learn-general] GSoC applications are open (based on Fwd: [Soc2013-general] Student Application Template (Applications start April 22!))

2013-04-24 Thread Mathieu Blondel
Could you elaborate why it would require a new API? Mathieu On Apr 25, 2013 9:08 AM, "Vlad Niculae" wrote: > The Baader-Meinhof phenomenon in action -- only 2 days ago I saw a > talk about information-theoretic biclustering (aka co-clustering) > applied to opinion mining of video game reviews a

Re: [Scikit-learn-general] GSoC applications are open (based on Fwd: [Soc2013-general] Student Application Template (Applications start April 22!))

2013-04-24 Thread Vlad Niculae
The Baader-Meinhof phenomenon in action -- only 2 days ago I saw a talk about information-theoretic biclustering (aka co-clustering) applied to opinion mining of video game reviews and the method raised my attention. An efficient implementation would be very nice, but it will definitely require a

Re: [Scikit-learn-general] GSoC applications are open (based on Fwd: [Soc2013-general] Student Application Template (Applications start April 22!))

2013-04-24 Thread Mathieu Blondel
Hi Kemal, On Thu, Apr 25, 2013 at 6:56 AM, Kemal Eren wrote: > > If you are looking for biclustering algorithms I could certainly do that. > I did my Master's thesis on it and wrote this software: > http://bmi.osu.edu/hpc/software/bibench/. Its biclustering algorithms are > wrappers to existing

Re: [Scikit-learn-general] GSoC applications are open (based on Fwd: [Soc2013-general] Student Application Template (Applications start April 22!))

2013-04-24 Thread Kemal Eren
Hi Mathieu and team, If you are looking for biclustering algorithms I could certainly do that. I did my Master's thesis on it and wrote this software: http://bmi.osu.edu/hpc/software/bibench/. Its biclustering algorithms are wrappers to existing tools. It would be really nice to have Python/Cython

Re: [Scikit-learn-general] Rotations Code?

2013-04-24 Thread Gael Varoquaux
On Sun, Apr 21, 2013 at 09:36:57PM -0400, Skipper Seabold wrote: > Does anyone have any code for computing rotations of components after > PCA or FactorAnalysis, etc. E.g., varimax? No (apart from ICA that is in scikit-learn), but I would be interested in a varimax code to play with :). G --

Re: [Scikit-learn-general] GSoC applications are open (based on Fwd: [Soc2013-general] Student Application Template (Applications start April 22!))

2013-04-24 Thread Mathieu Blondel
Something I would like to see in the scikit, if someone is looking for an idea, is biclustering: http://en.wikipedia.org/wiki/Biclustering Mathieu -- Try New Relic Now & We'll Send You this Cool Shirt New Relic is the onl

Re: [Scikit-learn-general] question about scikit / sklearn K folds cross validation

2013-04-24 Thread Alexandre Gramfort
hi, I'd use LeaveOneLabelOut where the label contains the sites indices. Basically the question is "do you generalize well to data acquired some place else" Alex On Wed, Apr 24, 2013 at 5:08 PM, Lars Buitinck wrote: > 2013/4/23 John Richey : >> clf.fit(X_train, X_test) > > You should fit on X_t

Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

2013-04-24 Thread Eustache DIEMERT
Hi Alex, If I understand correctly you are using 2 different kinds of features : categorical + ngrams. In a similar situation but in a classification setting a trick that worked reasonably well was to train two different models, one feeding the other. I.e. build a first model out of ngrams/nlp f

Re: [Scikit-learn-general] question about scikit / sklearn K folds cross validation

2013-04-24 Thread Lars Buitinck
2013/4/23 John Richey : > clf.fit(X_train, X_test) You should fit on X_train and y_train, not X_test. -- Lars Buitinck Scientific programmer, ILPS University of Amsterdam -- Try New Relic Now & We'll Send You this Cool

[Scikit-learn-general] question about scikit / sklearn K folds cross validation

2013-04-24 Thread John Richey
Hello, I am having difficulty with a cross validation problem, and any help would be much appreciated. I have a large number of research subjects from 15 different data collection sites. I want to assess whether "site" has any influence on the data. It occurred to me that one way to do this

Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

2013-04-24 Thread Olivier Grisel
2013/4/24 Alex Kopp : > Thanks, guys. > > Perhaps I should explain what I am trying to do and then open it up for > suggestions. > > I have 203k training examples each with 457k features. The features are > composed of one-hot encoded categorical values as well as stemmed, TFIDF > weighted unigrams

Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

2013-04-24 Thread Peter Prettenhofer
Have you tried tuning the hyper-parameters of the SGDRegressor? You really need to tune the learning rate for SGDRegressor (SGDClassifier has a pretty decent default). E.g. set up a grid search w/ a constant learning rate and try different values of eta0 ([0.1, 0.01, 0.001, 0.0001]). You can also s

Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

2013-04-24 Thread Alex Kopp
Thanks, guys. Perhaps I should explain what I am trying to do and then open it up for suggestions. I have 203k training examples each with 457k features. The features are composed of one-hot encoded categorical values as well as stemmed, TFIDF weighted unigrams and bigrams (NLP). As you can proba

Re: [Scikit-learn-general] GSOC idea

2013-04-24 Thread Vlad Niculae
Thank you, Do you have some references prepared? It would be useful. I am not sure if what is in my head is correct but I think association rule learning is interesting and a kind of method that I would like to see in scikit-learn, as well as finding frequent itemsets. I hope I'm thinking of the

Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

2013-04-24 Thread Peter Prettenhofer
2013/4/24 Olivier Grisel > 2013/4/24 Peter Prettenhofer : > > I totally agree with Brian - although I'd suggest you drop option 3) > because > > it will be a lot of work. > > > > I'd suggest you rather should do a) feature extraction or b) feature > > selection. > > > > Personally, I think decisi

Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

2013-04-24 Thread Olivier Grisel
2013/4/24 Peter Prettenhofer : > I totally agree with Brian - although I'd suggest you drop option 3) because > it will be a lot of work. > > I'd suggest you rather should do a) feature extraction or b) feature > selection. > > Personally, I think decision trees in general and random forest in > pa

Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

2013-04-24 Thread Peter Prettenhofer
I totally agree with Brian - although I'd suggest you drop option 3) because it will be a lot of work. I'd suggest you rather should do a) feature extraction or b) feature selection. Personally, I think decision trees in general and random forest in particular are not a good fit for sparse datase

Re: [Scikit-learn-general] GSoC applications are open (based on Fwd: [Soc2013-general] Student Application Template (Applications start April 22!))

2013-04-24 Thread Mathieu Blondel
On Wed, Apr 24, 2013 at 12:46 PM, Vlad Niculae wrote: > However, I think it would be nice to have some proposals that focus on > internals: consistency, clean up, refactoring of modules that need it > or documentation improvements. As long as the task is measurable, > closed-ended and well-defin

Re: [Scikit-learn-general] GSOC idea

2013-04-24 Thread Şükrü Bezen
Hi Vlad, It looks good for me to focus on the proposal now and looking into mentor later. I am considering collaborative filtering with *user similarity* and *item similarity*. And also* association rule learning* for finding out general behaviour of a user-item group. I think those two would be