Re: [Scikit-learn-general] Use of the 'learn' font in third party packages

2015-04-27 Thread Gael Varoquaux
Hi Trevor, This is an interesting question, and I don't have a clear cut opinion. What you are talking about is, in essence a trademark issue: the brand "scikit-learn", carries implications about quality and API. We enforce this on the scikit-learn package and would indeed love if the users assoc

Re: [Scikit-learn-general] Use of the 'learn' font in third party packages

2015-04-27 Thread Gilles Louppe
Hi Trevor, I am only speaking for myself, not on behalf of the scikit-learn project, but I would be +1 for your project and use of the -learn suffix. The pros you cite are in my opinion more important than the cons. Cheers, Gilles On 28 April 2015 at 05:33, Trevor Stephens wrote: > Hi All, > >

[Scikit-learn-general] Use of the 'learn' font in third party packages

2015-04-27 Thread Trevor Stephens
Hi All, I've been working for the past month or so on a third-party add-on/plug-in package `gplearn` that uses the scikit-learn API to implement genetic programming for symbolic regression tasks in Python and maintains compatibility with the sklearn pipeline and gridsearch modules, etc. The reason

Re: [Scikit-learn-general] sequential feature selection algorithms

2015-04-27 Thread Joel Nothman
I suspect this method is underreported by any particular name, as it's a straightforward greedy search. It is also very close to what I think many researchers do in system development or report in system analysis, albeit with more automation. In the case of KNN, I would think metric learning could

Re: [Scikit-learn-general] sequential feature selection algorithms

2015-04-27 Thread Andreas Mueller
Maybe we would want mrmr first? http://penglab.janelia.org/proj/mRMR/ On 04/27/2015 06:46 PM, Sebastian Raschka wrote: >> I guess that could be done, but has a much higher complexity than RFE. > Oh yes, I agree, the sequential feature algorithms are definitely > computationally more costly. > >

Re: [Scikit-learn-general] sequential feature selection algorithms

2015-04-27 Thread Sebastian Raschka
> I guess that could be done, but has a much higher complexity than RFE. Oh yes, I agree, the sequential feature algorithms are definitely computationally more costly. > It seems interesting. Is that really used in practice and is there any > literature evaluating it? I am not sure how often

Re: [Scikit-learn-general] Random forest with correlated features?

2015-04-27 Thread Luca Puggini
I think you can find here something of more rigorous. http://orbi.ulg.ac.be/handle/2268/170309 On Mon, Apr 27, 2015 at 11:20 PM, Daniel Homola < daniel.homol...@imperial.ac.uk> wrote: > Hi Luca, > > The reason I asked is because I'm interested in the second problem. Thanks > a lot for the pap

Re: [Scikit-learn-general] Random forest with correlated features?

2015-04-27 Thread Daniel Homola
Hi Luca, The reason I asked is because I'm interested in the second problem. Thanks a lot for the paper and the suggested params, I'll read it and try them! Has anyone tested these assumptions/parameters rigorously on simulated data, or is this more of a feeling? Thanks again for the quick

Re: [Scikit-learn-general] sequential feature selection algorithms

2015-04-27 Thread Andreas Mueller
That is like a one-step look-ahead feature selection? I guess that could be done, but has a much higher complexity than RFE. RFE works for anything that returns "importances", not just linear models. It doesn't really work for KNN, as you say. [I wouldn't say non-parametric models. Trees are prett

Re: [Scikit-learn-general] Value error when using KNeighboursClassifier with GridSearch

2015-04-27 Thread Joel Nothman
I assume you have checked that combine_train_test_dataset produces data of the correct dimensions in both X and y. I would be very surprised if the problem were not in PAA, so check it again: make sure that you test that PAA().fit(X1).transform(X2) gives the transformation of X2. The error seems t

Re: [Scikit-learn-general] bias in svm.LinearSVC classification accuracy in very small data sample? (Andreas Mueller)

2015-04-27 Thread Andreas Mueller
You changed the labels only once, and have a test-set size of 4? I would imagine that is where that comes from. If you repeat over different assignments, you will get 50/50. On 04/27/2015 11:33 AM, Fabrizio Fasano wrote: > Dear Andy, > > Yes, the classes have the same size, 8 and 8 > > this is on

Re: [Scikit-learn-general] Random forest with correlated features?

2015-04-27 Thread Luca Puggini
Hey, I spent quiet some time with this problem. 1) if you are interested only in prediction this is not a big problem. You can preproces the data with PCA 2) if you want to understand which variables are important I suggest you to read the paper "Understanding variable importances in forests of r

Re: [Scikit-learn-general] Value error when using KNeighboursClassifier with GridSearch

2015-04-27 Thread Jitesh Khandelwal
Hi Andreas, Thanks for your response. No, PAA does not change the number of samples. It just reduces the number of features. For example if the input matrix is X and X.shape = (100, 100) and the n_components = 10 in PAA, then the resultant X.shape = (100, 10). Yes, I did try using PAA in the ip

Re: [Scikit-learn-general] Value error when using KNeighboursClassifier with GridSearch

2015-04-27 Thread Andreas Mueller
Does PAA by any chance change the number of samples? The error is: ValueError: Found array with dim 37. Expected 19 Interestingly that happens only in the scoring. Does it work without the grid-search? On 04/27/2015 07:14 AM, Jitesh Khandelwal wrote: Hi all, I am trying to use grid search to

[Scikit-learn-general] sequential feature selection algorithms

2015-04-27 Thread Sebastian Raschka
Hi, I was wondering if sequential feature selection algorithms are currently implemented in scikit-learn. The closest that I could find was recursive feature elimination (RFE); http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html. However, unless the application r

[Scikit-learn-general] Random forest with correlated features?

2015-04-27 Thread Daniel Homola
Dear all, I've found several articles expressing concerns about using Random Forest with highly correlated features (e.g. http://www.biomedcentral.com/1471-2105/9/307). I was wondering if this drawback of the RF algorithm could be somehow remedied using scikit-learn methods? The above linked p

Re: [Scikit-learn-general] bias in svm.LinearSVC classification accuracy in very small data sample? (Andreas Mueller)

2015-04-27 Thread Fabrizio Fasano
Dear Andy, Yes, the classes have the same size, 8 and 8 this is one example of code I used to cross validate classification (I used here StratifiedShuffleSplit, but I also used other methods as leave one out or simple 4-fold cross validation, and the result didn't change so much) from sklearn.

[Scikit-learn-general] Value error when using KNeighboursClassifier with GridSearch

2015-04-27 Thread Jitesh Khandelwal
On Mon, Apr 27, 2015 at 4:44 PM, Jitesh Khandelwal wrote: > Hi all, > > I am trying to use grid search to evaluate some decomposition techniques > of my own. I have implemented some custom transformers such as PAA, DFT, > DWT as shown in the code below. > > I am getting a strange "ValueError" whe

[Scikit-learn-general] Value error when using KNeighboursClassifier with GridSearch

2015-04-27 Thread Jitesh Khandelwal
Hi all, I am trying to use grid search to evaluate some decomposition techniques of my own. I have implemented some custom transformers such as PAA, DFT, DWT as shown in the code below. I am getting a strange "ValueError" when run the below code and I am unable to figure out the origin of the pro