[Scikit-learn-general] Scikit-learn mailing list is moving!

2016-05-16 Thread Andreas Mueller
Hi all scikit-learn mailing list subscribers. The scikit-learn mailing list is moving! We say goodbye toscikit-learn-gene...@sourceforge.net and hello toscikit-le...@python.org. Our new home will not feature any advertising and hopefully less down-time. Sorry for the inconvenience. We will clos

Re: [Scikit-learn-general] DPGMM applied to 1-dimensional vector and variance problem

2016-05-12 Thread Andreas Mueller
Hi Johan. Unfortunately there are known problems with DPGMM https://github.com/scikit-learn/scikit-learn/issues/2454 There is a PR to reimplement: https://github.com/scikit-learn/scikit-learn/pull/4802 I didn't know about dpcluster, it seems unmaintained. But maybe something to compare against?

Re: [Scikit-learn-general] gridsearchCV - overfitting

2016-05-12 Thread Andreas Mueller
How did you evaluate on the development set? You should use "best_score_", not grid_search.score. On 05/12/2016 08:07 AM, A neuman wrote: thats actually what i did. and the difference is way to big. Should I do it withlout gridsearchCV? I'm just wondering why gridsearch giving me overfitted v

Re: [Scikit-learn-general] Suggestions for the model selection module

2016-05-08 Thread Andreas Mueller
Hi Matthias. Can you explain this point again? Is it about the bad __repr__ ? Thanks, Andy On 05/07/2016 08:56 AM, Matthias Feurer wrote: Dear Joel, Thank you for taking the time to answer my email. I didn't see the PR on this topic, thanks for pointing me to that. I can see your points with

Re: [Scikit-learn-general] tSNE assertion errors

2016-04-21 Thread Andreas Mueller
Can you please report this on the issue tracker? Thanks! On 04/18/2016 09:28 AM, leg...@web.de wrote: Thanks for your response Alexander! Here is a simplified version of my script applied to the MNIST data set. It wasn't clear from my first mail but I don't want to train it incrementally but in

Re: [Scikit-learn-general] VotingClassifier

2016-04-21 Thread Andreas Mueller
We could add a "make_voting_classifier" function. Which would at least be consistent. On 04/19/2016 06:25 PM, Sebastian Raschka wrote: > Hi, Saddy, > > the initial implementation did something like that, however, as far as I can > remember, the “majority vote” was in favor or the “tuples” (we di

Re: [Scikit-learn-general] sklearn Hackathon during ICML ?

2016-04-18 Thread Andreas Mueller
int, anything after June 17 works for > me. I was thinking to come hang around during ICML, even if I might > not be able to afford the conference. > > Cheers, > Vlad > > On Tue, Apr 12, 2016 at 11:39 AM, Andreas Mueller wrote: >> So should we pick another or possibly an addit

Re: [Scikit-learn-general] Weird overfitting in GridSearchCV?

2016-04-14 Thread Andreas Mueller
doing a quick check that the classifier *wasn't* perfectly accurate as claimed by the grid search. On Thu, Apr 14, 2016 at 3:38 AM, Andreas Mueller <mailto:t3k...@gmail.com>> wrote: The 280k were the staring of the sequence, while the 70k were from a shuffled bit, right?

Re: [Scikit-learn-general] Weird overfitting in GridSearchCV?

2016-04-14 Thread Andreas Mueller
accuracy) with 3K and 70K random samples but changes to perfect classification for 280K samples. I don't have the data on this computer so I can't test it right now, though. Juan. On Wed, Apr 13, 2016 at 8:51 AM, And

Re: [Scikit-learn-general] Class Weight Random Forest Classifier

2016-04-14 Thread Andreas Mueller
On 04/14/2016 05:04 AM, Mamun Rashid wrote: But reducing the threshold from 0.5 would simply increase false positives and increasing will give rise to false negative. Right ? Reducing will increase false positives and reduce false negatives. So it's an easy way to trade off false positives and

Re: [Scikit-learn-general] Weird overfitting in GridSearchCV?

2016-04-12 Thread Andreas Mueller
Have you tried to "score" the grid-search on the non-training set? The cross-validation is using stratified k-fold while your confirmation used the beginning of the dataset vs the rest. Your data is probably not IID. On 03/10/2016 01:08 AM, Juan Nunez-Iglesias wrote: Hi all, TL;DR: when I ru

Re: [Scikit-learn-general] Class Weight Random Forest Classifier

2016-04-12 Thread Andreas Mueller
Another possibility is to threshold the predict_proba differently, such that the decision maximizes whatever metric you have defined. On 03/15/2016 07:44 AM, Mamun Rashid wrote: Hi All, I have asked this question couple of weeks ago on the list. I have a two class problem where my positive cl

Re: [Scikit-learn-general] sklearn Hackathon during ICML ?

2016-04-12 Thread Andreas Mueller
So should we pick another or possibly an additional date? Will anyone be in NYC for ICML / UAI / COLT? On 04/12/2016 03:56 AM, Alexandre Gramfort wrote: >> Sorry, ICML is at the same dates as the big brain imaging conference, so >> I will not be able to attend (neither the conference, nor a sprint

Re: [Scikit-learn-general] [scikit-learn-general] Why sklearn RandomForest model take a lot of disk space after save?

2016-04-11 Thread Andreas Mueller
Which version of scikit-learn are you using? We recently (0.17) removed storing of data point indices in trees which greatly reduced the size in some cases. On 04/10/2016 09:28 AM, Piotr Płoński wrote: Thanks for comments! I put more details of my problem here http://stackoverflow.com/questio

Re: [Scikit-learn-general] Pickling custom Transformers in a Pipeline

2016-04-05 Thread Andreas Mueller
What's the type of self.custom? Also, you can step into the debugger to see which function it is that can not be pickled. On 04/05/2016 04:14 PM, Fred Mailhot wrote: Hi all, I've got a pipeline with some custom transformers that's not pickling, and I'm not sure why. I've had this previous

Re: [Scikit-learn-general] GraphLab to scikit-learn migration help

2016-04-05 Thread Andreas Mueller
Hi Andre There are no pre-trained neural nets (and no convolutional neural nets at all) in scikit-learn. Check out sklearn-theano, nolearn or keras. The knn is pretty straight-forward from the docs. Cheers, Andy On 04/05/2016 10:54 AM, André Cruz wrote: > Hello all. > > I've been using GraphLab

Re: [Scikit-learn-general] yet another parameter for sklearn.tree.DecisionTreeClassifier()

2016-04-05 Thread Andreas Mueller
pping using decrease in objective would be more helpful btw ;) Andy On 04/05/2016 07:39 AM, Boris Kulchitsky wrote: Hi Andreas, I suggest adding one more parameter to sklearn.tree.DecisionTreeClassifier() *max_feature_reuse* : int or None, optional (default=None) The maximum number of ti

Re: [Scikit-learn-general] DOI for scikit-learn

2016-04-04 Thread Andreas Mueller
g On Apr 1, 2016, at 16:10, Andreas Mueller <mailto:t3k...@gmail.com>> wrote: Hi all. I realized there is no DOI for scikit-learn releases. Should we create one? I've been talking with some agencies that require projects to have a DOI. The benefits could be easie

[Scikit-learn-general] DOI for scikit-learn

2016-04-01 Thread Andreas Mueller
Hi all. I realized there is no DOI for scikit-learn releases. Should we create one? I've been talking with some agencies that require projects to have a DOI. The benefits could be easier and more accurate citations of the project (not the paper). A possible disadvantage could be splitting citatio

Re: [Scikit-learn-general] using Support vector regression/gaussian regression with a Pearson VII function kernel

2016-04-01 Thread Andreas Mueller
t when I understand correctly, the autocorrelation function is just the “normalized” covariance kernel, thus it may be possible to provide custom kernels here as well? If not, it may be interesting to re-factor it a little bit and borrow the code from SVM to

Re: [Scikit-learn-general] using Support vector regression/gaussian regression with a Pearson VII function kernel

2016-03-31 Thread Andreas Mueller
Hi. What do you mean by Gaussian regression? You can specify your own kernels for SVMs, but it will be a bit slower. Cheers, Andy On 03/28/2016 09:40 PM, Amita Misra wrote: Hi, I was using weka earlier for support vector regression and gaussian regression I am now switching to scikit and wa

Re: [Scikit-learn-general] Problem using GridSearch and custom Tokenizer‏

2016-03-31 Thread Andreas Mueller
Put it in it's own file. On 03/29/2016 12:36 PM, Mehdi wrote: I tried this code but it doesn't work, I'm getting the same error. But I'm not doing explictly pickle.load(something) it is in parallelization process. Thanks to try. Looking in more details to this pickling problem. > From: se.r

Re: [Scikit-learn-general] Pipeline: string categorical data preprocessing

2016-03-28 Thread Andreas Mueller
Hi. In general, please stay on the mailing list. We could make the check_array in FunctionTransformer optional via a parameter. Cheers, Andy On 03/28/2016 01:34 PM, Алексей Драль wrote: Hi Andreas, Nice, I didn't know about make_pipeline before, thank you. I have exactly the situation

Re: [Scikit-learn-general] Speed up Random Forest/ Extra Trees tuning

2016-03-25 Thread Andreas Mueller
On 03/22/2016 03:27 AM, Gilles Louppe wrote: > Unfortunately, the most important parameters to adjust to maximize > accuracy are often those controlling the randomness in the algorithm, > i.e. max_features for which this strategy is not possible. > > That being said, in the case of boosting, I th

Re: [Scikit-learn-general] NMF parallel

2016-03-25 Thread Andreas Mueller
On 03/11/2016 11:25 AM, Roberto Pagliari wrote: Is it possible to use multithreading with non negative matrix factorization? Not currently. -- Transform Data into Opportunity. Accelerate data analysis in your applica

Re: [Scikit-learn-general] Pipeline: string categorical data preprocessing

2016-03-25 Thread Andreas Mueller
This is very common but currently not that easy. There is a fix here: https://github.com/scikit-learn/scikit-learn/pull/6559 In the meantime, I think the easiest way is to use pandas' get_dummies function. On 03/19/2016 02:17 PM, Алексей Драль wrote: Hi there, I have a data set which contain

Re: [Scikit-learn-general] GSoC suggestions : work on various stalled PRs and issues

2016-03-25 Thread Andreas Mueller
On 03/25/2016 11:11 AM, Raghav R V wrote: > Hey Maniteja, > > I took a look at your proposal. As I said before I feel it is a bit > broad and you should try to narrow it down to a good theme. > > Since you have chosen more than one PRs which are missing value > related, I have a suggestion for

Re: [Scikit-learn-general] Scikit-learn standards for serializing/saving objects

2016-03-24 Thread Andreas Mueller
Can you give a simple example for reproducing this problem? I haven't heard of this particular issue. On 03/23/2016 12:47 PM, Keith Lehman wrote: Hi: I’m fairly new to scikit-learn, python, and machine learning. This community has built a great set of libraries though, and is actually a larg

Re: [Scikit-learn-general] [Matplotlib-users] Scipy2016: call for proposals

2016-03-19 Thread Andreas Mueller
tions, and mine somewhere in-between ;) > > > On Mar 13, 2016, at 5:02 PM, Andreas Mueller wrote: > > > > Just bought the book on amazon ;) > > It's interesting. Have you read Marslands book by any chance? > > It has a somewhat similar approach. My book will a

Re: [Scikit-learn-general] [Matplotlib-users] Scipy2016: call for proposals

2016-03-13 Thread Andreas Mueller
erial? > Will do. The deadline is not that far away (March 21, right?). Do you already > have in mind what you’d like to talk about in particular? > > Best, > Sebastian > >> On Mar 10, 2016, at 7:59 PM, Andreas Mueller wrote: >> >> Sebastian: looks like it will b

Re: [Scikit-learn-general] Restrictions on feature names when drawing decision tree

2016-03-13 Thread Andreas Mueller
Try escaping the &. On 03/12/2016 02:57 PM, Raphael C wrote: > The code snippet should have been > > > reg = DecisionTreeRegressor(max_depth=None,min_samples_split=1) > reg.fit(X,Y) > scores = cross_val_score(reg, X, Y) > print scores > dot_data = StringIO() > tree.export_graphviz(reg, out_file=do

Re: [Scikit-learn-general] [Matplotlib-users] Scipy2016: call for proposals

2016-03-10 Thread Andreas Mueller
t; > Hi everyone. > > > > I'll definitely be happy to help on the tutorial! > > > > On Mon, Feb 22, 2016 at 11:41 AM, Andreas Mueller > wrote: > > Who's going? > > I'll definitely be there and am happy to do a tutorial. > > Who

Re: [Scikit-learn-general] Implementation of Bag-of-Features

2016-03-08 Thread Andreas Mueller
Hey Guillaume. If it is a couple of hours, I'm not sure it is worth adding. You can probably aggressively subsample or just do fewer iterations (like, one pass over the data) How do you run MiniBatchKMeans? Cheers, Andy On 03/08/2016 03:21 PM, Guillaume Lemaître wrote: Hi, I made a pull-requ

Re: [Scikit-learn-general] scikit-learn in Julia

2016-03-08 Thread Andreas Mueller
On 03/07/2016 04:47 PM, Cedric St-Jean wrote: > >> There is already Pandas.jl, Stan.jl, MATLAB.jl and Bokeh.jl following > >> that trend. > >That is interesting. Were they done by people associated with the > >original projects? > > As far as I can tell, no, they weren't. Stan.jl and Bokeh.jl are

Re: [Scikit-learn-general] scikit-learn in Julia

2016-03-07 Thread Andreas Mueller
On 03/07/2016 03:13 PM, Cedric St-Jean wrote: > There is already Pandas.jl, Stan.jl, MATLAB.jl and Bokeh.jl following > that trend. That is interesting. Were they done by people associated with the original projects? MATLAB.jl ? And mathworks was fine with that? > > Maybe... NotScikitLearn.jl?

Re: [Scikit-learn-general] [Matplotlib-users] Scipy2016: call for proposals

2016-03-07 Thread Andreas Mueller
Are any more core devs planning to attend? Jake? Kyle? Olivier? Gael? Vlad? On 02/22/2016 05:48 PM, Andreas Mueller wrote: Hi Nelson. There will be a scikit-learn sprint :) Not sure how many other core-devs will be there, though. Cheers, Andy On 02/22/2016 05:35 PM, Nelson Liu wrote: Hi

Re: [Scikit-learn-general] drawing ellipses around clusters in mean shift clustering

2016-03-07 Thread Andreas Mueller
Hi. The clusters in mean shift can have arbitrary shapes. So the ellipses would be overlapping. Look at the clusters in this graph: http://scikit-learn.org/dev/auto_examples/cluster/plot_cluster_comparison.html With the right parameter settings, mean shift could "correctly" cluster the first two

Re: [Scikit-learn-general] "In-bag" for RandomForest*

2016-03-07 Thread Andreas Mueller
Hi Ariel. We are not storing them any more because of memory issues, but you can recover them using the random state of the tree: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/forest.py#L76 > indices = _generate_sample_indices(tree.random_state, n_samples) Hth, Andy

Re: [Scikit-learn-general] Explanation for ordinary least squares formula

2016-03-07 Thread Andreas Mueller
Have a look at the book "Elements of statistical learning". http://statweb.stanford.edu/~tibs/ElemStatLearn/ On 03/06/2016 02:37 PM, Cayman Ag wrote: > Hello, I have a basic question regarding the formula on Ordinary Least > Squares 1.1.1. from this page: > > > scikit-learn.org/stable/modules/lin

Re: [Scikit-learn-general] scikit-learn in Julia

2016-03-07 Thread Andreas Mueller
Hi Cedric. I'm not sure about the naming of the package. As long as there is no active involvement of the scikit-learn developers in this package, I'd rather not have "our" name on it. Gael? The docs should be bsd-licensed (though I'm not sure that it the best license for docs). The inline doc

Re: [Scikit-learn-general] GSoC Project Proposal: Reinforcement Learning Module

2016-03-02 Thread Andreas Mueller
On 03/02/2016 02:21 PM, Michał Koziarski wrote: > As far as I can tell, except PyBrain (which doesn't seem to be > actively developed) there are no reinforcement learning libraries in > Python. I was wondering if community would be interested in using one > and making it a part of scikit-learn

[Scikit-learn-general] circle ci access and setup

2016-03-02 Thread Andreas Mueller
Hi all. Does anyone know how we set up circle ci for scikit-learn? Vighnesh set up circle ci for the scikit-learn template project, but that required authorizing circle ci to access his account. That grants circle ci write access. If you look at the scikit-learn contrib page, you can see the autho

Re: [Scikit-learn-general] Implementation of Bag-of-Features

2016-02-29 Thread Andreas Mueller
The extraction is in the source. I have to include the part to measure the timing. Let me know of that make sense. Cheers, On 23 February 2016 at 20:39, Nadim Farhat <mailto:nadim.far...@gmail.com>> wrote: HI Andreas, Sorry for Jumping into the conversation and getting

Re: [Scikit-learn-general] ValueError: numpy.dtype has the wrong size, try recompiling

2016-02-25 Thread Andreas Mueller
idn't work either. I don't want to move to Anaconda, because I have a number of other packages I use already set up via pip and would prefer to continue on that route. ---- *From:* Andreas Mueller *Sent:* Thursda

Re: [Scikit-learn-general] ValueError: numpy.dtype has the wrong size, try recompiling

2016-02-25 Thread Andreas Mueller
How did you install numpy, scipy and scikit-learn? I guess using wheels for all (and not compiling anything) should work. Or use anaconda. Also make sure all the libraries you are installing work in the same python environment. On 02/25/2016 10:40 AM, Laura Fava wrote: Hi, I have installed

Re: [Scikit-learn-general] annoying ad in scikit-learn mailing list

2016-02-25 Thread Andreas Mueller
Yes there is. We wanted to move to python.org, from which we didn't hear back, or scipy.org, from which I think we didn't hear back either. I suggested moving to google groups, but I think there was some opposition to that. Andy On 02/24/2016 07:20 PM, Hai Nguyen wrote: Hi, Is it only me o

Re: [Scikit-learn-general] Implementation of Bag-of-Features

2016-02-23 Thread Andreas Mueller
On 02/23/2016 05:41 AM, Guillaume Lemaître wrote: > That's a point :D > What would be the requirement regarding the dataset. Does it need to > be academic dependent? > > Other solution kinda crazy: scikit-learn currently has 555 > contributors. If each of them take picture of some objects prede

Re: [Scikit-learn-general] [Matplotlib-users] Scipy2016: call for proposals

2016-02-22 Thread Andreas Mueller
Sebastian > On Feb 22, 2016, at 12:11 PM, Manoj Kumar mailto:manojkumarsivaraj...@gmail.com>> wrote: > > Hi everyone. > > I'll definitely be happy to help on the tutorial! > > On Mon, Feb 22, 2016 at 11:41 AM, Andreas Mueller

Re: [Scikit-learn-general] Implementation of Bag-of-Features

2016-02-22 Thread Andreas Mueller
On 02/22/2016 02:04 PM, Guillaume Lemaitre wrote: > Maybe the simplest one should be to have texton (patches 9x9) with a PCA > behind then the clustering. That would be the one without skimage dependences. > yeah... but what dataset? Actually one that has unequal size images would be nice, but

Re: [Scikit-learn-general] Implementation of Bag-of-Features

2016-02-22 Thread Andreas Mueller
On 02/22/2016 12:41 PM, Gael Varoquaux wrote: >> For any particular application (I did bag of visual words), creating an >> implementation using the kmeans or sparse coding in scikit-learn >> is only a couple of lines (you can find my visual bow for per-superpixel >> descriptors here https://gith

Re: [Scikit-learn-general] How you free up memory or handle it while fitting/cross-validating model in Scikitlearn?

2016-02-22 Thread Andreas Mueller
On 02/17/2016 02:25 PM, muhammad waseem wrote: > @Sebastian: I have tried running it by using n_jobs=2 and you were > right it uses around 27% of the RAM. > Does this mean I can only use max n_jobs=8 for my case (obviously this > will also depend on the number of estimators, more will require m

Re: [Scikit-learn-general] Implementation of Bag-of-Features

2016-02-22 Thread Andreas Mueller
Hi Guillaume. I was a big user of BoW myself, but I don't think it should go into scikit-learn. BoW doesn't really operate on a "flat" dataset, as scikit-learn usually does. It works on groups of data points. Each sample is usually a concatenation of feature vectors, which you summarize as a h

Re: [Scikit-learn-general] Sampling in grid_search randomized_grid_search

2016-02-22 Thread Andreas Mueller
You can just do this via a CV object. For example, use StratifiedShuffleSplit(train_set=.1, test_set=.1, n_folds=5) and your training and test set will be randomly samples disjoint 10% of the data, repeated 5 times. On 02/19/2016 11:42 AM, Gael Varoquaux wrote: > That won't work, as it is modi

Re: [Scikit-learn-general] Query about GSoC 2016

2016-02-22 Thread Andreas Mueller
Hi Atharva. I think the consensus among the core people and possible mentors is that we would only accept a small number of students (probably 0 or 1). I don't think we currently have a list of projects, and it will likely depend on the interests of the applicants. We have few people that have e

Re: [Scikit-learn-general] Adding EarthRegressor

2016-02-22 Thread Andreas Mueller
Hi Devashish. I think we're still interested, though it is a bunch of work to include pyearth, and there are probably some non-trivial decisions to make on what to include. Cheers, Andy On 02/20/2016 02:40 AM, Devashish Deshpande wrote: Hi everyone, I was browsing through the projects that

Re: [Scikit-learn-general] [Matplotlib-users] Scipy2016: call for proposals

2016-02-22 Thread Andreas Mueller
Who's going? I'll definitely be there and am happy to do a tutorial. Who's in? On 02/22/2016 04:15 AM, Nelle Varoquaux wrote: Dear all, SciPy 2016, the Fifteenth Annual Conference on Python in Science, takes place in Austin, TX on July, 11th to 17th. The conference features two days of tuto

Re: [Scikit-learn-general] Python BitVector and Scikit-learn data representation

2016-02-11 Thread Andreas Mueller
On 02/11/2016 08:00 AM, Sanjay Rawat wrote: Hi, I have an on-going project, implemented in python. I am generating vectors (containing 0s, 1s) that are represented as bit-vector using pythob "BitVector" module. I am considering to applying some clustering algo to have some idea about these v

Re: [Scikit-learn-general] Using train_test_split with images from my local directory

2016-02-10 Thread Andreas Mueller
On 02/10/2016 12:52 PM, Zeyad Abdelmottaleb wrote: I defined a custom load_func that resize after imread and used it in ImageCollection class, do I need to change to greyscale? No. You need to provide the traceback and your code (ideally with a way to reproduce) to get help. Also, try stacko

[Scikit-learn-general] Fwd: Re: [Numpy-discussion] Numpy 1.11.0b2 released

2016-02-10 Thread Andreas Mueller
There is numpy beta, and I think we haven't tested against it. Apparently pandas tests against numpy master continuously, I think we should do that too. There are new test failures in the beta, which we should fix (and possibly let them know if they are weird). See this issue: https://github

Re: [Scikit-learn-general] Using train_test_split with images from my local directory

2016-02-10 Thread Andreas Mueller
Your image have different sizes. For RandomizedPCA to work, they all need to have the same size. On 02/10/2016 12:00 AM, Zeyad Abdelmottaleb wrote: Stefan, I’ve tried this method and I’m getting this error while implementing RandomizedPCA; setting an array element with a sequence. help? R

Re: [Scikit-learn-general] Random forest low score on testing data

2016-02-10 Thread Andreas Mueller
The problem is really how you do cross-validation. On 02/09/2016 11:47 PM, muhammad waseem wrote: Thanks Luca and Andreas, the idea behind this is to predict a weather parameter using some other parameters. You still think it will be difficult to solve with Random Forest as it is not really

Re: [Scikit-learn-general] Starting Contribution

2016-02-10 Thread Andreas Mueller
Please check out the contributor guidelines: http://scikit-learn.org/dev/developers/index.html On 02/10/2016 06:13 AM, jayaganesh.k wrote: > Hi, > > This is Jayaganesh K. I've been using scikit for the past couple of > months in all classifiers and now I think its my duty to contribute back > to t

Re: [Scikit-learn-general] Random forest low score on testing data

2016-02-09 Thread Andreas Mueller
values in two files), I use 3 years (first file) worth of data for training and one years worth of data (second file) for testing. Am I doing it correctly? any ideas? On Tue, Feb 9, 2016 at 9:01 PM, Andreas Mueller <mailto:t3k...@gmail.com>> wrote: How did you create the hold

Re: [Scikit-learn-general] Random forest low score on testing data

2016-02-09 Thread Andreas Mueller
How did you create the hold-out test data? Before or after shuffling? On 02/09/2016 03:22 PM, muhammad waseem wrote: Hi Andreas, Thanks for your reply. I have already shuffled my data so it is not in ordered now but still no luck. Any other suggestions? On Tue, Feb 9, 2016 at 8:16 PM

Re: [Scikit-learn-general] Random forest low score on testing data

2016-02-09 Thread Andreas Mueller
You should probably use a different cross-validation strategy if your data is ordered. This will give you more realistic cross-validation results. There was a time series CV object somewhere, and by now I think we should include it (this is the third time this comes up in the last 3 days) --

Re: [Scikit-learn-general] What is the relation between the decision function and the predicted probabilities?

2016-02-04 Thread Andreas Mueller
I suggested to remove the decision_function from MLPClassifier, because I feel it has little semantics. It is the output before it goes through a softmax. So softmax(decision_function) = predict_proba. Andy On 02/04/2016 12:11 PM, Gil Rutter wrote: Dear all, Many classification models in Sci

Re: [Scikit-learn-general] incrementally update LDA model

2016-01-28 Thread Andreas Mueller
Gael thought you meant Latent Dirichlet Allocation. The docs you point to are for Linear Discriminant Analysis. Linear Discriminant Analysis has indeed no partial fit. Latent Dirichlet allocation (introduced in 0.17) has. On 01/28/2016 02:05 PM, Mika S wrote: http://scikit-learn.org/0.16/module

Re: [Scikit-learn-general] Project/PR Idea for Faster Automated Model Search

2016-01-27 Thread Andreas Mueller
On 01/27/2016 03:15 PM, Pedro Rodriguez wrote: Thanks for response Andy, The main thing I wanted to get out of asking was: 1. Is this a reasonable thing to try? Yes. 2. Has it been done before? Not for TuPAQ afaik. I would want to make it scikit-learn compatible, but having it be a PR is

Re: [Scikit-learn-general] Project/PR Idea for Faster Automated Model Search

2016-01-27 Thread Andreas Mueller
Hi. Also check out this: https://github.com/scikit-learn/scikit-learn/pull/5491 auto-sklearn (which uses meta-learning) might also be of interest to you. From your description TuPAQ seems to assume that there is some notion of iterations. That is true only for some models. It might be easier to

Re: [Scikit-learn-general] Latent Dirichlet Allocation

2016-01-26 Thread Andreas Mueller
Hi Christian. Can you provide the data and code to reproduce? Best, Andy On 01/26/2016 08:21 AM, Rockenkamm, Christian wrote: Hallo, I have question concerning the Latent Dirichlet Allocation. The results I get from using it are a bit confusing. At first I use about 3000 documents. In the

Re: [Scikit-learn-general] some points on the documentation

2016-01-26 Thread Andreas Mueller
On 01/26/2016 07:17 AM, Panos Louridas wrote: > Hello, > > A few points on the documentation / examples in the scikit-learn site: > > * In the example that plots the decision surface of a decision tree on the > Iris dataset > (http://scikit-learn.org/stable/auto_examples/tree/plot_iris.html#exa

Re: [Scikit-learn-general] Building sklearn for different python versions (development)

2016-01-25 Thread Andreas Mueller
On 01/25/2016 02:56 PM, WENDLINGER Antoine wrote: > Hello Jake, > > Thanks for your answer. I am already using conda, and having to clean > and rebuild everything everytime I want to switch versions is what I > want to avoid, since it's a bit long to accomplish. Using two > different source fo

Re: [Scikit-learn-general] Out of sample extensions

2016-01-22 Thread Andreas Mueller
There actually seems to be a pretty well-cited Bengio 2003 NIPS paper: http://papers.nips.cc/paper/2461-out-of-sample-extensions-for-lle-isomap-mds-eigenmaps-and-spectral-clustering.pdf Maybe we should implement that? On 01/22/2016 12:06 PM, David Collins wrote: Are the any out of sample extensi

Re: [Scikit-learn-general] ANOVA SVM pipeline and cross validation

2016-01-15 Thread Andreas Mueller
On 01/15/2016 01:16 PM, Fabrizio Fasano wrote: > Dear community, > > I would like to use ANOVA + SVM pipeline to check 2 group classification > performances of neuroimaging datasets, > > My questions are: > > 1) In pipeline approach implemented by Scikit-learn > (http://scikit-learn.org/stable/

Re: [Scikit-learn-general] Dropping Python 2.6 compatibility

2016-01-04 Thread Andreas Mueller
On 01/04/2016 03:31 PM, Joel Nothman wrote: > FWIW: features that I have had to remove include format strings with > implicit arg numbers, set literals, dict comprehensions, perhaps > ordered dicts / counters. We are already clandestinely using argparse > in benchmark code. You probably just w

Re: [Scikit-learn-general] Fine tuning parameters of Multi label classification

2016-01-04 Thread Andreas Mueller
You didn't use a OneVsRestClassifier. SGDClassifier itself can only do multi-class, not multi-label. It needs to be GridSearchCV(OneVsRestClassifier(SGDClassifier()), ...) On 01/04/2016 02:15 AM, Startup Hire wrote: Providing the full StackTrace here:[ code in previous email] # Tuning hyper-pa

Re: [Scikit-learn-general] Hard cases of hyperparameter estimation problem

2016-01-04 Thread Andreas Mueller
Check out auto-sklearn and the datasets they use, as well as tpot and this pr: https://github.com/scikit-learn/scikit-learn/pull/5185 On 12/20/2015 12:37 PM, olologin wrote: > Hi folks. > > I'm seeking for hard examples of hyperparameter search problem. I've > just implemented scikit-learn compat

Re: [Scikit-learn-general] sklearn.cross_decomposition.PLSRegression: would like to fix the scaling issue

2016-01-04 Thread Andreas Mueller
Hi Ola. Sorry for the late reply, I've been offline over the holidays. Unfortunately, there is no owner of the cross_decomposition module at the moment, which is probably the main reason it is in a not-great state. You can also have a look at https://github.com/scikit-learn/scikit-learn/issues/

Re: [Scikit-learn-general] Student looking to contribute toward scikit-learn

2016-01-04 Thread Andreas Mueller
Hi Sonali. Your skill-set seems great for a GSoC with scikit-learn. We have found that in recent years, we were quite limited in terms of mentoring resources. Many of the core-devs are very busy, and we already have many contributions waiting for reviews. If you are interested in working on s

Re: [Scikit-learn-general] interrupted dataset download — implement redownload?

2016-01-04 Thread Andreas Mueller
Yeah, having a continuation of the download or retry would be nice, I think. PR welcome. On 12/31/2015 05:58 AM, Toasted Corn Flakes wrote: > Currently, if you interrupt check_fetch_lfw() (which downloads about 200mb of > data), the incomplete lfw-funneled.tgz stays on disk, and running it again

Re: [Scikit-learn-general] scikit-learn-0.17, atlas/unittest issues...

2016-01-04 Thread Andreas Mueller
That's really odd. scikit-learn 0.17 should definitely work with numpy 1.10. This looks more like a linker failure, so it being version dependent in numpy seems strange.. On 01/04/2016 08:00 AM, Joe Cammisa wrote: folks, just to follow up on this in case anyone else runs into the same proble

Re: [Scikit-learn-general] Dropping Python 2.6 compatibility

2016-01-04 Thread Andreas Mueller
Happy new year! I think it would be cool to hear from matplotlib what their experience was. I'm not sure I'm for dropping 2.6 for the sake of dropping 2.6. What would we actually gain? There are two fixes in sklearn/utils/fixes.py that we could remove, I think. Also: what does dropping 2.6 mean?

Re: [Scikit-learn-general] Stacking Classifier

2015-12-16 Thread Andreas Mueller
I think stacking would be a nice contribution. Are you doing loo / cross validation to get the predictions of the first level? Otherwise this is basically "VotingClassifier" And in the "literature" version, all classifiers get the same data. We need to think about how and if we want to support

Re: [Scikit-learn-general] how to fetch data from mldata

2015-12-09 Thread Andreas Mueller
On 12/09/2015 01:48 PM, Gael Varoquaux wrote: > On Wed, Dec 09, 2015 at 12:33:55PM -0500, Andreas Mueller wrote: >> I guess we use the matlab data with is not required by mldata. >> We could add code that tries to fetch the matlab, and if that doesn't >> work uses the h

Re: [Scikit-learn-general] how to fetch data from mldata

2015-12-09 Thread Andreas Mueller
5 01:17 PM, Luca Puggini wrote: Yes openml seems a better choice. I would really like to have an easy way to import public datasets. I think that fetch_mldata should throw a warning when it is imported if we think this is not working 100%. Best, Luca On Wed, Dec 9, 2015 at 5:35 PM Andre

Re: [Scikit-learn-general] how to fetch data from mldata

2015-12-09 Thread Andreas Mueller
if we think this is not working 100%. Best, Luca On Wed, Dec 9, 2015 at 5:35 PM Andreas Mueller <mailto:t3k...@gmail.com>> wrote: I guess we use the matlab data with is not required by mldata. We could add code that tries to fetch the matlab, and if that doesn't work

Re: [Scikit-learn-general] how to fetch data from mldata

2015-12-09 Thread Andreas Mueller
I guess we use the matlab data with is not required by mldata. We could add code that tries to fetch the matlab, and if that doesn't work uses the hdf5, with a soft dependency. Not sure we want that as mldata seems somewhat defunc. Maybe openml would be a better source (maybe once they finish thei

Re: [Scikit-learn-general] [X-Post] Requesting support for Open Source initiative in India

2015-12-08 Thread Andreas Mueller
contribute to? On Dec 8, 2015 11:51 PM, "Raghu Mohan" <mailto:ra...@hackerearth.com>> wrote: Hey Andreas, Would it be possible to have a 1-2 hour slot where core devs could be remotely available? We have the platform built into the hackerearth system

Re: [Scikit-learn-general] Latent Dirichlet Allocation topic-word-matrix and the document-topic-matrix

2015-12-08 Thread Andreas Mueller
Hi Christian. The document-topic-matrix is lda.transform(X), the word-topic-matrix is lda.components_. See http://scikit-learn.org/dev/modules/decomposition.html#latent-dirichlet-allocation-lda "When LatentDirichletAllocation

Re: [Scikit-learn-general] Supervised principal component analysis in scikit-learn?

2015-12-08 Thread Andreas Mueller
Hi Henry. Please discuss issues like these on the mailing list. Any one particular developer might not have time to respond. Blair's SPC is just "make_pipeline(SelectKBest(), PCA(), LogisticRegression())". So I wouldn't say "it didn't make it through". I'd rather say "it's already implemented".

Re: [Scikit-learn-general] Analyzer and tokenizer in (Count/TfIdf)Vectorizer

2015-12-07 Thread Andreas Mueller
Hi. I would say what you are doing with lemmatization is not tokenization but preprocessing. You are not creating tokens, right? The tokens are the char n-grams. So what is the problem in using the preprocessing option? I'm not super familiar with the NLP lingo, though, so I might be missing

Re: [Scikit-learn-general] Multi Label classification using OneVsRest Classifier

2015-12-01 Thread Andreas Mueller
Please provide the full traceback. What is the type of y here, and what are its entries? On 11/30/2015 07:45 PM, Startup Hire wrote: Hi Pypers, Hope you are doing well. I am doing multi label classification in which my X and Y are sparse matrices with Y properly binarized. Though my Y has m

Re: [Scikit-learn-general] Import error for Robust scaler

2015-12-01 Thread Andreas Mueller
You are likely using an old version of scikit-learn that doesn't include RobustScaler. Update your installation. On 11/28/2015 08:18 PM, Sumedh Arani wrote: Dear developers, In my due process to correct am way bug posted in the issues section in github, I tried to work on robust scaler. I t

Re: [Scikit-learn-general] Spectral / Kmeans Clustering taking prohibitively long to run

2015-12-01 Thread Andreas Mueller
You should check which solver is used. There was some odd regression in the time the example takes. On 11/26/2015 03:29 AM, Nelson Liu wrote: Hi everyone, I was modifying the plot_lena_segmentation.py (http://scikit-learn.org/stable/auto_examples/cluster/plot_lena_segmentation.html) example to

Re: [Scikit-learn-general] Jeff Levesque: '.predict_proba()' me tho for smaller datasets

2015-12-01 Thread Andreas Mueller
I don't understand the question. By definition this function provides probability estimates. In the case of SVC, it is possible that these probabilities don't coincide with the prediction. You could make predictions using the probabilities if you'd liked. There is no other way to ensure consisten

Re: [Scikit-learn-general] "Need Review" tag

2015-12-01 Thread Andreas Mueller
Yeah that was the intention of [MRG]. Though it might be easier to filter by tag. No strong opinion though. On 12/02/2015 12:44 AM, Gael Varoquaux wrote: >> How about adding a "Need Review(s?)(er?)" tag? > For me, it's the '[MRG]' in the PR name. > > --

Re: [Scikit-learn-general] [TfidfVectorizer problem]

2015-11-19 Thread Andreas Mueller
cabulary_) and it is also 1900 instead of 1914. I have another program that counts distinct terms and it is 1914 there. Best, Ehsan On Thu, Nov 19, 2015 at 9:36 AM, Andreas Mueller mailto:t3k...@gmail.com>> wrote: You shou

Re: [Scikit-learn-general] [TfidfVectorizer problem]

2015-11-19 Thread Andreas Mueller
rd frequency cutoff filters? (min_df, max_df) On Thu, Nov 19, 2015 at 8:55 AM, Andreas Mueller <mailto:t3k...@gmail.com>> wrote: Hi Ehsan. Which version of scikit-learn are you using? And why do you think the vocabulary size is 1860? What is len(tf.vocabulary_)? Andy

Re: [Scikit-learn-general] [TfidfVectorizer problem]

2015-11-19 Thread Andreas Mueller
Hi Ehsan. Which version of scikit-learn are you using? And why do you think the vocabulary size is 1860? What is len(tf.vocabulary_)? Andy On 11/18/2015 11:45 PM, Ehsan Asgari wrote: Hi, I am using TfidfVectorizer of sklearn.feature_extraction.text for generating tf-idf matrix of a corpus. H

Re: [Scikit-learn-general] Scikit-learn-general Digest, Vol 70, Issue 8

2015-11-12 Thread Andreas Mueller
I would really like to hear from people that used the simple averaging for regression. The classification case is something a lot of people use (at least on kaggle), and has been much asked for. It would be good to know if there are consistent improvements for regression, too. There is no reason

  1   2   3   4   5   6   7   8   9   10   >