Re: [Scikit-learn-general] Online learning

2012-07-30 Thread Abhi
Abhi writes: > > Olivier Grisel writes: > > > Could you please try to come up with one or two minimalistic > > reproduction scripts for the ch2.fit_transform and LinearSVC.fit > > segfaults? Is it just that it is exhausting memory on your system? Are > > you running a 32bit or a 64bit OS? How

Re: [Scikit-learn-general] Get the LASSO regularization path only? (i.e., just the alphas)

2012-07-30 Thread Alexandre Gramfort
> Sorry, I meant "a way to only calculate the alphas". The problem is that I > need to do this for a large number of datasets, so less calculations equals > more speed! you mean the position of the kinks in the paths? for this look at lars_path with method=lasso lasso_path uses coordinate descent

Re: [Scikit-learn-general] Get the LASSO regularization path only? (i.e., just the alphas)

2012-07-30 Thread Néstor Espinoza
Sorry, I meant "a way to only calculate the alphas". The problem is that I need to do this for a large number of datasets, so less calculations equals more speed! 2012/7/30 Néstor Espinoza > Dear all, > > First of all, thanks a lot for your work with the Scikit-learn package, > it is great and

[Scikit-learn-general] Get the LASSO regularization path only? (i.e., just the alphas)

2012-07-30 Thread Néstor Espinoza
Dear all, First of all, thanks a lot for your work with the Scikit-learn package, it is great and saved a lot of implementation time for me. I've been having some time figuring out how to use the LASSO (via LARS), though. I was trained with the "gamma" notation for the minimization function for

Re: [Scikit-learn-general] Online learning

2012-07-30 Thread Abhi
Olivier Grisel writes: > > 2012/7/25 Abhi : > > > > Hello, > > Sorry for getting back late..I originally had experimented with different > > classifiers including SGDClassifier, it seemed faster but much less accurate, > > about 93% for 3 emails[and decreasing as the number of emails

Re: [Scikit-learn-general] Gradient Boosting: Early stopping for setting learn_rate and n_estimators

2012-07-30 Thread Mathieu Blondel
OrthogonalMatchingPursuit has a similar issue: https://github.com/scikit-learn/scikit-learn/issues/930 BTW, Gradient Boosting (in its general form) and Matching Pursuit are very similar algorithms. Mathieu -- Live Securi

[Scikit-learn-general] Custom class priors in GaussianNB

2012-07-30 Thread Martin Ledoux
Hi, We're working on adding a parameter to set custom class prior probabilities in the Gaussian Naive Bayes classifier (see pull request 987). Unfortunately, because of a name clash with an existing deprecated parameter, we can't use the name

Re: [Scikit-learn-general] Gradient Boosting: Early stopping for setting learn_rate and n_estimators

2012-07-30 Thread Conrad Lee
Thanks Peter for your quick reply. ``fit_stage`` is an internal method and not intended for "public" use > - I'll add a scope guard to make this explicit. > Ok, if it's private then it's fair enough to be undocumented -- by "guard", do you mean that you'll add another underscore to the name? I i

Re: [Scikit-learn-general] advice on classification task ?

2012-07-30 Thread Jim Vickroy
On 7/30/2012 7:41 AM, Gael Varoquaux wrote: > Hi Jim, > > It is not possible for us to give a general advice: there is no universal > classifier working for all datasets (this is known as the "no free lunch > theorem). > > If you have a lot of training data, you can try gradient boosted trees, > or

Re: [Scikit-learn-general] advice on classification task ?

2012-07-30 Thread Gael Varoquaux
Hi Jim, It is not possible for us to give a general advice: there is no universal classifier working for all datasets (this is known as the "no free lunch theorem). If you have a lot of training data, you can try gradient boosted trees, or maybe random forests. If your training data is limited, I

Re: [Scikit-learn-general] Gradient Boosting: Early stopping for setting learn_rate and n_estimators

2012-07-30 Thread Peter Prettenhofer
Hi Conrad, 2012/7/30 Conrad Lee : > Two of the most important parameters of the gradient boosting classifier are > the learn_rate and n_estimators. In order to set these, the documentation > states: > >> [HTF2009] recommend to set the learning rate to a small constant (e.g. >> learn_rate <= 0.1)

[Scikit-learn-general] Gradient Boosting: Early stopping for setting learn_rate and n_estimators

2012-07-30 Thread Conrad Lee
Two of the most important parameters of the gradient boosting classifier are the learn_rate and n_estimators. In order to set these, the documentation states: [HTF2009] recommend to set the learning rate to a small constant (e.g. > learn_

Re: [Scikit-learn-general] stepwise regression

2012-07-30 Thread Skipper Seabold
On Sat, Jul 28, 2012 at 3:13 PM, Zach Bastick wrote: > The docs do not indicate whether there is anyway to do a stepwise > regression in scikit-learn or in Python. > All there seems to be is linear_model.LinearRegression(). > > This function outputs resulting x-values/beta-values/coefficents that

Re: [Scikit-learn-general] Problem in Reading Large CSV and Fitting to ML Algorithm

2012-07-30 Thread Gael Varoquaux
> Take the example of 1-NN it can be very well happen that for samples > close to the voronoi boundary, the closest neighbor is on the other > side of the boundary. Indeed, I was bullshiting. -- Live Security Virtual Conf

Re: [Scikit-learn-general] Problem in Reading Large CSV and Fitting to ML Algorithm

2012-07-30 Thread Mathieu Blondel
Solving this issue in a generic way would be nice: https://github.com/scikit-learn/scikit-learn/issues/325 On Mon, Jul 30, 2012 at 6:43 PM, Olivier Grisel wrote: > Actually I think the KNearestNeighborsClassifier implementation in > scikit-learn has a real memory occupation issue in "brute" mode

Re: [Scikit-learn-general] Why do GridSearch and CrossValidation results differ?

2012-07-30 Thread Jaques Grobler
Hey Tobias.. I ran that script and got the same output as you: GRID SEARCH: Best f1_score: 0.556 Best parameters set: alpha: 0.0001 loss: 'log' penalty: 'l1' seed: 0 CROSS VALIDATION: Best f1_score: 0.52 (+/- 0.05) I haven't had time this morning (chaotic and all!) to see if I can figure anythi

Re: [Scikit-learn-general] Why do GridSearch and CrossValidation results differ?

2012-07-30 Thread Tobias Günther
Hey! Did anybody run that script and can confirm, that the results differ on their machine aswell, or maybe even have an idea why the results between the CrossValidation and the best classifier of the GridSearch differ? Should I put this as an issue on Github? Best, Tobias On Fri, Jul 27, 2012 a

Re: [Scikit-learn-general] Problem in Reading Large CSV and Fitting to ML Algorithm

2012-07-30 Thread Olivier Grisel
2012/7/30 Gael Varoquaux : > On Mon, Jul 30, 2012 at 11:52:36AM +0200, Olivier Grisel wrote: >> > In addition, a voronoi tessalation computed with a KMeans during the >> > train could be used to avoid testing all the samples in the large n >> > situation. > >> Hum, that won't work for exact k-NN. >

Re: [Scikit-learn-general] Problem in Reading Large CSV and Fitting to ML Algorithm

2012-07-30 Thread Andreas Müller
In general, looking at neighboring centroids only is approximate and one of the strategies implemented in FLANN (nearest neighbors using K-Means trees). If you use the k nearest centroids, I guess the chance that you find the exact k nearest neighbors is quite good, though. I am not sure how wel

Re: [Scikit-learn-general] Problem in Reading Large CSV and Fitting to ML Algorithm

2012-07-30 Thread Gael Varoquaux
On Mon, Jul 30, 2012 at 11:52:36AM +0200, Olivier Grisel wrote: > > In addition, a voronoi tessalation computed with a KMeans during the > > train could be used to avoid testing all the samples in the large n > > situation. > Hum, that won't work for exact k-NN. I don't understand. Yes I do belie

Re: [Scikit-learn-general] Problem in Reading Large CSV and Fitting to ML Algorithm

2012-07-30 Thread Olivier Grisel
2012/7/30 Gael Varoquaux : > On Mon, Jul 30, 2012 at 11:43:01AM +0200, Olivier Grisel wrote: >> This could be worked around by chunking the data argument of the >> predict calls instead. > > Indeed. > > In addition, a voronoi tessalation computed with a KMeans during the > train could be used to av

Re: [Scikit-learn-general] Problem in Reading Large CSV and Fitting to ML Algorithm

2012-07-30 Thread Olivier Grisel
Ji, could you please create a new github issue to track this bug? https://github.com/scikit-learn/scikit-learn/issues Please include the python snippets of your notebook as verbatim markdown block in the issue: ```python from sklearn.neighbors import KNearestNeighborsClassifier ... ``` -- O

Re: [Scikit-learn-general] Problem in Reading Large CSV and Fitting to ML Algorithm

2012-07-30 Thread Gael Varoquaux
On Mon, Jul 30, 2012 at 11:43:01AM +0200, Olivier Grisel wrote: > This could be worked around by chunking the data argument of the > predict calls instead. Indeed. In addition, a voronoi tessalation computed with a KMeans during the train could be used to avoid testing all the samples in the larg

Re: [Scikit-learn-general] Problem in Reading Large CSV and Fitting to ML Algorithm

2012-07-30 Thread Olivier Grisel
Actually I think the KNearestNeighborsClassifier implementation in scikit-learn has a real memory occupation issue in "brute" mode (which is selected). I suspect it is materializing the whole (n_samples_train, n_samples_predict) distances array in memory before computing the (n_samples_predict * k

Re: [Scikit-learn-general] Problem in Reading Large CSV and Fitting to ML Algorithm

2012-07-30 Thread Gael Varoquaux
Hi Ji, What you are trying to do is called 'online fitting'. Only a small number of models can do online fitting. This is implemented in the scikit-learn with a 'partial_fit' method. As far as supervised learning goes, only SGD does online learning, I believe. http://scikit-learn.org/stable/module