[Scikit-learn-general] Confusing return results of SVC.decision_function()

2012-03-25 Thread xinfan meng
I use the following codes to obtain decision values for SVC classifier clf. --- In [5]: >>> clf = svm.SVC() In [23]: >>> X = [[0], [1], [2]] In [24]: >>> Y = [0, 1, 2] In [25]: clf.fit(X, Y) Out[25]: SV

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Gael Varoquaux
On Mon, Mar 26, 2012 at 09:48:52AM +1100, Robert Layton wrote: >It's a good description of DBSCAN. I would point out that the outliers are >found as "The points which do not belong to any current cluster and do not >have enough close neighbours to start a new cluster." Thanks, I have a

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Robert Layton
On 26 March 2012 09:38, Gael Varoquaux wrote: > On Mon, Mar 26, 2012 at 12:27:37AM +0200, Andreas wrote: > > Well as you can tell my motivation for working on the examples > > and the data sets was not all altruistic ;) > > The key to success in a shared project is that every actor should get a >

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Gael Varoquaux
On Mon, Mar 26, 2012 at 12:27:37AM +0200, Andreas wrote: > Well as you can tell my motivation for working on the examples > and the data sets was not all altruistic ;) The key to success in a shared project is that every actor should get a benefit. I don't work on the scikit for the glory of manki

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Andreas
On 03/26/2012 12:31 AM, Gael Varoquaux wrote: > On Mon, Mar 26, 2012 at 12:22:53AM +0200, Andreas wrote: > >> Thanks for the great work. This is really a step forward for the docs! >> > Thanks guys. I must confess that I had a presentation to give tomorow > about clustering and I jumped o

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Gael Varoquaux
On Mon, Mar 26, 2012 at 12:22:53AM +0200, Andreas wrote: > Thanks for the great work. This is really a step forward for the docs! Thanks guys. I must confess that I had a presentation to give tomorow about clustering and I jumped on the occasion to improve the docs. Gael

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Andreas
On 03/26/2012 12:19 AM, Gael Varoquaux wrote: > Thanks for all the feedback. I have included in at merged to master, > because I was running out of time, but it can still be improved! > > Thanks for the great work. This is really a step forward for the docs! ---

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Gael Varoquaux
On Mon, Mar 26, 2012 at 09:21:21AM +1100, Robert Layton wrote: >This is great, Thanks, >and I think it would be a good idea to include such a >summary table for classification at some point as well. Yes. Actually I believe that every main usecase should have one, at the beginning of

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Robert Layton
On 26 March 2012 09:19, Gael Varoquaux wrote: > Thanks for all the feedback. I have included in at merged to master, > because I was running out of time, but it can still be improved! > > Gael > > > -- > This SF email is s

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Gael Varoquaux
Thanks for all the feedback. I have included in at merged to master, because I was running out of time, but it can still be improved! Gael -- This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Andreas
On 03/26/2012 12:06 AM, Gael Varoquaux wrote: > On Sun, Mar 25, 2012 at 11:56:31PM +0200, Andreas wrote: > >> As far as I can see, your groups are "KMeans + Ward" and "rest". >> I don't know how ward works but looking at the lena example, >> the clusters don't seem to be convex. >> > But

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Gael Varoquaux
On Sun, Mar 25, 2012 at 11:56:31PM +0200, Andreas wrote: > As far as I can see, your groups are "KMeans + Ward" and "rest". > I don't know how ward works but looking at the lena example, > the clusters don't seem to be convex. But you are looking in the wrong space: the physical space, and not the

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Andreas
On 03/25/2012 11:47 PM, Gael Varoquaux wrote: > On Sun, Mar 25, 2012 at 11:38:50PM +0200, Andreas wrote: > >>> Unlike something like spectral clustering, it is the euclidean distance >>> to the centers that is minimized. Thus K-Means will seek clusters that >>> are regular in the flat euclidean

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Gael Varoquaux
On Sun, Mar 25, 2012 at 11:38:50PM +0200, Andreas wrote: > > Unlike something like spectral clustering, it is the euclidean distance > > to the centers that is minimized. Thus K-Means will seek clusters that > > are regular in the flat euclidean space. > Ok, that's right. Though I would argue that

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Andreas
> Unlike something like spectral clustering, it is the euclidean distance > to the centers that is minimized. Thus K-Means will seek clusters that > are regular in the flat euclidean space. > > Ok, that's right. Though I would argue that the distance measure is not the only factor here. MeanSh

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Gael Varoquaux
On Sun, Mar 25, 2012 at 11:30:36PM +0200, Andreas wrote: > On 03/25/2012 11:32 PM, Gael Varoquaux wrote: > > On Sun, Mar 25, 2012 at 11:23:55PM +0200, Andreas wrote: > >> Without looking at the source, it could be that we initialize GMM > >> with the result of KMeans. > > We do. > Then I would s

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Andreas
On 03/25/2012 11:32 PM, Gael Varoquaux wrote: > On Sun, Mar 25, 2012 at 11:23:55PM +0200, Andreas wrote: > >> Without looking at the source, it could be that we initialize GMM >> with the result of KMeans. >> > We do. > > Then I would suggest changing that. Although not sure what the

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Gael Varoquaux
On Sun, Mar 25, 2012 at 11:22:32PM +0200, Andreas wrote: > >> I'm not sure if "flat geometry" is a good way to describe the case that > >> KMeans works in. I would have said "convex clusters". Not sure in how far > >> that applies to hierarchical clustering, though. > > Euclidean distance. > Can

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Gael Varoquaux
On Sun, Mar 25, 2012 at 11:23:55PM +0200, Andreas wrote: > Without looking at the source, it could be that we initialize GMM > with the result of KMeans. We do. > I read that if you do this, the GMM > solution rarely changes. No surprising. > Instead, one should only run KMeans for one or two i

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Andreas
On 03/25/2012 11:20 PM, Gael Varoquaux wrote: > On Sun, Mar 25, 2012 at 10:51:36PM +0200, Gael Varoquaux wrote: > >>> - You should at least refer to GMMs, as this is the most popular >>> clustering framework that comes with a natural probabilistic setting >>> > >> Agreed. >> >

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Andreas
>> I'm not sure if "flat geometry" is a good way to describe the case that >> KMeans works in. I would have said "convex clusters". Not sure in how far >> that applies to hierarchical clustering, though. >> > Euclidean distance. > Can you please elaborate? >> Also, I would mention explic

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Gael Varoquaux
On Sun, Mar 25, 2012 at 10:51:36PM +0200, Gael Varoquaux wrote: > > - You should at least refer to GMMs, as this is the most popular > > clustering framework that comes with a natural probabilistic setting > Agreed. Actually, on our various examples, it is impressive how much GMMs behave similar

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Gael Varoquaux
On Sun, Mar 25, 2012 at 10:12:59PM +0200, Andreas wrote: > For the input, I would hope we can implement Olivier's proposal soon > so that we don't need to differentiate the different input types. Agreed. It was literly itching me when I was playing with the example. > I'm not sure if "flat geomet

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Gael Varoquaux
On Sun, Mar 25, 2012 at 10:22:39PM +0200, Andreas wrote: > I might not have the time next week but after that I can give > it a shot if you don't have the time. It would be great, as I am not a specialist of this method. Gael --

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Gael Varoquaux
On Sun, Mar 25, 2012 at 10:10:51PM +0200, bthirion wrote: > - "Hierarchical clustering -> Few clusters": I thought it was not the > best use case for these algorithms Yes, this is clearly a typo. > - "Hierarchical clustering -> even cluster size": this is not true if > you consider single linka

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Andreas
> - You should at least refer to GMMs, as this is the most popular > clustering framework that comes with a natural probabilistic setting > +1 > - With mean shift, I would refer to 'modes' rather than 'blobs'. > +1 In general the mean shift docs could be improved a lot. There is quite a n

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Andreas
On 03/25/2012 10:25 PM, Olivier Grisel wrote: > Le 25 mars 2012 22:12, Andreas a écrit : > >> ps: Maybe I'll find time to do the "fit_distance"/"fit_kernel" API in >> one or two weeks. >> > As discussed earlier, I would prefer `fit_symmetric` or `fit_pairwise` > when working with squared

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Olivier Grisel
Le 25 mars 2012 22:12, Andreas a écrit : > > ps: Maybe I'll find time to do the "fit_distance"/"fit_kernel" API in > one or two weeks. As discussed earlier, I would prefer `fit_symmetric` or `fit_pairwise` when working with squared distance / affinity / kernel matrices as main data input. -- Ol

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Andreas
> I am working on a summary table on clustering methods. It is not > finished, I need to do a bit more literature review, however, I'd love > some feedback on the current status: > https://github.com/GaelVaroquaux/scikit-learn/blob/master/doc/modules/clustering.rst > > > Thanks for starting on

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread bthirion
Hi Gael, Here are some suggestions regarding details of the page: - "Hierarchical clustering -> Few clusters": I thought it was not the best use case for these algorithms - "Hierarchical clustering -> even cluster size": this is not true if you consider single linkage, or even in general with Wa

Re: [Scikit-learn-general] Default value n_estimators for GBRT

2012-03-25 Thread Andreas
On 03/25/2012 07:40 PM, Peter Prettenhofer wrote: > 2012/3/25 Scott White: > >> Btw, another thing I think we should add is the ability to monitor the >> out-of-bag estimates after each iteration and allow the fitting to be >> terminated early. It's usually hard to guess the right number of >>

Re: [Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Olivier Grisel
Le 25 mars 2012 20:40, Gael Varoquaux a écrit : > Hi list, > > I am working on a summary table on clustering methods. It is not > finished, I need to do a bit more literature review, however, I'd love > some feedback on the current status: > https://github.com/GaelVaroquaux/scikit-learn/blob/maste

[Scikit-learn-general] Summary table on clustering

2012-03-25 Thread Gael Varoquaux
Hi list, I am working on a summary table on clustering methods. It is not finished, I need to do a bit more literature review, however, I'd love some feedback on the current status: https://github.com/GaelVaroquaux/scikit-learn/blob/master/doc/modules/clustering.rst Cheers, Gaël ---

Re: [Scikit-learn-general] Default value n_estimators for GBRT

2012-03-25 Thread Scott White
On Mar 25, 2012, at 10:37 AM, Peter Prettenhofer wrote: > 2012/3/25 Scott White : >> I have noticed also that GBRT can be quite slow even for small k. >> Since GBRT fits n*k decision trees, by design, it's crucial that the >> decision tree code be highly optimized. I did some quick performance

Re: [Scikit-learn-general] Default value n_estimators for GBRT

2012-03-25 Thread Scott White
Yes, and with oob_estimates and monitor some of the basic building blocks are in place but needs fleshing out. Also would be nice to resume a model from where you left off if terminated too early. Sent from my iPhone On Mar 25, 2012, at 10:40 AM, Peter Prettenhofer wrote: > 2012/3/25 Scott

Re: [Scikit-learn-general] Default value n_estimators for GBRT

2012-03-25 Thread Peter Prettenhofer
2012/3/25 Scott White : > Btw, another thing I think we should add is the ability to monitor the > out-of-bag estimates after each iteration and allow the fitting to be > terminated early. It's usually hard to guess the right number of > iterations required and if one can terminate the fitting earl

Re: [Scikit-learn-general] Default value n_estimators for GBRT

2012-03-25 Thread Peter Prettenhofer
2012/3/25 Scott White : > I have noticed also that GBRT can be quite slow even for small k. > Since GBRT fits n*k decision trees, by design, it's crucial that the > decision tree code be highly optimized. I did some quick performance > profiling the other day, which showed that the performance bott

Re: [Scikit-learn-general] Default value n_estimators for GBRT

2012-03-25 Thread Scott White
Btw, another thing I think we should add is the ability to monitor the out-of-bag estimates after each iteration and allow the fitting to be terminated early. It's usually hard to guess the right number of iterations required and if one can terminate the fitting early based on good oob results that

Re: [Scikit-learn-general] More easy fix issues

2012-03-25 Thread Lars Buitinck
Op 25 maart 2012 18:14 heeft Andreas het volgende geschreven: > Hi everybody. > Rachee (?) just asked at > https://github.com/scikit-learn/scikit-learn/issues/719 whether that would > be a good place to start. > Could someone please voice there opinion whether this is a good thing to > have? > Oth

[Scikit-learn-general] More easy fix issues

2012-03-25 Thread Andreas
Hi everybody. Rachee (?) just asked at https://github.com/scikit-learn/scikit-learn/issues/719 whether that would be a good place to start. Could someone please voice there opinion whether this is a good thing to have? Otherwise I would suggest doing #559 as there hasn't happened much recently.

Re: [Scikit-learn-general] extra trees

2012-03-25 Thread Gilles Louppe
Hi Satrajit, Adding more trees should never hurt accuracy. The more, the better. Since you have a lot of irrelevant features, I'll advise to increase max_features in order to capture the relevant features when computing the random splits. Otherwise, your trees will indeed fit on noise. Best, Gi

Re: [Scikit-learn-general] Default value n_estimators for GBRT

2012-03-25 Thread Mathieu Blondel
On Mon, Mar 26, 2012 at 12:09 AM, Peter Prettenhofer wrote: >  1. We need to support the query id (=``qid``) field in > ``svmlight_loader``; Pair-wise approaches such as RankingSVMs need > this information to form example pairs. My personal experience is that > RankingSVMs do surprisingly poor on

Re: [Scikit-learn-general] Default value n_estimators for GBRT

2012-03-25 Thread Olivier Grisel
Le 25 mars 2012 17:09, Peter Prettenhofer a écrit : > 2012/3/25 Olivier Grisel : >> Le 25 mars 2012 12:44, Peter Prettenhofer >> a écrit : >>> Olivier, >>> >>> In my experience GBRT usually requires more base learners than random >>> forests to get the same level of accuracy. I hardly use less th

Re: [Scikit-learn-general] Default value n_estimators for GBRT

2012-03-25 Thread Peter Prettenhofer
2012/3/25 Olivier Grisel : > Le 25 mars 2012 12:44, Peter Prettenhofer > a écrit : >> Olivier, >> >> In my experience GBRT usually requires more base learners than random >> forests to get the same level of accuracy. I hardly use less than 100. >> Regarding the poor performance of GBRT on the oliv

Re: [Scikit-learn-general] Stacking classifier

2012-03-25 Thread xinfan meng
This paper is about using Stacking to boost the multi-classes accuracy of sentiment classification significantly: "The Importance of Neutral Examples for Learning Sentiment". I found this paper interesting and the experiment results a bit surprising. On Sun, Mar 25, 2012 at 10:28 PM, Satrajit Ghos

Re: [Scikit-learn-general] Stacking classifier

2012-03-25 Thread Satrajit Ghosh
hi olivier, Good to know. Among the references from the Wikipedia article, the > following seems particularly interesting: > > http://arxiv.org/abs/0911.0460 i just read that last night after this series of emails. the introduction is nice, but the point they are trying to make is a little stra

Re: [Scikit-learn-general] Stacking classifier

2012-03-25 Thread Olivier Grisel
Le 25 mars 2012 03:33, xinfan meng a écrit : > > > On Sun, Mar 25, 2012 at 9:26 AM, Olivier Grisel > wrote: >> >> Le 25 mars 2012 01:38, xinfan meng a écrit : >> > Hi, list >> > >> >     I am looking for a a stacking classifier implementation. It falls >> > into >> > the category of  ensemble cl

Re: [Scikit-learn-general] extra trees

2012-03-25 Thread Satrajit Ghosh
thanks paolo, will give all of this a try. i'll also send a pr with a section on patterns for sklearn. although this pattern might be specific to my problem domain, having more real-world scripts/examples that reflect such considerations might be useful to the community. cheers, satra On Sun, M

Re: [Scikit-learn-general] extra trees

2012-03-25 Thread Paolo Losi
Hi Satraijit, On Sun, Mar 25, 2012 at 3:02 PM, Satrajit Ghosh wrote: > hi giles, > > when dealing with skinny matrices  of the type few samples x lots of > features what are the recommendations for extra trees in terms of max > features and number of estimators? as far as number of estimators (t

Re: [Scikit-learn-general] Default value n_estimators for GBRT

2012-03-25 Thread Olivier Grisel
Le 25 mars 2012 15:47, Gael Varoquaux a écrit : > On Sun, Mar 25, 2012 at 03:40:52PM +0200, Olivier Grisel wrote: >> BTW Learning-to-Rank seems to be a very important application domain >> that we do not cover well in scikit-learn. > > Yes, @fabianp is working on things in this flavour currently.

Re: [Scikit-learn-general] Default value n_estimators for GBRT

2012-03-25 Thread Olivier Grisel
Le 25 mars 2012 12:16, Gilles Louppe a écrit : > Hi Olivier, > > The higher the number of estimators, the better. The more random the > trees (e.g., the lower max_features), the more important it usually is > to have a large forest to decrease the variance. To me, 10 is actually > a very low defau

Re: [Scikit-learn-general] Default value n_estimators for GBRT

2012-03-25 Thread Gael Varoquaux
On Sun, Mar 25, 2012 at 03:40:52PM +0200, Olivier Grisel wrote: > BTW Learning-to-Rank seems to be a very important application domain > that we do not cover well in scikit-learn. Yes, @fabianp is working on things in this flavour currently. I think that he needs a bit of cheering to integrate the

Re: [Scikit-learn-general] Default value n_estimators for GBRT

2012-03-25 Thread Olivier Grisel
Le 25 mars 2012 12:44, Peter Prettenhofer a écrit : > Olivier, > > In my experience GBRT usually requires more base learners than random > forests to get the same level of accuracy. I hardly use less than 100. > Regarding the poor performance of GBRT on the olivetti dataset: > multi-class GBRT fit

Re: [Scikit-learn-general] extra trees

2012-03-25 Thread Paolo Losi
On Sun, Mar 25, 2012 at 3:32 PM, Paolo Losi wrote: > You could rank features by feature importance and perform recursive feature > limitation s/recursive feature limitation/recursive feature elimination/ -- This SF emai

[Scikit-learn-general] extra trees

2012-03-25 Thread Satrajit Ghosh
hi giles, when dealing with skinny matrices of the type few samples x lots of features what are the recommendations for extra trees in terms of max features and number of estimators? also if a lot of the features are nuisance and most are noisy, are there any recommendations for feature reductio

Re: [Scikit-learn-general] Default value n_estimators for GBRT

2012-03-25 Thread Peter Prettenhofer
A quick follow-up on my previous email: >> On 25 March 2012 03:49, Olivier Grisel wrote: >>> [..] >>> >>> Another way to rephrase that question: what is the typical sweet spot >>> for the dataset shape when doing classification Gradient Boosted >>> Trees? What are reasonable values for the number

Re: [Scikit-learn-general] Default value n_estimators for GBRT

2012-03-25 Thread Peter Prettenhofer
Olivier, In my experience GBRT usually requires more base learners than random forests to get the same level of accuracy. I hardly use less than 100. Regarding the poor performance of GBRT on the olivetti dataset: multi-class GBRT fits ``k`` trees at each stage, thus, if you have ``n_estimators``

Re: [Scikit-learn-general] Default value n_estimators for GBRT

2012-03-25 Thread Gilles Louppe
Hi Olivier, The higher the number of estimators, the better. The more random the trees (e.g., the lower max_features), the more important it usually is to have a large forest to decrease the variance. To me, 10 is actually a very low default value. In my daily research, I deal with hundreds of tre