I use the following codes to obtain decision values for SVC classifier clf.
---
In [5]: >>> clf = svm.SVC()
In [23]: >>> X = [[0], [1], [2]]
In [24]: >>> Y = [0, 1, 2]
In [25]: clf.fit(X, Y)
Out[25]:
SV
On Mon, Mar 26, 2012 at 09:48:52AM +1100, Robert Layton wrote:
>It's a good description of DBSCAN. I would point out that the outliers are
>found as "The points which do not belong to any current cluster and do not
>have enough close neighbours to start a new cluster."
Thanks, I have a
On 26 March 2012 09:38, Gael Varoquaux wrote:
> On Mon, Mar 26, 2012 at 12:27:37AM +0200, Andreas wrote:
> > Well as you can tell my motivation for working on the examples
> > and the data sets was not all altruistic ;)
>
> The key to success in a shared project is that every actor should get a
>
On Mon, Mar 26, 2012 at 12:27:37AM +0200, Andreas wrote:
> Well as you can tell my motivation for working on the examples
> and the data sets was not all altruistic ;)
The key to success in a shared project is that every actor should get a
benefit. I don't work on the scikit for the glory of manki
On 03/26/2012 12:31 AM, Gael Varoquaux wrote:
> On Mon, Mar 26, 2012 at 12:22:53AM +0200, Andreas wrote:
>
>> Thanks for the great work. This is really a step forward for the docs!
>>
> Thanks guys. I must confess that I had a presentation to give tomorow
> about clustering and I jumped o
On Mon, Mar 26, 2012 at 12:22:53AM +0200, Andreas wrote:
> Thanks for the great work. This is really a step forward for the docs!
Thanks guys. I must confess that I had a presentation to give tomorow
about clustering and I jumped on the occasion to improve the docs.
Gael
On 03/26/2012 12:19 AM, Gael Varoquaux wrote:
> Thanks for all the feedback. I have included in at merged to master,
> because I was running out of time, but it can still be improved!
>
>
Thanks for the great work. This is really a step forward for the docs!
---
On Mon, Mar 26, 2012 at 09:21:21AM +1100, Robert Layton wrote:
>This is great,
Thanks,
>and I think it would be a good idea to include such a
>summary table for classification at some point as well.
Yes. Actually I believe that every main usecase should have one, at the
beginning of
On 26 March 2012 09:19, Gael Varoquaux wrote:
> Thanks for all the feedback. I have included in at merged to master,
> because I was running out of time, but it can still be improved!
>
> Gael
>
>
> --
> This SF email is s
Thanks for all the feedback. I have included in at merged to master,
because I was running out of time, but it can still be improved!
Gael
--
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here
On 03/26/2012 12:06 AM, Gael Varoquaux wrote:
> On Sun, Mar 25, 2012 at 11:56:31PM +0200, Andreas wrote:
>
>> As far as I can see, your groups are "KMeans + Ward" and "rest".
>> I don't know how ward works but looking at the lena example,
>> the clusters don't seem to be convex.
>>
> But
On Sun, Mar 25, 2012 at 11:56:31PM +0200, Andreas wrote:
> As far as I can see, your groups are "KMeans + Ward" and "rest".
> I don't know how ward works but looking at the lena example,
> the clusters don't seem to be convex.
But you are looking in the wrong space: the physical space, and not the
On 03/25/2012 11:47 PM, Gael Varoquaux wrote:
> On Sun, Mar 25, 2012 at 11:38:50PM +0200, Andreas wrote:
>
>>> Unlike something like spectral clustering, it is the euclidean distance
>>> to the centers that is minimized. Thus K-Means will seek clusters that
>>> are regular in the flat euclidean
On Sun, Mar 25, 2012 at 11:38:50PM +0200, Andreas wrote:
> > Unlike something like spectral clustering, it is the euclidean distance
> > to the centers that is minimized. Thus K-Means will seek clusters that
> > are regular in the flat euclidean space.
> Ok, that's right. Though I would argue that
> Unlike something like spectral clustering, it is the euclidean distance
> to the centers that is minimized. Thus K-Means will seek clusters that
> are regular in the flat euclidean space.
>
>
Ok, that's right. Though I would argue that the distance measure
is not the only factor here. MeanSh
On Sun, Mar 25, 2012 at 11:30:36PM +0200, Andreas wrote:
> On 03/25/2012 11:32 PM, Gael Varoquaux wrote:
> > On Sun, Mar 25, 2012 at 11:23:55PM +0200, Andreas wrote:
> >> Without looking at the source, it could be that we initialize GMM
> >> with the result of KMeans.
> > We do.
> Then I would s
On 03/25/2012 11:32 PM, Gael Varoquaux wrote:
> On Sun, Mar 25, 2012 at 11:23:55PM +0200, Andreas wrote:
>
>> Without looking at the source, it could be that we initialize GMM
>> with the result of KMeans.
>>
> We do.
>
>
Then I would suggest changing that.
Although not sure what the
On Sun, Mar 25, 2012 at 11:22:32PM +0200, Andreas wrote:
> >> I'm not sure if "flat geometry" is a good way to describe the case that
> >> KMeans works in. I would have said "convex clusters". Not sure in how far
> >> that applies to hierarchical clustering, though.
> > Euclidean distance.
> Can
On Sun, Mar 25, 2012 at 11:23:55PM +0200, Andreas wrote:
> Without looking at the source, it could be that we initialize GMM
> with the result of KMeans.
We do.
> I read that if you do this, the GMM
> solution rarely changes.
No surprising.
> Instead, one should only run KMeans for one or two i
On 03/25/2012 11:20 PM, Gael Varoquaux wrote:
> On Sun, Mar 25, 2012 at 10:51:36PM +0200, Gael Varoquaux wrote:
>
>>> - You should at least refer to GMMs, as this is the most popular
>>> clustering framework that comes with a natural probabilistic setting
>>>
>
>> Agreed.
>>
>
>> I'm not sure if "flat geometry" is a good way to describe the case that
>> KMeans works in. I would have said "convex clusters". Not sure in how far
>> that applies to hierarchical clustering, though.
>>
> Euclidean distance.
>
Can you please elaborate?
>> Also, I would mention explic
On Sun, Mar 25, 2012 at 10:51:36PM +0200, Gael Varoquaux wrote:
> > - You should at least refer to GMMs, as this is the most popular
> > clustering framework that comes with a natural probabilistic setting
> Agreed.
Actually, on our various examples, it is impressive how much GMMs behave
similar
On Sun, Mar 25, 2012 at 10:12:59PM +0200, Andreas wrote:
> For the input, I would hope we can implement Olivier's proposal soon
> so that we don't need to differentiate the different input types.
Agreed. It was literly itching me when I was playing with the example.
> I'm not sure if "flat geomet
On Sun, Mar 25, 2012 at 10:22:39PM +0200, Andreas wrote:
> I might not have the time next week but after that I can give
> it a shot if you don't have the time.
It would be great, as I am not a specialist of this method.
Gael
--
On Sun, Mar 25, 2012 at 10:10:51PM +0200, bthirion wrote:
> - "Hierarchical clustering -> Few clusters": I thought it was not the
> best use case for these algorithms
Yes, this is clearly a typo.
> - "Hierarchical clustering -> even cluster size": this is not true if
> you consider single linka
> - You should at least refer to GMMs, as this is the most popular
> clustering framework that comes with a natural probabilistic setting
>
+1
> - With mean shift, I would refer to 'modes' rather than 'blobs'.
>
+1
In general the mean shift docs could be improved a lot.
There is quite a n
On 03/25/2012 10:25 PM, Olivier Grisel wrote:
> Le 25 mars 2012 22:12, Andreas a écrit :
>
>> ps: Maybe I'll find time to do the "fit_distance"/"fit_kernel" API in
>> one or two weeks.
>>
> As discussed earlier, I would prefer `fit_symmetric` or `fit_pairwise`
> when working with squared
Le 25 mars 2012 22:12, Andreas a écrit :
>
> ps: Maybe I'll find time to do the "fit_distance"/"fit_kernel" API in
> one or two weeks.
As discussed earlier, I would prefer `fit_symmetric` or `fit_pairwise`
when working with squared distance / affinity / kernel matrices as
main data input.
--
Ol
> I am working on a summary table on clustering methods. It is not
> finished, I need to do a bit more literature review, however, I'd love
> some feedback on the current status:
> https://github.com/GaelVaroquaux/scikit-learn/blob/master/doc/modules/clustering.rst
>
>
>
Thanks for starting on
Hi Gael,
Here are some suggestions regarding details of the page:
- "Hierarchical clustering -> Few clusters": I thought it was not the
best use case for these algorithms
- "Hierarchical clustering -> even cluster size": this is not true if
you consider single linkage, or even in general with Wa
On 03/25/2012 07:40 PM, Peter Prettenhofer wrote:
> 2012/3/25 Scott White:
>
>> Btw, another thing I think we should add is the ability to monitor the
>> out-of-bag estimates after each iteration and allow the fitting to be
>> terminated early. It's usually hard to guess the right number of
>>
Le 25 mars 2012 20:40, Gael Varoquaux a écrit :
> Hi list,
>
> I am working on a summary table on clustering methods. It is not
> finished, I need to do a bit more literature review, however, I'd love
> some feedback on the current status:
> https://github.com/GaelVaroquaux/scikit-learn/blob/maste
Hi list,
I am working on a summary table on clustering methods. It is not
finished, I need to do a bit more literature review, however, I'd love
some feedback on the current status:
https://github.com/GaelVaroquaux/scikit-learn/blob/master/doc/modules/clustering.rst
Cheers,
Gaël
---
On Mar 25, 2012, at 10:37 AM, Peter Prettenhofer
wrote:
> 2012/3/25 Scott White :
>> I have noticed also that GBRT can be quite slow even for small k.
>> Since GBRT fits n*k decision trees, by design, it's crucial that the
>> decision tree code be highly optimized. I did some quick performance
Yes, and with oob_estimates and monitor some of the basic building blocks are
in place but needs fleshing out.
Also would be nice to resume a model from where you left off if terminated too
early.
Sent from my iPhone
On Mar 25, 2012, at 10:40 AM, Peter Prettenhofer
wrote:
> 2012/3/25 Scott
2012/3/25 Scott White :
> Btw, another thing I think we should add is the ability to monitor the
> out-of-bag estimates after each iteration and allow the fitting to be
> terminated early. It's usually hard to guess the right number of
> iterations required and if one can terminate the fitting earl
2012/3/25 Scott White :
> I have noticed also that GBRT can be quite slow even for small k.
> Since GBRT fits n*k decision trees, by design, it's crucial that the
> decision tree code be highly optimized. I did some quick performance
> profiling the other day, which showed that the performance bott
Btw, another thing I think we should add is the ability to monitor the
out-of-bag estimates after each iteration and allow the fitting to be
terminated early. It's usually hard to guess the right number of
iterations required and if one can terminate the fitting early based
on good oob results that
Op 25 maart 2012 18:14 heeft Andreas het
volgende geschreven:
> Hi everybody.
> Rachee (?) just asked at
> https://github.com/scikit-learn/scikit-learn/issues/719 whether that would
> be a good place to start.
> Could someone please voice there opinion whether this is a good thing to
> have?
> Oth
Hi everybody.
Rachee (?) just asked at
https://github.com/scikit-learn/scikit-learn/issues/719 whether that would
be a good place to start.
Could someone please voice there opinion whether this is a good thing to
have?
Otherwise I would suggest doing #559 as there hasn't happened much recently.
Hi Satrajit,
Adding more trees should never hurt accuracy. The more, the better.
Since you have a lot of irrelevant features, I'll advise to increase
max_features in order to capture the relevant features when computing
the random splits. Otherwise, your trees will indeed fit on noise.
Best,
Gi
On Mon, Mar 26, 2012 at 12:09 AM, Peter Prettenhofer
wrote:
> 1. We need to support the query id (=``qid``) field in
> ``svmlight_loader``; Pair-wise approaches such as RankingSVMs need
> this information to form example pairs. My personal experience is that
> RankingSVMs do surprisingly poor on
Le 25 mars 2012 17:09, Peter Prettenhofer
a écrit :
> 2012/3/25 Olivier Grisel :
>> Le 25 mars 2012 12:44, Peter Prettenhofer
>> a écrit :
>>> Olivier,
>>>
>>> In my experience GBRT usually requires more base learners than random
>>> forests to get the same level of accuracy. I hardly use less th
2012/3/25 Olivier Grisel :
> Le 25 mars 2012 12:44, Peter Prettenhofer
> a écrit :
>> Olivier,
>>
>> In my experience GBRT usually requires more base learners than random
>> forests to get the same level of accuracy. I hardly use less than 100.
>> Regarding the poor performance of GBRT on the oliv
This paper is about using Stacking to boost the multi-classes accuracy of
sentiment classification significantly: "The Importance of Neutral Examples
for Learning Sentiment". I found this paper interesting and the experiment
results a bit surprising.
On Sun, Mar 25, 2012 at 10:28 PM, Satrajit Ghos
hi olivier,
Good to know. Among the references from the Wikipedia article, the
> following seems particularly interesting:
>
> http://arxiv.org/abs/0911.0460
i just read that last night after this series of emails. the introduction
is nice, but the point they are trying to make is a little stra
Le 25 mars 2012 03:33, xinfan meng a écrit :
>
>
> On Sun, Mar 25, 2012 at 9:26 AM, Olivier Grisel
> wrote:
>>
>> Le 25 mars 2012 01:38, xinfan meng a écrit :
>> > Hi, list
>> >
>> > I am looking for a a stacking classifier implementation. It falls
>> > into
>> > the category of ensemble cl
thanks paolo, will give all of this a try.
i'll also send a pr with a section on patterns for sklearn. although this
pattern might be specific to my problem domain, having more real-world
scripts/examples that reflect such considerations might be useful to the
community.
cheers,
satra
On Sun, M
Hi Satraijit,
On Sun, Mar 25, 2012 at 3:02 PM, Satrajit Ghosh wrote:
> hi giles,
>
> when dealing with skinny matrices of the type few samples x lots of
> features what are the recommendations for extra trees in terms of max
> features and number of estimators?
as far as number of estimators (t
Le 25 mars 2012 15:47, Gael Varoquaux a écrit :
> On Sun, Mar 25, 2012 at 03:40:52PM +0200, Olivier Grisel wrote:
>> BTW Learning-to-Rank seems to be a very important application domain
>> that we do not cover well in scikit-learn.
>
> Yes, @fabianp is working on things in this flavour currently.
Le 25 mars 2012 12:16, Gilles Louppe a écrit :
> Hi Olivier,
>
> The higher the number of estimators, the better. The more random the
> trees (e.g., the lower max_features), the more important it usually is
> to have a large forest to decrease the variance. To me, 10 is actually
> a very low defau
On Sun, Mar 25, 2012 at 03:40:52PM +0200, Olivier Grisel wrote:
> BTW Learning-to-Rank seems to be a very important application domain
> that we do not cover well in scikit-learn.
Yes, @fabianp is working on things in this flavour currently. I think
that he needs a bit of cheering to integrate the
Le 25 mars 2012 12:44, Peter Prettenhofer
a écrit :
> Olivier,
>
> In my experience GBRT usually requires more base learners than random
> forests to get the same level of accuracy. I hardly use less than 100.
> Regarding the poor performance of GBRT on the olivetti dataset:
> multi-class GBRT fit
On Sun, Mar 25, 2012 at 3:32 PM, Paolo Losi wrote:
> You could rank features by feature importance and perform recursive feature
> limitation
s/recursive feature limitation/recursive feature elimination/
--
This SF emai
hi giles,
when dealing with skinny matrices of the type few samples x lots of
features what are the recommendations for extra trees in terms of max
features and number of estimators?
also if a lot of the features are nuisance and most are noisy, are there
any recommendations for feature reductio
A quick follow-up on my previous email:
>> On 25 March 2012 03:49, Olivier Grisel wrote:
>>> [..]
>>>
>>> Another way to rephrase that question: what is the typical sweet spot
>>> for the dataset shape when doing classification Gradient Boosted
>>> Trees? What are reasonable values for the number
Olivier,
In my experience GBRT usually requires more base learners than random
forests to get the same level of accuracy. I hardly use less than 100.
Regarding the poor performance of GBRT on the olivetti dataset:
multi-class GBRT fits ``k`` trees at each stage, thus, if you have
``n_estimators``
Hi Olivier,
The higher the number of estimators, the better. The more random the
trees (e.g., the lower max_features), the more important it usually is
to have a large forest to decrease the variance. To me, 10 is actually
a very low default value. In my daily research, I deal with hundreds
of tre
58 matches
Mail list logo