Re: [Scikit-learn-general] Out of memory when running silhouette score function

2013-05-23 Thread Bao Thien
Hi Alexandre, Sorry for late reply. Just because the last two weeks there was a tutorial here, and I did not spend time for trying the new multi-cores. After this week, I will back to work and let you know soon. By the way, thank you for your sharing :) On Thu, May 23, 2013 at 10:15 AM, Alexand

Re: [Scikit-learn-general] Out of memory when running silhouette score function

2013-05-23 Thread Alexandre ABRAHAM
Hi Bao, I haven't heard from you so I guess that it is working. FYI, I opened a PR for this feature here : https://github.com/scikit-learn/scikit-learn/pull/1976 Alexandre. On Fri, May 10, 2013 at 6:26 PM, Bao Thien wrote: > Hi Alexandre, > > It sounds very great. I will try it and let you kn

Re: [Scikit-learn-general] Out of memory when running silhouette score function

2013-05-13 Thread Alexandre ABRAHAM
Hi Ronnie, As other people say, Theano won't be added as a dependency to the scikit. However, the code is fairly simple and I guess that it would not be difficult to make it work using Theano. Is you do so, you may consider doing a PR to Theano rather than the scikit. Alexandre. On Sun, May 12,

Re: [Scikit-learn-general] Out of memory when running silhouette score function

2013-05-12 Thread Gael Varoquaux
On Sun, May 12, 2013 at 01:35:07PM +0200, Alexandre ABRAHAM wrote: > I know that the first purpose of scikit is not to handle big data but > would you be interested by a PR of my silhouette block implementation ? +1 for PR. I think that I would introduce a keyword argument to switch between the 2

Re: [Scikit-learn-general] Out of memory when running silhouette score function

2013-05-12 Thread Matthieu Brucher
-parallelism. Cheers, -- Forwarded message -- From: Ronnie Ghose Date: 2013/5/12 Subject: Re: [Scikit-learn-general] Out of memory when running silhouette score function To: scikit-learn-general@lists.sourceforge.net theano for the parallelization? from what i understand your PR uses on

Re: [Scikit-learn-general] Out of memory when running silhouette score function

2013-05-12 Thread Ronnie Ghose
theano for the parallelization? from what i understand your PR uses on-the-fly computation to reduce memory usage vs all at once. Wouldn't Theano help? As in could you per chance 'theano-ize' the parallel calculation maybe? I consider heavy numerical processes to be (at least now) mostly the doma

Re: [Scikit-learn-general] Out of memory when running silhouette score function

2013-05-12 Thread Alexandre ABRAHAM
Hi Ronnie, I have never used Theano, could you be a little more specific ? What do you want to compute ? What is your input data ? Basically, all these metrics are independant of the scikit and take numpy arrays as input so you can use it with any data under this format. Now, if you want to integ

Re: [Scikit-learn-general] Out of memory when running silhouette score function

2013-05-12 Thread Ronnie Ghose
uhhh +1. any chance of using theano with it? On Sun, May 12, 2013 at 7:35 AM, Alexandre ABRAHAM < abraham.alexan...@gmail.com> wrote: > Hey scikit people, > > I know that the first purpose of scikit is not to handle big data but > would you be interested by a PR of my silhouette block implementa

Re: [Scikit-learn-general] Out of memory when running silhouette score function

2013-05-12 Thread Alexandre ABRAHAM
Hey scikit people, I know that the first purpose of scikit is not to handle big data but would you be interested by a PR of my silhouette block implementation ? My benches have shown that it is a bit slower than the scikit one when data is small but it divides memory usage by n_cluster ^ 2. Plus i

Re: [Scikit-learn-general] Out of memory when running silhouette score function

2013-05-10 Thread Bao Thien
Hi Alexandre, It sounds very great. I will try it and let you know soon. Regards, T.Bao On Fri, May 10, 2013 at 6:19 PM, Alexandre ABRAHAM < abraham.alexan...@gmail.com> wrote: > Bao, > > Sorry for the delay. I have push a new version of the code on the gist > (there is now a n_jobs keyword p

Re: [Scikit-learn-general] Out of memory when running silhouette score function

2013-05-10 Thread Alexandre ABRAHAM
Bao, Sorry for the delay. I have push a new version of the code on the gist (there is now a n_jobs keyword parameter). It should use a bit more memory. Fast bench (see main in the gist) : Scikit silhouette (113.294149s): -0.013992 Block silhouette (23.485517s): -0.013992 Block silhouette parallel

Re: [Scikit-learn-general] Out of memory when running silhouette score function

2013-05-10 Thread Alexandre ABRAHAM
the dataset is clustered into 50 clusters > OK, so each clusters contains approximately 5K elements, which means distance matrices of size 25 000K. > I have not monitored the memory usage. But the computation time here is > the real CPU time, not the elapse time > OK. > I only can run the

Re: [Scikit-learn-general] Out of memory when running silhouette score function

2013-05-10 Thread Bao Thien
Hi Alexandre, I have a few questions on your experiment though: > - how many clusters do you have (as the block method speed and memory > consumption is dependent of the number of cluster) > the dataset is clustered into 50 clusters > - have you monitored memory usage ? In particular, did you

Re: [Scikit-learn-general] Out of memory when running silhouette score function

2013-05-10 Thread Alexandre ABRAHAM
Hi Bao, Thanks for your feedback ! I am not surprised that the sampling method saves time and gives a good approximation, especially considering the size of your data. I have a few questions on your experiment though: - how many clusters do you have (as the block method speed and memory consumpti

Re: [Scikit-learn-general] Out of memory when running silhouette score function

2013-05-10 Thread Bao Thien
Hi Alexandre, I run the silhouette_score_block on my dataset, and this is the result dataset size |X| = 260486, dimension 40, RAM 4GB Trial Original Ward (whole data)(1) *Original Ward (sub_sample=50K)(2)* Silhouette Score Time(s) Silhouette Score Time(s) 1st 0.19045893 6250.758648 0.189+/

Re: [Scikit-learn-general] Out of memory when running silhouette score function

2013-05-09 Thread Bao Thien
HI Alexandre, Thank you very much for your help. This is absolutely the thing that fits my problem. Your help is very appreciate. I am also running the sampling method as Robert suggested. I will try with block version, and compare the results. Then, I will let all you guys know the results as soo

Re: [Scikit-learn-general] Out of memory when running silhouette score function

2013-05-09 Thread Alexandre ABRAHAM
Hi Bao, Sorry for late reply, I've set up some code yesterday evening and my post got blocked because of its size. The code is really simple and I kept the scikit formalism so if you lookes at the scikit function, this should be familiar to you. Gist : https://gist.github.com/AlexandreAbraham/554

Re: [Scikit-learn-general] Out of memory when running silhouette score function

2013-05-08 Thread Alexandre ABRAHAM
Bao, To compute the silhouette distance, the scikit precompute the matrix of distances between the elements of X (samples). But it is possible to do without this matrix and compute the distance between two samples only when it's needed. This is the most naive implementation of the silhouette. Ther

Re: [Scikit-learn-general] Out of memory when running silhouette score function

2013-05-08 Thread Bao Thien
Hi Alexandre, Thank for your feedback. But could you please more clarify about "computing the distance between samples "on the fly"'. In my case, the time requirement is not very serious. If you can make me clear about this, I think it would be a suitable solution for my case. Regards, T.Bao O

Re: [Scikit-learn-general] Out of memory when running silhouette score function

2013-05-08 Thread Alexandre ABRAHAM
Hi Bao, If I am not mistaken, the computation of pairwise distances is a way to speed up silhouette calculus, and make the code simpler. It is possible to compute silhouette by computing the distance between samples "on the fly". This will be very slow indeed but no additional memory is required.

Re: [Scikit-learn-general] Out of memory when running silhouette score function

2013-05-07 Thread Bao Thien
I got it. Thank you Robert. On Tue, May 7, 2013 at 11:31 PM, Robert Layton wrote: > By sampling, I meant to take X% of the data and calculate the silhouette > of just those points. > > > On 8 May 2013 06:49, Bao Thien wrote: > >> Thank Lars for correcting my mistake about the size of memory. >>

Re: [Scikit-learn-general] Out of memory when running silhouette score function

2013-05-07 Thread Robert Layton
By sampling, I meant to take X% of the data and calculate the silhouette of just those points. On 8 May 2013 06:49, Bao Thien wrote: > Thank Lars for correcting my mistake about the size of memory. > Thank Robert for your suggestion. > > To be honest, first, I cluster the data using Hierarchica

Re: [Scikit-learn-general] Out of memory when running silhouette score function

2013-05-07 Thread Bao Thien
Thank Lars for correcting my mistake about the size of memory. Thank Robert for your suggestion. To be honest, first, I cluster the data using Hierarchical Clustering (Ward in scikit-learn). And because the order of data is too big, I need to provide the connectivity matrix. And this matrix totall

Re: [Scikit-learn-general] Out of memory when running silhouette score function

2013-05-07 Thread Ronnie Ghose
sorry, my brain left me for a moment there xD On Tue, May 7, 2013 at 4:12 PM, Bao Thien wrote: > Thank Ronnle, > > But the data size is 300K, then the memory requirement is about > 300Kx300K~90.000MB~90GB. It is impossible to upgrade the ram :(. > Do you have any other suggestion? > > Regards,

Re: [Scikit-learn-general] Out of memory when running silhouette score function

2013-05-07 Thread Robert Layton
Hi Bao, The Silhouette Function hasn't been written with this type of scalability in mind. It requires a pairwise distance matrix, which is prohibitive (as others have said). If the number of clusters is low, sampling should give you a good approximation of the silhouette score, although I can't

Re: [Scikit-learn-general] Out of memory when running silhouette score function

2013-05-07 Thread Lars Buitinck
2013/5/7 Bao Thien : > But the data size is 300K, then the memory requirement is about > 300Kx300K~90.000MB~90GB. It is impossible to upgrade the ram :(. > Do you have any other suggestion? Since a float64 is eight bytes large, you'd actually need 670GB. No, this is a deficiency in silhouette_sco

Re: [Scikit-learn-general] Out of memory when running silhouette score function

2013-05-07 Thread Bao Thien
Thank Ronnle, But the data size is 300K, then the memory requirement is about 300Kx300K~90.000MB~90GB. It is impossible to upgrade the ram :(. Do you have any other suggestion? Regards, On Tue, May 7, 2013 at 10:06 PM, Ronnie Ghose wrote: > can you just get more ram? > On May 7, 2013 2:4

Re: [Scikit-learn-general] Out of memory when running silhouette score function

2013-05-07 Thread Ronnie Ghose
can you just get more ram? On May 7, 2013 2:42 PM, "Bao Thien" wrote: > I run a clustering algorithm and want to evaluate the result by using > silhouette score in scikit-learn. But in the scikit-learn, it needs to > calculate the distance matrix: distances = pairwise_distances(X, > metric=me

[Scikit-learn-general] Out of memory when running silhouette score function

2013-05-07 Thread Bao Thien
I run a clustering algorithm and want to evaluate the result by using silhouette score in scikit-learn. But in the scikit-learn, it needs to calculate the distance matrix: distances = pairwise_distances(X, metric=metric, **kwds) Due to the fact that my data is order of 300K, and my memory is 2GB,