Hi Alexandre,
Sorry for late reply. Just because the last two weeks there was a tutorial
here, and I did not spend time for trying the new multi-cores. After this
week, I will back to work and let you know soon.
By the way, thank you for your sharing :)
On Thu, May 23, 2013 at 10:15 AM, Alexand
Hi Bao,
I haven't heard from you so I guess that it is working. FYI, I opened a PR
for this feature here :
https://github.com/scikit-learn/scikit-learn/pull/1976
Alexandre.
On Fri, May 10, 2013 at 6:26 PM, Bao Thien wrote:
> Hi Alexandre,
>
> It sounds very great. I will try it and let you kn
Hi Ronnie,
As other people say, Theano won't be added as a dependency to the scikit.
However, the code is fairly simple and I guess that it would not be
difficult to make it work using Theano. Is you do so, you may consider
doing a PR to Theano rather than the scikit.
Alexandre.
On Sun, May 12,
On Sun, May 12, 2013 at 01:35:07PM +0200, Alexandre ABRAHAM wrote:
> I know that the first purpose of scikit is not to handle big data but
> would you be interested by a PR of my silhouette block implementation ?
+1 for PR. I think that I would introduce a keyword argument to switch
between the 2
-parallelism.
Cheers,
-- Forwarded message --
From: Ronnie Ghose
Date: 2013/5/12
Subject: Re: [Scikit-learn-general] Out of memory when running silhouette
score function
To: scikit-learn-general@lists.sourceforge.net
theano for the parallelization?
from what i understand your PR uses on
theano for the parallelization?
from what i understand your PR uses on-the-fly computation to reduce memory
usage vs all at once. Wouldn't Theano help? As in could you per chance
'theano-ize' the parallel calculation maybe? I consider heavy numerical
processes to be (at least now) mostly the doma
Hi Ronnie,
I have never used Theano, could you be a little more specific ? What do you
want to compute ? What is your input data ? Basically, all these metrics
are independant of the scikit and take numpy arrays as input so you can use
it with any data under this format.
Now, if you want to integ
uhhh +1. any chance of using theano with it?
On Sun, May 12, 2013 at 7:35 AM, Alexandre ABRAHAM <
abraham.alexan...@gmail.com> wrote:
> Hey scikit people,
>
> I know that the first purpose of scikit is not to handle big data but
> would you be interested by a PR of my silhouette block implementa
Hey scikit people,
I know that the first purpose of scikit is not to handle big data but would
you be interested by a PR of my silhouette block implementation ? My
benches have shown that it is a bit slower than the scikit one when data is
small but it divides memory usage by n_cluster ^ 2. Plus i
Hi Alexandre,
It sounds very great. I will try it and let you know soon.
Regards,
T.Bao
On Fri, May 10, 2013 at 6:19 PM, Alexandre ABRAHAM <
abraham.alexan...@gmail.com> wrote:
> Bao,
>
> Sorry for the delay. I have push a new version of the code on the gist
> (there is now a n_jobs keyword p
Bao,
Sorry for the delay. I have push a new version of the code on the gist
(there is now a n_jobs keyword parameter). It should use a bit more memory.
Fast bench (see main in the gist) :
Scikit silhouette (113.294149s): -0.013992
Block silhouette (23.485517s): -0.013992
Block silhouette parallel
the dataset is clustered into 50 clusters
>
OK, so each clusters contains approximately 5K elements, which means
distance matrices of size 25 000K.
> I have not monitored the memory usage. But the computation time here is
> the real CPU time, not the elapse time
>
OK.
> I only can run the
Hi Alexandre,
I have a few questions on your experiment though:
> - how many clusters do you have (as the block method speed and memory
> consumption is dependent of the number of cluster)
>
the dataset is clustered into 50 clusters
> - have you monitored memory usage ? In particular, did you
Hi Bao,
Thanks for your feedback ! I am not surprised that the sampling method
saves time and gives a good approximation, especially considering the size
of your data.
I have a few questions on your experiment though:
- how many clusters do you have (as the block method speed and memory
consumpti
Hi Alexandre,
I run the silhouette_score_block on my dataset, and this is the result
dataset size |X| = 260486, dimension 40, RAM 4GB
Trial Original Ward (whole data)(1) *Original Ward
(sub_sample=50K)(2)* Silhouette
Score Time(s) Silhouette Score Time(s) 1st 0.19045893 6250.758648
0.189+/
HI Alexandre,
Thank you very much for your help. This is absolutely the thing that fits
my problem. Your help is very appreciate.
I am also running the sampling method as Robert suggested.
I will try with block version, and compare the results. Then, I will let
all you guys know the results as soo
Hi Bao,
Sorry for late reply, I've set up some code yesterday evening and my post
got blocked because of its size. The code is really simple and I kept the
scikit formalism so if you lookes at the scikit function, this should be
familiar to you.
Gist : https://gist.github.com/AlexandreAbraham/554
Bao,
To compute the silhouette distance, the scikit precompute the matrix of
distances between the elements of X (samples). But it is possible to do
without this matrix and compute the distance between two samples only when
it's needed. This is the most naive implementation of the silhouette. Ther
Hi Alexandre,
Thank for your feedback. But could you please more clarify about "computing
the distance between samples "on the fly"'. In my case, the time
requirement is not very serious. If you can make me clear about this, I
think it would be a suitable solution for my case.
Regards,
T.Bao
O
Hi Bao,
If I am not mistaken, the computation of pairwise distances is a way to
speed up silhouette calculus, and make the code simpler. It is possible to
compute silhouette by computing the distance between samples "on the fly".
This will be very slow indeed but no additional memory is required.
I got it. Thank you Robert.
On Tue, May 7, 2013 at 11:31 PM, Robert Layton wrote:
> By sampling, I meant to take X% of the data and calculate the silhouette
> of just those points.
>
>
> On 8 May 2013 06:49, Bao Thien wrote:
>
>> Thank Lars for correcting my mistake about the size of memory.
>>
By sampling, I meant to take X% of the data and calculate the silhouette of
just those points.
On 8 May 2013 06:49, Bao Thien wrote:
> Thank Lars for correcting my mistake about the size of memory.
> Thank Robert for your suggestion.
>
> To be honest, first, I cluster the data using Hierarchica
Thank Lars for correcting my mistake about the size of memory.
Thank Robert for your suggestion.
To be honest, first, I cluster the data using Hierarchical Clustering (Ward
in scikit-learn). And because the order of data is too big, I need to
provide the connectivity matrix. And this matrix totall
sorry, my brain left me for a moment there xD
On Tue, May 7, 2013 at 4:12 PM, Bao Thien wrote:
> Thank Ronnle,
>
> But the data size is 300K, then the memory requirement is about
> 300Kx300K~90.000MB~90GB. It is impossible to upgrade the ram :(.
> Do you have any other suggestion?
>
> Regards,
Hi Bao,
The Silhouette Function hasn't been written with this type of scalability
in mind.
It requires a pairwise distance matrix, which is prohibitive (as others
have said).
If the number of clusters is low, sampling should give you a good
approximation of the silhouette score, although I can't
2013/5/7 Bao Thien :
> But the data size is 300K, then the memory requirement is about
> 300Kx300K~90.000MB~90GB. It is impossible to upgrade the ram :(.
> Do you have any other suggestion?
Since a float64 is eight bytes large, you'd actually need 670GB.
No, this is a deficiency in silhouette_sco
Thank Ronnle,
But the data size is 300K, then the memory requirement is about
300Kx300K~90.000MB~90GB. It is impossible to upgrade the ram :(.
Do you have any other suggestion?
Regards,
On Tue, May 7, 2013 at 10:06 PM, Ronnie Ghose wrote:
> can you just get more ram?
> On May 7, 2013 2:4
can you just get more ram?
On May 7, 2013 2:42 PM, "Bao Thien" wrote:
> I run a clustering algorithm and want to evaluate the result by using
> silhouette score in scikit-learn. But in the scikit-learn, it needs to
> calculate the distance matrix: distances = pairwise_distances(X,
> metric=me
I run a clustering algorithm and want to evaluate the result by using
silhouette score in scikit-learn. But in the scikit-learn, it needs to
calculate the distance matrix: distances = pairwise_distances(X,
metric=metric, **kwds)
Due to the fact that my data is order of 300K, and my memory is 2GB,
29 matches
Mail list logo