Thank Lars for correcting my mistake about the size of memory.
Thank Robert for your suggestion.

To be honest, first, I cluster the data using Hierarchical Clustering (Ward
in scikit-learn). And because the order of data is too big, I need to
provide the connectivity matrix. And this matrix totally depends on the
parameter k (how many neighbors of each sample). But because we don't know
k in advance. So I need to evaluate different clusters resulting from
different value of k, then I can decide which value of k is somehow good
for my data.

>From the scikit-learn, there is only Silhouette Score is suitable for
clustering, all other scores need the ground truth label. If you know that
there is other Score in scikit-learn can be used in my case, would you
please let me know!

p/s Robert: Sorry, but I can not catch what you mean about "sampling should
give you a good approximation of the silhouette score". In my case, I can
cut a tree and get about 50 clusters. Do you mind to explain me more detail
about how to 'sampling'. I don't need a very strict mathematically
guarantee, just a way to estimate score and choose the k value. Thank you
in advance for your help.

Regards,

T.Bao


On Tue, May 7, 2013 at 10:19 PM, Robert Layton <[email protected]>wrote:

> Hi Bao,
>
> The Silhouette Function hasn't been written with this type of scalability
> in mind.
> It requires a pairwise distance matrix, which is prohibitive (as others
> have said).
>
> If the number of clusters is low, sampling should give you a good
> approximation of the silhouette score, although I can't offer any
> mathematical guarantees on this.
>
> Thanks,
>
> - Robert
>
>
>
>
> On 8 May 2013 06:12, Bao Thien <[email protected]> wrote:
>
>> Thank Ronnle,
>>
>> But the data size is 300K, then the memory requirement is about
>> 300Kx300K~90.000MB~90GB. It is impossible to upgrade the ram :(.
>> Do you have any other suggestion?
>>
>> Regards,
>>
>>
>>
>>
>> On Tue, May 7, 2013 at 10:06 PM, Ronnie Ghose <[email protected]>wrote:
>>
>>> ....can you just get more ram?
>>> On May 7, 2013 2:42 PM, "Bao Thien" <[email protected]> wrote:
>>>
>>>>  I run a clustering algorithm and want to evaluate the result by using
>>>> silhouette score in scikit-learn. But in the scikit-learn, it needs to
>>>> calculate the distance matrix: distances = pairwise_distances(X,
>>>> metric=metric, **kwds)
>>>>
>>>> Due to the fact that my data is order of 300K, and my memory is 2GB,
>>>> leading to the result that out of memory.
>>>>
>>>> Does anyone know how to overcome this problem or not? Thank you for
>>>> your help.
>>>>
>>>>
>>>> --
>>>> Nguyen Thien Bao
>>>>
>>>> NeuroInformatics Laboratory (NILab),
>>>> Fondazione Bruno Kessler (FBK), Trento, Italy
>>>> Centro Interdipartimentale Mente e Cervello (CIMeC)
>>>> Universit`a degli Studi di Trento, Italy
>>>> Email: [email protected]  or  [email protected]
>>>> Cellphone: +39.345.293.1006 (Italy)
>>>> Cellphone: +84.996.352.452 (VietNam)
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>> "Graph Databases" is the definitive new guide to graph databases and
>>>> their applications. This 200-page book is written by three acclaimed
>>>> leaders in the field. The early access version is available now.
>>>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Learn Graph Databases - Download FREE O'Reilly Book
>>> "Graph Databases" is the definitive new guide to graph databases and
>>> their applications. This 200-page book is written by three acclaimed
>>> leaders in the field. The early access version is available now.
>>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> --
>> Nguyen Thien Bao
>>
>> NeuroInformatics Laboratory (NILab),
>> Fondazione Bruno Kessler (FBK), Trento, Italy
>> Centro Interdipartimentale Mente e Cervello (CIMeC)
>> Universit`a degli Studi di Trento, Italy
>> Email: [email protected]  or  [email protected]
>> Cellphone: +39.345.293.1006 (Italy)
>> Cellphone: +84.996.352.452 (VietNam)
>>
>>
>> ------------------------------------------------------------------------------
>> Learn Graph Databases - Download FREE O'Reilly Book
>> "Graph Databases" is the definitive new guide to graph databases and
>> their applications. This 200-page book is written by three acclaimed
>> leaders in the field. The early access version is available now.
>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> --
>
> Public key at: http://pgp.mit.edu/ Search for this email address and
> select the key from "2011-08-19" (key id: 54BA8735)
>
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and
> their applications. This 200-page book is written by three acclaimed
> leaders in the field. The early access version is available now.
> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>


-- 
Nguyen Thien Bao

NeuroInformatics Laboratory (NILab),
Fondazione Bruno Kessler (FBK), Trento, Italy
Centro Interdipartimentale Mente e Cervello (CIMeC)
Universit`a degli Studi di Trento, Italy
Email: [email protected]  or  [email protected]
Cellphone: +39.345.293.1006 (Italy)
Cellphone: +84.996.352.452 (VietNam)
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and 
their applications. This 200-page book is written by three acclaimed 
leaders in the field. The early access version is available now. 
Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to