By sampling, I meant to take X% of the data and calculate the silhouette of
just those points.
On 8 May 2013 06:49, Bao Thien <[email protected]> wrote:
> Thank Lars for correcting my mistake about the size of memory.
> Thank Robert for your suggestion.
>
> To be honest, first, I cluster the data using Hierarchical Clustering
> (Ward in scikit-learn). And because the order of data is too big, I need to
> provide the connectivity matrix. And this matrix totally depends on the
> parameter k (how many neighbors of each sample). But because we don't know
> k in advance. So I need to evaluate different clusters resulting from
> different value of k, then I can decide which value of k is somehow good
> for my data.
>
> From the scikit-learn, there is only Silhouette Score is suitable for
> clustering, all other scores need the ground truth label. If you know that
> there is other Score in scikit-learn can be used in my case, would you
> please let me know!
>
> p/s Robert: Sorry, but I can not catch what you mean about "sampling
> should give you a good approximation of the silhouette score". In my
> case, I can cut a tree and get about 50 clusters. Do you mind to explain me
> more detail about how to 'sampling'. I don't need a very strict
> mathematically guarantee, just a way to estimate score and choose the k
> value. Thank you in advance for your help.
>
> Regards,
>
> T.Bao
>
>
> On Tue, May 7, 2013 at 10:19 PM, Robert Layton <[email protected]>wrote:
>
>> Hi Bao,
>>
>> The Silhouette Function hasn't been written with this type of scalability
>> in mind.
>> It requires a pairwise distance matrix, which is prohibitive (as others
>> have said).
>>
>> If the number of clusters is low, sampling should give you a good
>> approximation of the silhouette score, although I can't offer any
>> mathematical guarantees on this.
>>
>> Thanks,
>>
>> - Robert
>>
>>
>>
>>
>> On 8 May 2013 06:12, Bao Thien <[email protected]> wrote:
>>
>>> Thank Ronnle,
>>>
>>> But the data size is 300K, then the memory requirement is about
>>> 300Kx300K~90.000MB~90GB. It is impossible to upgrade the ram :(.
>>> Do you have any other suggestion?
>>>
>>> Regards,
>>>
>>>
>>>
>>>
>>> On Tue, May 7, 2013 at 10:06 PM, Ronnie Ghose <[email protected]>wrote:
>>>
>>>> ....can you just get more ram?
>>>> On May 7, 2013 2:42 PM, "Bao Thien" <[email protected]> wrote:
>>>>
>>>>> I run a clustering algorithm and want to evaluate the result by
>>>>> using silhouette score in scikit-learn. But in the scikit-learn, it needs
>>>>> to calculate the distance matrix: distances = pairwise_distances(X,
>>>>> metric=metric, **kwds)
>>>>>
>>>>> Due to the fact that my data is order of 300K, and my memory is 2GB,
>>>>> leading to the result that out of memory.
>>>>>
>>>>> Does anyone know how to overcome this problem or not? Thank you for
>>>>> your help.
>>>>>
>>>>>
>>>>> --
>>>>> Nguyen Thien Bao
>>>>>
>>>>> NeuroInformatics Laboratory (NILab),
>>>>> Fondazione Bruno Kessler (FBK), Trento, Italy
>>>>> Centro Interdipartimentale Mente e Cervello (CIMeC)
>>>>> Universit`a degli Studi di Trento, Italy
>>>>> Email: [email protected] or [email protected]
>>>>> Cellphone: +39.345.293.1006 (Italy)
>>>>> Cellphone: +84.996.352.452 (VietNam)
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>>> "Graph Databases" is the definitive new guide to graph databases and
>>>>> their applications. This 200-page book is written by three acclaimed
>>>>> leaders in the field. The early access version is available now.
>>>>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>> "Graph Databases" is the definitive new guide to graph databases and
>>>> their applications. This 200-page book is written by three acclaimed
>>>> leaders in the field. The early access version is available now.
>>>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>>
>>> --
>>> Nguyen Thien Bao
>>>
>>> NeuroInformatics Laboratory (NILab),
>>> Fondazione Bruno Kessler (FBK), Trento, Italy
>>> Centro Interdipartimentale Mente e Cervello (CIMeC)
>>> Universit`a degli Studi di Trento, Italy
>>> Email: [email protected] or [email protected]
>>> Cellphone: +39.345.293.1006 (Italy)
>>> Cellphone: +84.996.352.452 (VietNam)
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Learn Graph Databases - Download FREE O'Reilly Book
>>> "Graph Databases" is the definitive new guide to graph databases and
>>> their applications. This 200-page book is written by three acclaimed
>>> leaders in the field. The early access version is available now.
>>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> --
>>
>> Public key at: http://pgp.mit.edu/ Search for this email address and
>> select the key from "2011-08-19" (key id: 54BA8735)
>>
>>
>> ------------------------------------------------------------------------------
>> Learn Graph Databases - Download FREE O'Reilly Book
>> "Graph Databases" is the definitive new guide to graph databases and
>> their applications. This 200-page book is written by three acclaimed
>> leaders in the field. The early access version is available now.
>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> --
> Nguyen Thien Bao
>
> NeuroInformatics Laboratory (NILab),
> Fondazione Bruno Kessler (FBK), Trento, Italy
> Centro Interdipartimentale Mente e Cervello (CIMeC)
> Universit`a degli Studi di Trento, Italy
> Email: [email protected] or [email protected]
> Cellphone: +39.345.293.1006 (Italy)
> Cellphone: +84.996.352.452 (VietNam)
>
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and
> their applications. This 200-page book is written by three acclaimed
> leaders in the field. The early access version is available now.
> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
Public key at: http://pgp.mit.edu/ Search for this email address and select
the key from "2011-08-19" (key id: 54BA8735)
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and
their applications. This 200-page book is written by three acclaimed
leaders in the field. The early access version is available now.
Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general