Hi Bao,
If I am not mistaken, the computation of pairwise distances is a way to
speed up silhouette calculus, and make the code simpler. It is possible to
compute silhouette by computing the distance between samples "on the fly".
This will be very slow indeed but no additional memory is required.
Alexandre.
On Tue, May 7, 2013 at 11:41 PM, Bao Thien <[email protected]> wrote:
> I got it. Thank you Robert.
>
>
> On Tue, May 7, 2013 at 11:31 PM, Robert Layton <[email protected]>wrote:
>
>> By sampling, I meant to take X% of the data and calculate the silhouette
>> of just those points.
>>
>>
>> On 8 May 2013 06:49, Bao Thien <[email protected]> wrote:
>>
>>> Thank Lars for correcting my mistake about the size of memory.
>>> Thank Robert for your suggestion.
>>>
>>> To be honest, first, I cluster the data using Hierarchical Clustering
>>> (Ward in scikit-learn). And because the order of data is too big, I need to
>>> provide the connectivity matrix. And this matrix totally depends on the
>>> parameter k (how many neighbors of each sample). But because we don't know
>>> k in advance. So I need to evaluate different clusters resulting from
>>> different value of k, then I can decide which value of k is somehow good
>>> for my data.
>>>
>>> From the scikit-learn, there is only Silhouette Score is suitable for
>>> clustering, all other scores need the ground truth label. If you know that
>>> there is other Score in scikit-learn can be used in my case, would you
>>> please let me know!
>>>
>>> p/s Robert: Sorry, but I can not catch what you mean about "sampling
>>> should give you a good approximation of the silhouette score". In my
>>> case, I can cut a tree and get about 50 clusters. Do you mind to explain me
>>> more detail about how to 'sampling'. I don't need a very strict
>>> mathematically guarantee, just a way to estimate score and choose the k
>>> value. Thank you in advance for your help.
>>>
>>> Regards,
>>>
>>> T.Bao
>>>
>>>
>>> On Tue, May 7, 2013 at 10:19 PM, Robert Layton
>>> <[email protected]>wrote:
>>>
>>>> Hi Bao,
>>>>
>>>> The Silhouette Function hasn't been written with this type of
>>>> scalability in mind.
>>>> It requires a pairwise distance matrix, which is prohibitive (as others
>>>> have said).
>>>>
>>>> If the number of clusters is low, sampling should give you a good
>>>> approximation of the silhouette score, although I can't offer any
>>>> mathematical guarantees on this.
>>>>
>>>> Thanks,
>>>>
>>>> - Robert
>>>>
>>>>
>>>>
>>>>
>>>> On 8 May 2013 06:12, Bao Thien <[email protected]> wrote:
>>>>
>>>>> Thank Ronnle,
>>>>>
>>>>> But the data size is 300K, then the memory requirement is about
>>>>> 300Kx300K~90.000MB~90GB. It is impossible to upgrade the ram :(.
>>>>> Do you have any other suggestion?
>>>>>
>>>>> Regards,
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, May 7, 2013 at 10:06 PM, Ronnie Ghose
>>>>> <[email protected]>wrote:
>>>>>
>>>>>> ....can you just get more ram?
>>>>>> On May 7, 2013 2:42 PM, "Bao Thien" <[email protected]> wrote:
>>>>>>
>>>>>>> I run a clustering algorithm and want to evaluate the result by
>>>>>>> using silhouette score in scikit-learn. But in the scikit-learn, it
>>>>>>> needs
>>>>>>> to calculate the distance matrix: distances = pairwise_distances(X,
>>>>>>> metric=metric, **kwds)
>>>>>>>
>>>>>>> Due to the fact that my data is order of 300K, and my memory is 2GB,
>>>>>>> leading to the result that out of memory.
>>>>>>>
>>>>>>> Does anyone know how to overcome this problem or not? Thank you for
>>>>>>> your help.
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nguyen Thien Bao
>>>>>>>
>>>>>>> NeuroInformatics Laboratory (NILab),
>>>>>>> Fondazione Bruno Kessler (FBK), Trento, Italy
>>>>>>> Centro Interdipartimentale Mente e Cervello (CIMeC)
>>>>>>> Universit`a degli Studi di Trento, Italy
>>>>>>> Email: [email protected] or [email protected]
>>>>>>> Cellphone: +39.345.293.1006 (Italy)
>>>>>>> Cellphone: +84.996.352.452 (VietNam)
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>>>>> "Graph Databases" is the definitive new guide to graph databases and
>>>>>>> their applications. This 200-page book is written by three acclaimed
>>>>>>> leaders in the field. The early access version is available now.
>>>>>>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>>>>>>> _______________________________________________
>>>>>>> Scikit-learn-general mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>>>> "Graph Databases" is the definitive new guide to graph databases and
>>>>>> their applications. This 200-page book is written by three acclaimed
>>>>>> leaders in the field. The early access version is available now.
>>>>>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>>>>>> _______________________________________________
>>>>>> Scikit-learn-general mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nguyen Thien Bao
>>>>>
>>>>> NeuroInformatics Laboratory (NILab),
>>>>> Fondazione Bruno Kessler (FBK), Trento, Italy
>>>>> Centro Interdipartimentale Mente e Cervello (CIMeC)
>>>>> Universit`a degli Studi di Trento, Italy
>>>>> Email: [email protected] or [email protected]
>>>>> Cellphone: +39.345.293.1006 (Italy)
>>>>> Cellphone: +84.996.352.452 (VietNam)
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>>> "Graph Databases" is the definitive new guide to graph databases and
>>>>> their applications. This 200-page book is written by three acclaimed
>>>>> leaders in the field. The early access version is available now.
>>>>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Public key at: http://pgp.mit.edu/ Search for this email address and
>>>> select the key from "2011-08-19" (key id: 54BA8735)
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>> "Graph Databases" is the definitive new guide to graph databases and
>>>> their applications. This 200-page book is written by three acclaimed
>>>> leaders in the field. The early access version is available now.
>>>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>>
>>> --
>>> Nguyen Thien Bao
>>>
>>> NeuroInformatics Laboratory (NILab),
>>> Fondazione Bruno Kessler (FBK), Trento, Italy
>>> Centro Interdipartimentale Mente e Cervello (CIMeC)
>>> Universit`a degli Studi di Trento, Italy
>>> Email: [email protected] or [email protected]
>>> Cellphone: +39.345.293.1006 (Italy)
>>> Cellphone: +84.996.352.452 (VietNam)
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Learn Graph Databases - Download FREE O'Reilly Book
>>> "Graph Databases" is the definitive new guide to graph databases and
>>> their applications. This 200-page book is written by three acclaimed
>>> leaders in the field. The early access version is available now.
>>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> --
>>
>> Public key at: http://pgp.mit.edu/ Search for this email address and
>> select the key from "2011-08-19" (key id: 54BA8735)
>>
>>
>> ------------------------------------------------------------------------------
>> Learn Graph Databases - Download FREE O'Reilly Book
>> "Graph Databases" is the definitive new guide to graph databases and
>> their applications. This 200-page book is written by three acclaimed
>> leaders in the field. The early access version is available now.
>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> --
> Nguyen Thien Bao
>
> NeuroInformatics Laboratory (NILab),
> Fondazione Bruno Kessler (FBK), Trento, Italy
> Centro Interdipartimentale Mente e Cervello (CIMeC)
> Universit`a degli Studi di Trento, Italy
> Email: [email protected] or [email protected]
> Cellphone: +39.345.293.1006 (Italy)
> Cellphone: +84.996.352.452 (VietNam)
>
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and
> their applications. This 200-page book is written by three acclaimed
> leaders in the field. The early access version is available now.
> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and
their applications. This 200-page book is written by three acclaimed
leaders in the field. The early access version is available now.
Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general