Bao,

To compute the silhouette distance, the scikit precompute the matrix of
distances between the elements of X (samples). But it is possible to do
without this matrix and compute the distance between two samples only when
it's needed. This is the most naive implementation of the silhouette. There
may even be a more clever strategy, for example by precomputing distances
between pair of clusters so that you just need a distance matrix the size
of two clusters.

Alexandre.


On Wed, May 8, 2013 at 1:58 PM, Bao Thien <[email protected]> wrote:

> Hi Alexandre,
>
> Thank for your feedback. But could you please more clarify about
> "computing the distance between samples "on the fly"'. In my case, the time
> requirement is not very serious. If you can make me clear about this, I
> think it would be a suitable solution for my case.
>
> Regards,
>
> T.Bao
>
>
> On Wed, May 8, 2013 at 12:28 PM, Alexandre ABRAHAM <
> [email protected]> wrote:
>
>> Hi Bao,
>>
>> If I am not mistaken, the computation of pairwise distances is a way to
>> speed up silhouette calculus, and make the code simpler. It is possible to
>> compute silhouette by computing the distance between samples "on the fly".
>> This will be very slow indeed but no additional memory is required.
>>
>> Alexandre.
>>
>>
>> On Tue, May 7, 2013 at 11:41 PM, Bao Thien <[email protected]> wrote:
>>
>>> I got it. Thank you Robert.
>>>
>>>
>>> On Tue, May 7, 2013 at 11:31 PM, Robert Layton 
>>> <[email protected]>wrote:
>>>
>>>> By sampling, I meant to take X% of the data and calculate the
>>>> silhouette of just those points.
>>>>
>>>>
>>>> On 8 May 2013 06:49, Bao Thien <[email protected]> wrote:
>>>>
>>>>> Thank Lars for correcting my mistake about the size of memory.
>>>>> Thank Robert for your suggestion.
>>>>>
>>>>> To be honest, first, I cluster the data using Hierarchical Clustering
>>>>> (Ward in scikit-learn). And because the order of data is too big, I need 
>>>>> to
>>>>> provide the connectivity matrix. And this matrix totally depends on the
>>>>> parameter k (how many neighbors of each sample). But because we don't know
>>>>> k in advance. So I need to evaluate different clusters resulting from
>>>>> different value of k, then I can decide which value of k is somehow good
>>>>> for my data.
>>>>>
>>>>> From the scikit-learn, there is only Silhouette Score is suitable for
>>>>> clustering, all other scores need the ground truth label. If you know that
>>>>> there is other Score in scikit-learn can be used in my case, would you
>>>>> please let me know!
>>>>>
>>>>> p/s Robert: Sorry, but I can not catch what you mean about "sampling
>>>>> should give you a good approximation of the silhouette score". In my
>>>>> case, I can cut a tree and get about 50 clusters. Do you mind to explain 
>>>>> me
>>>>> more detail about how to 'sampling'. I don't need a very strict
>>>>> mathematically guarantee, just a way to estimate score and choose the k
>>>>> value. Thank you in advance for your help.
>>>>>
>>>>> Regards,
>>>>>
>>>>> T.Bao
>>>>>
>>>>>
>>>>> On Tue, May 7, 2013 at 10:19 PM, Robert Layton <[email protected]
>>>>> > wrote:
>>>>>
>>>>>> Hi Bao,
>>>>>>
>>>>>> The Silhouette Function hasn't been written with this type of
>>>>>> scalability in mind.
>>>>>> It requires a pairwise distance matrix, which is prohibitive (as
>>>>>> others have said).
>>>>>>
>>>>>> If the number of clusters is low, sampling should give you a good
>>>>>> approximation of the silhouette score, although I can't offer any
>>>>>> mathematical guarantees on this.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> - Robert
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 8 May 2013 06:12, Bao Thien <[email protected]> wrote:
>>>>>>
>>>>>>> Thank Ronnle,
>>>>>>>
>>>>>>> But the data size is 300K, then the memory requirement is about
>>>>>>> 300Kx300K~90.000MB~90GB. It is impossible to upgrade the ram :(.
>>>>>>> Do you have any other suggestion?
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, May 7, 2013 at 10:06 PM, Ronnie Ghose <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> ....can you just get more ram?
>>>>>>>> On May 7, 2013 2:42 PM, "Bao Thien" <[email protected]> wrote:
>>>>>>>>
>>>>>>>>>  I run a clustering algorithm and want to evaluate the result by
>>>>>>>>> using silhouette score in scikit-learn. But in the scikit-learn, it 
>>>>>>>>> needs
>>>>>>>>> to calculate the distance matrix: distances = pairwise_distances(X,
>>>>>>>>> metric=metric, **kwds)
>>>>>>>>>
>>>>>>>>> Due to the fact that my data is order of 300K, and my memory is
>>>>>>>>> 2GB, leading to the result that out of memory.
>>>>>>>>>
>>>>>>>>> Does anyone know how to overcome this problem or not? Thank you
>>>>>>>>> for your help.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nguyen Thien Bao
>>>>>>>>>
>>>>>>>>> NeuroInformatics Laboratory (NILab),
>>>>>>>>> Fondazione Bruno Kessler (FBK), Trento, Italy
>>>>>>>>> Centro Interdipartimentale Mente e Cervello (CIMeC)
>>>>>>>>> Universit`a degli Studi di Trento, Italy
>>>>>>>>> Email: [email protected]  or  [email protected]
>>>>>>>>> Cellphone: +39.345.293.1006 (Italy)
>>>>>>>>> Cellphone: +84.996.352.452 (VietNam)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>>>>>>> "Graph Databases" is the definitive new guide to graph databases
>>>>>>>>> and
>>>>>>>>> their applications. This 200-page book is written by three
>>>>>>>>> acclaimed
>>>>>>>>> leaders in the field. The early access version is available now.
>>>>>>>>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>>>>>>>>> _______________________________________________
>>>>>>>>> Scikit-learn-general mailing list
>>>>>>>>> [email protected]
>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>>>>>> "Graph Databases" is the definitive new guide to graph databases and
>>>>>>>> their applications. This 200-page book is written by three acclaimed
>>>>>>>> leaders in the field. The early access version is available now.
>>>>>>>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>>>>>>>> _______________________________________________
>>>>>>>> Scikit-learn-general mailing list
>>>>>>>> [email protected]
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nguyen Thien Bao
>>>>>>>
>>>>>>> NeuroInformatics Laboratory (NILab),
>>>>>>> Fondazione Bruno Kessler (FBK), Trento, Italy
>>>>>>> Centro Interdipartimentale Mente e Cervello (CIMeC)
>>>>>>> Universit`a degli Studi di Trento, Italy
>>>>>>> Email: [email protected]  or  [email protected]
>>>>>>> Cellphone: +39.345.293.1006 (Italy)
>>>>>>> Cellphone: +84.996.352.452 (VietNam)
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>>>>> "Graph Databases" is the definitive new guide to graph databases and
>>>>>>> their applications. This 200-page book is written by three acclaimed
>>>>>>> leaders in the field. The early access version is available now.
>>>>>>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>>>>>>> _______________________________________________
>>>>>>> Scikit-learn-general mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Public key at: http://pgp.mit.edu/ Search for this email address and
>>>>>> select the key from "2011-08-19" (key id: 54BA8735)
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>>>> "Graph Databases" is the definitive new guide to graph databases and
>>>>>> their applications. This 200-page book is written by three acclaimed
>>>>>> leaders in the field. The early access version is available now.
>>>>>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>>>>>> _______________________________________________
>>>>>> Scikit-learn-general mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nguyen Thien Bao
>>>>>
>>>>> NeuroInformatics Laboratory (NILab),
>>>>> Fondazione Bruno Kessler (FBK), Trento, Italy
>>>>> Centro Interdipartimentale Mente e Cervello (CIMeC)
>>>>> Universit`a degli Studi di Trento, Italy
>>>>> Email: [email protected]  or  [email protected]
>>>>> Cellphone: +39.345.293.1006 (Italy)
>>>>> Cellphone: +84.996.352.452 (VietNam)
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>>> "Graph Databases" is the definitive new guide to graph databases and
>>>>> their applications. This 200-page book is written by three acclaimed
>>>>> leaders in the field. The early access version is available now.
>>>>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Public key at: http://pgp.mit.edu/ Search for this email address and
>>>> select the key from "2011-08-19" (key id: 54BA8735)
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>> "Graph Databases" is the definitive new guide to graph databases and
>>>> their applications. This 200-page book is written by three acclaimed
>>>> leaders in the field. The early access version is available now.
>>>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>>
>>> --
>>> Nguyen Thien Bao
>>>
>>> NeuroInformatics Laboratory (NILab),
>>> Fondazione Bruno Kessler (FBK), Trento, Italy
>>> Centro Interdipartimentale Mente e Cervello (CIMeC)
>>> Universit`a degli Studi di Trento, Italy
>>> Email: [email protected]  or  [email protected]
>>> Cellphone: +39.345.293.1006 (Italy)
>>> Cellphone: +84.996.352.452 (VietNam)
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Learn Graph Databases - Download FREE O'Reilly Book
>>> "Graph Databases" is the definitive new guide to graph databases and
>>> their applications. This 200-page book is written by three acclaimed
>>> leaders in the field. The early access version is available now.
>>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> Learn Graph Databases - Download FREE O'Reilly Book
>> "Graph Databases" is the definitive new guide to graph databases and
>> their applications. This 200-page book is written by three acclaimed
>> leaders in the field. The early access version is available now.
>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> --
> Nguyen Thien Bao
>
> NeuroInformatics Laboratory (NILab),
> Fondazione Bruno Kessler (FBK), Trento, Italy
> Centro Interdipartimentale Mente e Cervello (CIMeC)
> Universit`a degli Studi di Trento, Italy
> Email: [email protected]  or  [email protected]
> Cellphone: +39.345.293.1006 (Italy)
> Cellphone: +84.996.352.452 (VietNam)
>
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and
> their applications. This 200-page book is written by three acclaimed
> leaders in the field. The early access version is available now.
> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and 
their applications. This 200-page book is written by three acclaimed 
leaders in the field. The early access version is available now. 
Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to