I got it. Thank you Robert.
On Tue, May 7, 2013 at 11:31 PM, Robert Layton <[email protected]>wrote:
> By sampling, I meant to take X% of the data and calculate the silhouette
> of just those points.
>
>
> On 8 May 2013 06:49, Bao Thien <[email protected]> wrote:
>
>> Thank Lars for correcting my mistake about the size of memory.
>> Thank Robert for your suggestion.
>>
>> To be honest, first, I cluster the data using Hierarchical Clustering
>> (Ward in scikit-learn). And because the order of data is too big, I need to
>> provide the connectivity matrix. And this matrix totally depends on the
>> parameter k (how many neighbors of each sample). But because we don't know
>> k in advance. So I need to evaluate different clusters resulting from
>> different value of k, then I can decide which value of k is somehow good
>> for my data.
>>
>> From the scikit-learn, there is only Silhouette Score is suitable for
>> clustering, all other scores need the ground truth label. If you know that
>> there is other Score in scikit-learn can be used in my case, would you
>> please let me know!
>>
>> p/s Robert: Sorry, but I can not catch what you mean about "sampling
>> should give you a good approximation of the silhouette score". In my
>> case, I can cut a tree and get about 50 clusters. Do you mind to explain me
>> more detail about how to 'sampling'. I don't need a very strict
>> mathematically guarantee, just a way to estimate score and choose the k
>> value. Thank you in advance for your help.
>>
>> Regards,
>>
>> T.Bao
>>
>>
>> On Tue, May 7, 2013 at 10:19 PM, Robert Layton <[email protected]>wrote:
>>
>>> Hi Bao,
>>>
>>> The Silhouette Function hasn't been written with this type of
>>> scalability in mind.
>>> It requires a pairwise distance matrix, which is prohibitive (as others
>>> have said).
>>>
>>> If the number of clusters is low, sampling should give you a good
>>> approximation of the silhouette score, although I can't offer any
>>> mathematical guarantees on this.
>>>
>>> Thanks,
>>>
>>> - Robert
>>>
>>>
>>>
>>>
>>> On 8 May 2013 06:12, Bao Thien <[email protected]> wrote:
>>>
>>>> Thank Ronnle,
>>>>
>>>> But the data size is 300K, then the memory requirement is about
>>>> 300Kx300K~90.000MB~90GB. It is impossible to upgrade the ram :(.
>>>> Do you have any other suggestion?
>>>>
>>>> Regards,
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, May 7, 2013 at 10:06 PM, Ronnie Ghose
>>>> <[email protected]>wrote:
>>>>
>>>>> ....can you just get more ram?
>>>>> On May 7, 2013 2:42 PM, "Bao Thien" <[email protected]> wrote:
>>>>>
>>>>>> I run a clustering algorithm and want to evaluate the result by
>>>>>> using silhouette score in scikit-learn. But in the scikit-learn, it needs
>>>>>> to calculate the distance matrix: distances = pairwise_distances(X,
>>>>>> metric=metric, **kwds)
>>>>>>
>>>>>> Due to the fact that my data is order of 300K, and my memory is 2GB,
>>>>>> leading to the result that out of memory.
>>>>>>
>>>>>> Does anyone know how to overcome this problem or not? Thank you for
>>>>>> your help.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nguyen Thien Bao
>>>>>>
>>>>>> NeuroInformatics Laboratory (NILab),
>>>>>> Fondazione Bruno Kessler (FBK), Trento, Italy
>>>>>> Centro Interdipartimentale Mente e Cervello (CIMeC)
>>>>>> Universit`a degli Studi di Trento, Italy
>>>>>> Email: [email protected] or [email protected]
>>>>>> Cellphone: +39.345.293.1006 (Italy)
>>>>>> Cellphone: +84.996.352.452 (VietNam)
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>>>> "Graph Databases" is the definitive new guide to graph databases and
>>>>>> their applications. This 200-page book is written by three acclaimed
>>>>>> leaders in the field. The early access version is available now.
>>>>>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>>>>>> _______________________________________________
>>>>>> Scikit-learn-general mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>
>>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>>> "Graph Databases" is the definitive new guide to graph databases and
>>>>> their applications. This 200-page book is written by three acclaimed
>>>>> leaders in the field. The early access version is available now.
>>>>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nguyen Thien Bao
>>>>
>>>> NeuroInformatics Laboratory (NILab),
>>>> Fondazione Bruno Kessler (FBK), Trento, Italy
>>>> Centro Interdipartimentale Mente e Cervello (CIMeC)
>>>> Universit`a degli Studi di Trento, Italy
>>>> Email: [email protected] or [email protected]
>>>> Cellphone: +39.345.293.1006 (Italy)
>>>> Cellphone: +84.996.352.452 (VietNam)
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>> "Graph Databases" is the definitive new guide to graph databases and
>>>> their applications. This 200-page book is written by three acclaimed
>>>> leaders in the field. The early access version is available now.
>>>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Public key at: http://pgp.mit.edu/ Search for this email address and
>>> select the key from "2011-08-19" (key id: 54BA8735)
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Learn Graph Databases - Download FREE O'Reilly Book
>>> "Graph Databases" is the definitive new guide to graph databases and
>>> their applications. This 200-page book is written by three acclaimed
>>> leaders in the field. The early access version is available now.
>>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> --
>> Nguyen Thien Bao
>>
>> NeuroInformatics Laboratory (NILab),
>> Fondazione Bruno Kessler (FBK), Trento, Italy
>> Centro Interdipartimentale Mente e Cervello (CIMeC)
>> Universit`a degli Studi di Trento, Italy
>> Email: [email protected] or [email protected]
>> Cellphone: +39.345.293.1006 (Italy)
>> Cellphone: +84.996.352.452 (VietNam)
>>
>>
>> ------------------------------------------------------------------------------
>> Learn Graph Databases - Download FREE O'Reilly Book
>> "Graph Databases" is the definitive new guide to graph databases and
>> their applications. This 200-page book is written by three acclaimed
>> leaders in the field. The early access version is available now.
>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> --
>
> Public key at: http://pgp.mit.edu/ Search for this email address and
> select the key from "2011-08-19" (key id: 54BA8735)
>
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and
> their applications. This 200-page book is written by three acclaimed
> leaders in the field. The early access version is available now.
> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
Nguyen Thien Bao
NeuroInformatics Laboratory (NILab),
Fondazione Bruno Kessler (FBK), Trento, Italy
Centro Interdipartimentale Mente e Cervello (CIMeC)
Universit`a degli Studi di Trento, Italy
Email: [email protected] or [email protected]
Cellphone: +39.345.293.1006 (Italy)
Cellphone: +84.996.352.452 (VietNam)
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and
their applications. This 200-page book is written by three acclaimed
leaders in the field. The early access version is available now.
Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general