Re: [scikit-learn] clustering on big dataset

2018-02-07 Thread Manuel Castejón Limas
Hope this helps! Manuel @Article{Ciampi2008, author="Ciampi, Antonio and Lechevallier, Yves and Limas, Manuel Castej{\'o}n and Marcos, Ana Gonz{\'a}lez", title="Hierarchical clustering of subpopulations with a dissimilarity based on the likelihood ratio statistic: application to clustering

Re: [scikit-learn] clustering on big dataset

2018-01-04 Thread Joel Nothman
Yes, use an approximate nearest neighbors approach. None is included in scikit-learn, but there are numerous implementations with Python interfaces. On 5 January 2018 at 12:51, Shiheng Duan wrote: > Thanks, Joel, > I am working on KD-tree to find the nearest neighbors. Basically, I find > the ne

Re: [scikit-learn] clustering on big dataset

2018-01-04 Thread Shiheng Duan
Thanks, Joel, I am working on KD-tree to find the nearest neighbors. Basically, I find the nearest neighbors for each point and then merge a couple of points if they are both NN for each other. The problem is that after each iteration, we will have a new bunch of points, where new clusters are adde

Re: [scikit-learn] clustering on big dataset

2018-01-04 Thread Joel Nothman
Can you use nearest neighbors with a KD tree to build a distance matrix that is sparse, in that distances to all but the nearest neighbors of a point are (near-)infinite? Yes, this again has an additional parameter (neighborhood size), just as BIRCH has its threshold. I suspect you will not be able

Re: [scikit-learn] clustering on big dataset

2018-01-03 Thread Shiheng Duan
Yes, it is an efficient method, still, we need to specify the number of clusters or the threshold. Is there another way to run hierarchy clustering on the big dataset? The main problem is the distance matrix. Thanks. On Tue, Jan 2, 2018 at 6:02 AM, Olivier Grisel wrote: > Have you had a look at

Re: [scikit-learn] clustering on big dataset

2018-01-02 Thread Olivier Grisel
Have you had a look at BIRCH? http://scikit-learn.org/stable/modules/clustering.html#birch -- Olivier ​ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn

[scikit-learn] clustering on big dataset

2018-01-01 Thread Shiheng Duan
Hi all, I wonder if there is any method to do exact clustering (hierarchy cluster) on a huge dataset where it is impossible to use distance matrix. I am considering KD-tree but every time it needs to rebuild it, consuming lots time. Any ideas? ___ scikit-