Hi, I'm doing this here for multiple tens of millions of elements (and the goal is to reach multiple billions), on a relatively small cluster (7 nodes 4 cores 32GB RAM). We use multiprobe KLSH. All you have to do is run a Kmeans on your data, then compute the distance between each element with each cluster center, keep a few clusters and only look into these clusters for nearest neighbours. This method is known to perform very well and vastly speedup your computation The hardest part is to decide how many clusters to compute, and how many to keep. As a rule of thumb, I generally want 300-10000 elements per cluster, and use 5-20 clusters. Guillaume
--
|
- Huge matrix Xiaoli Li
- Re: Huge matrix Andrew Ash
- Re: Huge matrix Xiaoli Li
- Re: Huge matrix Reza Zadeh
- Re: Huge matrix Xiaoli Li
- Re: Huge matrix Guillaume Pitel
- Re: Huge matrix Xiaoli Li
- Re: Huge matrix Tom V
- Re: Huge matrix Xiaoli Li
- Re: Huge matrix Guillaume Pitel
- Re: Huge matrix Xiaoli Li