Hi Guillaume,

This sounds a good idea to me. I am a newbie here. Could you further
explain how will you determine which clusters to keep? According to the
distance between each element with each cluster center?
Will you keep several clusters for each element for searching nearest
neighbours? Thanks.

  Xiaoli




On Sat, Apr 12, 2014 at 9:46 AM, Guillaume Pitel <guillaume.pi...@exensa.com
> wrote:

>  Hi,
>
> I'm doing this here for multiple tens of millions of elements (and the
> goal is to reach multiple billions), on a relatively small cluster (7 nodes
> 4 cores 32GB RAM). We use multiprobe KLSH. All you have to do is run a
> Kmeans on your data, then compute the distance between each element with
> each cluster center, keep a few clusters and only look into these clusters
> for nearest neighbours.
>
> This method is known to perform very well and vastly speedup your
> computation
>
> The hardest part is to decide how many clusters to compute, and how many
> to keep. As a rule of thumb, I generally want 300-10000 elements per
> cluster, and use 5-20 clusters.
>
> Guillaume
>
>
>  I am implementing an algorithm using Spark. I have one million users. I
> need to compute the similarity between each pair of users using some user's
> attributes.  For each user, I need to get top k most similar users. What is
> the best way to implement this?
>
>
>  Thanks.
>
>
>
> --
>    [image: eXenSa]
>  *Guillaume PITEL, Président*
> +33(0)6 25 48 86 80 / +33(0)9 70 44 67 53
>
>  eXenSa S.A.S. <http://www.exensa.com/>
>  41, rue Périer - 92120 Montrouge - FRANCE
> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
>

<<inline: exensa_logo_mail.png>>

Reply via email to