Hi Tom, Thank you very much for your detailed explanation. I think it is very helpful to me.
On Sat, Apr 12, 2014 at 1:06 PM, Tom V <minnesota...@gmail.com> wrote: > The last writer is suggesting using the triangle inequality to cut down > the search space. If c is the centroid of cluster C, then the closest any > point in C is to x is ||x-c|| - r(C), where r(C) is the (precomputed) > radius of the cluster---the distance of the farthest point in C to c. > Whether you want to use the bound for an exact algorithm or a search > heuristic is up to you. > > Another approach can be very successful if the user-attributes matrix is > very sparse: The idea is to exploit the additive nature of common > similarity measures. I have used this for cosine and Bayesian posterior. > Here is the basic idea: A "User" index contains all the attributes for a > certain user (sorted by strength), and an "Attribute" index contains all > the users with a certain attribute. (Again, sorted by strength). To find > the best k matches for user i, start with user i's strongest attribute and > search the attribute index to find some users with high scores for that > attribute. Move to user i's next best attribute. If you proceed with a > "diagonal frontier" (as you fetch candidates bases on i's weaker > attributes, also go back and fetch more weak candidates for i's strong > attributes), you can get a candidate pool and a bound on the highest score > of any user not in the candidate pool. You just expand the candidate pool > until the bound drops below the scores of the k best candidates. > Practically, you'd want to limit the candidate pool size, but it's almost > always exact. > > For a million users, you should be able to distribute the things needed to > make a recommendation (either the centroids or the attributes matrix), and > just break up the work based on the users you want to generate > recommendations for. I hope this helps. > > Tom > > > On Sat, Apr 12, 2014 at 11:35 AM, Xiaoli Li <lixiaolima...@gmail.com>wrote: > >> Hi Guillaume, >> >> This sounds a good idea to me. I am a newbie here. Could you further >> explain how will you determine which clusters to keep? According to the >> distance between each element with each cluster center? >> Will you keep several clusters for each element for searching nearest >> neighbours? Thanks. >> >> Xiaoli >> >> >> >> >> On Sat, Apr 12, 2014 at 9:46 AM, Guillaume Pitel < >> guillaume.pi...@exensa.com> wrote: >> >>> Hi, >>> >>> I'm doing this here for multiple tens of millions of elements (and the >>> goal is to reach multiple billions), on a relatively small cluster (7 nodes >>> 4 cores 32GB RAM). We use multiprobe KLSH. All you have to do is run a >>> Kmeans on your data, then compute the distance between each element with >>> each cluster center, keep a few clusters and only look into these clusters >>> for nearest neighbours. >>> >>> This method is known to perform very well and vastly speedup your >>> computation >>> >>> The hardest part is to decide how many clusters to compute, and how many >>> to keep. As a rule of thumb, I generally want 300-10000 elements per >>> cluster, and use 5-20 clusters. >>> >>> Guillaume >>> >>> >>> I am implementing an algorithm using Spark. I have one million users. >>> I need to compute the similarity between each pair of users using some >>> user's attributes. For each user, I need to get top k most similar users. >>> What is the best way to implement this? >>> >>> >>> Thanks. >>> >>> >>> >>> -- >>> [image: eXenSa] >>> *Guillaume PITEL, Président* >>> +33(0)6 25 48 86 80 / +33(0)9 70 44 67 53 >>> >>> eXenSa S.A.S. <http://www.exensa.com/> >>> 41, rue Périer - 92120 Montrouge - FRANCE >>> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05 >>> >> >> >
<<inline: exensa_logo_mail.png>>