The naive way would be to put all the users and their attributes into an
RDD, then cartesian product that with itself.  Run the similarity score on
every pair (1M * 1M => 1T scores), map to (user, (score, otherUser)) and
take the .top(k) for each user.

I doubt that you'll be able to take this approach with the 1T pairs though,
so it might be worth looking at the literature for recommender systems to
see what else is out there.


On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <lixiaolima...@gmail.com> wrote:

> Hi all,
>
> I am implementing an algorithm using Spark. I have one million users. I
> need to compute the similarity between each pair of users using some user's
> attributes.  For each user, I need to get top k most similar users. What is
> the best way to implement this?
>
>
> Thanks.
>

Reply via email to