Hi Andrew, Thanks for your suggestion. I have tried the method. I used 8 nodes and every node has 8G memory. The program just stopped at a stage for about several hours without any further information. Maybe I need to find out a more efficient way.
On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <and...@andrewash.com> wrote: > The naive way would be to put all the users and their attributes into an > RDD, then cartesian product that with itself. Run the similarity score on > every pair (1M * 1M => 1T scores), map to (user, (score, otherUser)) and > take the .top(k) for each user. > > I doubt that you'll be able to take this approach with the 1T pairs > though, so it might be worth looking at the literature for recommender > systems to see what else is out there. > > > On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <lixiaolima...@gmail.com>wrote: > >> Hi all, >> >> I am implementing an algorithm using Spark. I have one million users. I >> need to compute the similarity between each pair of users using some user's >> attributes. For each user, I need to get top k most similar users. What is >> the best way to implement this? >> >> >> Thanks. >> > >