Hi Reza, Thank you for your information. I will try it.
On Fri, Apr 11, 2014 at 11:21 PM, Reza Zadeh <r...@databricks.com> wrote: > Hi Xiaoli, > > There is a PR currently in progress to allow this, via the sampling scheme > described in this paper: stanford.edu/~rezab/papers/dimsum.pdf > > The PR is at https://github.com/apache/spark/pull/336 though it will need > refactoring given the recent changes to matrix interface in MLlib. You may > implement the sampling scheme for your own app since it's much code. > > Best, > Reza > > > On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <lixiaolima...@gmail.com>wrote: > >> Hi Andrew, >> >> Thanks for your suggestion. I have tried the method. I used 8 nodes and >> every node has 8G memory. The program just stopped at a stage for about >> several hours without any further information. Maybe I need to find >> out a more efficient way. >> >> >> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <and...@andrewash.com> wrote: >> >>> The naive way would be to put all the users and their attributes into an >>> RDD, then cartesian product that with itself. Run the similarity score on >>> every pair (1M * 1M => 1T scores), map to (user, (score, otherUser)) and >>> take the .top(k) for each user. >>> >>> I doubt that you'll be able to take this approach with the 1T pairs >>> though, so it might be worth looking at the literature for recommender >>> systems to see what else is out there. >>> >>> >>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <lixiaolima...@gmail.com>wrote: >>> >>>> Hi all, >>>> >>>> I am implementing an algorithm using Spark. I have one million users. I >>>> need to compute the similarity between each pair of users using some user's >>>> attributes. For each user, I need to get top k most similar users. What is >>>> the best way to implement this? >>>> >>>> >>>> Thanks. >>>> >>> >>> >> >