Hi Deb, We are adding all-pairs and thresholded all-pairs via dimsum in this PR: https://github.com/apache/spark/pull/1778
Your question wasn't entirely clear - does this answer it? Best, Reza On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <debasish.da...@gmail.com> wrote: > Hi Reza, > > Have you compared with the brute force algorithm for similarity > computation with something like the following in Spark ? > > https://github.com/echen/scaldingale > > I am adding cosine similarity computation but I do want to compute an all > pair similarities... > > Note that the data is sparse for me (the data that goes to matrix > factorization) so I don't think joining and group-by on (product,product) > will be a big issue for me... > > Does it make sense to add all pair similarities as well with dimsum based > similarity ? > > Thanks. > Deb > > > > > > > On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <r...@databricks.com> wrote: > >> Hi Xiaoli, >> >> There is a PR currently in progress to allow this, via the sampling >> scheme described in this paper: stanford.edu/~rezab/papers/dimsum.pdf >> >> The PR is at https://github.com/apache/spark/pull/336 though it will >> need refactoring given the recent changes to matrix interface in MLlib. You >> may implement the sampling scheme for your own app since it's much code. >> >> Best, >> Reza >> >> >> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <lixiaolima...@gmail.com> >> wrote: >> >>> Hi Andrew, >>> >>> Thanks for your suggestion. I have tried the method. I used 8 nodes and >>> every node has 8G memory. The program just stopped at a stage for about >>> several hours without any further information. Maybe I need to find >>> out a more efficient way. >>> >>> >>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <and...@andrewash.com> >>> wrote: >>> >>>> The naive way would be to put all the users and their attributes into >>>> an RDD, then cartesian product that with itself. Run the similarity score >>>> on every pair (1M * 1M => 1T scores), map to (user, (score, otherUser)) and >>>> take the .top(k) for each user. >>>> >>>> I doubt that you'll be able to take this approach with the 1T pairs >>>> though, so it might be worth looking at the literature for recommender >>>> systems to see what else is out there. >>>> >>>> >>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <lixiaolima...@gmail.com> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I am implementing an algorithm using Spark. I have one million users. >>>>> I need to compute the similarity between each pair of users using some >>>>> user's attributes. For each user, I need to get top k most similar users. >>>>> What is the best way to implement this? >>>>> >>>>> >>>>> Thanks. >>>>> >>>> >>>> >>> >> >