Hi Deb,

We are adding all-pairs and thresholded all-pairs via dimsum in this PR:
https://github.com/apache/spark/pull/1778

Your question wasn't entirely clear - does this answer it?

Best,
Reza


On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <debasish.da...@gmail.com>
wrote:

> Hi Reza,
>
> Have you compared with the brute force algorithm for similarity
> computation with something like the following in Spark ?
>
> https://github.com/echen/scaldingale
>
> I am adding cosine similarity computation but I do want to compute an all
> pair similarities...
>
> Note that the data is sparse for me (the data that goes to matrix
> factorization) so I don't think joining and group-by on (product,product)
> will be a big issue for me...
>
> Does it make sense to add all pair similarities as well with dimsum based
> similarity ?
>
> Thanks.
> Deb
>
>
>
>
>
>
> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <r...@databricks.com> wrote:
>
>> Hi Xiaoli,
>>
>> There is a PR currently in progress to allow this, via the sampling
>> scheme described in this paper: stanford.edu/~rezab/papers/dimsum.pdf
>>
>> The PR is at https://github.com/apache/spark/pull/336 though it will
>> need refactoring given the recent changes to matrix interface in MLlib. You
>> may implement the sampling scheme for your own app since it's much code.
>>
>> Best,
>> Reza
>>
>>
>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <lixiaolima...@gmail.com>
>> wrote:
>>
>>> Hi Andrew,
>>>
>>> Thanks for your suggestion. I have tried the method. I used 8 nodes and
>>> every node has 8G memory. The program just stopped at a stage for about
>>> several hours without any further information. Maybe I need to find
>>> out a more efficient way.
>>>
>>>
>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <and...@andrewash.com>
>>> wrote:
>>>
>>>> The naive way would be to put all the users and their attributes into
>>>> an RDD, then cartesian product that with itself.  Run the similarity score
>>>> on every pair (1M * 1M => 1T scores), map to (user, (score, otherUser)) and
>>>> take the .top(k) for each user.
>>>>
>>>> I doubt that you'll be able to take this approach with the 1T pairs
>>>> though, so it might be worth looking at the literature for recommender
>>>> systems to see what else is out there.
>>>>
>>>>
>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <lixiaolima...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I am implementing an algorithm using Spark. I have one million users.
>>>>> I need to compute the similarity between each pair of users using some
>>>>> user's attributes.  For each user, I need to get top k most similar users.
>>>>> What is the best way to implement this?
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to