You might want to wait until Wednesday since the interface will be changing
in that PR before Wednesday, probably over the weekend, so that you don't
have to redo your code. Your call if you need it before a week.
Reza


On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <debasish.da...@gmail.com>
wrote:

> Ohh cool....all-pairs brute force is also part of this PR ? Let me pull it
> in and test on our dataset...
>
> Thanks.
> Deb
>
>
> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <r...@databricks.com> wrote:
>
>> Hi Deb,
>>
>> We are adding all-pairs and thresholded all-pairs via dimsum in this PR:
>> https://github.com/apache/spark/pull/1778
>>
>> Your question wasn't entirely clear - does this answer it?
>>
>> Best,
>> Reza
>>
>>
>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <debasish.da...@gmail.com>
>> wrote:
>>
>>> Hi Reza,
>>>
>>> Have you compared with the brute force algorithm for similarity
>>> computation with something like the following in Spark ?
>>>
>>> https://github.com/echen/scaldingale
>>>
>>> I am adding cosine similarity computation but I do want to compute an
>>> all pair similarities...
>>>
>>> Note that the data is sparse for me (the data that goes to matrix
>>> factorization) so I don't think joining and group-by on (product,product)
>>> will be a big issue for me...
>>>
>>> Does it make sense to add all pair similarities as well with dimsum
>>> based similarity ?
>>>
>>> Thanks.
>>> Deb
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <r...@databricks.com> wrote:
>>>
>>>> Hi Xiaoli,
>>>>
>>>> There is a PR currently in progress to allow this, via the sampling
>>>> scheme described in this paper: stanford.edu/~rezab/papers/dimsum.pdf
>>>>
>>>> The PR is at https://github.com/apache/spark/pull/336 though it will
>>>> need refactoring given the recent changes to matrix interface in MLlib. You
>>>> may implement the sampling scheme for your own app since it's much code.
>>>>
>>>> Best,
>>>> Reza
>>>>
>>>>
>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <lixiaolima...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Andrew,
>>>>>
>>>>> Thanks for your suggestion. I have tried the method. I used 8 nodes
>>>>> and every node has 8G memory. The program just stopped at a stage for 
>>>>> about
>>>>> several hours without any further information. Maybe I need to find
>>>>> out a more efficient way.
>>>>>
>>>>>
>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <and...@andrewash.com>
>>>>> wrote:
>>>>>
>>>>>> The naive way would be to put all the users and their attributes into
>>>>>> an RDD, then cartesian product that with itself.  Run the similarity 
>>>>>> score
>>>>>> on every pair (1M * 1M => 1T scores), map to (user, (score, otherUser)) 
>>>>>> and
>>>>>> take the .top(k) for each user.
>>>>>>
>>>>>> I doubt that you'll be able to take this approach with the 1T pairs
>>>>>> though, so it might be worth looking at the literature for recommender
>>>>>> systems to see what else is out there.
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <lixiaolima...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I am implementing an algorithm using Spark. I have one million
>>>>>>> users. I need to compute the similarity between each pair of users using
>>>>>>> some user's attributes.  For each user, I need to get top k most similar
>>>>>>> users. What is the best way to implement this?
>>>>>>>
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to