Hi Reza,

Thank you for your information. I will try it.



On Fri, Apr 11, 2014 at 11:21 PM, Reza Zadeh <r...@databricks.com> wrote:

> Hi Xiaoli,
>
> There is a PR currently in progress to allow this, via the sampling scheme
> described in this paper: stanford.edu/~rezab/papers/dimsum.pdf
>
> The PR is at https://github.com/apache/spark/pull/336 though it will need
> refactoring given the recent changes to matrix interface in MLlib. You may
> implement the sampling scheme for your own app since it's much code.
>
> Best,
> Reza
>
>
> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <lixiaolima...@gmail.com>wrote:
>
>> Hi Andrew,
>>
>> Thanks for your suggestion. I have tried the method. I used 8 nodes and
>> every node has 8G memory. The program just stopped at a stage for about
>> several hours without any further information. Maybe I need to find
>> out a more efficient way.
>>
>>
>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <and...@andrewash.com> wrote:
>>
>>> The naive way would be to put all the users and their attributes into an
>>> RDD, then cartesian product that with itself.  Run the similarity score on
>>> every pair (1M * 1M => 1T scores), map to (user, (score, otherUser)) and
>>> take the .top(k) for each user.
>>>
>>> I doubt that you'll be able to take this approach with the 1T pairs
>>> though, so it might be worth looking at the literature for recommender
>>> systems to see what else is out there.
>>>
>>>
>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <lixiaolima...@gmail.com>wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am implementing an algorithm using Spark. I have one million users. I
>>>> need to compute the similarity between each pair of users using some user's
>>>> attributes.  For each user, I need to get top k most similar users. What is
>>>> the best way to implement this?
>>>>
>>>>
>>>> Thanks.
>>>>
>>>
>>>
>>
>

Reply via email to