Hi Xiaoli,

There is a PR currently in progress to allow this, via the sampling scheme
described in this paper: stanford.edu/~rezab/papers/dimsum.pdf

The PR is at https://github.com/apache/spark/pull/336 though it will need
refactoring given the recent changes to matrix interface in MLlib. You may
implement the sampling scheme for your own app since it's much code.

Best,
Reza


On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <lixiaolima...@gmail.com> wrote:

> Hi Andrew,
>
> Thanks for your suggestion. I have tried the method. I used 8 nodes and
> every node has 8G memory. The program just stopped at a stage for about
> several hours without any further information. Maybe I need to find
> out a more efficient way.
>
>
> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <and...@andrewash.com> wrote:
>
>> The naive way would be to put all the users and their attributes into an
>> RDD, then cartesian product that with itself.  Run the similarity score on
>> every pair (1M * 1M => 1T scores), map to (user, (score, otherUser)) and
>> take the .top(k) for each user.
>>
>> I doubt that you'll be able to take this approach with the 1T pairs
>> though, so it might be worth looking at the literature for recommender
>> systems to see what else is out there.
>>
>>
>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <lixiaolima...@gmail.com>wrote:
>>
>>> Hi all,
>>>
>>> I am implementing an algorithm using Spark. I have one million users. I
>>> need to compute the similarity between each pair of users using some user's
>>> attributes.  For each user, I need to get top k most similar users. What is
>>> the best way to implement this?
>>>
>>>
>>> Thanks.
>>>
>>
>>
>

Reply via email to