For 60M x 10K brute force and dimsum thresholding should be fine.

For 60M x 10M probably brute force won't work depending on the cluster's
power, and dimsum thresholding should work with appropriate threshold.

Dimensionality reduction should help, and how effective it is will depend
on your application and domain, it's worth trying if the direct computation
doesn't work.

You can also try running KMeans clustering (perhaps after dimensionality
reduction) if your goal is to find batches of similar points instead of all
pairs above a threshold.




On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <debasish.da...@gmail.com>
wrote:

> Also for tall and wide (rows ~60M, columns 10M), I am considering running
> a matrix factorization to reduce the dimension to say ~60M x 50 and then
> run all pair similarity...
>
> Did you also try similar ideas and saw positive results ?
>
>
>
> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <debasish.da...@gmail.com>
> wrote:
>
>> Ok...just to make sure I have RowMatrix[SparseVector] where rows are ~
>> 60M and columns are 10M say with billion data points...
>>
>> I have another version that's around 60M and ~ 10K...
>>
>> I guess for the second one both all pair and dimsum will run fine...
>>
>> But for tall and wide, what do you suggest ? can dimsum handle it ?
>>
>> I might need jaccard as well...can I plug that in the PR ?
>>
>>
>>
>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <r...@databricks.com> wrote:
>>
>>> You might want to wait until Wednesday since the interface will be
>>> changing in that PR before Wednesday, probably over the weekend, so that
>>> you don't have to redo your code. Your call if you need it before a week.
>>> Reza
>>>
>>>
>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <debasish.da...@gmail.com>
>>> wrote:
>>>
>>>> Ohh cool....all-pairs brute force is also part of this PR ? Let me pull
>>>> it in and test on our dataset...
>>>>
>>>> Thanks.
>>>> Deb
>>>>
>>>>
>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <r...@databricks.com> wrote:
>>>>
>>>>> Hi Deb,
>>>>>
>>>>> We are adding all-pairs and thresholded all-pairs via dimsum in this
>>>>> PR: https://github.com/apache/spark/pull/1778
>>>>>
>>>>> Your question wasn't entirely clear - does this answer it?
>>>>>
>>>>> Best,
>>>>> Reza
>>>>>
>>>>>
>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <debasish.da...@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Hi Reza,
>>>>>>
>>>>>> Have you compared with the brute force algorithm for similarity
>>>>>> computation with something like the following in Spark ?
>>>>>>
>>>>>> https://github.com/echen/scaldingale
>>>>>>
>>>>>> I am adding cosine similarity computation but I do want to compute an
>>>>>> all pair similarities...
>>>>>>
>>>>>> Note that the data is sparse for me (the data that goes to matrix
>>>>>> factorization) so I don't think joining and group-by on (product,product)
>>>>>> will be a big issue for me...
>>>>>>
>>>>>> Does it make sense to add all pair similarities as well with dimsum
>>>>>> based similarity ?
>>>>>>
>>>>>> Thanks.
>>>>>> Deb
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <r...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Xiaoli,
>>>>>>>
>>>>>>> There is a PR currently in progress to allow this, via the sampling
>>>>>>> scheme described in this paper:
>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf
>>>>>>>
>>>>>>> The PR is at https://github.com/apache/spark/pull/336 though it
>>>>>>> will need refactoring given the recent changes to matrix interface in
>>>>>>> MLlib. You may implement the sampling scheme for your own app since it's
>>>>>>> much code.
>>>>>>>
>>>>>>> Best,
>>>>>>> Reza
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <lixiaolima...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Andrew,
>>>>>>>>
>>>>>>>> Thanks for your suggestion. I have tried the method. I used 8 nodes
>>>>>>>> and every node has 8G memory. The program just stopped at a stage for 
>>>>>>>> about
>>>>>>>> several hours without any further information. Maybe I need to find
>>>>>>>> out a more efficient way.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <and...@andrewash.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> The naive way would be to put all the users and their attributes
>>>>>>>>> into an RDD, then cartesian product that with itself.  Run the 
>>>>>>>>> similarity
>>>>>>>>> score on every pair (1M * 1M => 1T scores), map to (user, (score,
>>>>>>>>> otherUser)) and take the .top(k) for each user.
>>>>>>>>>
>>>>>>>>> I doubt that you'll be able to take this approach with the 1T
>>>>>>>>> pairs though, so it might be worth looking at the literature for
>>>>>>>>> recommender systems to see what else is out there.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <
>>>>>>>>> lixiaolima...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> I am implementing an algorithm using Spark. I have one million
>>>>>>>>>> users. I need to compute the similarity between each pair of users 
>>>>>>>>>> using
>>>>>>>>>> some user's attributes.  For each user, I need to get top k most 
>>>>>>>>>> similar
>>>>>>>>>> users. What is the best way to implement this?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to