Re: Huge matrix

Reza Zadeh Tue, 09 Sep 2014 01:15:08 -0700

Hi Deb,

Did you mean to message me instead of Xiangrui?


For TS matrices, dimsum with positiveinfinity and computeGramian have the
same cost, so you can do either one. For dense matrices with say, 1m
columns this won't be computationally feasible and you'll want to start
sampling with dimsum.

It would be helpful to have a loadRowMatrix function, I would use it.

Best,
Reza

On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das <debasish.da...@gmail.com>
wrote:

> Hi Xiangrui,
>
> For tall skinny matrices, if I can pass a similarityMeasure to
> computeGrammian, I could re-use the SVD's computeGrammian for similarity
> computation as well...
>
> Do you recommend using this approach for tall skinny matrices or just use
> the dimsum's routines ?
>
> Right now RowMatrix does not have a loadRowMatrix function like the one
> available in LabeledPoint...should I add one ? I want to export the matrix
> out from my stable code and then test dimsum...
>
> Thanks.
> Deb
>
>
>
> On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh <r...@databricks.com> wrote:
>
>> I will add dice, overlap, and jaccard similarity in a future PR, probably
>> still for 1.2
>>
>>
>> On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das <debasish.da...@gmail.com>
>> wrote:
>>
>>> Awesome...Let me try it out...
>>>
>>> Any plans of putting other similarity measures in future (jaccard is
>>> something that will be useful) ? I guess it makes sense to add some
>>> similarity measures in mllib...
>>>
>>>
>>> On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <r...@databricks.com> wrote:
>>>
>>>> Yes you're right, calling dimsum with gamma as PositiveInfinity turns
>>>> it into the usual brute force algorithm for cosine similarity, there is no
>>>> sampling. This is by design.
>>>>
>>>>
>>>> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das <debasish.da...@gmail.com>
>>>> wrote:
>>>>
>>>>> I looked at the code: similarColumns(Double.posInf) is generating the
>>>>> brute force...
>>>>>
>>>>> Basically dimsum with gamma as PositiveInfinity will produce the exact
>>>>> same result as doing catesian products of RDD[(product, vector)] and
>>>>> computing similarities or there will be some approximation ?
>>>>>
>>>>> Sorry I have not read your paper yet. Will read it over the weekend.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <r...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> For 60M x 10K brute force and dimsum thresholding should be fine.
>>>>>>
>>>>>> For 60M x 10M probably brute force won't work depending on the
>>>>>> cluster's power, and dimsum thresholding should work with appropriate
>>>>>> threshold.
>>>>>>
>>>>>> Dimensionality reduction should help, and how effective it is will
>>>>>> depend on your application and domain, it's worth trying if the direct
>>>>>> computation doesn't work.
>>>>>>
>>>>>> You can also try running KMeans clustering (perhaps after
>>>>>> dimensionality reduction) if your goal is to find batches of similar 
>>>>>> points
>>>>>> instead of all pairs above a threshold.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <
>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>
>>>>>>> Also for tall and wide (rows ~60M, columns 10M), I am considering
>>>>>>> running a matrix factorization to reduce the dimension to say ~60M x 50 
>>>>>>> and
>>>>>>> then run all pair similarity...
>>>>>>>
>>>>>>> Did you also try similar ideas and saw positive results ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <
>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where rows
>>>>>>>> are ~ 60M and columns are 10M say with billion data points...
>>>>>>>>
>>>>>>>> I have another version that's around 60M and ~ 10K...
>>>>>>>>
>>>>>>>> I guess for the second one both all pair and dimsum will run fine...
>>>>>>>>
>>>>>>>> But for tall and wide, what do you suggest ? can dimsum handle it ?
>>>>>>>>
>>>>>>>> I might need jaccard as well...can I plug that in the PR ?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <r...@databricks.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> You might want to wait until Wednesday since the interface will be
>>>>>>>>> changing in that PR before Wednesday, probably over the weekend, so 
>>>>>>>>> that
>>>>>>>>> you don't have to redo your code. Your call if you need it before a 
>>>>>>>>> week.
>>>>>>>>> Reza
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <
>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Ohh cool....all-pairs brute force is also part of this PR ? Let
>>>>>>>>>> me pull it in and test on our dataset...
>>>>>>>>>>
>>>>>>>>>> Thanks.
>>>>>>>>>> Deb
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <r...@databricks.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Deb,
>>>>>>>>>>>
>>>>>>>>>>> We are adding all-pairs and thresholded all-pairs via dimsum in
>>>>>>>>>>> this PR: https://github.com/apache/spark/pull/1778
>>>>>>>>>>>
>>>>>>>>>>> Your question wasn't entirely clear - does this answer it?
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Reza
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <
>>>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Reza,
>>>>>>>>>>>>
>>>>>>>>>>>> Have you compared with the brute force algorithm for similarity
>>>>>>>>>>>> computation with something like the following in Spark ?
>>>>>>>>>>>>
>>>>>>>>>>>> https://github.com/echen/scaldingale
>>>>>>>>>>>>
>>>>>>>>>>>> I am adding cosine similarity computation but I do want to
>>>>>>>>>>>> compute an all pair similarities...
>>>>>>>>>>>>
>>>>>>>>>>>> Note that the data is sparse for me (the data that goes to
>>>>>>>>>>>> matrix factorization) so I don't think joining and group-by on
>>>>>>>>>>>> (product,product) will be a big issue for me...
>>>>>>>>>>>>
>>>>>>>>>>>> Does it make sense to add all pair similarities as well with
>>>>>>>>>>>> dimsum based similarity ?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>> Deb
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <
>>>>>>>>>>>> r...@databricks.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Xiaoli,
>>>>>>>>>>>>>
>>>>>>>>>>>>> There is a PR currently in progress to allow this, via the
>>>>>>>>>>>>> sampling scheme described in this paper:
>>>>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf
>>>>>>>>>>>>>
>>>>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336 though
>>>>>>>>>>>>> it will need refactoring given the recent changes to matrix 
>>>>>>>>>>>>> interface in
>>>>>>>>>>>>> MLlib. You may implement the sampling scheme for your own app 
>>>>>>>>>>>>> since it's
>>>>>>>>>>>>> much code.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <
>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I used 8
>>>>>>>>>>>>>> nodes and every node has 8G memory. The program just stopped at 
>>>>>>>>>>>>>> a stage for
>>>>>>>>>>>>>> about several hours without any further information. Maybe I 
>>>>>>>>>>>>>> need to find
>>>>>>>>>>>>>> out a more efficient way.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <
>>>>>>>>>>>>>> and...@andrewash.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The naive way would be to put all the users and their
>>>>>>>>>>>>>>> attributes into an RDD, then cartesian product that with 
>>>>>>>>>>>>>>> itself.  Run the
>>>>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), map to 
>>>>>>>>>>>>>>> (user,
>>>>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I doubt that you'll be able to take this approach with the
>>>>>>>>>>>>>>> 1T pairs though, so it might be worth looking at the literature 
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>> recommender systems to see what else is out there.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <
>>>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have one
>>>>>>>>>>>>>>>> million users. I need to compute the similarity between each 
>>>>>>>>>>>>>>>> pair of users
>>>>>>>>>>>>>>>> using some user's attributes.  For each user, I need to get 
>>>>>>>>>>>>>>>> top k most
>>>>>>>>>>>>>>>> similar users. What is the best way to implement this?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Huge matrix

Reply via email to