Re: Huge matrix

Reza Zadeh Thu, 18 Sep 2014 10:35:24 -0700

Hi Deb,

I am not templating RowMatrix/CoordinateMatrix since that would be a big
deviation from the PR. We can add jaccard and other similarity measures in
later PRs.


In the meantime, you can un-normalize the cosine similarities to get the
dot product, and then compute the other similarity measures from the dot
product.

Best,
Reza


On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das <debasish.da...@gmail.com>
wrote:

> Hi Reza,
>
> In similarColumns, it seems with cosine similarity I also need other
> numbers such as intersection, jaccard and other measures...
>
> Right now I modified the code to generate jaccard but I had to run it
> twice due to the design of RowMatrix / CoordinateMatrix...I feel we should
> modify RowMatrix and CoordinateMatrix to be templated on the value...
>
> Are you considering this in your design ?
>
> Thanks.
> Deb
>
>
> On Tue, Sep 9, 2014 at 9:45 AM, Reza Zadeh <r...@databricks.com> wrote:
>
>> Better to do it in a PR of your own, it's not sufficiently related to
>> dimsum
>>
>> On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das <debasish.da...@gmail.com>
>> wrote:
>>
>>> Cool...can I add loadRowMatrix in your PR ?
>>>
>>> Thanks.
>>> Deb
>>>
>>> On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh <r...@databricks.com> wrote:
>>>
>>>> Hi Deb,
>>>>
>>>> Did you mean to message me instead of Xiangrui?
>>>>
>>>> For TS matrices, dimsum with positiveinfinity and computeGramian have
>>>> the same cost, so you can do either one. For dense matrices with say, 1m
>>>> columns this won't be computationally feasible and you'll want to start
>>>> sampling with dimsum.
>>>>
>>>> It would be helpful to have a loadRowMatrix function, I would use it.
>>>>
>>>> Best,
>>>> Reza
>>>>
>>>> On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das <debasish.da...@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi Xiangrui,
>>>>>
>>>>> For tall skinny matrices, if I can pass a similarityMeasure to
>>>>> computeGrammian, I could re-use the SVD's computeGrammian for similarity
>>>>> computation as well...
>>>>>
>>>>> Do you recommend using this approach for tall skinny matrices or just
>>>>> use the dimsum's routines ?
>>>>>
>>>>> Right now RowMatrix does not have a loadRowMatrix function like the
>>>>> one available in LabeledPoint...should I add one ? I want to export the
>>>>> matrix out from my stable code and then test dimsum...
>>>>>
>>>>> Thanks.
>>>>> Deb
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh <r...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> I will add dice, overlap, and jaccard similarity in a future PR,
>>>>>> probably still for 1.2
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das <
>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>
>>>>>>> Awesome...Let me try it out...
>>>>>>>
>>>>>>> Any plans of putting other similarity measures in future (jaccard is
>>>>>>> something that will be useful) ? I guess it makes sense to add some
>>>>>>> similarity measures in mllib...
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <r...@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Yes you're right, calling dimsum with gamma as PositiveInfinity
>>>>>>>> turns it into the usual brute force algorithm for cosine similarity, 
>>>>>>>> there
>>>>>>>> is no sampling. This is by design.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das <
>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I looked at the code: similarColumns(Double.posInf) is generating
>>>>>>>>> the brute force...
>>>>>>>>>
>>>>>>>>> Basically dimsum with gamma as PositiveInfinity will produce the
>>>>>>>>> exact same result as doing catesian products of RDD[(product, 
>>>>>>>>> vector)] and
>>>>>>>>> computing similarities or there will be some approximation ?
>>>>>>>>>
>>>>>>>>> Sorry I have not read your paper yet. Will read it over the
>>>>>>>>> weekend.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <r...@databricks.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> For 60M x 10K brute force and dimsum thresholding should be fine.
>>>>>>>>>>
>>>>>>>>>> For 60M x 10M probably brute force won't work depending on the
>>>>>>>>>> cluster's power, and dimsum thresholding should work with appropriate
>>>>>>>>>> threshold.
>>>>>>>>>>
>>>>>>>>>> Dimensionality reduction should help, and how effective it is
>>>>>>>>>> will depend on your application and domain, it's worth trying if the 
>>>>>>>>>> direct
>>>>>>>>>> computation doesn't work.
>>>>>>>>>>
>>>>>>>>>> You can also try running KMeans clustering (perhaps after
>>>>>>>>>> dimensionality reduction) if your goal is to find batches of similar 
>>>>>>>>>> points
>>>>>>>>>> instead of all pairs above a threshold.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <
>>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Also for tall and wide (rows ~60M, columns 10M), I am
>>>>>>>>>>> considering running a matrix factorization to reduce the dimension 
>>>>>>>>>>> to say
>>>>>>>>>>> ~60M x 50 and then run all pair similarity...
>>>>>>>>>>>
>>>>>>>>>>> Did you also try similar ideas and saw positive results ?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <
>>>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where
>>>>>>>>>>>> rows are ~ 60M and columns are 10M say with billion data points...
>>>>>>>>>>>>
>>>>>>>>>>>> I have another version that's around 60M and ~ 10K...
>>>>>>>>>>>>
>>>>>>>>>>>> I guess for the second one both all pair and dimsum will run
>>>>>>>>>>>> fine...
>>>>>>>>>>>>
>>>>>>>>>>>> But for tall and wide, what do you suggest ? can dimsum handle
>>>>>>>>>>>> it ?
>>>>>>>>>>>>
>>>>>>>>>>>> I might need jaccard as well...can I plug that in the PR ?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <r...@databricks.com
>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> You might want to wait until Wednesday since the interface
>>>>>>>>>>>>> will be changing in that PR before Wednesday, probably over the 
>>>>>>>>>>>>> weekend, so
>>>>>>>>>>>>> that you don't have to redo your code. Your call if you need it 
>>>>>>>>>>>>> before a
>>>>>>>>>>>>> week.
>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <
>>>>>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ohh cool....all-pairs brute force is also part of this PR ?
>>>>>>>>>>>>>> Let me pull it in and test on our dataset...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>> Deb
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <
>>>>>>>>>>>>>> r...@databricks.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Deb,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We are adding all-pairs and thresholded all-pairs via dimsum
>>>>>>>>>>>>>>> in this PR: https://github.com/apache/spark/pull/1778
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Your question wasn't entirely clear - does this answer it?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <
>>>>>>>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Reza,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Have you compared with the brute force algorithm for
>>>>>>>>>>>>>>>> similarity computation with something like the following in 
>>>>>>>>>>>>>>>> Spark ?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> https://github.com/echen/scaldingale
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am adding cosine similarity computation but I do want to
>>>>>>>>>>>>>>>> compute an all pair similarities...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Note that the data is sparse for me (the data that goes to
>>>>>>>>>>>>>>>> matrix factorization) so I don't think joining and group-by on
>>>>>>>>>>>>>>>> (product,product) will be a big issue for me...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Does it make sense to add all pair similarities as well
>>>>>>>>>>>>>>>> with dimsum based similarity ?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>> Deb
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <
>>>>>>>>>>>>>>>> r...@databricks.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Xiaoli,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> There is a PR currently in progress to allow this, via the
>>>>>>>>>>>>>>>>> sampling scheme described in this paper:
>>>>>>>>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336
>>>>>>>>>>>>>>>>> though it will need refactoring given the recent changes to 
>>>>>>>>>>>>>>>>> matrix
>>>>>>>>>>>>>>>>> interface in MLlib. You may implement the sampling scheme for 
>>>>>>>>>>>>>>>>> your own app
>>>>>>>>>>>>>>>>> since it's much code.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <
>>>>>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I
>>>>>>>>>>>>>>>>>> used 8 nodes and every node has 8G memory. The program just 
>>>>>>>>>>>>>>>>>> stopped at a
>>>>>>>>>>>>>>>>>> stage for about several hours without any further 
>>>>>>>>>>>>>>>>>> information. Maybe I need
>>>>>>>>>>>>>>>>>> to find
>>>>>>>>>>>>>>>>>> out a more efficient way.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <
>>>>>>>>>>>>>>>>>> and...@andrewash.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The naive way would be to put all the users and their
>>>>>>>>>>>>>>>>>>> attributes into an RDD, then cartesian product that with 
>>>>>>>>>>>>>>>>>>> itself.  Run the
>>>>>>>>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), map 
>>>>>>>>>>>>>>>>>>> to (user,
>>>>>>>>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I doubt that you'll be able to take this approach with
>>>>>>>>>>>>>>>>>>> the 1T pairs though, so it might be worth looking at the 
>>>>>>>>>>>>>>>>>>> literature for
>>>>>>>>>>>>>>>>>>> recommender systems to see what else is out there.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <
>>>>>>>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have one
>>>>>>>>>>>>>>>>>>>> million users. I need to compute the similarity between 
>>>>>>>>>>>>>>>>>>>> each pair of users
>>>>>>>>>>>>>>>>>>>> using some user's attributes.  For each user, I need to 
>>>>>>>>>>>>>>>>>>>> get top k most
>>>>>>>>>>>>>>>>>>>> similar users. What is the best way to implement this?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Huge matrix

Reply via email to