Hi Deb, I am not templating RowMatrix/CoordinateMatrix since that would be a big deviation from the PR. We can add jaccard and other similarity measures in later PRs.
In the meantime, you can un-normalize the cosine similarities to get the dot product, and then compute the other similarity measures from the dot product. Best, Reza On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das <debasish.da...@gmail.com> wrote: > Hi Reza, > > In similarColumns, it seems with cosine similarity I also need other > numbers such as intersection, jaccard and other measures... > > Right now I modified the code to generate jaccard but I had to run it > twice due to the design of RowMatrix / CoordinateMatrix...I feel we should > modify RowMatrix and CoordinateMatrix to be templated on the value... > > Are you considering this in your design ? > > Thanks. > Deb > > > On Tue, Sep 9, 2014 at 9:45 AM, Reza Zadeh <r...@databricks.com> wrote: > >> Better to do it in a PR of your own, it's not sufficiently related to >> dimsum >> >> On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das <debasish.da...@gmail.com> >> wrote: >> >>> Cool...can I add loadRowMatrix in your PR ? >>> >>> Thanks. >>> Deb >>> >>> On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh <r...@databricks.com> wrote: >>> >>>> Hi Deb, >>>> >>>> Did you mean to message me instead of Xiangrui? >>>> >>>> For TS matrices, dimsum with positiveinfinity and computeGramian have >>>> the same cost, so you can do either one. For dense matrices with say, 1m >>>> columns this won't be computationally feasible and you'll want to start >>>> sampling with dimsum. >>>> >>>> It would be helpful to have a loadRowMatrix function, I would use it. >>>> >>>> Best, >>>> Reza >>>> >>>> On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das <debasish.da...@gmail.com >>>> > wrote: >>>> >>>>> Hi Xiangrui, >>>>> >>>>> For tall skinny matrices, if I can pass a similarityMeasure to >>>>> computeGrammian, I could re-use the SVD's computeGrammian for similarity >>>>> computation as well... >>>>> >>>>> Do you recommend using this approach for tall skinny matrices or just >>>>> use the dimsum's routines ? >>>>> >>>>> Right now RowMatrix does not have a loadRowMatrix function like the >>>>> one available in LabeledPoint...should I add one ? I want to export the >>>>> matrix out from my stable code and then test dimsum... >>>>> >>>>> Thanks. >>>>> Deb >>>>> >>>>> >>>>> >>>>> On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh <r...@databricks.com> >>>>> wrote: >>>>> >>>>>> I will add dice, overlap, and jaccard similarity in a future PR, >>>>>> probably still for 1.2 >>>>>> >>>>>> >>>>>> On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das < >>>>>> debasish.da...@gmail.com> wrote: >>>>>> >>>>>>> Awesome...Let me try it out... >>>>>>> >>>>>>> Any plans of putting other similarity measures in future (jaccard is >>>>>>> something that will be useful) ? I guess it makes sense to add some >>>>>>> similarity measures in mllib... >>>>>>> >>>>>>> >>>>>>> On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <r...@databricks.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Yes you're right, calling dimsum with gamma as PositiveInfinity >>>>>>>> turns it into the usual brute force algorithm for cosine similarity, >>>>>>>> there >>>>>>>> is no sampling. This is by design. >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das < >>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>> >>>>>>>>> I looked at the code: similarColumns(Double.posInf) is generating >>>>>>>>> the brute force... >>>>>>>>> >>>>>>>>> Basically dimsum with gamma as PositiveInfinity will produce the >>>>>>>>> exact same result as doing catesian products of RDD[(product, >>>>>>>>> vector)] and >>>>>>>>> computing similarities or there will be some approximation ? >>>>>>>>> >>>>>>>>> Sorry I have not read your paper yet. Will read it over the >>>>>>>>> weekend. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <r...@databricks.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> For 60M x 10K brute force and dimsum thresholding should be fine. >>>>>>>>>> >>>>>>>>>> For 60M x 10M probably brute force won't work depending on the >>>>>>>>>> cluster's power, and dimsum thresholding should work with appropriate >>>>>>>>>> threshold. >>>>>>>>>> >>>>>>>>>> Dimensionality reduction should help, and how effective it is >>>>>>>>>> will depend on your application and domain, it's worth trying if the >>>>>>>>>> direct >>>>>>>>>> computation doesn't work. >>>>>>>>>> >>>>>>>>>> You can also try running KMeans clustering (perhaps after >>>>>>>>>> dimensionality reduction) if your goal is to find batches of similar >>>>>>>>>> points >>>>>>>>>> instead of all pairs above a threshold. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das < >>>>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Also for tall and wide (rows ~60M, columns 10M), I am >>>>>>>>>>> considering running a matrix factorization to reduce the dimension >>>>>>>>>>> to say >>>>>>>>>>> ~60M x 50 and then run all pair similarity... >>>>>>>>>>> >>>>>>>>>>> Did you also try similar ideas and saw positive results ? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das < >>>>>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where >>>>>>>>>>>> rows are ~ 60M and columns are 10M say with billion data points... >>>>>>>>>>>> >>>>>>>>>>>> I have another version that's around 60M and ~ 10K... >>>>>>>>>>>> >>>>>>>>>>>> I guess for the second one both all pair and dimsum will run >>>>>>>>>>>> fine... >>>>>>>>>>>> >>>>>>>>>>>> But for tall and wide, what do you suggest ? can dimsum handle >>>>>>>>>>>> it ? >>>>>>>>>>>> >>>>>>>>>>>> I might need jaccard as well...can I plug that in the PR ? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <r...@databricks.com >>>>>>>>>>>> > wrote: >>>>>>>>>>>> >>>>>>>>>>>>> You might want to wait until Wednesday since the interface >>>>>>>>>>>>> will be changing in that PR before Wednesday, probably over the >>>>>>>>>>>>> weekend, so >>>>>>>>>>>>> that you don't have to redo your code. Your call if you need it >>>>>>>>>>>>> before a >>>>>>>>>>>>> week. >>>>>>>>>>>>> Reza >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das < >>>>>>>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Ohh cool....all-pairs brute force is also part of this PR ? >>>>>>>>>>>>>> Let me pull it in and test on our dataset... >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>> Deb >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh < >>>>>>>>>>>>>> r...@databricks.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Deb, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We are adding all-pairs and thresholded all-pairs via dimsum >>>>>>>>>>>>>>> in this PR: https://github.com/apache/spark/pull/1778 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Your question wasn't entirely clear - does this answer it? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>> Reza >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das < >>>>>>>>>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Reza, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Have you compared with the brute force algorithm for >>>>>>>>>>>>>>>> similarity computation with something like the following in >>>>>>>>>>>>>>>> Spark ? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> https://github.com/echen/scaldingale >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I am adding cosine similarity computation but I do want to >>>>>>>>>>>>>>>> compute an all pair similarities... >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Note that the data is sparse for me (the data that goes to >>>>>>>>>>>>>>>> matrix factorization) so I don't think joining and group-by on >>>>>>>>>>>>>>>> (product,product) will be a big issue for me... >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Does it make sense to add all pair similarities as well >>>>>>>>>>>>>>>> with dimsum based similarity ? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>> Deb >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh < >>>>>>>>>>>>>>>> r...@databricks.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Xiaoli, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> There is a PR currently in progress to allow this, via the >>>>>>>>>>>>>>>>> sampling scheme described in this paper: >>>>>>>>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336 >>>>>>>>>>>>>>>>> though it will need refactoring given the recent changes to >>>>>>>>>>>>>>>>> matrix >>>>>>>>>>>>>>>>> interface in MLlib. You may implement the sampling scheme for >>>>>>>>>>>>>>>>> your own app >>>>>>>>>>>>>>>>> since it's much code. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>> Reza >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li < >>>>>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I >>>>>>>>>>>>>>>>>> used 8 nodes and every node has 8G memory. The program just >>>>>>>>>>>>>>>>>> stopped at a >>>>>>>>>>>>>>>>>> stage for about several hours without any further >>>>>>>>>>>>>>>>>> information. Maybe I need >>>>>>>>>>>>>>>>>> to find >>>>>>>>>>>>>>>>>> out a more efficient way. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash < >>>>>>>>>>>>>>>>>> and...@andrewash.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> The naive way would be to put all the users and their >>>>>>>>>>>>>>>>>>> attributes into an RDD, then cartesian product that with >>>>>>>>>>>>>>>>>>> itself. Run the >>>>>>>>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), map >>>>>>>>>>>>>>>>>>> to (user, >>>>>>>>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I doubt that you'll be able to take this approach with >>>>>>>>>>>>>>>>>>> the 1T pairs though, so it might be worth looking at the >>>>>>>>>>>>>>>>>>> literature for >>>>>>>>>>>>>>>>>>> recommender systems to see what else is out there. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li < >>>>>>>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have one >>>>>>>>>>>>>>>>>>>> million users. I need to compute the similarity between >>>>>>>>>>>>>>>>>>>> each pair of users >>>>>>>>>>>>>>>>>>>> using some user's attributes. For each user, I need to >>>>>>>>>>>>>>>>>>>> get top k most >>>>>>>>>>>>>>>>>>>> similar users. What is the best way to implement this? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >