I am still a bit confused whether numbers like these can be aggregated as double:
iVal * jVal / (math.min(sg, colMags(i)) * math.min(sg, colMags(j)) It should be aggregated using something like List[iVal*jVal, colMags(i), colMags(j)] I am not sure Algebird can aggregate deterministically over Double... On Thu, Sep 18, 2014 at 2:08 PM, Reza Zadeh <[email protected]> wrote: > Hi Deb, > I am currently seeding the algorithm to be pseudo-random, this is an issue > being addressed in the PR. If you pull the current version it will be > deterministic, but not potentially not pseudo-random. The PR will updated > today. > Best, > Reza > > On Thu, Sep 18, 2014 at 2:06 PM, Debasish Das <[email protected]> > wrote: > >> Hi Reza, >> >> Have you tested if different runs of the algorithm produce different >> similarities (basically if the algorithm is deterministic) ? >> >> This number does not look like a Monoid aggregation...iVal * jVal / >> (math.min(sg, colMags(i)) * math.min(sg, colMags(j)) >> >> I am noticing some weird behavior as different runs are changing the >> results... >> >> Also can columnMagnitudes produce non-deterministic results ? >> >> Thanks. >> >> Deb >> >> On Thu, Sep 18, 2014 at 10:34 AM, Reza Zadeh <[email protected]> wrote: >> >>> Hi Deb, >>> >>> I am not templating RowMatrix/CoordinateMatrix since that would be a big >>> deviation from the PR. We can add jaccard and other similarity measures in >>> later PRs. >>> >>> In the meantime, you can un-normalize the cosine similarities to get the >>> dot product, and then compute the other similarity measures from the dot >>> product. >>> >>> Best, >>> Reza >>> >>> >>> On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das <[email protected]> >>> wrote: >>> >>>> Hi Reza, >>>> >>>> In similarColumns, it seems with cosine similarity I also need other >>>> numbers such as intersection, jaccard and other measures... >>>> >>>> Right now I modified the code to generate jaccard but I had to run it >>>> twice due to the design of RowMatrix / CoordinateMatrix...I feel we should >>>> modify RowMatrix and CoordinateMatrix to be templated on the value... >>>> >>>> Are you considering this in your design ? >>>> >>>> Thanks. >>>> Deb >>>> >>>> >>>> On Tue, Sep 9, 2014 at 9:45 AM, Reza Zadeh <[email protected]> wrote: >>>> >>>>> Better to do it in a PR of your own, it's not sufficiently related to >>>>> dimsum >>>>> >>>>> On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das <[email protected] >>>>> > wrote: >>>>> >>>>>> Cool...can I add loadRowMatrix in your PR ? >>>>>> >>>>>> Thanks. >>>>>> Deb >>>>>> >>>>>> On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi Deb, >>>>>>> >>>>>>> Did you mean to message me instead of Xiangrui? >>>>>>> >>>>>>> For TS matrices, dimsum with positiveinfinity and computeGramian >>>>>>> have the same cost, so you can do either one. For dense matrices with >>>>>>> say, >>>>>>> 1m columns this won't be computationally feasible and you'll want to >>>>>>> start >>>>>>> sampling with dimsum. >>>>>>> >>>>>>> It would be helpful to have a loadRowMatrix function, I would use it. >>>>>>> >>>>>>> Best, >>>>>>> Reza >>>>>>> >>>>>>> On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi Xiangrui, >>>>>>>> >>>>>>>> For tall skinny matrices, if I can pass a similarityMeasure to >>>>>>>> computeGrammian, I could re-use the SVD's computeGrammian for >>>>>>>> similarity >>>>>>>> computation as well... >>>>>>>> >>>>>>>> Do you recommend using this approach for tall skinny matrices or >>>>>>>> just use the dimsum's routines ? >>>>>>>> >>>>>>>> Right now RowMatrix does not have a loadRowMatrix function like the >>>>>>>> one available in LabeledPoint...should I add one ? I want to export the >>>>>>>> matrix out from my stable code and then test dimsum... >>>>>>>> >>>>>>>> Thanks. >>>>>>>> Deb >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I will add dice, overlap, and jaccard similarity in a future PR, >>>>>>>>> probably still for 1.2 >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Awesome...Let me try it out... >>>>>>>>>> >>>>>>>>>> Any plans of putting other similarity measures in future (jaccard >>>>>>>>>> is something that will be useful) ? I guess it makes sense to add >>>>>>>>>> some >>>>>>>>>> similarity measures in mllib... >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Yes you're right, calling dimsum with gamma as PositiveInfinity >>>>>>>>>>> turns it into the usual brute force algorithm for cosine >>>>>>>>>>> similarity, there >>>>>>>>>>> is no sampling. This is by design. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> I looked at the code: similarColumns(Double.posInf) is >>>>>>>>>>>> generating the brute force... >>>>>>>>>>>> >>>>>>>>>>>> Basically dimsum with gamma as PositiveInfinity will produce >>>>>>>>>>>> the exact same result as doing catesian products of RDD[(product, >>>>>>>>>>>> vector)] >>>>>>>>>>>> and computing similarities or there will be some approximation ? >>>>>>>>>>>> >>>>>>>>>>>> Sorry I have not read your paper yet. Will read it over the >>>>>>>>>>>> weekend. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <[email protected] >>>>>>>>>>>> > wrote: >>>>>>>>>>>> >>>>>>>>>>>>> For 60M x 10K brute force and dimsum thresholding should be >>>>>>>>>>>>> fine. >>>>>>>>>>>>> >>>>>>>>>>>>> For 60M x 10M probably brute force won't work depending on the >>>>>>>>>>>>> cluster's power, and dimsum thresholding should work with >>>>>>>>>>>>> appropriate >>>>>>>>>>>>> threshold. >>>>>>>>>>>>> >>>>>>>>>>>>> Dimensionality reduction should help, and how effective it is >>>>>>>>>>>>> will depend on your application and domain, it's worth trying if >>>>>>>>>>>>> the direct >>>>>>>>>>>>> computation doesn't work. >>>>>>>>>>>>> >>>>>>>>>>>>> You can also try running KMeans clustering (perhaps after >>>>>>>>>>>>> dimensionality reduction) if your goal is to find batches of >>>>>>>>>>>>> similar points >>>>>>>>>>>>> instead of all pairs above a threshold. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Also for tall and wide (rows ~60M, columns 10M), I am >>>>>>>>>>>>>> considering running a matrix factorization to reduce the >>>>>>>>>>>>>> dimension to say >>>>>>>>>>>>>> ~60M x 50 and then run all pair similarity... >>>>>>>>>>>>>> >>>>>>>>>>>>>> Did you also try similar ideas and saw positive results ? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where >>>>>>>>>>>>>>> rows are ~ 60M and columns are 10M say with billion data >>>>>>>>>>>>>>> points... >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I have another version that's around 60M and ~ 10K... >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I guess for the second one both all pair and dimsum will run >>>>>>>>>>>>>>> fine... >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> But for tall and wide, what do you suggest ? can dimsum >>>>>>>>>>>>>>> handle it ? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I might need jaccard as well...can I plug that in the PR ? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> You might want to wait until Wednesday since the interface >>>>>>>>>>>>>>>> will be changing in that PR before Wednesday, probably over >>>>>>>>>>>>>>>> the weekend, so >>>>>>>>>>>>>>>> that you don't have to redo your code. Your call if you need >>>>>>>>>>>>>>>> it before a >>>>>>>>>>>>>>>> week. >>>>>>>>>>>>>>>> Reza >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Ohh cool....all-pairs brute force is also part of this PR >>>>>>>>>>>>>>>>> ? Let me pull it in and test on our dataset... >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>> Deb >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh < >>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Deb, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> We are adding all-pairs and thresholded all-pairs via >>>>>>>>>>>>>>>>>> dimsum in this PR: >>>>>>>>>>>>>>>>>> https://github.com/apache/spark/pull/1778 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Your question wasn't entirely clear - does this answer it? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>> Reza >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das < >>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi Reza, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Have you compared with the brute force algorithm for >>>>>>>>>>>>>>>>>>> similarity computation with something like the following in >>>>>>>>>>>>>>>>>>> Spark ? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> https://github.com/echen/scaldingale >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I am adding cosine similarity computation but I do want >>>>>>>>>>>>>>>>>>> to compute an all pair similarities... >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Note that the data is sparse for me (the data that goes >>>>>>>>>>>>>>>>>>> to matrix factorization) so I don't think joining and >>>>>>>>>>>>>>>>>>> group-by on >>>>>>>>>>>>>>>>>>> (product,product) will be a big issue for me... >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Does it make sense to add all pair similarities as well >>>>>>>>>>>>>>>>>>> with dimsum based similarity ? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>>>> Deb >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh < >>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi Xiaoli, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> There is a PR currently in progress to allow this, via >>>>>>>>>>>>>>>>>>>> the sampling scheme described in this paper: >>>>>>>>>>>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336 >>>>>>>>>>>>>>>>>>>> though it will need refactoring given the recent changes >>>>>>>>>>>>>>>>>>>> to matrix >>>>>>>>>>>>>>>>>>>> interface in MLlib. You may implement the sampling scheme >>>>>>>>>>>>>>>>>>>> for your own app >>>>>>>>>>>>>>>>>>>> since it's much code. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>>>> Reza >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li < >>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I >>>>>>>>>>>>>>>>>>>>> used 8 nodes and every node has 8G memory. The program >>>>>>>>>>>>>>>>>>>>> just stopped at a >>>>>>>>>>>>>>>>>>>>> stage for about several hours without any further >>>>>>>>>>>>>>>>>>>>> information. Maybe I need >>>>>>>>>>>>>>>>>>>>> to find >>>>>>>>>>>>>>>>>>>>> out a more efficient way. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash < >>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> The naive way would be to put all the users and their >>>>>>>>>>>>>>>>>>>>>> attributes into an RDD, then cartesian product that with >>>>>>>>>>>>>>>>>>>>>> itself. Run the >>>>>>>>>>>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), >>>>>>>>>>>>>>>>>>>>>> map to (user, >>>>>>>>>>>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I doubt that you'll be able to take this approach >>>>>>>>>>>>>>>>>>>>>> with the 1T pairs though, so it might be worth looking >>>>>>>>>>>>>>>>>>>>>> at the literature >>>>>>>>>>>>>>>>>>>>>> for recommender systems to see what else is out there. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li < >>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have >>>>>>>>>>>>>>>>>>>>>>> one million users. I need to compute the similarity >>>>>>>>>>>>>>>>>>>>>>> between each pair of >>>>>>>>>>>>>>>>>>>>>>> users using some user's attributes. For each user, I >>>>>>>>>>>>>>>>>>>>>>> need to get top k >>>>>>>>>>>>>>>>>>>>>>> most similar users. What is the best way to implement >>>>>>>>>>>>>>>>>>>>>>> this? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
