Re: Huge matrix

Debasish Das Thu, 18 Sep 2014 14:33:59 -0700

I am still a bit confused whether numbers like these can be aggregated as
double:


iVal * jVal / (math.min(sg, colMags(i)) * math.min(sg, colMags(j))

It should be aggregated using something like List[iVal*jVal, colMags(i),
colMags(j)]

I am not sure Algebird can aggregate deterministically over Double...







On Thu, Sep 18, 2014 at 2:08 PM, Reza Zadeh <r...@databricks.com> wrote:

> Hi Deb,
> I am currently seeding the algorithm to be pseudo-random, this is an issue
> being addressed in the PR. If you pull the current version it will be
> deterministic, but not potentially not pseudo-random. The PR will updated
> today.
> Best,
> Reza
>
> On Thu, Sep 18, 2014 at 2:06 PM, Debasish Das <debasish.da...@gmail.com>
> wrote:
>
>> Hi Reza,
>>
>> Have you tested if different runs of the algorithm produce different
>> similarities (basically if the algorithm is deterministic) ?
>>
>> This number does not look like a Monoid aggregation...iVal * jVal /
>> (math.min(sg, colMags(i)) * math.min(sg, colMags(j))
>>
>> I am noticing some weird behavior as different runs are changing the
>> results...
>>
>> Also can columnMagnitudes produce non-deterministic results ?
>>
>> Thanks.
>>
>> Deb
>>
>> On Thu, Sep 18, 2014 at 10:34 AM, Reza Zadeh <r...@databricks.com> wrote:
>>
>>> Hi Deb,
>>>
>>> I am not templating RowMatrix/CoordinateMatrix since that would be a big
>>> deviation from the PR. We can add jaccard and other similarity measures in
>>> later PRs.
>>>
>>> In the meantime, you can un-normalize the cosine similarities to get the
>>> dot product, and then compute the other similarity measures from the dot
>>> product.
>>>
>>> Best,
>>> Reza
>>>
>>>
>>> On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das <debasish.da...@gmail.com>
>>> wrote:
>>>
>>>> Hi Reza,
>>>>
>>>> In similarColumns, it seems with cosine similarity I also need other
>>>> numbers such as intersection, jaccard and other measures...
>>>>
>>>> Right now I modified the code to generate jaccard but I had to run it
>>>> twice due to the design of RowMatrix / CoordinateMatrix...I feel we should
>>>> modify RowMatrix and CoordinateMatrix to be templated on the value...
>>>>
>>>> Are you considering this in your design ?
>>>>
>>>> Thanks.
>>>> Deb
>>>>
>>>>
>>>> On Tue, Sep 9, 2014 at 9:45 AM, Reza Zadeh <r...@databricks.com> wrote:
>>>>
>>>>> Better to do it in a PR of your own, it's not sufficiently related to
>>>>> dimsum
>>>>>
>>>>> On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das <debasish.da...@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Cool...can I add loadRowMatrix in your PR ?
>>>>>>
>>>>>> Thanks.
>>>>>> Deb
>>>>>>
>>>>>> On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh <r...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Deb,
>>>>>>>
>>>>>>> Did you mean to message me instead of Xiangrui?
>>>>>>>
>>>>>>> For TS matrices, dimsum with positiveinfinity and computeGramian
>>>>>>> have the same cost, so you can do either one. For dense matrices with 
>>>>>>> say,
>>>>>>> 1m columns this won't be computationally feasible and you'll want to 
>>>>>>> start
>>>>>>> sampling with dimsum.
>>>>>>>
>>>>>>> It would be helpful to have a loadRowMatrix function, I would use it.
>>>>>>>
>>>>>>> Best,
>>>>>>> Reza
>>>>>>>
>>>>>>> On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das <
>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Xiangrui,
>>>>>>>>
>>>>>>>> For tall skinny matrices, if I can pass a similarityMeasure to
>>>>>>>> computeGrammian, I could re-use the SVD's computeGrammian for 
>>>>>>>> similarity
>>>>>>>> computation as well...
>>>>>>>>
>>>>>>>> Do you recommend using this approach for tall skinny matrices or
>>>>>>>> just use the dimsum's routines ?
>>>>>>>>
>>>>>>>> Right now RowMatrix does not have a loadRowMatrix function like the
>>>>>>>> one available in LabeledPoint...should I add one ? I want to export the
>>>>>>>> matrix out from my stable code and then test dimsum...
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>> Deb
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh <r...@databricks.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I will add dice, overlap, and jaccard similarity in a future PR,
>>>>>>>>> probably still for 1.2
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das <
>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Awesome...Let me try it out...
>>>>>>>>>>
>>>>>>>>>> Any plans of putting other similarity measures in future (jaccard
>>>>>>>>>> is something that will be useful) ? I guess it makes sense to add 
>>>>>>>>>> some
>>>>>>>>>> similarity measures in mllib...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <r...@databricks.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes you're right, calling dimsum with gamma as PositiveInfinity
>>>>>>>>>>> turns it into the usual brute force algorithm for cosine 
>>>>>>>>>>> similarity, there
>>>>>>>>>>> is no sampling. This is by design.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das <
>>>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I looked at the code: similarColumns(Double.posInf) is
>>>>>>>>>>>> generating the brute force...
>>>>>>>>>>>>
>>>>>>>>>>>> Basically dimsum with gamma as PositiveInfinity will produce
>>>>>>>>>>>> the exact same result as doing catesian products of RDD[(product, 
>>>>>>>>>>>> vector)]
>>>>>>>>>>>> and computing similarities or there will be some approximation ?
>>>>>>>>>>>>
>>>>>>>>>>>> Sorry I have not read your paper yet. Will read it over the
>>>>>>>>>>>> weekend.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <r...@databricks.com
>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> For 60M x 10K brute force and dimsum thresholding should be
>>>>>>>>>>>>> fine.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For 60M x 10M probably brute force won't work depending on the
>>>>>>>>>>>>> cluster's power, and dimsum thresholding should work with 
>>>>>>>>>>>>> appropriate
>>>>>>>>>>>>> threshold.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Dimensionality reduction should help, and how effective it is
>>>>>>>>>>>>> will depend on your application and domain, it's worth trying if 
>>>>>>>>>>>>> the direct
>>>>>>>>>>>>> computation doesn't work.
>>>>>>>>>>>>>
>>>>>>>>>>>>> You can also try running KMeans clustering (perhaps after
>>>>>>>>>>>>> dimensionality reduction) if your goal is to find batches of 
>>>>>>>>>>>>> similar points
>>>>>>>>>>>>> instead of all pairs above a threshold.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <
>>>>>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Also for tall and wide (rows ~60M, columns 10M), I am
>>>>>>>>>>>>>> considering running a matrix factorization to reduce the 
>>>>>>>>>>>>>> dimension to say
>>>>>>>>>>>>>> ~60M x 50 and then run all pair similarity...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Did you also try similar ideas and saw positive results ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <
>>>>>>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where
>>>>>>>>>>>>>>> rows are ~ 60M and columns are 10M say with billion data 
>>>>>>>>>>>>>>> points...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have another version that's around 60M and ~ 10K...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I guess for the second one both all pair and dimsum will run
>>>>>>>>>>>>>>> fine...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> But for tall and wide, what do you suggest ? can dimsum
>>>>>>>>>>>>>>> handle it ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I might need jaccard as well...can I plug that in the PR ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <
>>>>>>>>>>>>>>> r...@databricks.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> You might want to wait until Wednesday since the interface
>>>>>>>>>>>>>>>> will be changing in that PR before Wednesday, probably over 
>>>>>>>>>>>>>>>> the weekend, so
>>>>>>>>>>>>>>>> that you don't have to redo your code. Your call if you need 
>>>>>>>>>>>>>>>> it before a
>>>>>>>>>>>>>>>> week.
>>>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <
>>>>>>>>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Ohh cool....all-pairs brute force is also part of this PR
>>>>>>>>>>>>>>>>> ? Let me pull it in and test on our dataset...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>> Deb
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <
>>>>>>>>>>>>>>>>> r...@databricks.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Deb,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> We are adding all-pairs and thresholded all-pairs via
>>>>>>>>>>>>>>>>>> dimsum in this PR:
>>>>>>>>>>>>>>>>>> https://github.com/apache/spark/pull/1778
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Your question wasn't entirely clear - does this answer it?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <
>>>>>>>>>>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Reza,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Have you compared with the brute force algorithm for
>>>>>>>>>>>>>>>>>>> similarity computation with something like the following in 
>>>>>>>>>>>>>>>>>>> Spark ?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> https://github.com/echen/scaldingale
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I am adding cosine similarity computation but I do want
>>>>>>>>>>>>>>>>>>> to compute an all pair similarities...
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Note that the data is sparse for me (the data that goes
>>>>>>>>>>>>>>>>>>> to matrix factorization) so I don't think joining and 
>>>>>>>>>>>>>>>>>>> group-by on
>>>>>>>>>>>>>>>>>>> (product,product) will be a big issue for me...
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Does it make sense to add all pair similarities as well
>>>>>>>>>>>>>>>>>>> with dimsum based similarity ?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>>> Deb
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <
>>>>>>>>>>>>>>>>>>> r...@databricks.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Xiaoli,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> There is a PR currently in progress to allow this, via
>>>>>>>>>>>>>>>>>>>> the sampling scheme described in this paper:
>>>>>>>>>>>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336
>>>>>>>>>>>>>>>>>>>> though it will need refactoring given the recent changes 
>>>>>>>>>>>>>>>>>>>> to matrix
>>>>>>>>>>>>>>>>>>>> interface in MLlib. You may implement the sampling scheme 
>>>>>>>>>>>>>>>>>>>> for your own app
>>>>>>>>>>>>>>>>>>>> since it's much code.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <
>>>>>>>>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I
>>>>>>>>>>>>>>>>>>>>> used 8 nodes and every node has 8G memory. The program 
>>>>>>>>>>>>>>>>>>>>> just stopped at a
>>>>>>>>>>>>>>>>>>>>> stage for about several hours without any further 
>>>>>>>>>>>>>>>>>>>>> information. Maybe I need
>>>>>>>>>>>>>>>>>>>>> to find
>>>>>>>>>>>>>>>>>>>>> out a more efficient way.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <
>>>>>>>>>>>>>>>>>>>>> and...@andrewash.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> The naive way would be to put all the users and their
>>>>>>>>>>>>>>>>>>>>>> attributes into an RDD, then cartesian product that with 
>>>>>>>>>>>>>>>>>>>>>> itself.  Run the
>>>>>>>>>>>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), 
>>>>>>>>>>>>>>>>>>>>>> map to (user,
>>>>>>>>>>>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I doubt that you'll be able to take this approach
>>>>>>>>>>>>>>>>>>>>>> with the 1T pairs though, so it might be worth looking 
>>>>>>>>>>>>>>>>>>>>>> at the literature
>>>>>>>>>>>>>>>>>>>>>> for recommender systems to see what else is out there.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <
>>>>>>>>>>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have
>>>>>>>>>>>>>>>>>>>>>>> one million users. I need to compute the similarity 
>>>>>>>>>>>>>>>>>>>>>>> between each pair of
>>>>>>>>>>>>>>>>>>>>>>> users using some user's attributes.  For each user, I 
>>>>>>>>>>>>>>>>>>>>>>> need to get top k
>>>>>>>>>>>>>>>>>>>>>>> most similar users. What is the best way to implement 
>>>>>>>>>>>>>>>>>>>>>>> this?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Huge matrix

Reply via email to