Parquet runs out of memory when reading in a huge matrix

2015-12-05 Thread AlexG
f rows to aid with shuffling ... if this is the case, any suggestions on how to ameliorate this situation? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-runs-out-of-memory-when-reading-in-a-huge-matrix-tp25590.html Sent from the Apache Spark User List m

Re: Huge matrix

2014-09-18 Thread Reza Zadeh
Hi Deb, I am not templating RowMatrix/CoordinateMatrix since that would be a big deviation from the PR. We can add jaccard and other similarity measures in later PRs. In the meantime, you can un-normalize the cosine similarities to get the dot product, and then compute the other similarity

Re: Huge matrix

2014-09-18 Thread Debasish Das
Yup that's what I did for now... On Thu, Sep 18, 2014 at 10:34 AM, Reza Zadeh r...@databricks.com wrote: Hi Deb, I am not templating RowMatrix/CoordinateMatrix since that would be a big deviation from the PR. We can add jaccard and other similarity measures in later PRs. In the meantime,

Re: Huge matrix

2014-09-18 Thread Debasish Das
Hi Reza, Have you tested if different runs of the algorithm produce different similarities (basically if the algorithm is deterministic) ? This number does not look like a Monoid aggregation...iVal * jVal / (math.min(sg, colMags(i)) * math.min(sg, colMags(j)) I am noticing some weird behavior

Re: Huge matrix

2014-09-18 Thread Reza Zadeh
Hi Deb, I am currently seeding the algorithm to be pseudo-random, this is an issue being addressed in the PR. If you pull the current version it will be deterministic, but not potentially not pseudo-random. The PR will updated today. Best, Reza On Thu, Sep 18, 2014 at 2:06 PM, Debasish Das

Re: Huge matrix

2014-09-18 Thread Debasish Das
I am still a bit confused whether numbers like these can be aggregated as double: iVal * jVal / (math.min(sg, colMags(i)) * math.min(sg, colMags(j)) It should be aggregated using something like List[iVal*jVal, colMags(i), colMags(j)] I am not sure Algebird can aggregate deterministically over

Re: Huge matrix

2014-09-17 Thread Debasish Das
Hi Reza, In similarColumns, it seems with cosine similarity I also need other numbers such as intersection, jaccard and other measures... Right now I modified the code to generate jaccard but I had to run it twice due to the design of RowMatrix / CoordinateMatrix...I feel we should modify

Re: Huge matrix

2014-09-09 Thread Debasish Das
Hi Xiangrui, For tall skinny matrices, if I can pass a similarityMeasure to computeGrammian, I could re-use the SVD's computeGrammian for similarity computation as well... Do you recommend using this approach for tall skinny matrices or just use the dimsum's routines ? Right now RowMatrix does

Re: Huge matrix

2014-09-09 Thread Reza Zadeh
Hi Deb, Did you mean to message me instead of Xiangrui? For TS matrices, dimsum with positiveinfinity and computeGramian have the same cost, so you can do either one. For dense matrices with say, 1m columns this won't be computationally feasible and you'll want to start sampling with dimsum. It

Re: Huge matrix

2014-09-09 Thread Debasish Das
Cool...can I add loadRowMatrix in your PR ? Thanks. Deb On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh r...@databricks.com wrote: Hi Deb, Did you mean to message me instead of Xiangrui? For TS matrices, dimsum with positiveinfinity and computeGramian have the same cost, so you can do either

Re: Huge matrix

2014-09-09 Thread Reza Zadeh
Better to do it in a PR of your own, it's not sufficiently related to dimsum On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das debasish.da...@gmail.com wrote: Cool...can I add loadRowMatrix in your PR ? Thanks. Deb On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh r...@databricks.com wrote: Hi Deb,

Re: Huge matrix

2014-09-05 Thread Debasish Das
Hi Reza, Have you compared with the brute force algorithm for similarity computation with something like the following in Spark ? https://github.com/echen/scaldingale I am adding cosine similarity computation but I do want to compute an all pair similarities... Note that the data is sparse for

Re: Huge matrix

2014-09-05 Thread Reza Zadeh
Hi Deb, We are adding all-pairs and thresholded all-pairs via dimsum in this PR: https://github.com/apache/spark/pull/1778 Your question wasn't entirely clear - does this answer it? Best, Reza On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das debasish.da...@gmail.com wrote: Hi Reza, Have you

Re: Huge matrix

2014-09-05 Thread Debasish Das
Ohh coolall-pairs brute force is also part of this PR ? Let me pull it in and test on our dataset... Thanks. Deb On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh r...@databricks.com wrote: Hi Deb, We are adding all-pairs and thresholded all-pairs via dimsum in this PR:

Re: Huge matrix

2014-09-05 Thread Reza Zadeh
You might want to wait until Wednesday since the interface will be changing in that PR before Wednesday, probably over the weekend, so that you don't have to redo your code. Your call if you need it before a week. Reza On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das debasish.da...@gmail.com wrote:

Re: Huge matrix

2014-09-05 Thread Debasish Das
Ok...just to make sure I have RowMatrix[SparseVector] where rows are ~ 60M and columns are 10M say with billion data points... I have another version that's around 60M and ~ 10K... I guess for the second one both all pair and dimsum will run fine... But for tall and wide, what do you suggest ?

Re: Huge matrix

2014-09-05 Thread Debasish Das
Also for tall and wide (rows ~60M, columns 10M), I am considering running a matrix factorization to reduce the dimension to say ~60M x 50 and then run all pair similarity... Did you also try similar ideas and saw positive results ? On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das

Re: Huge matrix

2014-09-05 Thread Reza Zadeh
For 60M x 10K brute force and dimsum thresholding should be fine. For 60M x 10M probably brute force won't work depending on the cluster's power, and dimsum thresholding should work with appropriate threshold. Dimensionality reduction should help, and how effective it is will depend on your

Re: Huge matrix

2014-09-05 Thread Debasish Das
I looked at the code: similarColumns(Double.posInf) is generating the brute force... Basically dimsum with gamma as PositiveInfinity will produce the exact same result as doing catesian products of RDD[(product, vector)] and computing similarities or there will be some approximation ? Sorry I

Re: Huge matrix

2014-09-05 Thread Reza Zadeh
Yes you're right, calling dimsum with gamma as PositiveInfinity turns it into the usual brute force algorithm for cosine similarity, there is no sampling. This is by design. On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das debasish.da...@gmail.com wrote: I looked at the code:

Re: Huge matrix

2014-09-05 Thread Debasish Das
Awesome...Let me try it out... Any plans of putting other similarity measures in future (jaccard is something that will be useful) ? I guess it makes sense to add some similarity measures in mllib... On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh r...@databricks.com wrote: Yes you're right,

Re: Huge matrix

2014-09-05 Thread Reza Zadeh
I will add dice, overlap, and jaccard similarity in a future PR, probably still for 1.2 On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das debasish.da...@gmail.com wrote: Awesome...Let me try it out... Any plans of putting other similarity measures in future (jaccard is something that will be

Re: Huge matrix

2014-04-14 Thread Guillaume Pitel
On 04/12/2014 06:35 PM, Xiaoli Li wrote: Hi Guillaume, This sounds a good idea to me. I am a newbie here. Could you further explain how will you determine which clusters to keep? According to the distance between each

Re: Huge matrix

2014-04-14 Thread Xiaoli Li
Hi Guillaume, Thanks for your explanation. It helps me a lot. I will try it. Xiaoli

Re: Huge matrix

2014-04-12 Thread Xiaoli Li
Hi Reza, Thank you for your information. I will try it. On Fri, Apr 11, 2014 at 11:21 PM, Reza Zadeh r...@databricks.com wrote: Hi Xiaoli, There is a PR currently in progress to allow this, via the sampling scheme described in this paper: stanford.edu/~rezab/papers/dimsum.pdf The PR is

Re: Huge matrix

2014-04-12 Thread Guillaume Pitel
Hi, I'm doing this here for multiple tens of millions of elements (and the goal is to reach multiple billions), on a relatively small cluster (7 nodes 4 cores 32GB RAM). We use multiprobe KLSH. All you have to do is run a Kmeans on your data, then compute the

Re: Huge matrix

2014-04-12 Thread Xiaoli Li
Hi Guillaume, This sounds a good idea to me. I am a newbie here. Could you further explain how will you determine which clusters to keep? According to the distance between each element with each cluster center? Will you keep several clusters for each element for searching nearest neighbours?

Re: Huge matrix

2014-04-12 Thread Tom V
The last writer is suggesting using the triangle inequality to cut down the search space. If c is the centroid of cluster C, then the closest any point in C is to x is ||x-c|| - r(C), where r(C) is the (precomputed) radius of the cluster---the distance of the farthest point in C to c. Whether

Huge matrix

2014-04-11 Thread Xiaoli Li
Hi all, I am implementing an algorithm using Spark. I have one million users. I need to compute the similarity between each pair of users using some user's attributes. For each user, I need to get top k most similar users. What is the best way to implement this? Thanks.

Re: Huge matrix

2014-04-11 Thread Andrew Ash
The naive way would be to put all the users and their attributes into an RDD, then cartesian product that with itself. Run the similarity score on every pair (1M * 1M = 1T scores), map to (user, (score, otherUser)) and take the .top(k) for each user. I doubt that you'll be able to take this

Re: Huge matrix

2014-04-11 Thread Xiaoli Li
Hi Andrew, Thanks for your suggestion. I have tried the method. I used 8 nodes and every node has 8G memory. The program just stopped at a stage for about several hours without any further information. Maybe I need to find out a more efficient way. On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash

Re: Huge matrix

2014-04-11 Thread Reza Zadeh
Hi Xiaoli, There is a PR currently in progress to allow this, via the sampling scheme described in this paper: stanford.edu/~rezab/papers/dimsum.pdf The PR is at https://github.com/apache/spark/pull/336 though it will need refactoring given the recent changes to matrix interface in MLlib. You