Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-02 Thread Pat Ferrel
Sab, not sure what you require for the similarity metric or your use case but you can also look at spark-rowsimilarity or spark-itemsimilarity (column-wise) here http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-02 Thread Sabarish Sasidharan
Thanks Debasish, Reza and Pat. In my case, I am doing an SVD and then doing the similarities computation. So a rowSimiliarities() would be a good fit, looking forward to it. In the meanwhile I will try to see if I can further limit the number of similarities computed through some other fashion or

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-02 Thread Reza Zadeh
Hi Sab, The current method is optimized for having many rows and few columns. In your case it is exactly the opposite. We are working on your case, tracked by this JIRA: https://issues.apache.org/jira/browse/SPARK-4823 Your case is very common, so I will put some time into building it. In the

Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Sabarish Sasidharan
I am trying to compute column similarities on a 30x1000 RowMatrix of DenseVectors. The size of the input RDD is 3.1MB and its all in one partition. I am running on a single node of 15G and giving the driver 1G and the executor 9G. This is on a single node hadoop. In the first attempt the

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Sabarish Sasidharan
Sorry, I actually meant 30 x 1 matrix (missed a 0) Regards Sab

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Reza Zadeh
Hi Sab, In this dense case, the output will contain 1 x 1 entries, i.e. 100 million doubles, which doesn't fit in 1GB with overheads. For a dense matrix, similarColumns() scales quadratically in the number of columns, so you need more memory across the cluster. Reza On Sun, Mar 1, 2015

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Sabarish Sasidharan
​Hi Reza ​​ I see that ((int, int), double) pairs are generated for any combination that meets the criteria controlled by the threshold. But assuming a simple 1x10K matrix that means I would need atleast 12GB memory per executor for the flat map just for these pairs excluding any other overhead.

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Debasish Das
Column based similarities work well if the columns are mild (10K, 100K, we actually scaled it to 1.5M columns but it really stress tests the shuffle and it needs to tune the shuffle parameters)...You can either use dimsum sampling or come up with your own threshold based on your application that

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Reza Zadeh
Hi Sabarish, Works fine for me with less than those settings (30x1000 dense matrix, 1GB driver, 1GB executor): bin/spark-shell --driver-memory 1G --executor-memory 1G Then running the following finished without trouble and in a few seconds. Are you sure your driver is actually getting the RAM