subject:"Column Similarities using DIMSUM fails with GC overhead limit exceeded"

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-02 Thread Pat Ferrel

Sab, not sure what you require for the similarity metric or your use case but you can also look at spark-rowsimilarity or spark-itemsimilarity (column-wise) here http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-02 Thread Sabarish Sasidharan

Thanks Debasish, Reza and Pat. In my case, I am doing an SVD and then doing the similarities computation. So a rowSimiliarities() would be a good fit, looking forward to it. In the meanwhile I will try to see if I can further limit the number of similarities computed through some other fashion or

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-02 Thread Reza Zadeh

Hi Sab, The current method is optimized for having many rows and few columns. In your case it is exactly the opposite. We are working on your case, tracked by this JIRA: https://issues.apache.org/jira/browse/SPARK-4823 Your case is very common, so I will put some time into building it. In the

Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Sabarish Sasidharan

I am trying to compute column similarities on a 30x1000 RowMatrix of DenseVectors. The size of the input RDD is 3.1MB and its all in one partition. I am running on a single node of 15G and giving the driver 1G and the executor 9G. This is on a single node hadoop. In the first attempt the

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Sabarish Sasidharan

Sorry, I actually meant 30 x 1 matrix (missed a 0) Regards Sab

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Reza Zadeh

Hi Sab, In this dense case, the output will contain 1 x 1 entries, i.e. 100 million doubles, which doesn't fit in 1GB with overheads. For a dense matrix, similarColumns() scales quadratically in the number of columns, so you need more memory across the cluster. Reza On Sun, Mar 1, 2015

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Sabarish Sasidharan

Hi Reza I see that ((int, int), double) pairs are generated for any combination that meets the criteria controlled by the threshold. But assuming a simple 1x10K matrix that means I would need atleast 12GB memory per executor for the flat map just for these pairs excluding any other overhead.

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Debasish Das

Column based similarities work well if the columns are mild (10K, 100K, we actually scaled it to 1.5M columns but it really stress tests the shuffle and it needs to tune the shuffle parameters)...You can either use dimsum sampling or come up with your own threshold based on your application that

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Reza Zadeh

Hi Sabarish, Works fine for me with less than those settings (30x1000 dense matrix, 1GB driver, 1GB executor): bin/spark-shell --driver-memory 1G --executor-memory 1G Then running the following finished without trouble and in a few seconds. Are you sure your driver is actually getting the RAM

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

Column Similarities using DIMSUM fails with GC overhead limit exceeded

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

9 matches

Site Navigation

Mail list logo

Footer information