Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

Pat Ferrel Mon, 02 Mar 2015 17:42:54 -0800

Sab, not sure what you require for the similarity metric or your use case but 
you can also look at spark-rowsimilarity or spark-itemsimilarity (column-wise) 
here http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html 
<http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html>.  
These are optimized for LLR based “similarity” which is very simple to 
calculate since you don’t use either the item weight or the entire row or 
column vector values. Downsampling is done by number of values per column (or 
row) and by LLR strength. This keeps it to O(n)


They run pretty fast and only use memory if you use the version that attaches 
application IDs to the rows and columns. Using SimilarityAnalysis.cooccurrence 
may help. It’s in the Spark/Scala part of Mahout.

On Mar 2, 2015, at 12:56 PM, Reza Zadeh <r...@databricks.com> wrote:

Hi Sab,
The current method is optimized for having many rows and few columns. In your 
case it is exactly the opposite. We are working on your case, tracked by this 
JIRA: https://issues.apache.org/jira/browse/SPARK-4823 
<https://issues.apache.org/jira/browse/SPARK-4823>
Your case is very common, so I will put some time into building it.

In the meantime, if you're looking for groups of similar points, consider using 
K-means - it will get you clusters of similar rows with euclidean distance.

Best,
Reza


On Sun, Mar 1, 2015 at 9:36 PM, Sabarish Sasidharan 
<sabarish.sasidha...@manthan.com <mailto:sabarish.sasidha...@manthan.com>> 
wrote:

Hi Reza

I see that ((int, int), double) pairs are generated for any combination that 
meets the criteria controlled by the threshold. But assuming a simple 1x10K 
matrix that means I would need atleast 12GB memory per executor for the flat 
map just for these pairs excluding any other overhead. Is that correct? How can 
we make this scale for even larger n (when m stays small) like 100 x 5 
million. One is by using higher thresholds. The other is that I use a 
SparseVector to begin with. Are there any other optimizations I can take 
advantage of?




Thanks
Sab

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

Reply via email to