[jira] [Updated] (SPARK-2885) All-pairs similarity via DIMSUM

Clive Cox (JIRA) Tue, 02 Sep 2014 11:00:45 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Clive Cox updated SPARK-2885:
-----------------------------
    Attachment: SimilarItemsSmallTest.java

This is a test I used. Sorry its not cleaned up, but hopefully you can 
reproduce the error.

> All-pairs similarity via DIMSUM
> -------------------------------
>
>                 Key: SPARK-2885
>                 URL: https://issues.apache.org/jira/browse/SPARK-2885
>             Project: Spark
>          Issue Type: New Feature
>            Reporter: Reza Zadeh
>            Assignee: Reza Zadeh
>         Attachments: SimilarItemsSmallTest.java
>
>
> Build all-pairs similarity algorithm via DIMSUM. 
> Given a dataset of sparse vector data, the all-pairs similarity problem is to 
> find all similar vector pairs according to a similarity function such as 
> cosine similarity, and a given similarity score threshold. Sometimes, this 
> problem is called a “similarity join”.
> The brute force approach of considering all pairs quickly breaks, since it 
> scales quadratically. For example, for a million vectors, it is not feasible 
> to check all roughly trillion pairs to see if they are above the similarity 
> threshold. Having said that, there exist clever sampling techniques to focus 
> the computational effort on those pairs that are above the similarity 
> threshold, which makes the problem feasible.
> DIMSUM has a single parameter (called gamma) to tradeoff computation time vs 
> accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of 
> computation vs accuracy from low computation to high accuracy. For a very 
> large gamma, all cosine similarities are computed exactly with no sampling.
> Current PR:
> https://github.com/apache/spark/pull/1778
> Justification for adding to MLlib:
> - All-pairs similarity is missing from MLlib and has been requested several 
> times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see 
> https://github.com/apache/spark/pull/1778#issuecomment-51300825)
> - Algorithm is used in large-scale production at Twitter. e.g. see 
> https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco
>  . Twitter also open-sourced their version in scalding: 
> https://github.com/twitter/scalding/pull/833
> - When used with the gamma parameter set high, this algorithm becomes the 
> normalized gramian matrix, which is useful in RowMatrix alongside the 
> computeGramianMatrix method already in RowMatrix
> More details about usage at Twitter: 
> https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum
> For correctness proof, see Theorem 4.3 in 
> http://stanford.edu/~rezab/papers/dimsum.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2885) All-pairs similarity via DIMSUM

Reply via email to