[ https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Clive Cox updated SPARK-2885: ----------------------------- Attachment: SimilarItemsSmallTest.java This is a test I used. Sorry its not cleaned up, but hopefully you can reproduce the error. > All-pairs similarity via DIMSUM > ------------------------------- > > Key: SPARK-2885 > URL: https://issues.apache.org/jira/browse/SPARK-2885 > Project: Spark > Issue Type: New Feature > Reporter: Reza Zadeh > Assignee: Reza Zadeh > Attachments: SimilarItemsSmallTest.java > > > Build all-pairs similarity algorithm via DIMSUM. > Given a dataset of sparse vector data, the all-pairs similarity problem is to > find all similar vector pairs according to a similarity function such as > cosine similarity, and a given similarity score threshold. Sometimes, this > problem is called a “similarity join”. > The brute force approach of considering all pairs quickly breaks, since it > scales quadratically. For example, for a million vectors, it is not feasible > to check all roughly trillion pairs to see if they are above the similarity > threshold. Having said that, there exist clever sampling techniques to focus > the computational effort on those pairs that are above the similarity > threshold, which makes the problem feasible. > DIMSUM has a single parameter (called gamma) to tradeoff computation time vs > accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of > computation vs accuracy from low computation to high accuracy. For a very > large gamma, all cosine similarities are computed exactly with no sampling. > Current PR: > https://github.com/apache/spark/pull/1778 > Justification for adding to MLlib: > - All-pairs similarity is missing from MLlib and has been requested several > times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see > https://github.com/apache/spark/pull/1778#issuecomment-51300825) > - Algorithm is used in large-scale production at Twitter. e.g. see > https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco > . Twitter also open-sourced their version in scalding: > https://github.com/twitter/scalding/pull/833 > - When used with the gamma parameter set high, this algorithm becomes the > normalized gramian matrix, which is useful in RowMatrix alongside the > computeGramianMatrix method already in RowMatrix > More details about usage at Twitter: > https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum > For correctness proof, see Theorem 4.3 in > http://stanford.edu/~rezab/papers/dimsum.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org