[ 
https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reza Zadeh updated SPARK-2885:
------------------------------

    Description: 
Build all-pairs similarity algorithm via DIMSUM. 

Given a dataset of sparse vector data, the all-pairs similarity problem is to 
find all similar vector pairs according to a similarity function such as cosine 
similarity, and a given similarity score threshold. Sometimes, this problem is 
called a “similarity join”.

The brute force approach of considering all pairs quickly breaks, since it 
scales quadratically. For example, for a million vectors, it is not feasible to 
check all roughly trillion pairs to see if they are above the similarity 
threshold. Having said that, there exist clever sampling techniques to focus 
the computational effort on those pairs that are above the similarity 
threshold, which makes the problem feasible.

Current PR:
https://github.com/apache/spark/pull/1778

  was:
Build all-pairs similarity algorithm via DIMSUM. 

Given a dataset of sparse vector data, the all-pairs similarity problem is to 
find all similar vector pairs according to a similarity function such as cosine 
similarity, and a given similarity score threshold. Sometimes, this problem is 
called a “similarity join”.

The brute force approach of considering all pairs quickly breaks, since it 
scales quadratically. For example, for a million vectors, it is not feasible to 
check all roughly trillion pairs to see if they are above the similarity 
threshold. Having said that, there exist clever sampling techniques to focus 
the computational effort on those pairs that are above the similarity 
threshold, which makes the problem feasible.

Current PR for this is WIP:
https://github.com/apache/spark/pull/1778


> All-pairs similarity via DIMSUM
> -------------------------------
>
>                 Key: SPARK-2885
>                 URL: https://issues.apache.org/jira/browse/SPARK-2885
>             Project: Spark
>          Issue Type: New Feature
>            Reporter: Reza Zadeh
>
> Build all-pairs similarity algorithm via DIMSUM. 
> Given a dataset of sparse vector data, the all-pairs similarity problem is to 
> find all similar vector pairs according to a similarity function such as 
> cosine similarity, and a given similarity score threshold. Sometimes, this 
> problem is called a “similarity join”.
> The brute force approach of considering all pairs quickly breaks, since it 
> scales quadratically. For example, for a million vectors, it is not feasible 
> to check all roughly trillion pairs to see if they are above the similarity 
> threshold. Having said that, there exist clever sampling techniques to focus 
> the computational effort on those pairs that are above the similarity 
> threshold, which makes the problem feasible.
> Current PR:
> https://github.com/apache/spark/pull/1778



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to