[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14637337#comment-14637337
 ] 

Maruf Aytekin edited comment on SPARK-5992 at 7/31/15 11:33 AM:
----------------------------------------------------------------

In addition to Charikar's scheme for cosine [~karlhigley]  pointed out, LSH 
schemes for the other known similarity/distance measures are  as follows:

1. Hamming norm:
A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via 
Hashing. In Proc. of the 25th Intl. Conf. on Very Large Data Bases, VLDB(1999).
http://www.cs.princeton.edu/courses/archive/spring13/cos598C/Gionis.pdf

2. Lp norms:
M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni Locality-Sensitive Hashing 
Scheme Based on p-Stable Distributions. In Proc. of the 20th ACM Annual
http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/p253-datar.pdf
http://people.csail.mit.edu/indyk/nips-nn.ps

-3. Jaccard distance:-
-Mining Massive Data Sets chapter#3-
_spark-hash package referenced above already implements this_

4. Cosine distance and Earth movers distance (EMD):
M. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In 
Proc. of the 34th Annual ACM Symposium on Theory of Computing, STOC (2002).
http://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf




was (Author: cepnyci):
In addition to Charikar's scheme for cosine [~karlhigley]  pointed out, LSH 
schemes for the other known similarity/distance measures are  as follows:

1. Hamming norm:
A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via 
Hashing. In Proc. of the 25th Intl. Conf. on Very Large Data Bases, VLDB(1999).
http://www.cs.princeton.edu/courses/archive/spring13/cos598C/Gionis.pdf

2. Lp norms:
M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni Locality-Sensitive Hashing 
Scheme Based on p-Stable Distributions. In Proc. of the 20th ACM Annual
http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/p253-datar.pdf
http://people.csail.mit.edu/indyk/nips-nn.ps

3. Jaccard distance:
Mining Massive Data Sets chapter#3: 
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

4. Cosine distance and Earth movers distance (EMD):
M. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In 
Proc. of the 34th Annual ACM Symposium on Theory of Computing, STOC (2002).
http://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf



> Locality Sensitive Hashing (LSH) for MLlib
> ------------------------------------------
>
>                 Key: SPARK-5992
>                 URL: https://issues.apache.org/jira/browse/SPARK-5992
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>    Affects Versions: 1.4.0
>            Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to