[jira] [Comment Edited] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432186#comment-15432186 ] Yun Ni edited comment on SPARK-5992 at 8/23/16 5:50 AM: Hi, We are engineers from Uber. Here is our design doc for LSH: https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit Please take a look and let us know if this meets your requirements or not. Thanks, Yun Ni was (Author: yunn): Hi, We are engineers from Uber. Here is our design doc for LSH: https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit Please take a look and let us know if this meets your requirements or not. > Locality Sensitive Hashing (LSH) for MLlib > -- > > Key: SPARK-5992 > URL: https://issues.apache.org/jira/browse/SPARK-5992 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley > > Locality Sensitive Hashing (LSH) would be very useful for ML. It would be > great to discuss some possible algorithms here, choose an API, and make a PR > for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400751#comment-15400751 ] snehil suresh wakchaure edited comment on SPARK-5992 at 7/30/16 5:43 PM: - Hello, just curious to know if I can contribute to this project too although I am new at it. I Can use some pointers to get started. Is this going to be a scala, java or python codebase? Any updates from the Uber community? was (Author: snehil.w): Hello, just curious to know if I can contribute to this project too although I am new at it. I Can use some pointers to get started. Any updates from the Uber community? > Locality Sensitive Hashing (LSH) for MLlib > -- > > Key: SPARK-5992 > URL: https://issues.apache.org/jira/browse/SPARK-5992 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley > > Locality Sensitive Hashing (LSH) would be very useful for ML. It would be > great to discuss some possible algorithms here, choose an API, and make a PR > for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400751#comment-15400751 ] snehil suresh wakchaure edited comment on SPARK-5992 at 7/30/16 5:28 PM: - Hello, just curious to know if I can contribute to this project too although I am new at it. I Can use some pointers to get started & where we are at right now with this feature design. Any updates from the Uber community? was (Author: snehil.w): Hello, just curious to know if I can contribute to this project too although I am new at it. I Can use some pointers to get started & where we are at right now with this feature design. > Locality Sensitive Hashing (LSH) for MLlib > -- > > Key: SPARK-5992 > URL: https://issues.apache.org/jira/browse/SPARK-5992 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley > > Locality Sensitive Hashing (LSH) would be very useful for ML. It would be > great to discuss some possible algorithms here, choose an API, and make a PR > for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400751#comment-15400751 ] snehil suresh wakchaure edited comment on SPARK-5992 at 7/30/16 5:31 PM: - Hello, just curious to know if I can contribute to this project too although I am new at it. I Can use some pointers to get started. Any updates from the Uber community? was (Author: snehil.w): Hello, just curious to know if I can contribute to this project too although I am new at it. I Can use some pointers to get started & where we are at right now with this feature design. Any updates from the Uber community? > Locality Sensitive Hashing (LSH) for MLlib > -- > > Key: SPARK-5992 > URL: https://issues.apache.org/jira/browse/SPARK-5992 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley > > Locality Sensitive Hashing (LSH) would be very useful for ML. It would be > great to discuss some possible algorithms here, choose an API, and make a PR > for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389321#comment-15389321 ] tony edited comment on SPARK-5992 at 7/22/16 11:08 AM: --- Hi, I am new to spark. How can I make a contribution to Spark LSH implementation. I am a Python programmer. I am interested to make a contribution in the LSH technique of random projection for approximating cosine distance. can you guys keep me in loop? was (Author: tonygrey): Hi, I am new to spark. How can I make a contribution to Spark LSH implementation. I am a Python programmer. I am interested to make a contribution in the LSH technique of random projection for approximating cosine distance. I can you guys keep me in loop. > Locality Sensitive Hashing (LSH) for MLlib > -- > > Key: SPARK-5992 > URL: https://issues.apache.org/jira/browse/SPARK-5992 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley > > Locality Sensitive Hashing (LSH) would be very useful for ML. It would be > great to discuss some possible algorithms here, choose an API, and make a PR > for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389321#comment-15389321 ] tony edited comment on SPARK-5992 at 7/22/16 11:03 AM: --- Hi, I am new to spark. How can I make a contribution to Spark LSH implementation. I am a Python programmer. I am interested to make a contribution in the LSH technique of random projection for approximating cosine distance. I can you guys keep me in loop. was (Author: tonygrey): Hi, I am new to spark. How can I make a contribution to Spark LSH implementation. I am a Python programmer. I am interested to make a contribution in the LSH technique of random projection for approximating cosine distance. I can you guys keep me in loop also. > Locality Sensitive Hashing (LSH) for MLlib > -- > > Key: SPARK-5992 > URL: https://issues.apache.org/jira/browse/SPARK-5992 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley > > Locality Sensitive Hashing (LSH) would be very useful for ML. It would be > great to discuss some possible algorithms here, choose an API, and make a PR > for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029318#comment-15029318 ] Karl Higley edited comment on SPARK-5992 at 11/26/15 11:20 PM: --- I'm a bit confused by this section of the design doc: {quote} It is pretty hard to define a common interface. Because LSH algorithm has two types at least. One is to calculate hash value. The other is to calculate a similarity between a feature(vector) and another one. For example, random projection algorithm is a type of calculating a similarity. It is designed to approximate the cosine distance between vectors. On the other hand, min hash algorithm is a type of calculating a hash value. The hash function maps a d dimensional vector onto a set of integers. {quote} Sign-random-projection LSH does calculate a hash value (essentially a Bitset) for each feature vector, and the Hamming distance between two hash values is used to estimate the cosine similarity between the corresponding feature vectors. The two "types" of LSH mentioned here seem more like two kinds of operations which are sometimes applied sequentially. Maybe this distinction makes more sense for other types of LSH? was (Author: karlhigley): I'm a bit confused by this section of the design doc: {quote} It is pretty hard to define a common interface. Because LSH algorithm has two types at least. One is to calculate hash value. The other is to calculate a similarity between a feature(vector) and another one. For example, random projection algorithm is a type of calculating a similarity. It is designed to approximate the cosine distance between vectors. On the other hand, min hash algorithm is a type of calculating a hash value. The hash function maps a d dimensional vector onto a set of integers. {quote} Sign-random-projection LSH does calculate a hash value (essentially a Bitset) for each feature vector, and the Hamming distance between two hash values is used to estimate the cosine similarity between the corresponding vectors. The two "types" of LSH mentioned here seem more like two kinds of operations which are sometimes applied sequentially. Maybe this distinction makes more sense for other types of LSH? > Locality Sensitive Hashing (LSH) for MLlib > -- > > Key: SPARK-5992 > URL: https://issues.apache.org/jira/browse/SPARK-5992 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley > > Locality Sensitive Hashing (LSH) would be very useful for ML. It would be > great to discuss some possible algorithms here, choose an API, and make a PR > for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637337#comment-14637337 ] Maruf Aytekin edited comment on SPARK-5992 at 7/31/15 11:33 AM: In addition to Charikar's scheme for cosine [~karlhigley] pointed out, LSH schemes for the other known similarity/distance measures are as follows: 1. Hamming norm: A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. In Proc. of the 25th Intl. Conf. on Very Large Data Bases, VLDB(1999). http://www.cs.princeton.edu/courses/archive/spring13/cos598C/Gionis.pdf 2. Lp norms: M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. In Proc. of the 20th ACM Annual http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/p253-datar.pdf http://people.csail.mit.edu/indyk/nips-nn.ps -3. Jaccard distance:- -Mining Massive Data Sets chapter#3- _spark-hash package referenced above already implements this_ 4. Cosine distance and Earth movers distance (EMD): M. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In Proc. of the 34th Annual ACM Symposium on Theory of Computing, STOC (2002). http://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf was (Author: cepnyci): In addition to Charikar's scheme for cosine [~karlhigley] pointed out, LSH schemes for the other known similarity/distance measures are as follows: 1. Hamming norm: A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. In Proc. of the 25th Intl. Conf. on Very Large Data Bases, VLDB(1999). http://www.cs.princeton.edu/courses/archive/spring13/cos598C/Gionis.pdf 2. Lp norms: M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. In Proc. of the 20th ACM Annual http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/p253-datar.pdf http://people.csail.mit.edu/indyk/nips-nn.ps 3. Jaccard distance: Mining Massive Data Sets chapter#3: http://infolab.stanford.edu/~ullman/mmds/ch3.pdf 4. Cosine distance and Earth movers distance (EMD): M. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In Proc. of the 34th Annual ACM Symposium on Theory of Computing, STOC (2002). http://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf Locality Sensitive Hashing (LSH) for MLlib -- Key: SPARK-5992 URL: https://issues.apache.org/jira/browse/SPARK-5992 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Locality Sensitive Hashing (LSH) would be very useful for ML. It would be great to discuss some possible algorithms here, choose an API, and make a PR for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org