[jira] [Comment Edited] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-08-22 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432186#comment-15432186
 ] 

Yun Ni edited comment on SPARK-5992 at 8/23/16 5:50 AM:


Hi,

We are engineers from Uber. Here is our design doc for LSH:
https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit

Please take a look and let us know if this meets your requirements or not.

Thanks,
Yun Ni


was (Author: yunn):
Hi,

We are engineers from Uber. Here is our design doc for LSH:
https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit

Please take a look and let us know if this meets your requirements or not.

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-07-30 Thread snehil suresh wakchaure (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400751#comment-15400751
 ] 

snehil suresh wakchaure edited comment on SPARK-5992 at 7/30/16 5:43 PM:
-

Hello, just curious to know if I can contribute to this project too although I 
am new at it. I Can use some pointers to get started. Is this going to be a 
scala, java or python codebase?

Any updates from the Uber community? 


was (Author: snehil.w):
Hello, just curious to know if I can contribute to this project too although I 
am new at it. I Can use some pointers to get started.

Any updates from the Uber community? 

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-07-30 Thread snehil suresh wakchaure (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400751#comment-15400751
 ] 

snehil suresh wakchaure edited comment on SPARK-5992 at 7/30/16 5:28 PM:
-

Hello, just curious to know if I can contribute to this project too although I 
am new at it. I Can use some pointers to get started & where we are at right 
now with this feature design.

Any updates from the Uber community? 


was (Author: snehil.w):
Hello, just curious to know if I can contribute to this project too although I 
am new at it. I Can use some pointers to get started & where we are at right 
now with this feature design.

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-07-30 Thread snehil suresh wakchaure (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400751#comment-15400751
 ] 

snehil suresh wakchaure edited comment on SPARK-5992 at 7/30/16 5:31 PM:
-

Hello, just curious to know if I can contribute to this project too although I 
am new at it. I Can use some pointers to get started.

Any updates from the Uber community? 


was (Author: snehil.w):
Hello, just curious to know if I can contribute to this project too although I 
am new at it. I Can use some pointers to get started & where we are at right 
now with this feature design.

Any updates from the Uber community? 

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-07-22 Thread tony (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389321#comment-15389321
 ] 

tony edited comment on SPARK-5992 at 7/22/16 11:08 AM:
---

Hi, 
I am new to spark. How can I make a contribution to Spark LSH implementation. I 
am a Python programmer. I am interested to make a contribution in the LSH 
technique of random projection for approximating cosine distance. can you guys 
keep me in loop?


was (Author: tonygrey):
Hi, 
I am new to spark. How can I make a contribution to Spark LSH implementation. I 
am a Python programmer. I am interested to make a contribution in the LSH 
technique of random projection for approximating cosine distance. I can you 
guys keep me in loop.

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-07-22 Thread tony (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389321#comment-15389321
 ] 

tony edited comment on SPARK-5992 at 7/22/16 11:03 AM:
---

Hi, 
I am new to spark. How can I make a contribution to Spark LSH implementation. I 
am a Python programmer. I am interested to make a contribution in the LSH 
technique of random projection for approximating cosine distance. I can you 
guys keep me in loop.


was (Author: tonygrey):
Hi, 
I am new to spark. How can I make a contribution to Spark LSH implementation. I 
am a Python programmer. I am interested to make a contribution in the LSH 
technique of random projection for approximating cosine distance. I can you 
guys keep me in loop also.

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-11-26 Thread Karl Higley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029318#comment-15029318
 ] 

Karl Higley edited comment on SPARK-5992 at 11/26/15 11:20 PM:
---

I'm a bit confused by this section of the design doc:
{quote}
It is pretty hard to define a common interface. Because LSH algorithm has two 
types at least. One is to calculate hash value. The other is to calculate a 
similarity between a feature(vector) and another one. 

For example, random projection algorithm is a type of calculating a similarity. 
It is designed to approximate the cosine distance between vectors. On the other 
hand, min hash algorithm is a type of calculating a hash value. The hash 
function maps a d dimensional vector onto a set of integers.
{quote}
Sign-random-projection LSH does calculate a hash value (essentially a Bitset) 
for each feature vector, and the Hamming distance between two hash values is 
used to estimate the cosine similarity between the corresponding feature 
vectors. The two "types" of LSH mentioned here seem more like two kinds of 
operations which are sometimes applied sequentially. Maybe this distinction 
makes more sense for other types of LSH?


was (Author: karlhigley):
I'm a bit confused by this section of the design doc:
{quote}
It is pretty hard to define a common interface. Because LSH algorithm has two 
types at least. One is to calculate hash value. The other is to calculate a 
similarity between a feature(vector) and another one. 

For example, random projection algorithm is a type of calculating a similarity. 
It is designed to approximate the cosine distance between vectors. On the other 
hand, min hash algorithm is a type of calculating a hash value. The hash 
function maps a d dimensional vector onto a set of integers.
{quote}
Sign-random-projection LSH does calculate a hash value (essentially a Bitset) 
for each feature vector, and the Hamming distance between two hash values is 
used to estimate the cosine similarity between the corresponding vectors. The 
two "types" of LSH mentioned here seem more like two kinds of operations which 
are sometimes applied sequentially. Maybe this distinction makes more sense for 
other types of LSH?

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-07-31 Thread Maruf Aytekin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637337#comment-14637337
 ] 

Maruf Aytekin edited comment on SPARK-5992 at 7/31/15 11:33 AM:


In addition to Charikar's scheme for cosine [~karlhigley]  pointed out, LSH 
schemes for the other known similarity/distance measures are  as follows:

1. Hamming norm:
A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via 
Hashing. In Proc. of the 25th Intl. Conf. on Very Large Data Bases, VLDB(1999).
http://www.cs.princeton.edu/courses/archive/spring13/cos598C/Gionis.pdf

2. Lp norms:
M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni Locality-Sensitive Hashing 
Scheme Based on p-Stable Distributions. In Proc. of the 20th ACM Annual
http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/p253-datar.pdf
http://people.csail.mit.edu/indyk/nips-nn.ps

-3. Jaccard distance:-
-Mining Massive Data Sets chapter#3-
_spark-hash package referenced above already implements this_

4. Cosine distance and Earth movers distance (EMD):
M. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In 
Proc. of the 34th Annual ACM Symposium on Theory of Computing, STOC (2002).
http://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf




was (Author: cepnyci):
In addition to Charikar's scheme for cosine [~karlhigley]  pointed out, LSH 
schemes for the other known similarity/distance measures are  as follows:

1. Hamming norm:
A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via 
Hashing. In Proc. of the 25th Intl. Conf. on Very Large Data Bases, VLDB(1999).
http://www.cs.princeton.edu/courses/archive/spring13/cos598C/Gionis.pdf

2. Lp norms:
M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni Locality-Sensitive Hashing 
Scheme Based on p-Stable Distributions. In Proc. of the 20th ACM Annual
http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/p253-datar.pdf
http://people.csail.mit.edu/indyk/nips-nn.ps

3. Jaccard distance:
Mining Massive Data Sets chapter#3: 
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

4. Cosine distance and Earth movers distance (EMD):
M. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In 
Proc. of the 34th Annual ACM Symposium on Theory of Computing, STOC (2002).
http://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf



 Locality Sensitive Hashing (LSH) for MLlib
 --

 Key: SPARK-5992
 URL: https://issues.apache.org/jira/browse/SPARK-5992
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley

 Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
 great to discuss some possible algorithms here, choose an API, and make a PR 
 for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org