[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-10-17 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581416#comment-15581416
 ] 

Yun Ni commented on SPARK-5992:
---

Yes, I have implemented cosine, jaccard, euclidean and hamming distance. Please 
check:
https://github.com/apache/spark/pull/15148

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-10-17 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581412#comment-15581412
 ] 

Yun Ni commented on SPARK-5992:
---

Yes, I do have comparison with full scan. Usually kNN with 1 hash function does 
not work well, so I enabled OR-amplification with multiple hash functions. You 
can set the number of hash functions using "outputDim".

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-10-17 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581366#comment-15581366
 ] 

Debasish Das commented on SPARK-5992:
-

Also do you have hash function for euclidean distance?  We use cosine, jaccard 
and euclidean with SPARK-4823...for knn comparison we can use overlap 
metric...pick up k and then compare overlap within lsh based approximate knn 
and brute force knn...let me know if you need help in running the benchmarks...

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-10-17 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581361#comment-15581361
 ] 

Debasish Das commented on SPARK-5992:
-

Did you compare with brute force knn ? Normally lsh does not work well for nn 
queries and that's why hybrid spill trees and other ideas came alongI can 
run some comparisons using SPARK-4823

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-10-17 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581345#comment-15581345
 ] 

Yun Ni commented on SPARK-5992:
---

Some tests on the WEX Open Dataset, with steps and results:
https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_mrg_-vLro

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-09-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503971#comment-15503971
 ] 

Apache Spark commented on SPARK-5992:
-

User 'Yunni' has created a pull request for this issue:
https://github.com/apache/spark/pull/15148

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-09-19 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15502493#comment-15502493
 ] 

Yun Ni commented on SPARK-5992:
---

Hi Joseph,

I have made an initial PR based on the design doc: 
https://github.com/apache/spark/pull/15148
Please take a look and I will test on the open dataset in the mean time.

Thanks very much,
Yun Ni

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-09-09 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15478135#comment-15478135
 ] 

Yun Ni commented on SPARK-5992:
---

Thank you very much for reviewing it, Joseph!

I will work on the first commit according to the doc and send a PR soon. :)

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-09-09 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15478125#comment-15478125
 ] 

Joseph K. Bradley commented on SPARK-5992:
--

The design doc LGTM!  Thanks for updating it.  Shall we proceed with an initial 
PR?

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-08-31 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453980#comment-15453980
 ] 

Yun Ni commented on SPARK-5992:
---

Thanks very much for reviewing, Joseph!

Based on your comments, I have made the following major changes:
(1) Change the API Design to use Estimator instead of UnaryTransformer.
(2) Add a Testing section to elaborate more about the unit tests and scale 
tests design.

Please let me know if you have any further comments, so that we can go forward.

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-08-30 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15451091#comment-15451091
 ] 

Joseph K. Bradley commented on SPARK-5992:
--

Awesome, thanks!  I made some comments, but it looks like a good approach 
overall.

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-08-22 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432186#comment-15432186
 ] 

Yun Ni commented on SPARK-5992:
---

Hi,

We are engineers from Uber. Here is our design doc for LSH:
https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit

Please take a look and let us know if this meets your requirements or not.

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-07-30 Thread snehil suresh wakchaure (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400751#comment-15400751
 ] 

snehil suresh wakchaure commented on SPARK-5992:


Hello, just curious to know if I can contribute to this project too although I 
am new at it. I Can use some pointers to get started & where we are at right 
now with this feature design.

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-07-22 Thread tony (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389321#comment-15389321
 ] 

tony commented on SPARK-5992:
-

Hi, 
I am new to spark. How can I make a contribution to Spark LSH implementation. I 
am a Python programmer. I am interested to make a contribution in the LSH 
technique of random projection for approximating cosine distance. I can you 
guys keep me in loop also.

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-06-23 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15345890#comment-15345890
 ] 

Xiangrui Meng commented on SPARK-5992:
--

FYI, I had an offline discussion with Kelvin Chu and Erran Li from Uber after 
Spark Summit, where they talked about the large-scale LSH implementation at 
Uber. I asked them to join the discussion and possibly contribute their API 
design and implementation to MLlib.

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-02-23 Thread Karl Higley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159679#comment-15159679
 ] 

Karl Higley commented on SPARK-5992:


I've been working on [a Spark package for approximate nearest 
neighbors|https://github.com/karlhigley/spark-neighbors] that implements 
several LSH flavors for different distance measures behind a unified interface. 
Currently, the package supports Hamming, Jaccard, Euclidean, and cosine 
distance. It's still a work in progress, but maybe it will provide some food 
for thought on how to proceed with the implementation for MLlib.

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-11-26 Thread Karl Higley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029318#comment-15029318
 ] 

Karl Higley commented on SPARK-5992:


I'm a bit confused by this section of the design doc:
{quote}
It is pretty hard to define a common interface. Because LSH algorithm has two 
types at least. One is to calculate hash value. The other is to calculate a 
similarity between a feature(vector) and another one. 

For example, random projection algorithm is a type of calculating a similarity. 
It is designed to approximate the cosine distance between vectors. On the other 
hand, min hash algorithm is a type of calculating a hash value. The hash 
function maps a d dimensional vector onto a set of integers.
{quote}
Sign-random-projection LSH does calculate a hash value (essentially a Bitset) 
for each feature vector, and the Hamming distance between two hash values is 
used to estimate the cosine similarity between the corresponding vectors. The 
two "types" of LSH mentioned here seem more like two kinds of operations which 
are sometimes applied sequentially. Maybe this distinction makes more sense for 
other types of LSH?

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-11-18 Thread Jonas Seiler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15010635#comment-15010635
 ] 

Jonas Seiler commented on SPARK-5992:
-

Very nice!

Hope there will be soon a common interface and we can use all this methods.

When deciding up on methods, please take the executor memory consumption into 
account. If I have to store a #num of org. dim x #of reduced dim Matrix on each 
executor. This might be too much already (at least in our case).
If you use sparse random projections (e.g.)
Ping Li, T. Hastie and K. W. Church, 2006, “Very Sparse Random Projections”.
https://web.stanford.edu/~hastie/Papers/Ping/KDD06_rp.pdf 
You can store a sparse matrix and have a parameter to tune the memory 
consumption.

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-09-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14904500#comment-14904500
 ] 

Özgür Demir commented on SPARK-5992:


hi, we just open sourced an lsh topk implementation for spark. it's an 
implementation for cosine similarity only. for search and recommendation tasks 
this is the most used similarity metric as it normalizes popularity. 

influenced by the 'dimsum' implementation its input is a row based matrix where 
each row represents one item. The output is a similarity matrix. This allowed 
us to easily switch between both implementations. 

this might be the most basic interface for other lsh join (all pair) 
implementations?

you'll find the code here:

https://github.com/soundcloud/cosine-lsh-join-spark

cheers

demir

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-08-29 Thread Maruf Aytekin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721066#comment-14721066
 ] 

Maruf Aytekin commented on SPARK-5992:
--

I have developed spark implementation of LSH for Charikar's scheme for 
collection of vectors. It is published here: 
https://github.com/marufaytekin/lsh-spark. The details are documented in 
Readme.md file. I'd really appreciate if you check it out and provide feedback.


 Locality Sensitive Hashing (LSH) for MLlib
 --

 Key: SPARK-5992
 URL: https://issues.apache.org/jira/browse/SPARK-5992
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley

 Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
 great to discuss some possible algorithms here, choose an API, and make a PR 
 for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-07-22 Thread Maruf Aytekin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637337#comment-14637337
 ] 

Maruf Aytekin commented on SPARK-5992:
--

In addition to Charikar's scheme for cosine [~karlhigley]  pointed out, LSH 
schemes for the other known similarity/distance measures are  as follows:

1. Hamming norm:
A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via 
Hashing. In Proc. of the 25th Intl. Conf. on Very Large Data Bases, VLDB(1999).
http://www.cs.princeton.edu/courses/archive/spring13/cos598C/Gionis.pdf

2. Lp norms:
M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni Locality-Sensitive Hashing 
Scheme Based on p-Stable Distributions. In Proc. of the 20th ACM Annual
http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/p253-datar.pdf
http://people.csail.mit.edu/indyk/nips-nn.ps

3. Jaccard distance:
Mining Massive Data Sets chapter#3: 
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

4. Cosine distance and Earth movers distance (EMD):
M. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In 
Proc. of the 34th Annual ACM Symposium on Theory of Computing, STOC (2002).
http://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf



 Locality Sensitive Hashing (LSH) for MLlib
 --

 Key: SPARK-5992
 URL: https://issues.apache.org/jira/browse/SPARK-5992
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley

 Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
 great to discuss some possible algorithms here, choose an API, and make a PR 
 for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-06-05 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14574933#comment-14574933
 ] 

Yu Ishikawa commented on SPARK-5992:


h2. Initial version of desing doc

https://github.com/yu-iskw/SPARK-5992-LSH-design-doc/blob/master/design-doc.md

 Locality Sensitive Hashing (LSH) for MLlib
 --

 Key: SPARK-5992
 URL: https://issues.apache.org/jira/browse/SPARK-5992
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley

 Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
 great to discuss some possible algorithms here, choose an API, and make a PR 
 for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-06-05 Thread Karl Higley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575002#comment-14575002
 ] 

Karl Higley commented on SPARK-5992:


To make it easier to define a common interface, it might help to restrict 
consideration to methods that produce hash signatures.  For cosine similarity, 
sign-random-projection LSH would probably fit the bill.  See Section 3 of 
Similarity Estimation Techniques from Rounding
Algorithms
http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf

 Locality Sensitive Hashing (LSH) for MLlib
 --

 Key: SPARK-5992
 URL: https://issues.apache.org/jira/browse/SPARK-5992
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley

 Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
 great to discuss some possible algorithms here, choose an API, and make a PR 
 for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-06-05 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14574750#comment-14574750
 ] 

Yu Ishikawa commented on SPARK-5992:


[~debasish83] Thank you for your comment. I haven't compared algebird LSH with 
spark's one. And I don't know if adding algebird to mllib is ok. However, as 
you're suggesting, I think we should use it in Spark Streaming. 

 Locality Sensitive Hashing (LSH) for MLlib
 --

 Key: SPARK-5992
 URL: https://issues.apache.org/jira/browse/SPARK-5992
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley

 Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
 great to discuss some possible algorithms here, choose an API, and make a PR 
 for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-06-05 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575079#comment-14575079
 ] 

Joseph K. Bradley commented on SPARK-5992:
--

I'm just noting here that [~yuu.ishik...@gmail.com] and I discussed the design 
doc.  The current plan is to have these implemented under ml.feature, with 2 
types of UnaryTransformers:
* Vector - T: hash feature vector to some other type (Vector, Int, etc.)
* Vector - Double: compare feature vector with a fixed base Vector, and return 
similarity

 Locality Sensitive Hashing (LSH) for MLlib
 --

 Key: SPARK-5992
 URL: https://issues.apache.org/jira/browse/SPARK-5992
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley

 Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
 great to discuss some possible algorithms here, choose an API, and make a PR 
 for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-04-25 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512843#comment-14512843
 ] 

Debasish Das commented on SPARK-5992:
-

Did someone compared algebird LSH with spark minhash link above ? Unless 
algebird is slow (which I found for TopK monoid) we should use it the same way 
HLL is being used in Spark streaming ? Is it ok to add algebird to mllib ?

 Locality Sensitive Hashing (LSH) for MLlib
 --

 Key: SPARK-5992
 URL: https://issues.apache.org/jira/browse/SPARK-5992
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley

 Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
 great to discuss some possible algorithms here, choose an API, and make a PR 
 for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-04-03 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395415#comment-14395415
 ] 

Yu Ishikawa commented on SPARK-5992:


Yes, I will try to write a design doc. Thank you for your continues support.

 Locality Sensitive Hashing (LSH) for MLlib
 --

 Key: SPARK-5992
 URL: https://issues.apache.org/jira/browse/SPARK-5992
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley

 Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
 great to discuss some possible algorithms here, choose an API, and make a PR 
 for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-04-02 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392366#comment-14392366
 ] 

Yu Ishikawa commented on SPARK-5992:


As I said when we talked in Spark Summit East, I think we should design the 
public API and implement just random projection method at first. 
I am interested in this paper. However, I haven't read the details of this 
paper. 

Bahman Bahmani, Ashish Goel, and Rajendra Shinde. 2012. Efficient distributed 
locality sensitive hashing. In Proceedings of the 21st ACM international 
conference on Information and knowledge management (CIKM '12). ACM, New York, 
NY, USA, 2174-2178. DOI=10.1145/2396761.2398596
http://web.stanford.edu/~ashishg/papers/darpa_unpub2.pdf


 Locality Sensitive Hashing (LSH) for MLlib
 --

 Key: SPARK-5992
 URL: https://issues.apache.org/jira/browse/SPARK-5992
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley

 Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
 great to discuss some possible algorithms here, choose an API, and make a PR 
 for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-03-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357335#comment-14357335
 ] 

Joseph K. Bradley commented on SPARK-5992:
--

I'm not sure what the most important or useful ones would be.  I've heard the 
most about random projections.  Are you familiar with the others?  Perhaps we 
can make a list of algorithms and their properties to try to get a reasonable 
coverage of use cases.

 Locality Sensitive Hashing (LSH) for MLlib
 --

 Key: SPARK-5992
 URL: https://issues.apache.org/jira/browse/SPARK-5992
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley

 Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
 great to discuss some possible algorithms here, choose an API, and make a PR 
 for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-03-11 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356507#comment-14356507
 ] 

Yu Ishikawa commented on SPARK-5992:


What kind of LSH algorithms should we support at first?

- Bit sampling for Hamming distance
- Min-wise independent permutations
- Nilsimsa Hash
- Random projection
- Min Hash

 Locality Sensitive Hashing (LSH) for MLlib
 --

 Key: SPARK-5992
 URL: https://issues.apache.org/jira/browse/SPARK-5992
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley

 Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
 great to discuss some possible algorithms here, choose an API, and make a PR 
 for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org