[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-02-23 Thread Karl Higley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159679#comment-15159679
 ] 

Karl Higley commented on SPARK-5992:


I've been working on [a Spark package for approximate nearest 
neighbors|https://github.com/karlhigley/spark-neighbors] that implements 
several LSH flavors for different distance measures behind a unified interface. 
Currently, the package supports Hamming, Jaccard, Euclidean, and cosine 
distance. It's still a work in progress, but maybe it will provide some food 
for thought on how to proceed with the implementation for MLlib.

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-11-26 Thread Karl Higley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029318#comment-15029318
 ] 

Karl Higley commented on SPARK-5992:


I'm a bit confused by this section of the design doc:
{quote}
It is pretty hard to define a common interface. Because LSH algorithm has two 
types at least. One is to calculate hash value. The other is to calculate a 
similarity between a feature(vector) and another one. 

For example, random projection algorithm is a type of calculating a similarity. 
It is designed to approximate the cosine distance between vectors. On the other 
hand, min hash algorithm is a type of calculating a hash value. The hash 
function maps a d dimensional vector onto a set of integers.
{quote}
Sign-random-projection LSH does calculate a hash value (essentially a Bitset) 
for each feature vector, and the Hamming distance between two hash values is 
used to estimate the cosine similarity between the corresponding vectors. The 
two "types" of LSH mentioned here seem more like two kinds of operations which 
are sometimes applied sequentially. Maybe this distinction makes more sense for 
other types of LSH?

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-11-26 Thread Karl Higley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029318#comment-15029318
 ] 

Karl Higley edited comment on SPARK-5992 at 11/26/15 11:20 PM:
---

I'm a bit confused by this section of the design doc:
{quote}
It is pretty hard to define a common interface. Because LSH algorithm has two 
types at least. One is to calculate hash value. The other is to calculate a 
similarity between a feature(vector) and another one. 

For example, random projection algorithm is a type of calculating a similarity. 
It is designed to approximate the cosine distance between vectors. On the other 
hand, min hash algorithm is a type of calculating a hash value. The hash 
function maps a d dimensional vector onto a set of integers.
{quote}
Sign-random-projection LSH does calculate a hash value (essentially a Bitset) 
for each feature vector, and the Hamming distance between two hash values is 
used to estimate the cosine similarity between the corresponding feature 
vectors. The two "types" of LSH mentioned here seem more like two kinds of 
operations which are sometimes applied sequentially. Maybe this distinction 
makes more sense for other types of LSH?


was (Author: karlhigley):
I'm a bit confused by this section of the design doc:
{quote}
It is pretty hard to define a common interface. Because LSH algorithm has two 
types at least. One is to calculate hash value. The other is to calculate a 
similarity between a feature(vector) and another one. 

For example, random projection algorithm is a type of calculating a similarity. 
It is designed to approximate the cosine distance between vectors. On the other 
hand, min hash algorithm is a type of calculating a hash value. The hash 
function maps a d dimensional vector onto a set of integers.
{quote}
Sign-random-projection LSH does calculate a hash value (essentially a Bitset) 
for each feature vector, and the Hamming distance between two hash values is 
used to estimate the cosine similarity between the corresponding vectors. The 
two "types" of LSH mentioned here seem more like two kinds of operations which 
are sometimes applied sequentially. Maybe this distinction makes more sense for 
other types of LSH?

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8614) Row order preservation for operations on MLlib IndexedRowMatrix

2015-11-26 Thread Karl Higley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029442#comment-15029442
 ] 

Karl Higley commented on SPARK-8614:


After reading the description and referenced code, I think I can see how this 
issue would happen. From the report, sounds like reproducing it requires 
"multiple executors/machines".

> Row order preservation for operations on MLlib IndexedRowMatrix
> ---
>
> Key: SPARK-8614
> URL: https://issues.apache.org/jira/browse/SPARK-8614
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Jan Luts
>
> In both IndexedRowMatrix.computeSVD and IndexedRowMatrix.multiply indices are 
> dropped before calling the methods from RowMatrix. For example for 
> IndexedRowMatrix.computeSVD:
>val svd = toRowMatrix().computeSVD(k, computeU, rCond)
> and for IndexedRowMatrix.multiply:
>val mat = toRowMatrix().multiply(B).
> After computing these results, they are zipped with the original indices, 
> e.g. for IndexedRowMatrix.computeSVD
>val indexedRows = indices.zip(svd.U.rows).map { case (i, v) =>
>   IndexedRow(i, v)
>}
> and for IndexedRowMatrix.multiply:
>
>val indexedRows = rows.map(_.index).zip(mat.rows).map { case (i, v) =>
>   IndexedRow(i, v)
>}
> I have experienced that for IndexedRowMatrix.computeSVD().U and 
> IndexedRowMatrix.multiply() (which both depend on RowMatrix.multiply) row 
> indices can get mixed (when running Spark jobs with multiple 
> executors/machines): i.e. the vectors and indices of the result do not seem 
> to correspond anymore. 
> To me it looks like this is caused by zipping RDDs that have a different 
> ordering?
> For the IndexedRowMatrix.multiply I have observed that ordering within 
> partitions is preserved, but that it seems to get mixed up between 
> partitions. For example, for:
> part1Index1 part1Vector1
> part1Index2 part1Vector2
> part2Index1 part2Vector1
> part2Index2 part2Vector2
> I got:
> part2Index1 part1Vector1
> part2Index2 part1Vector2
> part1Index1 part2Vector1
> part1Index2 part2Vector2
> Another observation is that the mapPartitions in RowMatrix.multiply :
> val AB = rows.mapPartitions { iter =>
> had an "preservesPartitioning = true" argument in version 1.0, but this is no 
> longer there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-06-05 Thread Karl Higley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575002#comment-14575002
 ] 

Karl Higley commented on SPARK-5992:


To make it easier to define a common interface, it might help to restrict 
consideration to methods that produce hash signatures.  For cosine similarity, 
sign-random-projection LSH would probably fit the bill.  See Section 3 of 
Similarity Estimation Techniques from Rounding
Algorithms
http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf

 Locality Sensitive Hashing (LSH) for MLlib
 --

 Key: SPARK-5992
 URL: https://issues.apache.org/jira/browse/SPARK-5992
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley

 Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
 great to discuss some possible algorithms here, choose an API, and make a PR 
 for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7857) IDF w/ minDocFreq on SparseVectors results in literal zeros

2015-06-03 Thread Karl Higley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14571641#comment-14571641
 ] 

Karl Higley commented on SPARK-7857:


That does seem like the intent of the test.  I started scratching my head when 
I noticed that [the minDocFreq is set to 
1|https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/feature/IDFSuite.scala#L73],
 and that none of the terms seem to be filtered out.

 IDF w/ minDocFreq on SparseVectors results in literal zeros
 ---

 Key: SPARK-7857
 URL: https://issues.apache.org/jira/browse/SPARK-7857
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Karl Higley
Priority: Minor

 When the IDF model's minDocFreq parameter is set to a non-zero threshold, the 
 IDF for any feature below that threshold is set to zero.  When the model is 
 used to transform a set of SparseVectors containing that feature, the 
 resulting SparseVectors contain entries whose values are zero.  The zero 
 entries should be omitted in order to simplify downstream processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7857) IDF w/ minDocFreq on SparseVectors results in literal zeros

2015-06-03 Thread Karl Higley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14571771#comment-14571771
 ] 

Karl Higley commented on SPARK-7857:


Ah, okay, that makes more sense.  Seems like a small fix -- maybe a decent 
starter task?  I'd be interested in giving it a try.

 IDF w/ minDocFreq on SparseVectors results in literal zeros
 ---

 Key: SPARK-7857
 URL: https://issues.apache.org/jira/browse/SPARK-7857
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Karl Higley
Priority: Minor

 When the IDF model's minDocFreq parameter is set to a non-zero threshold, the 
 IDF for any feature below that threshold is set to zero.  When the model is 
 used to transform a set of SparseVectors containing that feature, the 
 resulting SparseVectors contain entries whose values are zero.  The zero 
 entries should be omitted in order to simplify downstream processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7857) IDF w/ minDocFreq on SparseVectors results in literal zeros

2015-06-03 Thread Karl Higley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14571220#comment-14571220
 ] 

Karl Higley commented on SPARK-7857:


Agreed. numNonZeros works for my use case, but I had to go find it first. 
Vectors.sparse(...).toSparse() looks a little strange, but probably produces 
a less surprising result.

I'd submit a PR to make that change, but I'm confused by the minDocFreq test. 
It appears to test that filtering out terms occurring in less than one document 
works as expected. Maybe I'm misreading?

 IDF w/ minDocFreq on SparseVectors results in literal zeros
 ---

 Key: SPARK-7857
 URL: https://issues.apache.org/jira/browse/SPARK-7857
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Karl Higley
Priority: Minor

 When the IDF model's minDocFreq parameter is set to a non-zero threshold, the 
 IDF for any feature below that threshold is set to zero.  When the model is 
 used to transform a set of SparseVectors containing that feature, the 
 resulting SparseVectors contain entries whose values are zero.  The zero 
 entries should be omitted in order to simplify downstream processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7857) IDF w/ minDocFreq on SparseVectors results in literal zeros

2015-06-01 Thread Karl Higley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567520#comment-14567520
 ] 

Karl Higley commented on SPARK-7857:


This is addressed by the addition of numNonZeros in SPARK-6756.

 IDF w/ minDocFreq on SparseVectors results in literal zeros
 ---

 Key: SPARK-7857
 URL: https://issues.apache.org/jira/browse/SPARK-7857
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Karl Higley
Priority: Minor

 When the IDF model's minDocFreq parameter is set to a non-zero threshold, the 
 IDF for any feature below that threshold is set to zero.  When the model is 
 used to transform a set of SparseVectors containing that feature, the 
 resulting SparseVectors contain entries whose values are zero.  The zero 
 entries should be omitted in order to simplify downstream processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org