[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87906133
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala 
---
@@ -31,13 +31,9 @@ import org.apache.spark.sql.types.StructType
 /**
  * :: Experimental ::
  *
- * Model produced by [[MinHash]], where multiple hash functions are 
stored. Each hash function is
- * a perfect hash function:
- *`h_i(x) = (x * k_i mod prime) mod numEntries`
- * where `k_i` is the i-th coefficient, and both `x` and `k_i` are from 
`Z_prime^*`
- *
- * Reference:
- * [[https://en.wikipedia.org/wiki/Perfect_hash_function Wikipedia on 
Perfect Hash Function]]
+ * Model produced by [[MinHashLSH]], where multiple hash functions are 
stored. Each hash function is
+ * a perfect hash function for a specific set `S` with cardinality equal 
to a half of `numEntries`:
--- End diff --

I'm not following exactly why the cardinality of `S` is _half_ of 
`numEntries`. Actually, why is threshold for feature dimensionality `prime / 2` 
? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87906309
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala 
---
@@ -46,21 +42,23 @@ import org.apache.spark.sql.types.StructType
 @Since("2.1.0")
 class MinHashModel private[ml] (
 override val uid: String,
-@Since("2.1.0") val numEntries: Int,
-@Since("2.1.0") val randCoefficients: Array[Int])
+@Since("2.1.0") private[ml] val numEntries: Int,
+@Since("2.1.0") private[ml] val randCoefficients: Array[Int])
   extends LSHModel[MinHashModel] {
 
   @Since("2.1.0")
-  override protected[ml] val hashFunction: Vector => Vector = {
-elems: Vector =>
+  override protected[ml] val hashFunction: Vector => Array[Vector] = {
+elems: Vector => {
   require(elems.numNonzeros > 0, "Must have at least 1 non zero 
entry.")
   val elemsList = elems.toSparse.indices.toList
   val hashValues = randCoefficients.map({ randCoefficient: Int =>
-  elemsList.map({elem: Int =>
-(1 + elem) * randCoefficient.toLong % MinHash.prime % 
numEntries
-  }).min.toDouble
+elemsList.map({ elem: Int =>
--- End diff --

redundant brackets. Just use `elemsList.map { elem: Int =>`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87844941
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -106,22 +123,24 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]]
* transformed data when necessary.
*
* This method implements two ways of fetching k nearest neighbors:
-   *  - Single Probing: Fast, return at most k elements (Probing only one 
buckets)
-   *  - Multiple Probing: Slow, return exact k elements (Probing multiple 
buckets close to the key)
+   *  - Single-probe: Fast, return at most k elements (Probing only one 
buckets)
--- End diff --

"Probing only one bucket"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87906709
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala 
---
@@ -46,21 +42,23 @@ import org.apache.spark.sql.types.StructType
 @Since("2.1.0")
 class MinHashModel private[ml] (
 override val uid: String,
-@Since("2.1.0") val numEntries: Int,
-@Since("2.1.0") val randCoefficients: Array[Int])
+@Since("2.1.0") private[ml] val numEntries: Int,
--- End diff --

no since tags for private values.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87874869
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -66,10 +66,10 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]]
   self: T =>
 
   /**
-   * The hash function of LSH, mapping a predefined KeyType to a Vector
+   * The hash function of LSH, mapping an input feature to multiple vectors
--- End diff --

"mapping an input feature vector to multiple hash vectors."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87878252
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala 
---
@@ -102,8 +103,7 @@ class MinHashModel private[ml] (
  */
 @Experimental
 @Since("2.1.0")
-class MinHash(override val uid: String) extends LSH[MinHashModel] with 
HasSeed {
-
+class MinHashLSH(override val uid: String) extends LSH[MinHashModel] with 
HasSeed {
--- End diff --

Also, the comment above says:


 * ... For example,
 *`Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])`
 * means there are 10 elements in the space. This set contains elem 2, elem 
3 and elem 5.
 * Also, any input vector must have at least 1 non-zero indices, and all 
non-zero values are treated
 * as binary "1" values.


Can we change it to:


* ... For example,
*`Vectors.sparse(10, Array((2, 1.0), (3, 1.0), (5, 1.0)))`
* means there are 10 elements in the space. This set contains non-zero 
values at indices 2, 3, and
* 5. Also, any input vector must have at least 1 non-zero index, and all 
non-zero values are
* treated as binary "1" values.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87908012
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala 
---
@@ -125,11 +125,11 @@ class MinHash(override val uid: String) extends 
LSH[MinHashModel] with HasSeed {
 
   @Since("2.1.0")
   override protected[ml] def createRawLSHModel(inputDim: Int): 
MinHashModel = {
-require(inputDim <= MinHash.prime / 2,
-  s"The input vector dimension $inputDim exceeds the threshold 
${MinHash.prime / 2}.")
+require(inputDim <= MinHashLSH.prime / 2,
+  s"The input vector dimension $inputDim exceeds the threshold 
${MinHashLSH.prime / 2}.")
 val rand = new Random($(seed))
 val numEntry = inputDim * 2
-val randCoofs: Array[Int] = Array.fill($(outputDim))(1 + 
rand.nextInt(MinHash.prime - 1))
+val randCoofs: Array[Int] = Array.fill($(numHashTables))(1 + 
rand.nextInt(MinHashLSH.prime - 1))
--- End diff --

`randCoefs`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87922281
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -179,16 +211,13 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]]
   inputName: String,
   explodeCols: Seq[String]): Dataset[_] = {
 require(explodeCols.size == 2, "explodeCols must be two strings.")
-val vectorToMap = udf((x: Vector) => x.asBreeze.iterator.toMap,
-  MapType(DataTypes.IntegerType, DataTypes.DoubleType))
 val modelDataset: DataFrame = if 
(!dataset.columns.contains($(outputCol))) {
   transform(dataset)
 } else {
   dataset.toDF()
 }
 modelDataset.select(
-  struct(col("*")).as(inputName),
-  explode(vectorToMap(col($(outputCol.as(explodeCols))
+  struct(col("*")).as(inputName), 
posexplode(col($(outputCol))).as(explodeCols))
--- End diff --

Well here's a fun one. When I run this test:

scala
  test("memory leak test") {
val numDim = 50
val data = {
  for (i <- 0 until numDim; j <- Seq(-2, -1, 1, 2))
yield Vectors.sparse(numDim, Seq((i, j.toDouble)))
}
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("keys")

// Project from 100 dimensional Euclidean Space to 10 dimensions
val brp = new BucketedRandomProjectionLSH()
  .setNumHashTables(10)
  .setInputCol("keys")
  .setOutputCol("values")
  .setBucketLength(2.5)
  .setSeed(12345)
val model = brp.fit(df)
val joined = model.approxSimilarityJoin(df, df, Double.MaxValue, 
"distCol")
joined.show()
}

I get the following error:

[info] - BucketedRandomProjectionLSH with high dimension data: test of LSH 
property *** FAILED *** (7 seconds, 568 milliseconds)
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
4.0 (TID 205, localhost, executor driver): org.apache.spark.SparkException: 
Managed memory leak detected; size = 33816576 bytes, TID = 205
[info]  at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:295)
[info]  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[info]  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[info]  at java.lang.Thread.run(Thread.java:745)

Could you run the same test and see if you get an error?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87904353
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/MinHashLSHSuite.scala ---
@@ -24,7 +24,7 @@ import org.apache.spark.ml.util.DefaultReadWriteTest
 import org.apache.spark.mllib.util.MLlibTestSparkContext
 import org.apache.spark.sql.Dataset
 
-class MinHashSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+class MinHashLSHSuite extends SparkFunSuite with MLlibTestSparkContext 
with DefaultReadWriteTest {
--- End diff --

Looking at the code for LSH, I see a few requires on input to some of the 
public methods, but there aren't tests for these edge cases. Specifically we 
should add

**MinHash**
* tests for empty vectors (or all zero vectors)
* tests for `inputDim > prime / 2`

**LSH**
* Test for `numNearestNeighbors < 0`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87875688
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala 
---
@@ -46,21 +42,23 @@ import org.apache.spark.sql.types.StructType
 @Since("2.1.0")
 class MinHashModel private[ml] (
--- End diff --

Not specifically related to this pr: I checked and the default random uids 
used in ML library never contain spaces. For more complex uids, it seems more 
common to use camel case, but I do see some with hyphens. Can we make the 
default uids: `"mh-lsh"` and `"brp-lsh"` or similar?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87928721
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.scala
 ---
@@ -89,23 +90,25 @@ class RandomProjectionModel private[ml] (
   }
 
   @Since("2.1.0")
-  override protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+  override protected[ml] def hashDistance(x: Seq[Vector], y: Seq[Vector]): 
Double = {
 // Since it's generated by hashing, it will be a pair of dense vectors.
-x.toDense.values.zip(y.toDense.values).map(pair => math.abs(pair._1 - 
pair._2)).min
+x.zip(y).map(vectorPair => Vectors.sqdist(vectorPair._1, 
vectorPair._2)).min
   }
 
   @Since("2.1.0")
   override def copy(extra: ParamMap): this.type = defaultCopy(extra)
 
   @Since("2.1.0")
-  override def write: MLWriter = new 
RandomProjectionModel.RandomProjectionModelWriter(this)
+  override def write: MLWriter = {
+new 
BucketedRandomProjectionModel.BucketedRandomProjectionModelWriter(this)
+  }
 }
 
 /**
  * :: Experimental ::
  *
- * This [[RandomProjection]] implements Locality Sensitive Hashing 
functions for Euclidean
- * distance metrics.
+ * This [[BucketedRandomProjectionLSH]] implements Locality Sensitive 
Hashing functions for
+ * Euclidean distance metrics.
  *
  * The input is dense or sparse vectors, each of which represents a point 
in the Euclidean
  * distance space. The output will be vectors of configurable dimension. 
Hash value in the same
--- End diff --

"Hash values in the same dimension are calculated"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87876322
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala 
---
@@ -102,8 +103,7 @@ class MinHashModel private[ml] (
  */
 @Experimental
 @Since("2.1.0")
-class MinHash(override val uid: String) extends LSH[MinHashModel] with 
HasSeed {
-
+class MinHashLSH(override val uid: String) extends LSH[MinHashModel] with 
HasSeed {
--- End diff --

change the model names to reflect the new estimator names.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87871105
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -106,22 +106,24 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]]
* transformed data when necessary.
*
* This method implements two ways of fetching k nearest neighbors:
-   *  - Single Probing: Fast, return at most k elements (Probing only one 
buckets)
-   *  - Multiple Probing: Slow, return exact k elements (Probing multiple 
buckets close to the key)
+   *  - Single-probe: Fast, return at most k elements (Probing only one 
buckets)
+   *  - Multi-probe: Slow, return exact k elements (Probing multiple 
buckets close to the key)
+   *
+   * Currently it is made private since more discussion is needed for 
Multi-probe
--- End diff --

I don't understand the point here. Are you trying to make the 
`approxNearestNeighbors` method completely private? There is still a public 
overload of this method - which now shows up as the only method in the docs and 
just says "overloaded method for approxNearestNeighbors". This doc above does 
not show up. 

As a general rule, we should always generate and closely inspect the docs 
to make sure that they are what we intend and that they make sense from an end 
user's perspective.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87874663
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -35,26 +35,26 @@ private[ml] trait LSHParams extends HasInputCol with 
HasOutputCol {
   /**
* Param for the dimension of LSH OR-amplification.
*
-   * In this implementation, we use LSH OR-amplification to reduce the 
false negative rate. The
-   * higher the dimension is, the lower the false negative rate.
+   * LSH OR-amplification can be used to reduce the false negative rate. 
The higher the dimension
--- End diff --

We are still using the word "dimension" here. It might also be useful to 
add that reducing false negatives comes at the cost of added computation. How 
does this sound?


   * Param for the number of hash tables used in LSH OR-amplification.
   *
   * LSH OR-amplification can be used to reduce the false negative rate. 
Higher values for this 
   * param lead to a reduced false negative rate, at the expense of added 
computational complexity.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87910679
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala 
---
@@ -74,9 +72,12 @@ class MinHashModel private[ml] (
   }
 
   @Since("2.1.0")
-  override protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+  override protected[ml] def hashDistance(x: Seq[Vector], y: Seq[Vector]): 
Double = {
 // Since it's generated by hashing, it will be a pair of dense vectors.
-x.toDense.values.zip(y.toDense.values).map(pair => math.abs(pair._1 - 
pair._2)).min
+// TODO: This hashDistance function is controversial. Requires more 
discussion.
+x.zip(y).map(vectorPair =>
--- End diff --

At this point, I'm quite unsure, but this does not look to me like what 
what was discussed 
[here](https://github.com/apache/spark/pull/15800#event-857283655). @jkbradley 
Can you confirm this is what you wanted?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87875995
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala 
---
@@ -74,9 +72,12 @@ class MinHashModel private[ml] (
   }
 
   @Since("2.1.0")
-  override protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+  override protected[ml] def hashDistance(x: Seq[Vector], y: Seq[Vector]): 
Double = {
 // Since it's generated by hashing, it will be a pair of dense vectors.
-x.toDense.values.zip(y.toDense.values).map(pair => math.abs(pair._1 - 
pair._2)).min
+// TODO: This hashDistance function is controversial. Requires more 
discussion.
--- End diff --

This is likely to confuse future developers. Let's just link it to a JIRA 
and note that it may be changed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87844308
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala 
---
@@ -144,12 +152,12 @@ class MinHash(override val uid: String) extends 
LSH[MinHashModel] with HasSeed {
 }
 
 @Since("2.1.0")
-object MinHash extends DefaultParamsReadable[MinHash] {
+object MinHashLSH extends DefaultParamsReadable[MinHashLSH] {
   // A large prime smaller than sqrt(2^63 − 1)
   private[ml] val prime = 2038074743
--- End diff --

We typically use all caps for constants like these. I prefer 
`MinHashLSH.HASH_PRIME` or `MinHashLSH.PRIME_MODULUS`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org