[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87906133 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -31,13 +31,9 @@ import org.apache.spark.sql.types.StructType /** * :: Experimental :: * - * Model produced by [[MinHash]], where multiple hash functions are stored. Each hash function is - * a perfect hash function: - *`h_i(x) = (x * k_i mod prime) mod numEntries` - * where `k_i` is the i-th coefficient, and both `x` and `k_i` are from `Z_prime^*` - * - * Reference: - * [[https://en.wikipedia.org/wiki/Perfect_hash_function Wikipedia on Perfect Hash Function]] + * Model produced by [[MinHashLSH]], where multiple hash functions are stored. Each hash function is + * a perfect hash function for a specific set `S` with cardinality equal to a half of `numEntries`: --- End diff -- I'm not following exactly why the cardinality of `S` is _half_ of `numEntries`. Actually, why is threshold for feature dimensionality `prime / 2` ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87906309 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -46,21 +42,23 @@ import org.apache.spark.sql.types.StructType @Since("2.1.0") class MinHashModel private[ml] ( override val uid: String, -@Since("2.1.0") val numEntries: Int, -@Since("2.1.0") val randCoefficients: Array[Int]) +@Since("2.1.0") private[ml] val numEntries: Int, +@Since("2.1.0") private[ml] val randCoefficients: Array[Int]) extends LSHModel[MinHashModel] { @Since("2.1.0") - override protected[ml] val hashFunction: Vector => Vector = { -elems: Vector => + override protected[ml] val hashFunction: Vector => Array[Vector] = { +elems: Vector => { require(elems.numNonzeros > 0, "Must have at least 1 non zero entry.") val elemsList = elems.toSparse.indices.toList val hashValues = randCoefficients.map({ randCoefficient: Int => - elemsList.map({elem: Int => -(1 + elem) * randCoefficient.toLong % MinHash.prime % numEntries - }).min.toDouble +elemsList.map({ elem: Int => --- End diff -- redundant brackets. Just use `elemsList.map { elem: Int =>` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87844941 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -106,22 +123,24 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]] * transformed data when necessary. * * This method implements two ways of fetching k nearest neighbors: - * - Single Probing: Fast, return at most k elements (Probing only one buckets) - * - Multiple Probing: Slow, return exact k elements (Probing multiple buckets close to the key) + * - Single-probe: Fast, return at most k elements (Probing only one buckets) --- End diff -- "Probing only one bucket" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87906709 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -46,21 +42,23 @@ import org.apache.spark.sql.types.StructType @Since("2.1.0") class MinHashModel private[ml] ( override val uid: String, -@Since("2.1.0") val numEntries: Int, -@Since("2.1.0") val randCoefficients: Array[Int]) +@Since("2.1.0") private[ml] val numEntries: Int, --- End diff -- no since tags for private values. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87874869 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -66,10 +66,10 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]] self: T => /** - * The hash function of LSH, mapping a predefined KeyType to a Vector + * The hash function of LSH, mapping an input feature to multiple vectors --- End diff -- "mapping an input feature vector to multiple hash vectors." --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87878252 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -102,8 +103,7 @@ class MinHashModel private[ml] ( */ @Experimental @Since("2.1.0") -class MinHash(override val uid: String) extends LSH[MinHashModel] with HasSeed { - +class MinHashLSH(override val uid: String) extends LSH[MinHashModel] with HasSeed { --- End diff -- Also, the comment above says: * ... For example, *`Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` * means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5. * Also, any input vector must have at least 1 non-zero indices, and all non-zero values are treated * as binary "1" values. Can we change it to: * ... For example, *`Vectors.sparse(10, Array((2, 1.0), (3, 1.0), (5, 1.0)))` * means there are 10 elements in the space. This set contains non-zero values at indices 2, 3, and * 5. Also, any input vector must have at least 1 non-zero index, and all non-zero values are * treated as binary "1" values. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87908012 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -125,11 +125,11 @@ class MinHash(override val uid: String) extends LSH[MinHashModel] with HasSeed { @Since("2.1.0") override protected[ml] def createRawLSHModel(inputDim: Int): MinHashModel = { -require(inputDim <= MinHash.prime / 2, - s"The input vector dimension $inputDim exceeds the threshold ${MinHash.prime / 2}.") +require(inputDim <= MinHashLSH.prime / 2, + s"The input vector dimension $inputDim exceeds the threshold ${MinHashLSH.prime / 2}.") val rand = new Random($(seed)) val numEntry = inputDim * 2 -val randCoofs: Array[Int] = Array.fill($(outputDim))(1 + rand.nextInt(MinHash.prime - 1)) +val randCoofs: Array[Int] = Array.fill($(numHashTables))(1 + rand.nextInt(MinHashLSH.prime - 1)) --- End diff -- `randCoefs` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87922281 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -179,16 +211,13 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]] inputName: String, explodeCols: Seq[String]): Dataset[_] = { require(explodeCols.size == 2, "explodeCols must be two strings.") -val vectorToMap = udf((x: Vector) => x.asBreeze.iterator.toMap, - MapType(DataTypes.IntegerType, DataTypes.DoubleType)) val modelDataset: DataFrame = if (!dataset.columns.contains($(outputCol))) { transform(dataset) } else { dataset.toDF() } modelDataset.select( - struct(col("*")).as(inputName), - explode(vectorToMap(col($(outputCol.as(explodeCols)) + struct(col("*")).as(inputName), posexplode(col($(outputCol))).as(explodeCols)) --- End diff -- Well here's a fun one. When I run this test: scala test("memory leak test") { val numDim = 50 val data = { for (i <- 0 until numDim; j <- Seq(-2, -1, 1, 2)) yield Vectors.sparse(numDim, Seq((i, j.toDouble))) } val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("keys") // Project from 100 dimensional Euclidean Space to 10 dimensions val brp = new BucketedRandomProjectionLSH() .setNumHashTables(10) .setInputCol("keys") .setOutputCol("values") .setBucketLength(2.5) .setSeed(12345) val model = brp.fit(df) val joined = model.approxSimilarityJoin(df, df, Double.MaxValue, "distCol") joined.show() } I get the following error: [info] - BucketedRandomProjectionLSH with high dimension data: test of LSH property *** FAILED *** (7 seconds, 568 milliseconds) [info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 205, localhost, executor driver): org.apache.spark.SparkException: Managed memory leak detected; size = 33816576 bytes, TID = 205 [info] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:295) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [info] at java.lang.Thread.run(Thread.java:745) Could you run the same test and see if you get an error? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87904353 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/MinHashLSHSuite.scala --- @@ -24,7 +24,7 @@ import org.apache.spark.ml.util.DefaultReadWriteTest import org.apache.spark.mllib.util.MLlibTestSparkContext import org.apache.spark.sql.Dataset -class MinHashSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { +class MinHashLSHSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { --- End diff -- Looking at the code for LSH, I see a few requires on input to some of the public methods, but there aren't tests for these edge cases. Specifically we should add **MinHash** * tests for empty vectors (or all zero vectors) * tests for `inputDim > prime / 2` **LSH** * Test for `numNearestNeighbors < 0` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87875688 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -46,21 +42,23 @@ import org.apache.spark.sql.types.StructType @Since("2.1.0") class MinHashModel private[ml] ( --- End diff -- Not specifically related to this pr: I checked and the default random uids used in ML library never contain spaces. For more complex uids, it seems more common to use camel case, but I do see some with hyphens. Can we make the default uids: `"mh-lsh"` and `"brp-lsh"` or similar? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87928721 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.scala --- @@ -89,23 +90,25 @@ class RandomProjectionModel private[ml] ( } @Since("2.1.0") - override protected[ml] def hashDistance(x: Vector, y: Vector): Double = { + override protected[ml] def hashDistance(x: Seq[Vector], y: Seq[Vector]): Double = { // Since it's generated by hashing, it will be a pair of dense vectors. -x.toDense.values.zip(y.toDense.values).map(pair => math.abs(pair._1 - pair._2)).min +x.zip(y).map(vectorPair => Vectors.sqdist(vectorPair._1, vectorPair._2)).min } @Since("2.1.0") override def copy(extra: ParamMap): this.type = defaultCopy(extra) @Since("2.1.0") - override def write: MLWriter = new RandomProjectionModel.RandomProjectionModelWriter(this) + override def write: MLWriter = { +new BucketedRandomProjectionModel.BucketedRandomProjectionModelWriter(this) + } } /** * :: Experimental :: * - * This [[RandomProjection]] implements Locality Sensitive Hashing functions for Euclidean - * distance metrics. + * This [[BucketedRandomProjectionLSH]] implements Locality Sensitive Hashing functions for + * Euclidean distance metrics. * * The input is dense or sparse vectors, each of which represents a point in the Euclidean * distance space. The output will be vectors of configurable dimension. Hash value in the same --- End diff -- "Hash values in the same dimension are calculated" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87876322 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -102,8 +103,7 @@ class MinHashModel private[ml] ( */ @Experimental @Since("2.1.0") -class MinHash(override val uid: String) extends LSH[MinHashModel] with HasSeed { - +class MinHashLSH(override val uid: String) extends LSH[MinHashModel] with HasSeed { --- End diff -- change the model names to reflect the new estimator names. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87871105 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -106,22 +106,24 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]] * transformed data when necessary. * * This method implements two ways of fetching k nearest neighbors: - * - Single Probing: Fast, return at most k elements (Probing only one buckets) - * - Multiple Probing: Slow, return exact k elements (Probing multiple buckets close to the key) + * - Single-probe: Fast, return at most k elements (Probing only one buckets) + * - Multi-probe: Slow, return exact k elements (Probing multiple buckets close to the key) + * + * Currently it is made private since more discussion is needed for Multi-probe --- End diff -- I don't understand the point here. Are you trying to make the `approxNearestNeighbors` method completely private? There is still a public overload of this method - which now shows up as the only method in the docs and just says "overloaded method for approxNearestNeighbors". This doc above does not show up. As a general rule, we should always generate and closely inspect the docs to make sure that they are what we intend and that they make sense from an end user's perspective. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87874663 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -35,26 +35,26 @@ private[ml] trait LSHParams extends HasInputCol with HasOutputCol { /** * Param for the dimension of LSH OR-amplification. * - * In this implementation, we use LSH OR-amplification to reduce the false negative rate. The - * higher the dimension is, the lower the false negative rate. + * LSH OR-amplification can be used to reduce the false negative rate. The higher the dimension --- End diff -- We are still using the word "dimension" here. It might also be useful to add that reducing false negatives comes at the cost of added computation. How does this sound? * Param for the number of hash tables used in LSH OR-amplification. * * LSH OR-amplification can be used to reduce the false negative rate. Higher values for this * param lead to a reduced false negative rate, at the expense of added computational complexity. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87910679 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -74,9 +72,12 @@ class MinHashModel private[ml] ( } @Since("2.1.0") - override protected[ml] def hashDistance(x: Vector, y: Vector): Double = { + override protected[ml] def hashDistance(x: Seq[Vector], y: Seq[Vector]): Double = { // Since it's generated by hashing, it will be a pair of dense vectors. -x.toDense.values.zip(y.toDense.values).map(pair => math.abs(pair._1 - pair._2)).min +// TODO: This hashDistance function is controversial. Requires more discussion. +x.zip(y).map(vectorPair => --- End diff -- At this point, I'm quite unsure, but this does not look to me like what what was discussed [here](https://github.com/apache/spark/pull/15800#event-857283655). @jkbradley Can you confirm this is what you wanted? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87875995 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -74,9 +72,12 @@ class MinHashModel private[ml] ( } @Since("2.1.0") - override protected[ml] def hashDistance(x: Vector, y: Vector): Double = { + override protected[ml] def hashDistance(x: Seq[Vector], y: Seq[Vector]): Double = { // Since it's generated by hashing, it will be a pair of dense vectors. -x.toDense.values.zip(y.toDense.values).map(pair => math.abs(pair._1 - pair._2)).min +// TODO: This hashDistance function is controversial. Requires more discussion. --- End diff -- This is likely to confuse future developers. Let's just link it to a JIRA and note that it may be changed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87844308 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -144,12 +152,12 @@ class MinHash(override val uid: String) extends LSH[MinHashModel] with HasSeed { } @Since("2.1.0") -object MinHash extends DefaultParamsReadable[MinHash] { +object MinHashLSH extends DefaultParamsReadable[MinHashLSH] { // A large prime smaller than sqrt(2^63 â 1) private[ml] val prime = 2038074743 --- End diff -- We typically use all caps for constants like these. I prefer `MinHashLSH.HASH_PRIME` or `MinHashLSH.PRIME_MODULUS`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org