huaxingao commented on a change in pull request #26415: [SPARK-18409][ML] LSH 
approxNearestNeighbors should use approxQuantile instead of sort
URL: https://github.com/apache/spark/pull/26415#discussion_r347146766
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala
 ##########
 @@ -137,14 +139,23 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]]
       val hashDistUDF = udf((x: Seq[Vector]) => hashDistance(x, keyHash), 
DataTypes.DoubleType)
       val hashDistCol = hashDistUDF(col($(outputCol)))
 
-      // Compute threshold to get exact k elements.
-      // TODO: SPARK-18409: Use approxQuantile to get the threshold
-      val modelDatasetSortedByHash = 
modelDataset.sort(hashDistCol).limit(numNearestNeighbors)
-      val thresholdDataset = modelDatasetSortedByHash.select(max(hashDistCol))
-      val hashThreshold = thresholdDataset.take(1).head.getDouble(0)
-
-      // Filter the dataset where the hash value is less than the threshold.
-      modelDataset.filter(hashDistCol <= hashThreshold)
+      val modelDatasetWithDist = modelDataset.withColumn(distCol, hashDistCol)
+      var filtered: DataFrame = null
+      var requestedNum = numNearestNeighbors
+      do {
+        requestedNum *= 2
+        if (requestedNum > modelDataset.count()) {
+          requestedNum = modelDataset.count().toInt
+        }
+        var quantile = requestedNum.toDouble / modelDataset.count()
+        var hashThreshold = modelDatasetWithDist.stat
+          .approxQuantile(distCol, Array(quantile), 0.001)
+
+        // Filter the dataset where the hash value is less than the threshold.
+        filtered = modelDatasetWithDist.filter(hashDistCol <= hashThreshold(0))
 
 Review comment:
   I think over. Seems to me that it's safe not to have the second loop, and we 
might not need to double the quantile in the beginning.
   
   If the relative error sets to zero, the exact quantile is computed. 
Filtering out at this exact quantile gives the exact k items of 
```numNearestNeighbors``` we ask for. Of course, this could be very expensive. 
If users want to trade off accuracy for performance, they can increase the 
relative error. In this case, using approximate quantile as threshold may 
return less number of data than ```numNearestNeighbors```, but this is OK since 
the docs says ```approximately find at most k items```. I tried to do some 
tests but it seems hard to find a universal relative error that works for all 
cases. Since the approximate quantile has the range of 
   
   ```
    floor((p - err) * N) <= rank(x) <= ceil((p + err) * N)
   ```
   In the extreme unlucky case, the approximate quantile is ```(p - err) * 
N)``` and the elements that meet the requirement happen to be in the range of 
```(p - err) * N  to p * N```. It seems to me the appropriate value of relative 
error depends on the distribution of the dataset and the users may need to 
adjust their own relative error depends on their dataset. 
   
   So I guess 1) we don't need the loop 2) expose the relative error to user? 
   I am OK with either double the quantile or not double it. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to