Hi There, Using spark-mllib_2.11-2.1.0. Facing issue that BucketedRandomProjectionLSHModel.approxNearestNeighbors returns one result, always.
Dataset looks like: +----+--------------------+-------------+------------------------+----------------------+ | id| features|kmeansCluster|predictionVectorFeatures|featuresInNewDimension| +----+--------------------+-------------+------------------------+----------------------+ |1045|(16384,[196,11016...| 0| (16384,[196],[0.2...| [[0.0], [0.0], [0...| |1041|(16384,[4110,1065...| 0| (16384,[196],[0.2...| [[0.0], [0.0], [-...| +----+--------------------+-------------+------------------------+----------------------+ Execution code: Dataset<Row> approximatedDS = (Dataset<Row>) ((BucketedRandomProjectionLSHModel)model) .approxNearestNeighbors(dataset, vectorToCalculateAgainst, numberOfResults, false, MLFlowConstants.THEMES_PREDICTION_COLUMNS.distance.name()); Where: numberOfResults = 2 vectorToCalculateAgainst = first vector in predictionVectorFeatures column approximatedDS looks like follows: +----+--------------------+-------------+------------------------+----------------------+------------------+ | id| features|kmeansCluster|predictionVectorFeatures|featuresInNewDimension| distance| +----+--------------------+-------------+------------------------+----------------------+------------------+ |1061|(16384,[196,11016...| 1| (16384,[196],[0.2...| [[0.0], [0.0], [0...|0.8536603178950374| +----+--------------------+-------------+------------------------+----------------------+------------------+ I have suspicion, that in LSH.scala // Compute threshold to get exact k elements. // TODO: SPARK-18409: Use approxQuantile to get the threshold val modelDatasetSortedByHash = modelDataset.sort(hashDistCol).limit(numNearestNeighbors) val thresholdDataset = modelDatasetSortedByHash.select(max(hashDistCol)) val hashThreshold = thresholdDataset.take(1).head.getDouble(0) // Filter the dataset where the hash value is less than the threshold. modelDataset.filter(hashDistCol <= hashThreshold) } last filter does wrong filtering, but may be wrong (do not know scala). Can anyone help me understand how to make BucketedRandomProjectionLSHModel.approxNearestNeighbors to return multiple "nearest" vectors? Thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/BucketedRandomProjectionLSHModel-algorithm-details-tp28578.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org