[ https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15879539#comment-15879539 ]
Mingjie Tang commented on SPARK-18454: -------------------------------------- [~yunn] I leave some comments on the document. the build index over the input data would be very useful, if we do not shuffle the input data table. > Changes to improve Nearest Neighbor Search for LSH > -------------------------------------------------- > > Key: SPARK-18454 > URL: https://issues.apache.org/jira/browse/SPARK-18454 > Project: Spark > Issue Type: Improvement > Components: ML > Reporter: Yun Ni > > We all agree to do the following improvement to Multi-Probe NN Search: > (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing > full sort on the whole dataset > Currently we are still discussing the following: > (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}} > (2) What are the issues and how we should change the current Nearest Neighbor > implementation -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org