[ https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15832487#comment-15832487 ]
Yun Ni commented on SPARK-18392: -------------------------------- Yes, comparing if the hash signature equals is faster than computing the distance between the key and every instance. > LSH API, algorithm, and documentation follow-ups > ------------------------------------------------ > > Key: SPARK-18392 > URL: https://issues.apache.org/jira/browse/SPARK-18392 > Project: Spark > Issue Type: Improvement > Components: ML > Reporter: Joseph K. Bradley > > This JIRA summarizes discussions from the initial LSH PR > [https://github.com/apache/spark/pull/15148] as well as the follow-up for > hash distance [https://github.com/apache/spark/pull/15800]. This will be > broken into subtasks: > * API changes (targeted for 2.1) > * algorithmic fixes (targeted for 2.1) > * documentation improvements (ideally 2.1, but could slip) > The major issues we have mentioned are as follows: > * OR vs AND amplification > ** Need to make API flexible enough to support both types of amplification in > the future > ** Need to clarify which we support, including in each model function > (transform, similarity join, neighbors) > * Need to clarify which algorithms we have implemented, improve docs and > references, and fix the algorithms if needed. > These major issues are broken down into detailed issues below. > h3. LSH abstraction > * Rename {{outputDim}} to something indicative of OR-amplification. > ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used > in the future for AND amplification (Thanks [~mlnick]!) > * transform > ** Update output schema to {{Array of Vector}} instead of {{Vector}}. This > is the "raw" output of all hash functions, i.e., with no aggregation for > amplification. > ** Clarify meaning of output in terms of multiple hash functions and > amplification. > ** Note: We will _not_ worry about users using this output for dimensionality > reduction; if anything, that use case can be explained in the User Guide. > * Documentation > ** Clarify terminology used everywhere > *** hash function {{h_i}}: basic hash function without amplification > *** hash value {{h_i(key)}}: output of a hash function > *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with > AND-amplification using K base hash functions > *** compound hash function value {{g(key)}}: vector-valued output > *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with > OR-amplification using L compound hash functions > *** hash table value {{H(key)}}: output of array of vectors > *** This terminology is largely pulled from Wang et al.'s survey and the > multi-probe LSH paper. > ** Link clearly to documentation (Wikipedia or papers) which matches our > terminology and what we implemented > h3. RandomProjection (or P-Stable Distributions) > * Rename {{RandomProjection}} > ** Options include: {{ScalarRandomProjectionLSH}}, > {{BucketedRandomProjectionLSH}}, {{PStableLSH}} > * API privacy > ** Make randUnitVectors private > * hashFunction > ** Currently, this uses OR-amplification for single probing, as we intended. > ** It does *not* do multiple probing, at least not in the sense of the > original MPLSH paper. We should fix that or at least document its behavior. > * Documentation > ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia > ** Also link to the multi-probe LSH paper since that explains this method > very clearly. > ** Clarify hash function and distance metric > h3. MinHash > * Rename {{MinHash}} -> {{MinHashLSH}} > * API privacy > ** Make randCoefficients, numEntries private > * hashDistance (used in approxNearestNeighbors) > ** Update to use average of indicators of hash collisions [SPARK-18334] > ** See [Wikipedia | > https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a > reference > h3. All references > I'm just listing references I looked at. > Papers > * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf] > * [https://people.csail.mit.edu/indyk/p117-andoni.pdf] > * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf] > * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe > LSH paper > Wikipedia > * > [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search] > * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification] -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org