Joseph K. Bradley created SPARK-18392: -----------------------------------------
Summary: LSH API, algorithm, and documentation follow-ups Key: SPARK-18392 URL: https://issues.apache.org/jira/browse/SPARK-18392 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley This JIRA summarizes discussions from the initial LSH PR [https://github.com/apache/spark/pull/15148] as well as the follow-up for hash distance [https://github.com/apache/spark/pull/15800]. This will be broken into subtasks: * API changes (targeted for 2.1) * algorithmic fixes (targeted for 2.1) * documentation improvements (ideally 2.1, but could slip) The major issues we have mentioned are as follows: * OR vs AND amplification ** Need to make API flexible enough to support both types of amplification in the future ** Need to clarify which we support, including in each model function (transform, similarity join, neighbors) * Need to clarify which algorithms we have implemented, improve docs and references, and fix the algorithms if needed. These major issues are broken down into detailed issues below. h3. LSH abstraction * Rename {{outputDim}} to something indicative of OR-amplification. ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used in the future for AND amplification (Thanks [~mlnick]!) * transform ** Update output schema to {{Array of Vector}} instead of {{Vector}}. This is the "raw" output of all hash functions, i.e., with no aggregation for amplification. ** Clarify meaning of output in terms of multiple hash functions and amplification. ** Note: We will _not_ worry about users using this output for dimensionality reduction; if anything, that use case can be explained in the User Guide. * Documentation ** Clarify terminology used everywhere *** hash function {{h_i}}: basic hash function without amplification *** hash value {{h_i(key)}}: output of a hash function *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with AND-amplification using K base hash functions *** compound hash function value {{g(key)}}: vector-valued output *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with OR-amplification using L compound hash functions *** hash table value {{H(key)}}: output of array of vectors *** This terminology is largely pulled from Wang et al.'s survey and the multi-probe LSH paper. ** Link clearly to documentation (Wikipedia or papers) which matches our terminology and what we implemented h3. RandomProjection (or P-Stable Distributions) * Rename {{RandomProjection}} ** Options include: {{ScalarRandomProjectionLSH}}, {{BucketedRandomProjectionLSH}}, {{PStableLSH}} * API privacy ** Make randUnitVectors private * hashFunction ** Currently, this uses OR-amplification for single probing, as we intended. ** It does *not* do multiple probing, at least not in the sense of the original MPLSH paper. We should fix that or at least document its behavior. * Documentation ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia ** Also link to the multi-probe LSH paper since that explains this method very clearly. ** Clarify hash function and distance metric h3. MinHash * Rename {{MinHash}} -> {{MinHashLSH}} * API privacy ** Make randCoefficients, numEntries private * hashDistance (used in approxNearestNeighbors) ** Update to use average of indicators of hash collisions [SPARK-18334] ** See [Wikipedia | https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a reference h3. All references I'm just listing references I looked at. Papers * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf] * [https://people.csail.mit.edu/indyk/p117-andoni.pdf] * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf] * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe LSH paper Wikipedia * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search] * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification] -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org