Github user sethah commented on the issue: https://github.com/apache/spark/pull/15148 I was using L to refer to the number of compound hash functions, but you're right that in my explanation L was the "OR" parameter and d was the "AND" parameter. Thinking more about it, this is a tough question. What is the intended use of the output column generated by transform? As an alternative set of features with decreased dimensionality? When/if we use the AND/OR amplification, we could go a couple of different routes. Let's say for d = 3 and L = 3 we could first apply our hashing scheme to the input to obtain: | features| g1| g2| g3| |--------------------|--------------------|--------------------|--------------------| |[12.5609584702036...|[112.0,1.0,12.0]|[1.0,120.0,16.0]|[102.0,1.0,14.0] |...|...|...|...| Then we generate `g1(q), g2(q), g3(q)` where q is the query point and we would select all points where `g1(q) == g1(x_i) OR g2(q) == g2(x_i) OR ...`. In spark-neighbors, instead the number of elements in the output dataframe has `L * N` rows where N is the number of rows in the input dataframe. Then you can join on the hashed column plus a "table identifier" (the index l in range [1, L]). Still, this makes a temporary dataframe within the near-neighbors or approx-join algos, and I'm not sure the output schema of `transform` needs to have all `L` hashed values. We could store `randUnitVectors: Array[Array[Vector]]` and for transform output the hashed value for only the first sequence of random vectors, but that seems a bit strange to me. Thoughts?
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org