[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

sethah Mon, 07 Nov 2016 17:07:16 -0800

Github user sethah commented on the issue:

    https://github.com/apache/spark/pull/15148
  
    I was using L to refer to the number of compound hash functions, but you're 
right that in my explanation L was the "OR" parameter and d was the "AND" 
parameter.
    
    Thinking more about it, this is a tough question. What is the intended use 
of the output column generated by transform? As an alternative set of features 
with decreased dimensionality?
    
    When/if we use the AND/OR amplification, we could go a couple of different 
routes. Let's say for d = 3 and L = 3 we could first apply our hashing scheme 
to the input to obtain:
    
    |            features|              g1| g2| g3|
    
|--------------------|--------------------|--------------------|--------------------|
    |[12.5609584702036...|[112.0,1.0,12.0]|[1.0,120.0,16.0]|[102.0,1.0,14.0]
    |...|...|...|...|
    
    Then we generate `g1(q), g2(q), g3(q)` where q is the query point and we 
would select all points where `g1(q) == g1(x_i) OR g2(q) == g2(x_i) OR ...`. In 
spark-neighbors, instead the number of elements in the output dataframe has `L 
* N` rows where N is the number of rows in the input dataframe. Then you can 
join on the hashed column plus a "table identifier" (the index l in range [1, 
L]). Still, this makes a temporary dataframe within the near-neighbors or 
approx-join algos, and I'm not sure the output schema of `transform` needs to 
have all `L` hashed values. We could store `randUnitVectors: 
Array[Array[Vector]]` and for transform output the hashed value for only the 
first sequence of random vectors, but that seems a bit strange to me. Thoughts?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

Reply via email to