Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/15148
  
    @MLnick  I agree with most of your comments.  A few responses:
    
    > In terms of transform - I disagree somewhat that the main use case is 
"dimensionality reduction". Perhaps there are common examples of using the hash 
signatures as a lower-dim representation as a feature in some model (e.g. in a 
similar way to say a PCA transform), but I haven't seen that.
    
    This is very common in academic research and literature, but it may not be 
in industry.  I'm fine with not considering it for now.
    
    >  I also don't see why randUnitVectors or randCoefficients needs to be 
public
    
    You mentioned people using LSH outside of Spark for serving.  In order to 
do that, we will need to expose randUnitVectors and randCoefficients so that 
users can compute hash values for query points.  That said, I'm fine with 
making those private for now and preventing this use case for 1 release while 
we stabilize the API.
    
    > One issue I have is that currently we would output a 1 x L set of hash 
values. But it actually should be L x 1 i.e. a set of signatures of length 1. I 
guess we can leave it as is, but document what the output actually is.
    
    What about outputting a Matrix instead of an Array of Vectors?  That will 
make it easy to change in the future, without us having weird Vectors of length 
1.
    
    > Finally, my understanding was results from some performance testing would 
be posted. I don't believe we've seen this yet.
    
    You can see some results linked from the JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to