I'm looking into this a bit further, thanks for bringing it up! Right now the LSH implementation only uses OR-amplification. The practical consequence of this is that it will select too many candidates when doing approximate near neighbor search and approximate similarity join. When we add AND-amplification I think it will become significantly more usable. In the meantime, I will also investigate scalability issues.
Can you please provide every parameter you used? It will be very helfpul :) For instance, the similarity threshold, the number of hash tables, the bucket width, etc... Thanks! On Mon, Feb 13, 2017 at 3:21 PM, Nick Pentreath <[email protected]> wrote: > The original Uber authors provided this performance test result: > https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_ > mrg_-vLro > > This was for MinHash only though, so it's not clear about what the > scalability is for the other metric types. > > The SignRandomProjectionLSH is not yet in Spark master (see > https://issues.apache.org/jira/browse/SPARK-18082). It could be there are > some implementation details that would make a difference here. > > By the way, what is the join threshold you use in approx join? > > Could you perhaps create a JIRA ticket with the details in order to track > this? > > > On Sun, 12 Feb 2017 at 22:52 nguyen duc Tuan <[email protected]> wrote: > >> After all, I switched back to LSH implementation that I used before ( >> https://github.com/karlhigley/spark-neighbors ). I can run on my dataset >> now. If someone has any suggestion, please tell me. >> Thanks. >> >> 2017-02-12 9:25 GMT+07:00 nguyen duc Tuan <[email protected]>: >> >> Hi Timur, >> 1) Our data is transformed to dataset of Vector already. >> 2) If I use RandomSignProjectLSH, the job dies after I call >> approximateSimilarJoin. I tried to use Minhash instead, the job is still >> slow. I don't thinks the problem is related to the GC. The time for GC is >> small compare with the time for computation. Here is some screenshots of my >> job. >> Thanks >> >> 2017-02-12 8:01 GMT+07:00 Timur Shenkao <[email protected]>: >> >> Hello, >> >> 1) Are you sure that your data is "clean"? No unexpected missing values? >> No strings in unusual encoding? No additional or missing columns ? >> 2) How long does your job run? What about garbage collector parameters? >> Have you checked what happens with jconsole / jvisualvm ? >> >> Sincerely yours, Timur >> >> On Sat, Feb 11, 2017 at 12:52 AM, nguyen duc Tuan <[email protected]> >> wrote: >> >> Hi Nick, >> Because we use *RandomSignProjectionLSH*, there is only one parameter >> for LSH is the number of hashes. I try with small number of hashes (2) but >> the error is still happens. And it happens when I call similarity join. >> After transformation, the size of dataset is about 4G. >> >> 2017-02-11 3:07 GMT+07:00 Nick Pentreath <[email protected]>: >> >> What other params are you using for the lsh transformer? >> >> Are the issues occurring during transform or during the similarity join? >> >> >> On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan <[email protected]> >> wrote: >> >> hi Das, >> In general, I will apply them to larger datasets, so I want to use LSH, >> which is more scaleable than the approaches as you suggested. Have you >> tried LSH in Spark 2.1.0 before ? If yes, how do you set the >> parameters/configuration to make it work ? >> Thanks. >> >> 2017-02-10 19:21 GMT+07:00 Debasish Das <[email protected]>: >> >> If it is 7m rows and 700k features (or say 1m features) brute force row >> similarity will run fine as well...check out spark-4823...you can compare >> quality with approximate variant... >> On Feb 9, 2017 2:55 AM, "nguyen duc Tuan" <[email protected]> wrote: >> >> Hi everyone, >> Since spark 2.1.0 introduces LSH (http://spark.apache.org/docs/ >> latest/ml-features.html#locality-sensitive-hashing), we want to use LSH >> to find approximately nearest neighbors. Basically, We have dataset with >> about 7M rows. we want to use cosine distance to meassure the similarity >> between items, so we use *RandomSignProjectionLSH* ( >> https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db) instead >> of *BucketedRandomProjectionLSH*. I try to tune some configurations such >> as serialization, memory fraction, executor memory (~6G), number of >> executors ( ~20), memory overhead ..., but nothing works. I often get error >> "java.lang.OutOfMemoryError: Java heap space" while running. I know that >> this implementation is done by engineer at Uber but I don't know right >> configurations,.. to run the algorithm at scale. Do they need very big >> memory to run it? >> >> Any help would be appreciated. >> Thanks >> >> >> >> >> >> >>
