Hi Seth, Here's the parameters that I used in my experiments. - Number of executors: 16 - Executor's memories: vary from 1G -> 2G -> 3G - Number of cores per executor: 1-> 2 - Driver's memory: 1G -> 2G -> 3G - The similar threshold: 0.6 MinHash: - number of hash tables: 2 SignedRandomProjection: - Number of hash tables: 2
2017-02-23 0:13 GMT+07:00 Seth Hendrickson <seth.hendrickso...@gmail.com>: > I'm looking into this a bit further, thanks for bringing it up! Right now > the LSH implementation only uses OR-amplification. The practical > consequence of this is that it will select too many candidates when doing > approximate near neighbor search and approximate similarity join. When we > add AND-amplification I think it will become significantly more usable. In > the meantime, I will also investigate scalability issues. > > Can you please provide every parameter you used? It will be very helfpul > :) For instance, the similarity threshold, the number of hash tables, the > bucket width, etc... > > Thanks! > > On Mon, Feb 13, 2017 at 3:21 PM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > >> The original Uber authors provided this performance test result: >> https://docs.google.com/document/d/19BXg-67U83NVB3M0 >> I84HVBVg3baAVaESD_mrg_-vLro >> >> This was for MinHash only though, so it's not clear about what the >> scalability is for the other metric types. >> >> The SignRandomProjectionLSH is not yet in Spark master (see >> https://issues.apache.org/jira/browse/SPARK-18082). It could be there >> are some implementation details that would make a difference here. >> >> By the way, what is the join threshold you use in approx join? >> >> Could you perhaps create a JIRA ticket with the details in order to track >> this? >> >> >> On Sun, 12 Feb 2017 at 22:52 nguyen duc Tuan <newvalu...@gmail.com> >> wrote: >> >>> After all, I switched back to LSH implementation that I used before ( >>> https://github.com/karlhigley/spark-neighbors ). I can run on my >>> dataset now. If someone has any suggestion, please tell me. >>> Thanks. >>> >>> 2017-02-12 9:25 GMT+07:00 nguyen duc Tuan <newvalu...@gmail.com>: >>> >>> Hi Timur, >>> 1) Our data is transformed to dataset of Vector already. >>> 2) If I use RandomSignProjectLSH, the job dies after I call >>> approximateSimilarJoin. I tried to use Minhash instead, the job is still >>> slow. I don't thinks the problem is related to the GC. The time for GC is >>> small compare with the time for computation. Here is some screenshots of my >>> job. >>> Thanks >>> >>> 2017-02-12 8:01 GMT+07:00 Timur Shenkao <t...@timshenkao.su>: >>> >>> Hello, >>> >>> 1) Are you sure that your data is "clean"? No unexpected missing >>> values? No strings in unusual encoding? No additional or missing columns ? >>> 2) How long does your job run? What about garbage collector parameters? >>> Have you checked what happens with jconsole / jvisualvm ? >>> >>> Sincerely yours, Timur >>> >>> On Sat, Feb 11, 2017 at 12:52 AM, nguyen duc Tuan <newvalu...@gmail.com> >>> wrote: >>> >>> Hi Nick, >>> Because we use *RandomSignProjectionLSH*, there is only one parameter >>> for LSH is the number of hashes. I try with small number of hashes (2) but >>> the error is still happens. And it happens when I call similarity join. >>> After transformation, the size of dataset is about 4G. >>> >>> 2017-02-11 3:07 GMT+07:00 Nick Pentreath <nick.pentre...@gmail.com>: >>> >>> What other params are you using for the lsh transformer? >>> >>> Are the issues occurring during transform or during the similarity join? >>> >>> >>> On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan <newvalu...@gmail.com> >>> wrote: >>> >>> hi Das, >>> In general, I will apply them to larger datasets, so I want to use LSH, >>> which is more scaleable than the approaches as you suggested. Have you >>> tried LSH in Spark 2.1.0 before ? If yes, how do you set the >>> parameters/configuration to make it work ? >>> Thanks. >>> >>> 2017-02-10 19:21 GMT+07:00 Debasish Das <debasish.da...@gmail.com>: >>> >>> If it is 7m rows and 700k features (or say 1m features) brute force row >>> similarity will run fine as well...check out spark-4823...you can compare >>> quality with approximate variant... >>> On Feb 9, 2017 2:55 AM, "nguyen duc Tuan" <newvalu...@gmail.com> wrote: >>> >>> Hi everyone, >>> Since spark 2.1.0 introduces LSH (http://spark.apache.org/docs/ >>> latest/ml-features.html#locality-sensitive-hashing), we want to use LSH >>> to find approximately nearest neighbors. Basically, We have dataset with >>> about 7M rows. we want to use cosine distance to meassure the similarity >>> between items, so we use *RandomSignProjectionLSH* ( >>> https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db) >>> instead of *BucketedRandomProjectionLSH*. I try to tune some >>> configurations such as serialization, memory fraction, executor memory >>> (~6G), number of executors ( ~20), memory overhead ..., but nothing works. >>> I often get error "java.lang.OutOfMemoryError: Java heap space" while >>> running. I know that this implementation is done by engineer at Uber but I >>> don't know right configurations,.. to run the algorithm at scale. Do they >>> need very big memory to run it? >>> >>> Any help would be appreciated. >>> Thanks >>> >>> >>> >>> >>> >>> >>> >