Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-23 Thread nguyen duc Tuan
I do a self-join. I tried to cache the transformed dataset before joining, but it didn't help too. 2017-02-23 13:25 GMT+07:00 Nick Pentreath : > And to be clear, are you doing a self-join for approx similarity? Or > joining to another dataset? > > > > On Thu, 23 Feb

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-22 Thread Nick Pentreath
And to be clear, are you doing a self-join for approx similarity? Or joining to another dataset? On Thu, 23 Feb 2017 at 02:01, nguyen duc Tuan wrote: > Hi Seth, > Here's the parameters that I used in my experiments. > - Number of executors: 16 > - Executor's memories:

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-22 Thread nguyen duc Tuan
Hi Seth, Here's the parameters that I used in my experiments. - Number of executors: 16 - Executor's memories: vary from 1G -> 2G -> 3G - Number of cores per executor: 1-> 2 - Driver's memory: 1G -> 2G -> 3G - The similar threshold: 0.6 MinHash: - number of hash tables: 2 SignedRandomProjection:

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-22 Thread Seth Hendrickson
I'm looking into this a bit further, thanks for bringing it up! Right now the LSH implementation only uses OR-amplification. The practical consequence of this is that it will select too many candidates when doing approximate near neighbor search and approximate similarity join. When we add

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-13 Thread Nick Pentreath
The original Uber authors provided this performance test result: https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_mrg_-vLro This was for MinHash only though, so it's not clear about what the scalability is for the other metric types. The SignRandomProjectionLSH is not yet in

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-12 Thread nguyen duc Tuan
After all, I switched back to LSH implementation that I used before ( https://github.com/karlhigley/spark-neighbors ). I can run on my dataset now. If someone has any suggestion, please tell me. Thanks. 2017-02-12 9:25 GMT+07:00 nguyen duc Tuan : > Hi Timur, > 1) Our data

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-11 Thread Timur Shenkao
Hello, 1) Are you sure that your data is "clean"? No unexpected missing values? No strings in unusual encoding? No additional or missing columns ? 2) How long does your job run? What about garbage collector parameters? Have you checked what happens with jconsole / jvisualvm ? Sincerely yours,

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-10 Thread nguyen duc Tuan
Hi Nick, Because we use *RandomSignProjectionLSH*, there is only one parameter for LSH is the number of hashes. I try with small number of hashes (2) but the error is still happens. And it happens when I call similarity join. After transformation, the size of dataset is about 4G. 2017-02-11 3:07

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-10 Thread Nick Pentreath
What other params are you using for the lsh transformer? Are the issues occurring during transform or during the similarity join? On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan wrote: > hi Das, > In general, I will apply them to larger datasets, so I want to use LSH, >

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-10 Thread nguyen duc Tuan
hi Das, In general, I will apply them to larger datasets, so I want to use LSH, which is more scaleable than the approaches as you suggested. Have you tried LSH in Spark 2.1.0 before ? If yes, how do you set the parameters/configuration to make it work ? Thanks. 2017-02-10 19:21 GMT+07:00

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-10 Thread Debasish Das
If it is 7m rows and 700k features (or say 1m features) brute force row similarity will run fine as well...check out spark-4823...you can compare quality with approximate variant... On Feb 9, 2017 2:55 AM, "nguyen duc Tuan" wrote: > Hi everyone, > Since spark 2.1.0

Practical configuration to run LSH in Spark 2.1.0

2017-02-08 Thread nguyen duc Tuan
Hi everyone, Since spark 2.1.0 introduces LSH ( http://spark.apache.org/docs/latest/ml-features.html#locality-sensitive-hashing), we want to use LSH to find approximately nearest neighbors. Basically, We have dataset with about 7M rows. we want to use cosine distance to meassure the similarity