I do a self-join. I tried to cache the transformed dataset before joining,
but it didn't help too.

2017-02-23 13:25 GMT+07:00 Nick Pentreath <nick.pentre...@gmail.com>:

> And to be clear, are you doing a self-join for approx similarity? Or
> joining to another dataset?
>
>
>
> On Thu, 23 Feb 2017 at 02:01, nguyen duc Tuan <newvalu...@gmail.com>
> wrote:
>
>> Hi Seth,
>> Here's the parameters that I used in my experiments.
>> - Number of executors: 16
>> - Executor's memories: vary from 1G -> 2G -> 3G
>> - Number of cores per executor: 1-> 2
>> - Driver's memory:  1G -> 2G -> 3G
>> - The similar threshold: 0.6
>> MinHash:
>> - number of hash tables: 2
>> SignedRandomProjection:
>> - Number of hash tables: 2
>>
>> 2017-02-23 0:13 GMT+07:00 Seth Hendrickson <seth.hendrickso...@gmail.com>
>> :
>>
>> I'm looking into this a bit further, thanks for bringing it up! Right now
>> the LSH implementation only uses OR-amplification. The practical
>> consequence of this is that it will select too many candidates when doing
>> approximate near neighbor search and approximate similarity join. When we
>> add AND-amplification I think it will become significantly more usable. In
>> the meantime, I will also investigate scalability issues.
>>
>> Can you please provide every parameter you used? It will be very helfpul
>> :) For instance, the similarity threshold, the number of hash tables, the
>> bucket width, etc...
>>
>> Thanks!
>>
>> On Mon, Feb 13, 2017 at 3:21 PM, Nick Pentreath <nick.pentre...@gmail.com
>> > wrote:
>>
>> The original Uber authors provided this performance test result:
>> https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_
>> mrg_-vLro
>>
>> This was for MinHash only though, so it's not clear about what the
>> scalability is for the other metric types.
>>
>> The SignRandomProjectionLSH is not yet in Spark master (see
>> https://issues.apache.org/jira/browse/SPARK-18082). It could be there
>> are some implementation details that would make a difference here.
>>
>> By the way, what is the join threshold you use in approx join?
>>
>> Could you perhaps create a JIRA ticket with the details in order to track
>> this?
>>
>>
>> On Sun, 12 Feb 2017 at 22:52 nguyen duc Tuan <newvalu...@gmail.com>
>> wrote:
>>
>> After all, I switched back to LSH implementation that I used before (
>> https://github.com/karlhigley/spark-neighbors ). I can run on my dataset
>> now. If someone has any suggestion, please tell me.
>> Thanks.
>>
>> 2017-02-12 9:25 GMT+07:00 nguyen duc Tuan <newvalu...@gmail.com>:
>>
>> Hi Timur,
>> 1) Our data is transformed to dataset of Vector already.
>> 2) If I use RandomSignProjectLSH, the job dies after I call
>> approximateSimilarJoin. I tried to use Minhash instead, the job is still
>> slow. I don't thinks the problem is related to the GC. The time for GC is
>> small compare with the time for computation. Here is some screenshots of my
>> job.
>> Thanks
>>
>> 2017-02-12 8:01 GMT+07:00 Timur Shenkao <t...@timshenkao.su>:
>>
>> Hello,
>>
>> 1) Are you sure that your data is "clean"?  No unexpected missing values?
>> No strings in unusual encoding? No additional or missing columns ?
>> 2) How long does your job run? What about garbage collector parameters?
>> Have you checked what happens with jconsole / jvisualvm ?
>>
>> Sincerely yours, Timur
>>
>> On Sat, Feb 11, 2017 at 12:52 AM, nguyen duc Tuan <newvalu...@gmail.com>
>> wrote:
>>
>> Hi Nick,
>> Because we use *RandomSignProjectionLSH*, there is only one parameter
>> for LSH is the number of hashes. I try with small number of hashes (2) but
>> the error is still happens. And it happens when I call similarity join.
>> After transformation, the size of  dataset is about 4G.
>>
>> 2017-02-11 3:07 GMT+07:00 Nick Pentreath <nick.pentre...@gmail.com>:
>>
>> What other params are you using for the lsh transformer?
>>
>> Are the issues occurring during transform or during the similarity join?
>>
>>
>> On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan <newvalu...@gmail.com>
>> wrote:
>>
>> hi Das,
>> In general, I will apply them to larger datasets, so I want to use LSH,
>> which is more scaleable than the approaches as you suggested. Have you
>> tried LSH in Spark 2.1.0 before ? If yes, how do you set the
>> parameters/configuration to make it work ?
>> Thanks.
>>
>> 2017-02-10 19:21 GMT+07:00 Debasish Das <debasish.da...@gmail.com>:
>>
>> If it is 7m rows and 700k features (or say 1m features) brute force row
>> similarity will run fine as well...check out spark-4823...you can compare
>> quality with approximate variant...
>> On Feb 9, 2017 2:55 AM, "nguyen duc Tuan" <newvalu...@gmail.com> wrote:
>>
>> Hi everyone,
>> Since spark 2.1.0 introduces LSH (http://spark.apache.org/docs/
>> latest/ml-features.html#locality-sensitive-hashing), we want to use LSH
>> to find approximately nearest neighbors. Basically, We have dataset with
>> about 7M rows. we want to use cosine distance to meassure the similarity
>> between items, so we use *RandomSignProjectionLSH* (
>> https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db) instead
>> of *BucketedRandomProjectionLSH*. I try to tune some configurations such
>> as serialization, memory fraction, executor memory (~6G), number of
>> executors ( ~20), memory overhead ..., but nothing works. I often get error
>> "java.lang.OutOfMemoryError: Java heap space" while running. I know that
>> this implementation is done by engineer at Uber but I don't know right
>> configurations,.. to run the algorithm at scale. Do they need very big
>> memory to run it?
>>
>> Any help would be appreciated.
>> Thanks
>>
>>
>>
>>
>>
>>
>>
>>
>>

Reply via email to