I do a self-join. I tried to cache the transformed dataset before joining,
but it didn't help too.
2017-02-23 13:25 GMT+07:00 Nick Pentreath :
> And to be clear, are you doing a self-join for approx similarity? Or
> joining to another dataset?
>
>
>
> On Thu, 23 Feb
And to be clear, are you doing a self-join for approx similarity? Or
joining to another dataset?
On Thu, 23 Feb 2017 at 02:01, nguyen duc Tuan wrote:
> Hi Seth,
> Here's the parameters that I used in my experiments.
> - Number of executors: 16
> - Executor's memories:
Hi Seth,
Here's the parameters that I used in my experiments.
- Number of executors: 16
- Executor's memories: vary from 1G -> 2G -> 3G
- Number of cores per executor: 1-> 2
- Driver's memory: 1G -> 2G -> 3G
- The similar threshold: 0.6
MinHash:
- number of hash tables: 2
SignedRandomProjection:
I'm looking into this a bit further, thanks for bringing it up! Right now
the LSH implementation only uses OR-amplification. The practical
consequence of this is that it will select too many candidates when doing
approximate near neighbor search and approximate similarity join. When we
add
The original Uber authors provided this performance test result:
https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_mrg_-vLro
This was for MinHash only though, so it's not clear about what the
scalability is for the other metric types.
The SignRandomProjectionLSH is not yet in
After all, I switched back to LSH implementation that I used before (
https://github.com/karlhigley/spark-neighbors ). I can run on my dataset
now. If someone has any suggestion, please tell me.
Thanks.
2017-02-12 9:25 GMT+07:00 nguyen duc Tuan :
> Hi Timur,
> 1) Our data
Hello,
1) Are you sure that your data is "clean"? No unexpected missing values?
No strings in unusual encoding? No additional or missing columns ?
2) How long does your job run? What about garbage collector parameters?
Have you checked what happens with jconsole / jvisualvm ?
Sincerely yours,
Hi Nick,
Because we use *RandomSignProjectionLSH*, there is only one parameter for
LSH is the number of hashes. I try with small number of hashes (2) but the
error is still happens. And it happens when I call similarity join. After
transformation, the size of dataset is about 4G.
2017-02-11 3:07
What other params are you using for the lsh transformer?
Are the issues occurring during transform or during the similarity join?
On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan wrote:
> hi Das,
> In general, I will apply them to larger datasets, so I want to use LSH,
>
hi Das,
In general, I will apply them to larger datasets, so I want to use LSH,
which is more scaleable than the approaches as you suggested. Have you
tried LSH in Spark 2.1.0 before ? If yes, how do you set the
parameters/configuration to make it work ?
Thanks.
2017-02-10 19:21 GMT+07:00
If it is 7m rows and 700k features (or say 1m features) brute force row
similarity will run fine as well...check out spark-4823...you can compare
quality with approximate variant...
On Feb 9, 2017 2:55 AM, "nguyen duc Tuan" wrote:
> Hi everyone,
> Since spark 2.1.0
Hi everyone,
Since spark 2.1.0 introduces LSH (
http://spark.apache.org/docs/latest/ml-features.html#locality-sensitive-hashing),
we want to use LSH to find approximately nearest neighbors. Basically, We
have dataset with about 7M rows. we want to use cosine distance to meassure
the similarity
12 matches
Mail list logo