Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?

Jaonary Rabarisoa Fri, 17 Oct 2014 03:43:53 -0700

Hi all,

I need to compute a similiarity between elements of two large sets of high
dimensional feature vector.
Naively, I create all possible pair of vectors with
* features1.cartesian(features2)* and then map the produced paired rdd with
my similarity function.


The problem is that the cartesian operation takes a lot times, more time
that computing the similarity itself. If I save each of my  feature vector
into disk, form a list of file name pair and compute the similarity by
reading the files it runs significantly much faster.

Any ideas will be helpful,

Cheers,

Jao

Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?

Reply via email to