Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?

2014-10-17 Thread Jaonary Rabarisoa
Hi all, I need to compute a similiarity between elements of two large sets of high dimensional feature vector. Naively, I create all possible pair of vectors with * features1.cartesian(features2)* and then map the produced paired rdd with my similarity function. The problem is that the cartesian

Re: Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?

2014-10-17 Thread Sonal Goyal
Cartesian joins of large datasets are usually going to be slow. If there is a way you can reduce the problem space to make sure you only join subsets with each other, that may be helpful. Maybe if you explain your problem in more detail, people on the list can come up with more suggestions. Best

Re: Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?

2014-10-17 Thread Jaonary Rabarisoa
Hi Reza, Thank you for the suggestion. The number of point are not that large about 1000 for each set. So I have 1000x1000 pairs. But, my similarity is obtained using a metric learning to rank and from spark it is viewed as a black box. So my idea was just to distribute the computation of the