Hi all,
I need to compute a similiarity between elements of two large sets of high
dimensional feature vector.
Naively, I create all possible pair of vectors with
* features1.cartesian(features2)* and then map the produced paired rdd with
my similarity function.
The problem is that the cartesian
Cartesian joins of large datasets are usually going to be slow. If there
is a way you can reduce the problem space to make sure you only join
subsets with each other, that may be helpful. Maybe if you explain your
problem in more detail, people on the list can come up with more
suggestions.
Best
Hi Reza,
Thank you for the suggestion. The number of point are not that large about
1000 for each set. So I have 1000x1000 pairs. But, my similarity is
obtained using a metric learning to rank and from spark it is viewed as a
black box. So my idea was just to distribute the computation of the