Hi all, I need to compute a similiarity between elements of two large sets of high dimensional feature vector. Naively, I create all possible pair of vectors with * features1.cartesian(features2)* and then map the produced paired rdd with my similarity function.
The problem is that the cartesian operation takes a lot times, more time that computing the similarity itself. If I save each of my feature vector into disk, form a list of file name pair and compute the similarity by reading the files it runs significantly much faster. Any ideas will be helpful, Cheers, Jao