One idea is to avoid materializing the pairs of points before computing the distances between them. You could do that using the LSH signatures by building (Signature, (Int, Vector)) tuples, grouping by signature, and then iterating pairwise over the resulting lists of points to compute the distances between them. The points still have to be shuffled over the network, but at least the shuffle doesn't create multiple copies of each point (like a join by point ids would).
Here's an implementation of that idea in the context of finding nearest neighbors: https://github.com/karlhigley/spark-neighbors/blob/master/src/main/scala/com/github/karlhigley/spark/neighbors/ANNModel.scala#L33-L34 Best, Karl On Wed, Apr 27, 2016 at 10:22 PM nguyen duc tuan <newvalu...@gmail.com> wrote: > Hi all, > Currently, I'm working on implementing LSH on spark. The problem leads to > follow problem. I have an RDD[(Int, Int)] stores all pairs of ids of > vectors need to compute distance and an other RDD[(Int, Vector)] stores all > vectors with their ids. Can anyone suggest an efficiency way to compute > distance? My simple version that I try first is as follows but it's > inefficient because it require a lot of shuffling data over the network. > > rdd1: RDD[(Int, Int)] = .. > rdd2: RDD[(Int, Vector)] = ... > val distances = rdd2.cartesian(rdd2) > .map(x => ((x._1._1, x._2._1), (x._1._2, x._2._2))) > .join(rdd1.map(x => (x, 1)) > .mapValues(x => { > measure.compute(x._1._1, x._1._2) > }) > > Thanks for any suggestion. >