One idea is to avoid materializing the pairs of points before computing the
distances between them. You could do that using the LSH signatures by
building (Signature, (Int, Vector)) tuples, grouping by signature, and then
iterating pairwise over the resulting lists of points to compute the
distances between them. The points still have to be shuffled over the
network, but at least the shuffle doesn't create multiple copies of each
point (like a join by point ids would).

Here's an implementation of that idea in the context of finding nearest
neighbors:
https://github.com/karlhigley/spark-neighbors/blob/master/src/main/scala/com/github/karlhigley/spark/neighbors/ANNModel.scala#L33-L34

Best,
Karl



On Wed, Apr 27, 2016 at 10:22 PM nguyen duc tuan <newvalu...@gmail.com>
wrote:

> Hi all,
> Currently, I'm working on implementing LSH on spark. The problem leads to
> follow problem. I have an RDD[(Int, Int)] stores all pairs of ids of
> vectors need to compute distance and an other RDD[(Int, Vector)] stores all
> vectors with their ids. Can anyone  suggest an efficiency way to compute
> distance? My simple version that I try first is as follows but it's
> inefficient because it require a lot of shuffling data over the network.
>
> rdd1: RDD[(Int, Int)] = ..
> rdd2: RDD[(Int, Vector)] = ...
> val distances = rdd2.cartesian(rdd2)
>       .map(x => ((x._1._1, x._2._1), (x._1._2, x._2._2)))
>       .join(rdd1.map(x => (x, 1))
>       .mapValues(x => {
>          measure.compute(x._1._1, x._1._2)
>       })
>
> Thanks for any suggestion.
>

Reply via email to