Hi Reza, Thank you for the suggestion. The number of point are not that large about 1000 for each set. So I have 1000x1000 pairs. But, my similarity is obtained using a metric learning to rank and from spark it is viewed as a black box. So my idea was just to distribute the computation of the 1000x1000 similarities.
After some investigation, I managed to make it run faster. My feature vectors are obtained after a join operation and I didn't cache the result of this operation before the cartesian operation. Caching the result of the join operation make my code runs amazingly faster. So I think, the real problem I have is the lack of good practice on spark programming. Best Jao On Fri, Oct 17, 2014 at 11:08 PM, Reza Zadeh <r...@databricks.com> wrote: > Hi Jaonary, > > What are the numbers, i.e. number of points you're trying to do all-pairs > on, and the dimension of each? > > Have you tried the new implementation of columnSimilarities in RowMatrix? > Setting the threshold high enough (potentially above 1.0) might solve your > problem, here is an example > <https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala> > . > > This implements the DIMSUM sampling scheme, recently merged into master > <https://github.com/apache/spark/pull/1778>. > > Best, > Reza > > On Fri, Oct 17, 2014 at 3:43 AM, Jaonary Rabarisoa <jaon...@gmail.com> > wrote: > >> Hi all, >> >> I need to compute a similiarity between elements of two large sets of >> high dimensional feature vector. >> Naively, I create all possible pair of vectors with >> * features1.cartesian(features2)* and then map the produced paired rdd >> with my similarity function. >> >> The problem is that the cartesian operation takes a lot times, more time >> that computing the similarity itself. If I save each of my feature vector >> into disk, form a list of file name pair and compute the similarity by >> reading the files it runs significantly much faster. >> >> Any ideas will be helpful, >> >> Cheers, >> >> Jao >> >> >> >> >