Hi Reza,
Thank you for the suggestion. The number of point are not that large about
1000 for each set. So I have 1000x1000 pairs. But, my similarity is
obtained using a metric learning to rank and from spark it is viewed as a
black box. So my idea was just to distribute the computation of the
1000x1000 similarities.
After some investigation, I managed to make it run faster. My feature
vectors are obtained after a join operation and I didn't cache the result
of this operation before the cartesian operation. Caching the result of the
join operation make my code runs amazingly faster. So I think, the real
problem I have is the lack of good practice on spark programming.
Best
Jao
On Fri, Oct 17, 2014 at 11:08 PM, Reza Zadeh r...@databricks.com wrote:
Hi Jaonary,
What are the numbers, i.e. number of points you're trying to do all-pairs
on, and the dimension of each?
Have you tried the new implementation of columnSimilarities in RowMatrix?
Setting the threshold high enough (potentially above 1.0) might solve your
problem, here is an example
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala
.
This implements the DIMSUM sampling scheme, recently merged into master
https://github.com/apache/spark/pull/1778.
Best,
Reza
On Fri, Oct 17, 2014 at 3:43 AM, Jaonary Rabarisoa jaon...@gmail.com
wrote:
Hi all,
I need to compute a similiarity between elements of two large sets of
high dimensional feature vector.
Naively, I create all possible pair of vectors with
* features1.cartesian(features2)* and then map the produced paired rdd
with my similarity function.
The problem is that the cartesian operation takes a lot times, more time
that computing the similarity itself. If I save each of my feature vector
into disk, form a list of file name pair and compute the similarity by
reading the files it runs significantly much faster.
Any ideas will be helpful,
Cheers,
Jao