subject:"Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation \?"

Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?

2014-10-17 Thread Jaonary Rabarisoa

Hi all,

I need to compute a similiarity between elements of two large sets of high
dimensional feature vector.
Naively, I create all possible pair of vectors with
* features1.cartesian(features2)* and then map the produced paired rdd with
my similarity function.

The problem is that the cartesian operation takes a lot times, more time
that computing the similarity itself. If I save each of my  feature vector
into disk, form a list of file name pair and compute the similarity by
reading the files it runs significantly much faster.

Any ideas will be helpful,

Cheers,

Jao

Re: Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?

2014-10-17 Thread Sonal Goyal

Cartesian joins of large datasets are usually going to be slow. If there
is a way you can reduce the problem space to make sure you only join
subsets with each other, that may be helpful. Maybe if you explain your
problem in more detail, people on the list can come up with more
suggestions.

Best Regards,
Sonal
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal



On Fri, Oct 17, 2014 at 4:13 PM, Jaonary Rabarisoa jaon...@gmail.com
wrote:

 Hi all,

 I need to compute a similiarity between elements of two large sets of high
 dimensional feature vector.
 Naively, I create all possible pair of vectors with
 * features1.cartesian(features2)* and then map the produced paired rdd
 with my similarity function.

 The problem is that the cartesian operation takes a lot times, more time
 that computing the similarity itself. If I save each of my  feature vector
 into disk, form a list of file name pair and compute the similarity by
 reading the files it runs significantly much faster.

 Any ideas will be helpful,

 Cheers,

 Jao

Re: Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?

2014-10-17 Thread Jaonary Rabarisoa

Hi Reza,

Thank you for the suggestion. The number of point are not that large about
1000 for each set. So I have 1000x1000 pairs. But, my similarity is
obtained using a metric learning to rank and from spark it is viewed as a
black box. So my idea was just to distribute the computation of the
1000x1000 similarities.

After some investigation, I managed to make it run faster. My feature
vectors are obtained after a join operation and I didn't cache the result
of this operation before the cartesian operation. Caching the result of the
join operation make my code runs amazingly faster. So I think, the real
problem I have is the lack of good practice on spark programming.

Best
Jao

On Fri, Oct 17, 2014 at 11:08 PM, Reza Zadeh r...@databricks.com wrote:

Hi Jaonary,

What are the numbers, i.e. number of points you're trying to do all-pairs
on, and the dimension of each?

Have you tried the new implementation of columnSimilarities in RowMatrix?
Setting the threshold high enough (potentially above 1.0) might solve your
problem, here is an example
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala
.

This implements the DIMSUM sampling scheme, recently merged into master
https://github.com/apache/spark/pull/1778.

Best,
Reza

On Fri, Oct 17, 2014 at 3:43 AM, Jaonary Rabarisoa jaon...@gmail.com
wrote:

Hi all,

I need to compute a similiarity between elements of two large sets of
high dimensional feature vector.
Naively, I create all possible pair of vectors with
* features1.cartesian(features2)* and then map the produced paired rdd
with my similarity function.

The problem is that the cartesian operation takes a lot times, more time
that computing the similarity itself. If I save each of my feature vector
into disk, form a list of file name pair and compute the similarity by
reading the files it runs significantly much faster.

Any ideas will be helpful,

Cheers,

Jao

Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?

Re: Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?

Re: Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?

3 matches

Site Navigation

Mail list logo

Footer information