The general approach to the Cartesian problem is to first block or index
your rows so that similar items fall in the same bucket, and then join
within each bucket. Is that possible in your case?

On Friday, August 5, 2016, Paschalis Veskos <ves...@gmail.com> wrote:

> Hello everyone,
>
> I am interested in running an application on Spark that at some point
> needs to compare all elements of an RDD against all others to create a
> distance matrix. The RDD is of type <String, Double> and the Pearson
> correlation is applied to each element against all others, generating
> a matrix with the distance between all possible combinations of
> elements.
>
> I have implemented this by taking the cartesian product of the RDD
> with itself, filtering half the matrix away since it is symmetric,
> then doing a combineByKey to get all other elements that it needs to
> be compared with. I map the output of this over the comparison
> function implementing the Pearson correlation.
>
> You can probably guess this is dead slow. I use Spark 1.6.2, the code
> is written in Java 8. At the rate it is processing in a cluster with 4
> nodes with 16cores and 56gb ram each, for a list with ~15000 elements
> split in 512 partitions, the cartesian operation alone is estimated to
> take about 3000 hours (all cores are maxed out on all machines)!
>
> Is there a way to avoid the cartesian product to calculate what I
> want? Would a DataFrame join be faster? Or is this an operation that
> just requires a much larger cluster?
>
> Thank you,
>
> Paschalis
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org <javascript:;>
>
>

-- 
Best Regards,
Sonal
Founder, Nube Technologies <http://www.nubetech.co>
Reifier at Strata Hadoop World <https://www.youtube.com/watch?v=eD3LkpPQIgM>
Reifier at Spark Summit 2015
<https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/>

<http://in.linkedin.com/in/sonalgoyal>

Reply via email to