Hello everyone, I am a Spark novice facing a nontrivial problem to solve with Spark.
I have an RDD consisting of many elements (say, 60K), where each element is is a d-dimensional vector. I want to implement an iterative algorithm which does the following. At each iteration, I want to apply an operation on *pairs* of elements (say, compute their dot product). Of course the number of pairs is huge, but I only need to consider a small random subset of the possible pairs at each iteration. To minimize communication between nodes, I am willing to partition my RDD by key (where each elements gets a random key) and to only consider pairs of elements that belong to the same partition (i.e., that share the same key). But I am not sure how to sample and apply the operation on pairs, and to make sure that the computation for each pair is indeed done by the node holding the corresponding elements. Any help would be greatly appreciated. Thanks a lot! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pairwise-computations-within-partition-tp22436.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org