Hi Sean,

Thanks a lot for your reply. The problem is that I need to sample random *independent* pairs. If I draw two samples and build all n*(n-1) pairs then there is a lot of dependency. My current solution is also not satisfying because some pairs (the closest ones in a partition) have a much higher probability to be sampled. Not sure how to fix this.

Aurelien

Le 16/04/2015 20:44, Sean Owen a écrit :
Use mapPartitions, and then take two random samples of the elements in
the partition, and return an iterator over all pairs of them? Should
be pretty simple assuming your sample size n is smallish since you're
returning ~n^2 pairs.

On Thu, Apr 16, 2015 at 7:00 PM, abellet
<aurelien.bel...@telecom-paristech.fr> wrote:
Hi everyone,

I have a large RDD and I am trying to create a RDD of a random sample of
pairs of elements from this RDD. The elements composing a pair should come
from the same partition for efficiency. The idea I've come up with is to
take two random samples and then use zipPartitions to pair each i-th element
of the first sample with the i-th element of the second sample. Here is a
sample code illustrating the idea:

-----------
val rdd = sc.parallelize(1 to 60000, 16)

val sample1 = rdd.sample(true,0.01,42)
val sample2 = rdd.sample(true,0.01,43)

def myfunc(s1: Iterator[Int], s2: Iterator[Int]): Iterator[String] =
{
   var res = List[String]()
   while (s1.hasNext && s2.hasNext)
   {
     val x = s1.next + " " + s2.next
     res ::= x
   }
   res.iterator
}

val pairs = sample1.zipPartitions(sample2)(myfunc)
-------------

However I am not happy with this solution because each element is most
likely to be paired with elements that are "closeby" in the partition. This
is because sample returns an "ordered" Iterator.

Any idea how to fix this? I did not find a way to efficiently shuffle the
random sample so far.

Thanks a lot!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Random-pairs-RDD-order-tp22529.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to