Hi everyone, I have a large RDD and I am trying to create a RDD of a random sample of pairs of elements from this RDD. The elements composing a pair should come from the same partition for efficiency. The idea I've come up with is to take two random samples and then use zipPartitions to pair each i-th element of the first sample with the i-th element of the second sample. Here is a sample code illustrating the idea:
----------- val rdd = sc.parallelize(1 to 60000, 16) val sample1 = rdd.sample(true,0.01,42) val sample2 = rdd.sample(true,0.01,43) def myfunc(s1: Iterator[Int], s2: Iterator[Int]): Iterator[String] = { var res = List[String]() while (s1.hasNext && s2.hasNext) { val x = s1.next + " " + s2.next res ::= x } res.iterator } val pairs = sample1.zipPartitions(sample2)(myfunc) ------------- However I am not happy with this solution because each element is most likely to be paired with elements that are "closeby" in the partition. This is because sample returns an "ordered" Iterator. Any idea how to fix this? I did not find a way to efficiently shuffle the random sample so far. Thanks a lot! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Random-pairs-RDD-order-tp22529.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org