Hi Aurelien,

Sean's solution is nice, but maybe not completely order-free, since pairs will come from the same partition.

The easiest / fastest way to do it in my opinion is to use a random key instead of a zipWithIndex. Of course you'll not be able to ensure uniqueness of each elements of the pairs, but maybe you don't care since you're sampling with replacement already?

val a = rdd.sample(...).map{ x => (rand() % k, x)}
val b = rdd.sample(...).map{ x => (rand() % k, x)}

k must be ~ the number of elements you're sampling. You'll have a skewed distribution due to collisions, but I don't think it should hurt too much.

Guillaume
Hi everyone,
However I am not happy with this solution because each element is most
likely to be paired with elements that are "closeby" in the partition. This
is because sample returns an "ordered" Iterator.



--
eXenSa

        
*Guillaume PITEL, Président*
+33(0)626 222 431

eXenSa S.A.S. <http://www.exensa.com/>
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)184 163 677 / Fax +33(0)972 283 705

Reply via email to