Hi Aurelien,
Sean's solution is nice, but maybe not completely order-free, since
pairs will come from the same partition.
The easiest / fastest way to do it in my opinion is to use a random key
instead of a zipWithIndex. Of course you'll not be able to ensure
uniqueness of each elements of the pairs, but maybe you don't care since
you're sampling with replacement already?
val a = rdd.sample(...).map{ x => (rand() % k, x)}
val b = rdd.sample(...).map{ x => (rand() % k, x)}
k must be ~ the number of elements you're sampling. You'll have a
skewed distribution due to collisions, but I don't think it should hurt
too much.
Guillaume
Hi everyone,
However I am not happy with this solution because each element is most
likely to be paired with elements that are "closeby" in the partition. This
is because sample returns an "ordered" Iterator.
--
eXenSa
*Guillaume PITEL, Président*
+33(0)626 222 431
eXenSa S.A.S. <http://www.exensa.com/>
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)184 163 677 / Fax +33(0)972 283 705