Hi Imran,

Thanks for the suggestion! Unfortunately the type does not match. But I could write my own function that shuffle the sample though.

Le 4/17/15 9:34 PM, Imran Rashid a écrit :
if you can store the entire sample for one partition in memory, I think
you just want:

val sample1 =
rdd.sample(true,0.01,42).mapPartitions(scala.util.Random.shuffle)
val sample2 =
rdd.sample(true,0.01,43).mapPartitions(scala.util.Random.shuffle)

...



On Fri, Apr 17, 2015 at 3:05 AM, Aurélien Bellet
<aurelien.bel...@telecom-paristech.fr
<mailto:aurelien.bel...@telecom-paristech.fr>> wrote:

    Hi Sean,

    Thanks a lot for your reply. The problem is that I need to sample
    random *independent* pairs. If I draw two samples and build all
    n*(n-1) pairs then there is a lot of dependency. My current solution
    is also not satisfying because some pairs (the closest ones in a
    partition) have a much higher probability to be sampled. Not sure
    how to fix this.

    Aurelien


    Le 16/04/2015 20:44, Sean Owen a écrit :

        Use mapPartitions, and then take two random samples of the
        elements in
        the partition, and return an iterator over all pairs of them? Should
        be pretty simple assuming your sample size n is smallish since
        you're
        returning ~n^2 pairs.

        On Thu, Apr 16, 2015 at 7:00 PM, abellet
        <aurelien.bel...@telecom-paristech.fr
        <mailto:aurelien.bel...@telecom-paristech.fr>> wrote:

            Hi everyone,

            I have a large RDD and I am trying to create a RDD of a
            random sample of
            pairs of elements from this RDD. The elements composing a
            pair should come
            from the same partition for efficiency. The idea I've come
            up with is to
            take two random samples and then use zipPartitions to pair
            each i-th element
            of the first sample with the i-th element of the second
            sample. Here is a
            sample code illustrating the idea:

            -----------
            val rdd = sc.parallelize(1 to 60000, 16)

            val sample1 = rdd.sample(true,0.01,42)
            val sample2 = rdd.sample(true,0.01,43)

            def myfunc(s1: Iterator[Int], s2: Iterator[Int]):
            Iterator[String] =
            {
                var res = List[String]()
                while (s1.hasNext && s2.hasNext)
                {
                  val x = s1.next + " " + s2.next
                  res ::= x
                }
                res.iterator
            }

            val pairs = sample1.zipPartitions(sample2)(myfunc)
            -------------

            However I am not happy with this solution because each
            element is most
            likely to be paired with elements that are "closeby" in the
            partition. This
            is because sample returns an "ordered" Iterator.

            Any idea how to fix this? I did not find a way to
            efficiently shuffle the
            random sample so far.

            Thanks a lot!



            --
            View this message in context:
            
http://apache-spark-user-list.1001560.n3.nabble.com/Random-pairs-RDD-order-tp22529.html
            Sent from the Apache Spark User List mailing list archive at
            Nabble.com.

            
---------------------------------------------------------------------
            To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
            <mailto:user-unsubscr...@spark.apache.org>
            For additional commands, e-mail: user-h...@spark.apache.org
            <mailto:user-h...@spark.apache.org>


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
    <mailto:user-unsubscr...@spark.apache.org>
    For additional commands, e-mail: user-h...@spark.apache.org
    <mailto:user-h...@spark.apache.org>



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to