Re: rdd.sample() methods very slow

Reynold Xin Thu, 21 May 2015 23:53:29 -0700

You can do something like this:

val myRdd = ...


val rddSampledByPartition = PartitionPruningRDD.create(myRdd, i =>
Random.nextDouble() < 0.1)  // this samples 10% of the partitions

rddSampledByPartition.mapPartitions { iter => iter.take(10) }  // take the
first 10 elements out of each partition



On Thu, May 21, 2015 at 11:36 AM, Sean Owen <so...@cloudera.com> wrote:

> If sampling whole partitions is sufficient (or a part of a partition),
> sure you could mapPartitionsWithIndex and decide whether to process a
> partition at all based on its # and skip the rest. That's much faster.
>
> On Thu, May 21, 2015 at 7:07 PM, Wang, Ningjun (LNG-NPV)
> <ningjun.w...@lexisnexis.com> wrote:
> > I don't need to be 100% randome. How about randomly pick a few
> partitions and return all docs in those partitions? Is
> > rdd.mapPartitionsWithIndex() the right method to use to just process a
> small portion of partitions?
> >
> > Ningjun
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: rdd.sample() methods very slow

Reply via email to