You can do something like this:

val myRdd = ...

val rddSampledByPartition = PartitionPruningRDD.create(myRdd, i =>
Random.nextDouble() < 0.1)  // this samples 10% of the partitions

rddSampledByPartition.mapPartitions { iter => iter.take(10) }  // take the
first 10 elements out of each partition



On Thu, May 21, 2015 at 11:36 AM, Sean Owen <so...@cloudera.com> wrote:

> If sampling whole partitions is sufficient (or a part of a partition),
> sure you could mapPartitionsWithIndex and decide whether to process a
> partition at all based on its # and skip the rest. That's much faster.
>
> On Thu, May 21, 2015 at 7:07 PM, Wang, Ningjun (LNG-NPV)
> <ningjun.w...@lexisnexis.com> wrote:
> > I don't need to be 100% randome. How about randomly pick a few
> partitions and return all docs in those partitions? Is
> > rdd.mapPartitionsWithIndex() the right method to use to just process a
> small portion of partitions?
> >
> > Ningjun
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to