You can do something like this: val myRdd = ...
val rddSampledByPartition = PartitionPruningRDD.create(myRdd, i => Random.nextDouble() < 0.1) // this samples 10% of the partitions rddSampledByPartition.mapPartitions { iter => iter.take(10) } // take the first 10 elements out of each partition On Thu, May 21, 2015 at 11:36 AM, Sean Owen <so...@cloudera.com> wrote: > If sampling whole partitions is sufficient (or a part of a partition), > sure you could mapPartitionsWithIndex and decide whether to process a > partition at all based on its # and skip the rest. That's much faster. > > On Thu, May 21, 2015 at 7:07 PM, Wang, Ningjun (LNG-NPV) > <ningjun.w...@lexisnexis.com> wrote: > > I don't need to be 100% randome. How about randomly pick a few > partitions and return all docs in those partitions? Is > > rdd.mapPartitionsWithIndex() the right method to use to just process a > small portion of partitions? > > > > Ningjun > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >