You can do something like this:
val myRdd = ...
val rddSampledByPartition = PartitionPruningRDD.create(myRdd, i =
Random.nextDouble() 0.1) // this samples 10% of the partitions
rddSampledByPartition.mapPartitions { iter = iter.take(10) } // take the
first 10 elements out of each partition
...@cloudera.com]
Sent: Thursday, May 21, 2015 11:30 AM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: Re: rdd.sample() methods very slow
I guess the fundamental issue is that these aren't stored in a way that allows
random access to a Document.
Underneath, Hadoop has a concept of a MapFile
: user@spark.apache.org
Subject: Re: rdd.sample() methods very slow
The way these files are accessed is inherently sequential-access. There isn't a
way to in general know where record N is in a file like this and jump to it. So
they must be read to be sampled.
On Tue, May 19, 2015 at 9:44 PM
...@cloudera.com]
Sent: Tuesday, May 19, 2015 4:51 PM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: Re: rdd.sample() methods very slow
The way these files are accessed is inherently sequential-access. There
isn't a way to in general know where record N is in a file like
If sampling whole partitions is sufficient (or a part of a partition),
sure you could mapPartitionsWithIndex and decide whether to process a
partition at all based on its # and skip the rest. That's much faster.
On Thu, May 21, 2015 at 7:07 PM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com
The way these files are accessed is inherently sequential-access. There
isn't a way to in general know where record N is in a file like this and
jump to it. So they must be read to be sampled.
On Tue, May 19, 2015 at 9:44 PM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
Hi