Re: rdd.sample() methods very slow

2015-05-22 Thread Reynold Xin
You can do something like this: val myRdd = ... val rddSampledByPartition = PartitionPruningRDD.create(myRdd, i = Random.nextDouble() 0.1) // this samples 10% of the partitions rddSampledByPartition.mapPartitions { iter = iter.take(10) } // take the first 10 elements out of each partition

RE: rdd.sample() methods very slow

2015-05-21 Thread Wang, Ningjun (LNG-NPV)
...@cloudera.com] Sent: Thursday, May 21, 2015 11:30 AM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: rdd.sample() methods very slow I guess the fundamental issue is that these aren't stored in a way that allows random access to a Document. Underneath, Hadoop has a concept of a MapFile

RE: rdd.sample() methods very slow

2015-05-21 Thread Wang, Ningjun (LNG-NPV)
: user@spark.apache.org Subject: Re: rdd.sample() methods very slow The way these files are accessed is inherently sequential-access. There isn't a way to in general know where record N is in a file like this and jump to it. So they must be read to be sampled. On Tue, May 19, 2015 at 9:44 PM

Re: rdd.sample() methods very slow

2015-05-21 Thread Sean Owen
...@cloudera.com] Sent: Tuesday, May 19, 2015 4:51 PM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: rdd.sample() methods very slow The way these files are accessed is inherently sequential-access. There isn't a way to in general know where record N is in a file like

Re: rdd.sample() methods very slow

2015-05-21 Thread Sean Owen
If sampling whole partitions is sufficient (or a part of a partition), sure you could mapPartitionsWithIndex and decide whether to process a partition at all based on its # and skip the rest. That's much faster. On Thu, May 21, 2015 at 7:07 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com

Re: rdd.sample() methods very slow

2015-05-19 Thread Sean Owen
The way these files are accessed is inherently sequential-access. There isn't a way to in general know where record N is in a file like this and jump to it. So they must be read to be sampled. On Tue, May 19, 2015 at 9:44 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: Hi