subject:"RE\: rdd.sample\(\) methods very slow"

Re: rdd.sample() methods very slow

2015-05-22 Thread Reynold Xin

You can do something like this: val myRdd = ... val rddSampledByPartition = PartitionPruningRDD.create(myRdd, i = Random.nextDouble() 0.1) // this samples 10% of the partitions rddSampledByPartition.mapPartitions { iter = iter.take(10) } // take the first 10 elements out of each partition

RE: rdd.sample() methods very slow

2015-05-21 Thread Wang, Ningjun (LNG-NPV)

...@cloudera.com] Sent: Thursday, May 21, 2015 11:30 AM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: rdd.sample() methods very slow I guess the fundamental issue is that these aren't stored in a way that allows random access to a Document. Underneath, Hadoop has a concept of a MapFile

RE: rdd.sample() methods very slow

2015-05-21 Thread Wang, Ningjun (LNG-NPV)

: user@spark.apache.org Subject: Re: rdd.sample() methods very slow The way these files are accessed is inherently sequential-access. There isn't a way to in general know where record N is in a file like this and jump to it. So they must be read to be sampled. On Tue, May 19, 2015 at 9:44 PM

Re: rdd.sample() methods very slow

2015-05-21 Thread Sean Owen

...@cloudera.com] Sent: Tuesday, May 19, 2015 4:51 PM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: rdd.sample() methods very slow The way these files are accessed is inherently sequential-access. There isn't a way to in general know where record N is in a file like

Re: rdd.sample() methods very slow

2015-05-21 Thread Sean Owen

If sampling whole partitions is sufficient (or a part of a partition), sure you could mapPartitionsWithIndex and decide whether to process a partition at all based on its # and skip the rest. That's much faster. On Thu, May 21, 2015 at 7:07 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com

Re: rdd.sample() methods very slow

2015-05-19 Thread Sean Owen

The way these files are accessed is inherently sequential-access. There isn't a way to in general know where record N is in a file like this and jump to it. So they must be read to be sampled. On Tue, May 19, 2015 at 9:44 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: Hi

Re: rdd.sample() methods very slow

RE: rdd.sample() methods very slow

RE: rdd.sample() methods very slow

Re: rdd.sample() methods very slow

Re: rdd.sample() methods very slow

Re: rdd.sample() methods very slow

6 matches

Site Navigation

Mail list logo

Footer information