Re: How to process one partition at a time?

Hemant Bhanawat Wed, 06 Apr 2016 22:23:27 -0700

Apparently, there is another way to do it. You can try creating a
PartitionPruningRDD and pass a partition filter function to it. This RDD
will do the same thing that I suggested in my mail and you will not have to
create a new RDD.


Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811>
www.snappydata.io

On Wed, Apr 6, 2016 at 5:35 PM, Sun, Rui <rui....@intel.com> wrote:

> Maybe you can try SparkContext.submitJob:
>
> *def **submitJob**[T, U, R](rdd: RDD
> <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html>[T],
>  processPartition:
> (Iterator[T]) **⇒** U, partitions: Seq[Int], resultHandler: (Int, U) **⇒** 
> Unit, resultFunc:
> **⇒** R): SimpleFutureAction
> <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/SimpleFutureAction.html>[R]*
>
>
>
>
>
> *From:* Hemant Bhanawat [mailto:hemant9...@gmail.com]
> *Sent:* Wednesday, April 6, 2016 7:16 PM
> *To:* Andrei <faithlessfri...@gmail.com>
> *Cc:* user <user@spark.apache.org>
> *Subject:* Re: How to process one partition at a time?
>
>
>
> Instead of doing it in compute, you could rather override getPartitions
> method of your RDD and return only the target partitions. This way tasks
> for only target partitions will be created. Currently in your case, tasks
> for all the partitions are getting created.
>
> I hope it helps. I would like to hear if you take some other approach.
>
>
> Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811>
>
> www.snappydata.io
>
>
>
> On Wed, Apr 6, 2016 at 3:49 PM, Andrei <faithlessfri...@gmail.com> wrote:
>
> I'm writing a kind of sampler which in most cases will require only 1
> partition, sometimes 2 and very rarely more. So it doesn't make sense to
> process all partitions in parallel. What is the easiest way to limit
> computations to one partition only?
>
>
>
> So far the best idea I came to is to create a custom partition whose
> `compute` method looks something like:
>
>
>
> def compute(split: Partition, context: TaskContext) = {
>
>     if (split.index == targetPartition) {
>
>         // do computation
>
>     } else {
>
>        // return empty iterator
>
>     }
>
> }
>
>
>
>
>
> But it's quite ugly and I'm unlikely to be the first person with such a
> need. Is there easier way to do it?
>
>
>
>
>
>
>

Re: How to process one partition at a time?

Reply via email to