Re: How to process one partition at a time?

2016-04-07 Thread Andrei
Thanks everyone, both - `submitJob` and `PartitionPrunningRDD` - work for
me.

On Thu, Apr 7, 2016 at 8:22 AM, Hemant Bhanawat <hemant9...@gmail.com>
wrote:

> Apparently, there is another way to do it. You can try creating a
> PartitionPruningRDD and pass a partition filter function to it. This RDD
> will do the same thing that I suggested in my mail and you will not have to
> create a new RDD.
>
> Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811>
> www.snappydata.io
>
> On Wed, Apr 6, 2016 at 5:35 PM, Sun, Rui <rui@intel.com> wrote:
>
>> Maybe you can try SparkContext.submitJob:
>>
>> *def **submitJob**[T, U, R](rdd: RDD
>> <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html>[T],
>>  processPartition:
>> (Iterator[T]) **⇒** U, partitions: Seq[Int], resultHandler: (Int, U) **⇒** 
>> Unit, resultFunc:
>> **⇒** R): SimpleFutureAction
>> <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/SimpleFutureAction.html>[R]*
>>
>>
>>
>>
>>
>> *From:* Hemant Bhanawat [mailto:hemant9...@gmail.com]
>> *Sent:* Wednesday, April 6, 2016 7:16 PM
>> *To:* Andrei <faithlessfri...@gmail.com>
>> *Cc:* user <user@spark.apache.org>
>> *Subject:* Re: How to process one partition at a time?
>>
>>
>>
>> Instead of doing it in compute, you could rather override getPartitions
>> method of your RDD and return only the target partitions. This way tasks
>> for only target partitions will be created. Currently in your case, tasks
>> for all the partitions are getting created.
>>
>> I hope it helps. I would like to hear if you take some other approach.
>>
>>
>> Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811>
>>
>> www.snappydata.io
>>
>>
>>
>> On Wed, Apr 6, 2016 at 3:49 PM, Andrei <faithlessfri...@gmail.com> wrote:
>>
>> I'm writing a kind of sampler which in most cases will require only 1
>> partition, sometimes 2 and very rarely more. So it doesn't make sense to
>> process all partitions in parallel. What is the easiest way to limit
>> computations to one partition only?
>>
>>
>>
>> So far the best idea I came to is to create a custom partition whose
>> `compute` method looks something like:
>>
>>
>>
>> def compute(split: Partition, context: TaskContext) = {
>>
>> if (split.index == targetPartition) {
>>
>> // do computation
>>
>> } else {
>>
>>// return empty iterator
>>
>> }
>>
>> }
>>
>>
>>
>>
>>
>> But it's quite ugly and I'm unlikely to be the first person with such a
>> need. Is there easier way to do it?
>>
>>
>>
>>
>>
>>
>>
>
>


Re: How to process one partition at a time?

2016-04-06 Thread Hemant Bhanawat
Apparently, there is another way to do it. You can try creating a
PartitionPruningRDD and pass a partition filter function to it. This RDD
will do the same thing that I suggested in my mail and you will not have to
create a new RDD.

Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811>
www.snappydata.io

On Wed, Apr 6, 2016 at 5:35 PM, Sun, Rui <rui@intel.com> wrote:

> Maybe you can try SparkContext.submitJob:
>
> *def **submitJob**[T, U, R](rdd: RDD
> <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html>[T],
>  processPartition:
> (Iterator[T]) **⇒** U, partitions: Seq[Int], resultHandler: (Int, U) **⇒** 
> Unit, resultFunc:
> **⇒** R): SimpleFutureAction
> <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/SimpleFutureAction.html>[R]*
>
>
>
>
>
> *From:* Hemant Bhanawat [mailto:hemant9...@gmail.com]
> *Sent:* Wednesday, April 6, 2016 7:16 PM
> *To:* Andrei <faithlessfri...@gmail.com>
> *Cc:* user <user@spark.apache.org>
> *Subject:* Re: How to process one partition at a time?
>
>
>
> Instead of doing it in compute, you could rather override getPartitions
> method of your RDD and return only the target partitions. This way tasks
> for only target partitions will be created. Currently in your case, tasks
> for all the partitions are getting created.
>
> I hope it helps. I would like to hear if you take some other approach.
>
>
> Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811>
>
> www.snappydata.io
>
>
>
> On Wed, Apr 6, 2016 at 3:49 PM, Andrei <faithlessfri...@gmail.com> wrote:
>
> I'm writing a kind of sampler which in most cases will require only 1
> partition, sometimes 2 and very rarely more. So it doesn't make sense to
> process all partitions in parallel. What is the easiest way to limit
> computations to one partition only?
>
>
>
> So far the best idea I came to is to create a custom partition whose
> `compute` method looks something like:
>
>
>
> def compute(split: Partition, context: TaskContext) = {
>
> if (split.index == targetPartition) {
>
> // do computation
>
> } else {
>
>// return empty iterator
>
> }
>
> }
>
>
>
>
>
> But it's quite ugly and I'm unlikely to be the first person with such a
> need. Is there easier way to do it?
>
>
>
>
>
>
>


RE: How to process one partition at a time?

2016-04-06 Thread Sun, Rui
Maybe you can try SparkContext.submitJob:
def submitJob[T, U, R](rdd: 
RDD<http://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html>[T],
 processPartition: (Iterator[T]) ⇒ U, partitions: Seq[Int], resultHandler: 
(Int, U) ⇒ Unit, resultFunc: ⇒ R): 
SimpleFutureAction<http://spark.apache.org/docs/latest/api/scala/org/apache/spark/SimpleFutureAction.html>[R]


From: Hemant Bhanawat [mailto:hemant9...@gmail.com]
Sent: Wednesday, April 6, 2016 7:16 PM
To: Andrei <faithlessfri...@gmail.com>
Cc: user <user@spark.apache.org>
Subject: Re: How to process one partition at a time?

Instead of doing it in compute, you could rather override getPartitions method 
of your RDD and return only the target partitions. This way tasks for only 
target partitions will be created. Currently in your case, tasks for all the 
partitions are getting created.
I hope it helps. I would like to hear if you take some other approach.

Hemant Bhanawat<https://www.linkedin.com/in/hemant-bhanawat-92a3811>
www.snappydata.io<http://www.snappydata.io>

On Wed, Apr 6, 2016 at 3:49 PM, Andrei 
<faithlessfri...@gmail.com<mailto:faithlessfri...@gmail.com>> wrote:
I'm writing a kind of sampler which in most cases will require only 1 
partition, sometimes 2 and very rarely more. So it doesn't make sense to 
process all partitions in parallel. What is the easiest way to limit 
computations to one partition only?

So far the best idea I came to is to create a custom partition whose `compute` 
method looks something like:

def compute(split: Partition, context: TaskContext) = {
if (split.index == targetPartition) {
// do computation
} else {
   // return empty iterator
}
}


But it's quite ugly and I'm unlikely to be the first person with such a need. 
Is there easier way to do it?





Re: How to process one partition at a time?

2016-04-06 Thread Hemant Bhanawat
Instead of doing it in compute, you could rather override getPartitions
method of your RDD and return only the target partitions. This way tasks
for only target partitions will be created. Currently in your case, tasks
for all the partitions are getting created.

I hope it helps. I would like to hear if you take some other approach.


Hemant Bhanawat 
www.snappydata.io

On Wed, Apr 6, 2016 at 3:49 PM, Andrei  wrote:

> I'm writing a kind of sampler which in most cases will require only 1
> partition, sometimes 2 and very rarely more. So it doesn't make sense to
> process all partitions in parallel. What is the easiest way to limit
> computations to one partition only?
>
> So far the best idea I came to is to create a custom partition whose
> `compute` method looks something like:
>
> def compute(split: Partition, context: TaskContext) = {
> if (split.index == targetPartition) {
> // do computation
> } else {
>// return empty iterator
> }
> }
>
>
>
> But it's quite ugly and I'm unlikely to be the first person with such a
> need. Is there easier way to do it?
>
>
>


Re: How to process one partition at a time?

2016-04-06 Thread Andrei
I'm writing a kind of sampler which in most cases will require only 1
partition, sometimes 2 and very rarely more. So it doesn't make sense to
process all partitions in parallel. What is the easiest way to limit
computations to one partition only?

So far the best idea I came to is to create a custom partition whose
`compute` method looks something like:

def compute(split: Partition, context: TaskContext) = {
if (split.index == targetPartition) {
// do computation
} else {
   // return empty iterator
}
}



But it's quite ugly and I'm unlikely to be the first person with such a
need. Is there easier way to do it?