Thanks TD and Ashish.

On Mon, Oct 5, 2015 at 9:14 PM, Tathagata Das <t...@databricks.com> wrote:

> You could create a threadpool on demand within the foreachPartitoin
> function, then handoff the REST calls to that threadpool, get back the
> futures and wait for them to finish. Should be pretty straightforward. Make
> sure that your foreachPartition function cleans up the threadpool before
> finishing. Alternatively, you can create an on-demand singleton threadpool
> that is reused across batches, will reduce the cost of creating threadpools
> everytime.
>
> On Mon, Oct 5, 2015 at 6:07 PM, Ashish Soni <asoni.le...@gmail.com> wrote:
>
>> Need more details but you might want to filter the data first ( create
>> multiple RDD) and then process.
>>
>>
>> > On Oct 5, 2015, at 8:35 PM, Chen Song <chen.song...@gmail.com> wrote:
>> >
>> > We have a use case with the following design in Spark Streaming.
>> >
>> > Within each batch,
>> > * data is read and partitioned by some key
>> > * forEachPartition is used to process the entire partition
>> > * within each partition, there are several REST clients created to
>> connect to different REST services
>> > * for the list of records within each partition, it will call these
>> services, each service call is independent of others; records are just
>> pre-partitioned to make these calls more efficiently.
>> >
>> > I have a question
>> > * Since each call is time taking and to prevent the calls to be
>> executed sequentially, how can I parallelize the service calls within
>> processing of each partition? Can I just use Scala future within
>> forEachPartition(or mapPartitions)?
>> >
>> > Any suggestions greatly appreciated.
>> >
>> > Chen
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


-- 
Chen Song

Reply via email to