Thanks TD and Ashish. On Mon, Oct 5, 2015 at 9:14 PM, Tathagata Das <t...@databricks.com> wrote:
> You could create a threadpool on demand within the foreachPartitoin > function, then handoff the REST calls to that threadpool, get back the > futures and wait for them to finish. Should be pretty straightforward. Make > sure that your foreachPartition function cleans up the threadpool before > finishing. Alternatively, you can create an on-demand singleton threadpool > that is reused across batches, will reduce the cost of creating threadpools > everytime. > > On Mon, Oct 5, 2015 at 6:07 PM, Ashish Soni <asoni.le...@gmail.com> wrote: > >> Need more details but you might want to filter the data first ( create >> multiple RDD) and then process. >> >> >> > On Oct 5, 2015, at 8:35 PM, Chen Song <chen.song...@gmail.com> wrote: >> > >> > We have a use case with the following design in Spark Streaming. >> > >> > Within each batch, >> > * data is read and partitioned by some key >> > * forEachPartition is used to process the entire partition >> > * within each partition, there are several REST clients created to >> connect to different REST services >> > * for the list of records within each partition, it will call these >> services, each service call is independent of others; records are just >> pre-partitioned to make these calls more efficiently. >> > >> > I have a question >> > * Since each call is time taking and to prevent the calls to be >> executed sequentially, how can I parallelize the service calls within >> processing of each partition? Can I just use Scala future within >> forEachPartition(or mapPartitions)? >> > >> > Any suggestions greatly appreciated. >> > >> > Chen >> > >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > -- Chen Song