[ 
https://issues.apache.org/jira/browse/SPARK-11175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang closed SPARK-11175.
--------------------------------
    Resolution: Not A Problem

> Concurrent execution of JobSet within a batch in Spark streaming
> ----------------------------------------------------------------
>
>                 Key: SPARK-11175
>                 URL: https://issues.apache.org/jira/browse/SPARK-11175
>             Project: Spark
>          Issue Type: Improvement
>          Components: Streaming
>            Reporter: Yongjia Wang
>
> Spark StreamingContext can register multiple independent Input DStreams (such 
> as from different Kafka topics) that results in multiple independent jobs for 
> each batch. These jobs should better be run concurrently to maximally take 
> advantage of available resources. The current behavior is that these jobs end 
> up in an invisible job queue to be submitted one by one.
> I went through a few hacks:
> 1.  launch the rdd action into a new thread from the function passed to 
> foreachRDD. However, it will mess up with streaming statistics since the 
> batch will finish immediately even the jobs it launched are still running in 
> another thread. This can further affect resuming from checkpoint, since all 
> batches are completed right away even the actual threaded jobs may fail and 
> checkpoint only resume the last batch.
> 2. It's possible by just using foreachRDD and the available APIs to block the 
> Jobset to wait for all threads to join, but doing this would mess up with 
> closure serialization, and make checkpoint not usable.
> 3. Instead of running multiple Dstreams in one streaming context, just run 
> them in separate streaming context (separate Spark applications). Putting 
> aside the extra deployment overhead, when working with Spark standalone 
> cluster which only has FIFO scheduler across applications, the resource has 
> to be set in advance and it won't automatically adjust with resizing the 
> cluster.
> Therefore, I think there is a good use case to make the default behavior just 
> run all jobs of the current batch concurrently, and mark batch completion 
> when all the jobs complete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to