Yongjia Wang created SPARK-11175:
------------------------------------

             Summary: Concurrent execution of JobSet within a batch in Spark 
streaming
                 Key: SPARK-11175
                 URL: https://issues.apache.org/jira/browse/SPARK-11175
             Project: Spark
          Issue Type: Improvement
          Components: Streaming
            Reporter: Yongjia Wang


Spark StreamingContext can register multiple independent Input DStreams (such 
as from different Kafka topics) that results in multiple independent jobs for 
each batch. These jobs should better be run concurrently to maximally take 
advantage of available resources. 
I went through a few hacks:
1.  launch the rdd action into a new thread from the function passed to 
foreachRDD to achieve. However, it will mess up with streaming statistics since 
the batch will finish immediately even the jobs it launched are still running 
in another thread. This can further affect resuming from checkpoint, since all 
batches are completed right away even the actual threaded jobs may fail and 
checkpoint only resume the last batch.
2. It's possible by just using foreachRDD and the available APIs to block the 
Jobset to wait for all threads to join, but doing this would mess up with 
closure serialization, and make checkpoint not usable.
Therefore, I would propose to make the default behavior to just run all jobs of 
the current batch concurrently, and mark batch completion when all the jobs 
complete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to