[ 
https://issues.apache.org/jira/browse/SPARK-11175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang updated SPARK-11175:
---------------------------------
    Description: 
Spark StreamingContext can register multiple independent Input DStreams (such 
as from different Kafka topics) that results in multiple independent jobs for 
each batch. These jobs should better be run concurrently to maximally take 
advantage of available resources. The current behavior is that these jobs end 
up in an invisible job queue to be submitted one by one.
I went through a few hacks:
1.  launch the rdd action into a new thread from the function passed to 
foreachRDD. However, it will mess up with streaming statistics since the batch 
will finish immediately even the jobs it launched are still running in another 
thread. This can further affect resuming from checkpoint, since all batches are 
completed right away even the actual threaded jobs may fail and checkpoint only 
resume the last batch.
2. It's possible by just using foreachRDD and the available APIs to block the 
Jobset to wait for all threads to join, but doing this would mess up with 
closure serialization, and make checkpoint not usable.
3. Instead of running multiple Dstreams in one streaming context, just run them 
in separate streaming context (separate Spark applications). Putting aside the 
extra deployment overhead, when working with Spark standalone cluster which 
only has FIFO scheduler across applications, the resource has to be set in 
advance and it won't automatically adjust with resizing the cluster.

Therefore, I think there is a good use case to make the default behavior just 
run all jobs of the current batch concurrently, and mark batch completion when 
all the jobs complete.

  was:
Spark StreamingContext can register multiple independent Input DStreams (such 
as from different Kafka topics) that results in multiple independent jobs for 
each batch. These jobs should better be run concurrently to maximally take 
advantage of available resources. The current behavior is that these jobs end 
up in an invisible job queue to be submitted one by one.
I went through a few hacks:
1.  launch the rdd action into a new thread from the function passed to 
foreachRDD. However, it will mess up with streaming statistics since the batch 
will finish immediately even the jobs it launched are still running in another 
thread. This can further affect resuming from checkpoint, since all batches are 
completed right away even the actual threaded jobs may fail and checkpoint only 
resume the last batch.
2. It's possible by just using foreachRDD and the available APIs to block the 
Jobset to wait for all threads to join, but doing this would mess up with 
closure serialization, and make checkpoint not usable.
Therefore, I would propose to make the default behavior to just run all jobs of 
the current batch concurrently, and mark batch completion when all the jobs 
complete.


> Concurrent execution of JobSet within a batch in Spark streaming
> ----------------------------------------------------------------
>
>                 Key: SPARK-11175
>                 URL: https://issues.apache.org/jira/browse/SPARK-11175
>             Project: Spark
>          Issue Type: Improvement
>          Components: Streaming
>            Reporter: Yongjia Wang
>
> Spark StreamingContext can register multiple independent Input DStreams (such 
> as from different Kafka topics) that results in multiple independent jobs for 
> each batch. These jobs should better be run concurrently to maximally take 
> advantage of available resources. The current behavior is that these jobs end 
> up in an invisible job queue to be submitted one by one.
> I went through a few hacks:
> 1.  launch the rdd action into a new thread from the function passed to 
> foreachRDD. However, it will mess up with streaming statistics since the 
> batch will finish immediately even the jobs it launched are still running in 
> another thread. This can further affect resuming from checkpoint, since all 
> batches are completed right away even the actual threaded jobs may fail and 
> checkpoint only resume the last batch.
> 2. It's possible by just using foreachRDD and the available APIs to block the 
> Jobset to wait for all threads to join, but doing this would mess up with 
> closure serialization, and make checkpoint not usable.
> 3. Instead of running multiple Dstreams in one streaming context, just run 
> them in separate streaming context (separate Spark applications). Putting 
> aside the extra deployment overhead, when working with Spark standalone 
> cluster which only has FIFO scheduler across applications, the resource has 
> to be set in advance and it won't automatically adjust with resizing the 
> cluster.
> Therefore, I think there is a good use case to make the default behavior just 
> run all jobs of the current batch concurrently, and mark batch completion 
> when all the jobs complete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to