[ 
https://issues.apache.org/jira/browse/SPARK-11308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renjie Liu updated SPARK-11308:
-------------------------------
    Priority: Major  (was: Minor)

> Change spark streaming's job scheduler logic to ensuer guaranteed order of 
> batch processing
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-11308
>                 URL: https://issues.apache.org/jira/browse/SPARK-11308
>             Project: Spark
>          Issue Type: Improvement
>          Components: Streaming
>    Affects Versions: 1.5.1
>            Reporter: Renjie Liu
>
> In current implementation, spark streaming uses a thread pool to run jobs 
> generated in each time interval and orders are not guaranteed, i.e., if jobs 
> generated in time 1 takes time longer than the batch duration, jobs 2 will 
> begin to execute and the finish order is not guaranteed. This implementation 
> is not quite useful in practice since it may cost much more storage. For 
> example, when we do a word count in spark streaming, to be accurate we need 
> to store records for each batch rather than just word count in database to be 
> idempotent. But if the processing order of each batch is guaranteed, we just 
> need to store the last update time with word count in database to be 
> idempotent. Just simply set the thread pool size to 1 may cause the system to 
> be inefficient when there are more than one output streams.  This feature can 
> be implemented by giving each output stream a thread and jobs of each output 
> stream are executed in order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to