[ 
https://issues.apache.org/jira/browse/SPARK-11308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renjie Liu updated SPARK-11308:
-------------------------------
    Description: In current implementation, spark streaming uses a thread pool 
to run jobs generated in each time interval and orders are not guaranteed, 
i.e., if jobs generated in time 1 takes time longer than the batch duration, 
jobs 2 will begin to execute and the finish order is not guaranteed. This 
implementation is not quite useful in practice since it may cost much more 
storage. For example, when we do a word count in spark streaming, to be 
accurate we need to store records for each batch rather than just word count in 
database to be idempotent. But if the processing order of each batch is 
guaranteed, we just need to store the last update time with word count in 
database to be idempotent. Just simply set the thread pool size to 1 may cause 
the system to be inefficient when there are more than one output streams.  This 
feature can be implemented by giving each output stream a thread and jobs of 
each output stream are executed in order.  (was: In current implementation, 
spark streaming uses a thread pool to run jobs generated in each time interval 
and orders are not guaranteed, i.e., if jobs generated in time 1 takes time 
longer than the batch duration, jobs 2 will begin to execute and the finish 
order is not guaranteed. This implementation is not quite useful in practice 
since it may cost much more storage. For example, when we do a word count in 
spark streaming, to be accurate we need to store records for each batch rather 
than just word count in database. But if the processing order of each batch is 
guaranteed, we just need to store the last update time with word count in 
database to be idempotent. Just simply set the thread pool size to 1 may cause 
the system to be inefficient when there are more than one output streams.  This 
feature can be implemented by giving each output stream a thread and jobs of 
each output stream are executed in order.)

> Change spark streaming's job scheduler logic to ensuer guaranteed order of 
> batch processing
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-11308
>                 URL: https://issues.apache.org/jira/browse/SPARK-11308
>             Project: Spark
>          Issue Type: Improvement
>          Components: Streaming
>    Affects Versions: 1.5.1
>            Reporter: Renjie Liu
>            Priority: Minor
>
> In current implementation, spark streaming uses a thread pool to run jobs 
> generated in each time interval and orders are not guaranteed, i.e., if jobs 
> generated in time 1 takes time longer than the batch duration, jobs 2 will 
> begin to execute and the finish order is not guaranteed. This implementation 
> is not quite useful in practice since it may cost much more storage. For 
> example, when we do a word count in spark streaming, to be accurate we need 
> to store records for each batch rather than just word count in database to be 
> idempotent. But if the processing order of each batch is guaranteed, we just 
> need to store the last update time with word count in database to be 
> idempotent. Just simply set the thread pool size to 1 may cause the system to 
> be inefficient when there are more than one output streams.  This feature can 
> be implemented by giving each output stream a thread and jobs of each output 
> stream are executed in order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to