Renjie Liu created SPARK-11308:
----------------------------------

             Summary: Change spark streaming's job scheduler logic to ensuer 
guaranteed order of batch processing
                 Key: SPARK-11308
                 URL: https://issues.apache.org/jira/browse/SPARK-11308
             Project: Spark
          Issue Type: Improvement
          Components: Streaming
    Affects Versions: 1.5.1
            Reporter: Renjie Liu
            Priority: Minor


In current implementation, spark streaming uses a thread pool to run jobs 
generated in each time interval and orders are not guaranteed, i.e., if jobs 
generated in time 1 takes time longer than the batch duration, jobs 2 will 
begin to execute and the finish order is not guaranteed. This implementation is 
not quite useful in practice since it may cost much more storage. For example, 
when we do a word count in spark streaming, to be accurate we need to store 
records for each batch rather than just word count in database. But if the 
processing order of each batch is guaranteed, we just need to store the last 
update time with word count in database to be idempotent. Just simply set the 
thread pool size to 1 may cause the system to be inefficient when there are 
more than one output streams.  This feature can be implemented by giving each 
output stream a thread and jobs of each output stream are executed in order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to