Renjie Liu created SPARK-11308: ---------------------------------- Summary: Change spark streaming's job scheduler logic to ensuer guaranteed order of batch processing Key: SPARK-11308 URL: https://issues.apache.org/jira/browse/SPARK-11308 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.5.1 Reporter: Renjie Liu Priority: Minor
In current implementation, spark streaming uses a thread pool to run jobs generated in each time interval and orders are not guaranteed, i.e., if jobs generated in time 1 takes time longer than the batch duration, jobs 2 will begin to execute and the finish order is not guaranteed. This implementation is not quite useful in practice since it may cost much more storage. For example, when we do a word count in spark streaming, to be accurate we need to store records for each batch rather than just word count in database. But if the processing order of each batch is guaranteed, we just need to store the last update time with word count in database to be idempotent. Just simply set the thread pool size to 1 may cause the system to be inefficient when there are more than one output streams. This feature can be implemented by giving each output stream a thread and jobs of each output stream are executed in order. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org