[ https://issues.apache.org/jira/browse/SPARK-11308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Renjie Liu updated SPARK-11308: ------------------------------- Priority: Major (was: Minor) > Change spark streaming's job scheduler logic to ensuer guaranteed order of > batch processing > ------------------------------------------------------------------------------------------- > > Key: SPARK-11308 > URL: https://issues.apache.org/jira/browse/SPARK-11308 > Project: Spark > Issue Type: Improvement > Components: Streaming > Affects Versions: 1.5.1 > Reporter: Renjie Liu > > In current implementation, spark streaming uses a thread pool to run jobs > generated in each time interval and orders are not guaranteed, i.e., if jobs > generated in time 1 takes time longer than the batch duration, jobs 2 will > begin to execute and the finish order is not guaranteed. This implementation > is not quite useful in practice since it may cost much more storage. For > example, when we do a word count in spark streaming, to be accurate we need > to store records for each batch rather than just word count in database to be > idempotent. But if the processing order of each batch is guaranteed, we just > need to store the last update time with word count in database to be > idempotent. Just simply set the thread pool size to 1 may cause the system to > be inefficient when there are more than one output streams. This feature can > be implemented by giving each output stream a thread and jobs of each output > stream are executed in order. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org