Nan Zhu created SPARK-19233: ------------------------------- Summary: Inconsistent Behaviour of Spark Streaming Checkpoint Key: SPARK-19233 URL: https://issues.apache.org/jira/browse/SPARK-19233 Project: Spark Issue Type: Improvement Components: DStreams Affects Versions: 2.1.0, 2.0.2, 2.0.1, 2.0.0 Reporter: Nan Zhu
When checking one of our application logs, we found the following behavior (simplified) 1. Spark application recovers from checkpoint constructed at timestamp 1000ms 2. The log shows that Spark application can recover RDDs generated at timestamp 2000, 3000 The root cause is that generateJobs event is pushed to the queue by a separate thread (RecurTimer), before doCheckpoint event is pushed to the queue, there might have been multiple generatedJobs being processed. As a result, when doCheckpoint for timestamp 1000 is processed, the generatedRDDs data structure containing RDDs generated at 2000, 3000 is serialized as part of checkpoint of 1000. It brings overhead for debugging and coordinate our offset management strategy with Spark Streaming's checkpoint strategy when we are developing a new type of DStream which integrates Spark Streaming with an internal message middleware. The proposed fix is to filter generatedRDDs according to checkpoint timestamp when serializing it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org