Nan Zhu created SPARK-19233:
-------------------------------

             Summary: Inconsistent Behaviour of Spark Streaming Checkpoint
                 Key: SPARK-19233
                 URL: https://issues.apache.org/jira/browse/SPARK-19233
             Project: Spark
          Issue Type: Improvement
          Components: DStreams
    Affects Versions: 2.1.0, 2.0.2, 2.0.1, 2.0.0
            Reporter: Nan Zhu


When checking one of our application logs, we found the following behavior 
(simplified)

1. Spark application recovers from checkpoint constructed at timestamp 1000ms

2. The log shows that Spark application can recover RDDs generated at timestamp 
2000, 3000

The root cause is that generateJobs event is pushed to the queue by a separate 
thread (RecurTimer), before doCheckpoint event is pushed to the queue, there 
might have been multiple generatedJobs being processed. As a result, when 
doCheckpoint for timestamp 1000 is processed, the generatedRDDs data structure 
containing RDDs generated at 2000, 3000 is serialized as part of checkpoint of 
1000.

It brings overhead for debugging and coordinate our offset management strategy 
with Spark Streaming's checkpoint strategy when we are developing a new type of 
DStream which integrates Spark Streaming with an internal message middleware.

The proposed fix is to filter generatedRDDs according to checkpoint timestamp 
when serializing it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to