Nan Zhu created SPARK-18905:
-------------------------------

             Summary: Potential Issue of Semantics of BatchCompleted
                 Key: SPARK-18905
                 URL: https://issues.apache.org/jira/browse/SPARK-18905
             Project: Spark
          Issue Type: Bug
          Components: DStreams
    Affects Versions: 2.0.2, 2.0.1, 2.0.0
            Reporter: Nan Zhu


the current implementation of Spark streaming considers a batch is completed no 
matter the result of the jobs 
(https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203)

Let's consider the following case:

A micro batch contains 2 jobs and they read from two different kafka topics 
respectively. One of this job is failed due to some problem in the user defined 
logic. 

1. The main thread in the Spark streaming application will execute the line 
mentioned above, 

2. and another thread (checkpoint writer) will make a checkpoint file 
immediately after this line is executed. 

3. Then due to the current error handling mechanism in Spark Streaming, 
StreamingContext will be closed 
(https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214)

the user recovers from the checkpoint file, and because the JobSet containing 
the failed job has been removed (taken as completed) before the checkpoint is 
constructed, the data being processed by the failed job would never be 
reprocessed?


I might have missed something in the checkpoint thread or this 
handleJobCompletion()....or it is a potential bug 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to