[jira] [Commented] (FLINK-35669) Release Testing: Verify FLIP-383: Support Job Recovery from JobMaster Failures for Batch Jobs

xingbe (Jira) Fri, 05 Jul 2024 03:46:33 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-35669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17863200#comment-17863200
 ]


xingbe commented on FLINK-35669:
--------------------------------

I have conducted tests for both scenarios as instructed:

1、I start a batch job consisting of multiple stages in high availability (HA) 
mode utilizing ZooKeeper. When some stages have already completed, I kill the 
JobManager process. The job is capable of resuming its execution, with the 
already completed stages not requiring re-execution.
2、I used a custom FileSource and implemented the SupportsBatchSnapshot 
interface. During the process, I kill the JobManager and wait for it to 
recover, and I observed the resumption from the progress. It should be noted 
that if no task finishes during this period, a snapshot is taken every 3 
minutes by default. It can be adjusted by setting the configration 
`execution.batch.job-recovery.snapshot.min-pause`.

> Release Testing: Verify FLIP-383: Support Job Recovery from JobMaster 
> Failures for Batch Jobs
> ---------------------------------------------------------------------------------------------
>
>                 Key: FLINK-35669
>                 URL: https://issues.apache.org/jira/browse/FLINK-35669
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Network
>            Reporter: Junrui Li
>            Assignee: xingbe
>            Priority: Blocker
>              Labels: release-testing
>             Fix For: 1.20.0
>
>
> In 1.20, we introduced a batch job recovery mechanism to enable batch jobs to 
> recover as much progress as possible after a JobMaster failover, avoiding the 
> need to rerun tasks that have already been finished.
> More information about this feature and how to enable it could be found in: 
> [https://nightlies.apache.org/flink/flink-docs-master/docs/ops/batch/recovery_from_job_master_failure/]
> We may need the following tests:
>  # Start a batch job with High Availability (HA) enabled, and after it has 
> progressed to a certain point, kill the JobManager (jm), then observe whether 
> the job recovers its progress normally.
>  # Use a custom source and ensure that its SplitEnumerator implements the 
> SupportsBatchSnapshot interface, submit the job, and after it has progressed 
> to a certain point, kill the JobManager (jm), then observe whether the job 
> recovers its progress normally.
>  
> Follow up the test for https://issues.apache.org/jira/browse/FLINK-33892



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-35669) Release Testing: Verify FLIP-383: Support Job Recovery from JobMaster Failures for Batch Jobs

Reply via email to