[jira] [Updated] (FLINK-35603) Release Testing Instructions: Verify FLINK-35533(FLIP-459): Support Flink hybrid shuffle integration with Apache Celeborn

Weijie Guo (Jira) Mon, 24 Jun 2024 22:39:30 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-35603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Weijie Guo updated FLINK-35603:
-------------------------------
    Description: 
Follow up the test for https://issues.apache.org/jira/browse/FLINK-35533

In 1.20, we introduced a batch job recovery mechanism to enable batch jobs to 
recover as much progress as possible after a JobMaster failover, avoiding the 
need to rerun tasks that have already been finished.

More information about this feature and how to enable it could be found in: 
[https://nightlies.apache.org/flink/flink-docs-master/docs/ops/batch/recovery_from_job_master_failure/]

We may need the following tests:
 # Start a batch job with High Availability (HA) enabled, and after it has 
progressed to a certain point, kill the JobManager (jm), then observe whether 
the job recovers its progress normally.
 # Use a custom source and ensure that its SplitEnumerator implements the 
SupportsBatchSnapshot interface, submit the job, and after it has progressed to 
a certain point, kill the JobManager (jm), then observe whether the job 
recovers its progress normally.

 

Follow up the test for https://issues.apache.org/jira/browse/FLINK-33892

  was:Follow up the test for https://issues.apache.org/jira/browse/FLINK-35533


> Release Testing Instructions: Verify FLINK-35533(FLIP-459): Support Flink 
> hybrid shuffle integration with Apache Celeborn
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-35603
>                 URL: https://issues.apache.org/jira/browse/FLINK-35603
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Network
>            Reporter: Rui Fan
>            Assignee: Yuxin Tan
>            Priority: Blocker
>              Labels: release-testing
>             Fix For: 1.20.0
>
>
> Follow up the test for https://issues.apache.org/jira/browse/FLINK-35533
> In 1.20, we introduced a batch job recovery mechanism to enable batch jobs to 
> recover as much progress as possible after a JobMaster failover, avoiding the 
> need to rerun tasks that have already been finished.
> More information about this feature and how to enable it could be found in: 
> [https://nightlies.apache.org/flink/flink-docs-master/docs/ops/batch/recovery_from_job_master_failure/]
> We may need the following tests:
>  # Start a batch job with High Availability (HA) enabled, and after it has 
> progressed to a certain point, kill the JobManager (jm), then observe whether 
> the job recovers its progress normally.
>  # Use a custom source and ensure that its SplitEnumerator implements the 
> SupportsBatchSnapshot interface, submit the job, and after it has progressed 
> to a certain point, kill the JobManager (jm), then observe whether the job 
> recovers its progress normally.
>  
> Follow up the test for https://issues.apache.org/jira/browse/FLINK-33892



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-35603) Release Testing Instructions: Verify FLINK-35533(FLIP-459): Support Flink hybrid shuffle integration with Apache Celeborn

Reply via email to