[flink] branch release-1.13 updated: [FLINK-24155][documentation] Sync Chinese version of documentation to configure CheckpointFailureManager (#17173) (#17257)

yuanmei Mon, 13 Sep 2021 20:12:02 -0700

This is an automated email from the ASF dual-hosted git repository.

yuanmei pushed a commit to branch release-1.13
in repository https://gitbox.apache.org/repos/asf/flink.git



The following commit(s) were added to refs/heads/release-1.13 by this push:
     new 36a33e6  [FLINK-24155][documentation] Sync Chinese version of 
documentation to configure CheckpointFailureManager (#17173) (#17257)
36a33e6 is described below

commit 36a33e6fefc2c2a361bff755d7f31339a5a784d0
Author: Yuan Mei <yuanmei.w...@gmail.com>
AuthorDate: Tue Sep 14 11:10:53 2021 +0800

    [FLINK-24155][documentation] Sync Chinese version of documentation to 
configure CheckpointFailureManager (#17173) (#17257)
---
 .../datastream/fault-tolerance/checkpointing.md    | 35 +++++++++++++++-------
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git 
a/docs/content.zh/docs/dev/datastream/fault-tolerance/checkpointing.md 
b/docs/content.zh/docs/dev/datastream/fault-tolerance/checkpointing.md
index e7e53e6..e687696 100644
--- a/docs/content.zh/docs/dev/datastream/fault-tolerance/checkpointing.md
+++ b/docs/content.zh/docs/dev/datastream/fault-tolerance/checkpointing.md
@@ -59,6 +59,9 @@ Checkpoint 其他的属性包括：
     
     注意这个值也意味着并发 checkpoint 的数目是*一*。
 
+  - *checkpoint 可容忍连续失败次数*：该属性定义可容忍多少次连续的 checkpoint 失败。超过这个阈值之后会触发作业错误 fail 
over。
+    默认次数为“0”，这意味着不容忍 checkpoint 失败，作业将在第一次 checkpoint 失败时fail over。
+    
   - *并发 checkpoint 的数目*: 默认情况下，在上一个 checkpoint 未完成（失败或者成功）的情况下，系统不会触发另一个 
checkpoint。这确保了拓扑不会在 checkpoint 上花费太多时间，从而影响正常的处理流程。
     不过允许多个 checkpoint 并行进行是可行的，对于有确定的处理延迟（例如某方法所调用比较耗时的外部服务），但是仍然想进行频繁的 
checkpoint 去最小化故障后重跑的 pipelines 来说，是有意义的。
     
@@ -66,9 +69,6 @@ Checkpoint 其他的属性包括：
     
   - *externalized checkpoints*: 你可以配置周期存储 checkpoint 到外部系统中。Externalized 
checkpoints 将他们的元数据写到持久化存储上并且在 job 失败的时候*不会*被自动删除。
     这种方式下，如果你的 job 失败，你将会有一个现有的 checkpoint 去恢复。更多的细节请看 [Externalized 
checkpoints 的部署文档]({{< ref "docs/ops/state/checkpoints" 
>}}#externalized-checkpoints)。
-  
-  - *在 checkpoint 出错时使 task 失败或者继续进行 task*：他决定了在 task checkpoint 的过程中发生错误时，是否使 
task 也失败，使失败是默认的行为。
-     或者禁用它时，这个任务将会简单的把 checkpoint 错误信息报告给 checkpoint coordinator 并继续运行。
      
   - *优先从 checkpoint 恢复（prefer checkpoint for recovery）*：该属性确定 job 是否在最新的 
checkpoint 回退，即使有更近的 savepoint 可用，这可以潜在地减少恢复时间（checkpoint 恢复比 savepoint 恢复更快）。
 
@@ -91,11 +91,15 @@ 
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(500);
 // Checkpoint 必须在一分钟内完成，否则就会被抛弃
 env.getCheckpointConfig().setCheckpointTimeout(60000);
 
+// 允许两个连续的 checkpoint 错误
+env.getCheckpointConfig().setTolerableCheckpointFailureNumber(2)
+        
 // 同一时间只允许一个 checkpoint 进行
 env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
 
-// 开启在 job 中止后仍然保留的 externalized checkpoints
-env.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
+// 使用 externalized checkpoints，这样 checkpoint 在作业取消后仍就会被保留
+env.getCheckpointConfig().enableExternalizedCheckpoints(
+        ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
 
 // 开启实验性的 unaligned checkpoints
 env.getCheckpointConfig().enableUnalignedCheckpoints();
@@ -119,12 +123,16 @@ env.getCheckpointConfig.setMinPauseBetweenCheckpoints(500)
 // Checkpoint 必须在一分钟内完成，否则就会被抛弃
 env.getCheckpointConfig.setCheckpointTimeout(60000)
 
-// 如果 task 的 checkpoint 发生错误，会阻止 task 失败，checkpoint 仅仅会被抛弃
-env.getCheckpointConfig.setFailTasksOnCheckpointingErrors(false)
+// 允许两个连续的 checkpoint 错误
+env.getCheckpointConfig().setTolerableCheckpointFailureNumber(2)
 
 // 同一时间只允许一个 checkpoint 进行
 env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)
 
+// 使用 externalized checkpoints，这样 checkpoint 在作业取消后仍就会被保留
+env.getCheckpointConfig().enableExternalizedCheckpoints(
+  ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)
+
 // 开启实验性的 unaligned checkpoints
 env.getCheckpointConfig.enableUnalignedCheckpoints()
 ```
@@ -147,14 +155,21 @@ 
env.get_checkpoint_config().set_min_pause_between_checkpoints(500)
 # Checkpoint 必须在一分钟内完成，否则就会被抛弃
 env.get_checkpoint_config().set_checkpoint_timeout(60000)
 
+# 允许两个连续的 checkpoint 错误
+env.get_checkpoint_config().set_tolerable_checkpoint_failure_number(2)
+
 # 同一时间只允许一个 checkpoint 进行
 env.get_checkpoint_config().set_max_concurrent_checkpoints(1)
 
-# 开启在 job 中止后仍然保留的 externalized checkpoints
-env.get_checkpoint_config().enable_externalized_checkpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)
-
 # 允许在有更近 savepoint 时回退到 checkpoint
 env.get_checkpoint_config().set_prefer_checkpoint_for_recovery(True)
+
+# 使用 externalized checkpoints，这样 checkpoint 在作业取消后仍就会被保留
+env.get_checkpoint_config().enable_externalized_checkpoints(
+    ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)
+    
+# 开启实验性的 unaligned checkpoints
+env.get_checkpoint_config().enable_unaligned_checkpoints()
 ```
 {{< /tab >}}
 {{< /tabs >}}

[flink] branch release-1.13 updated: [FLINK-24155][documentation] Sync Chinese version of documentation to configure CheckpointFailureManager (#17173) (#17257)

Reply via email to