[
https://issues.apache.org/jira/browse/FLINK-39738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-39738:
-----------------------------------
Labels: pull-request-available (was: )
> React to checkpoint ACK RPC failures
> ------------------------------------
>
> Key: FLINK-39738
> URL: https://issues.apache.org/jira/browse/FLINK-39738
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Checkpointing, Runtime / RPC
> Affects Versions: 2.3.0
> Reporter: Roman Khachatryan
> Assignee: Roman Khachatryan
> Priority: Major
> Labels: pull-request-available
>
> Flink TMs use “fire-and-forget” semantics to send checkpoint ACK messages.
> That means that any failures (e.g. exceeding pekko framesize) won’t fail the
> checkpoint; rather, it will expire due to checkpoint timeout, or the job
> would be restarted for some reason before that.
> This potentially increases e2e latency and makes it difficult to debug and
> alert on such problems.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)