Roman Khachatryan created FLINK-39738:
-----------------------------------------
Summary: React to checkpoint ACK RPC failures
Key: FLINK-39738
URL: https://issues.apache.org/jira/browse/FLINK-39738
Project: Flink
Issue Type: Improvement
Components: Runtime / Checkpointing, Runtime / RPC
Affects Versions: 2.3.0
Reporter: Roman Khachatryan
Assignee: Roman Khachatryan
Flink TMs use “fire-and-forget” semantics to send checkpoint ACK messages.
That means that any failures (e.g. exceeding pekko framesize) won’t fail the
checkpoint; rather, it will expire due to checkpoint timeout, or the job would
be restarted for some reason before that.
This potentially increases e2e latency and makes it difficult to debug and
alert on such problems.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)