Lorenzo Nicora created FLINK-36319:
--------------------------------------
Summary: FAIL behavior on non-retriable write errors causes an
infinite loop when restarting from checkpoint
Key: FLINK-36319
URL: https://issues.apache.org/jira/browse/FLINK-36319
Project: Flink
Issue Type: Sub-task
Reporter: Lorenzo Nicora
The {{FAIL}} (default) error handling behavior when a write request is rejected
as non-retriable ({{onPrometheusNonRetriableError}}), causes the job to fail
and restart.
Restarting from checkpoint causes some out-of-order (duplicate) writes, that by
default Prometheus rejects as non-retrable.
As a consequence, when {{onPrometheusNonRetriableError}} = {{FAIL}} any
restarts from checkpoint puts the job in an infinite loop.
Changes:
1. default {{onPrometheusNonRetriableError}} should be {{DISCARD_AND_CONTINUE}}
2. {{onPrometheusNonRetriableError}} cannot be set to {{FAIL}}
We can keep the rest of the implementation as-is for the moment, and just
prevent from setting {{FAIL}} for this behaviour, as we may expand handling
this error with a different behaviour
--
This message was sent by Atlassian Jira
(v8.20.10#820010)