[
https://issues.apache.org/jira/browse/FLINK-39967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gyula Fora reassigned FLINK-39967:
----------------------------------
Assignee: Santwana Verma
> FlinkStateSnapshot with default backoffLimit=-1 (documented as "unlimited
> retries") never retries and fails immediately on first error
> --------------------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-39967
> URL: https://issues.apache.org/jira/browse/FLINK-39967
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Reporter: Santwana Verma
> Assignee: Santwana Verma
> Priority: Major
>
> {{`FlinkStateSnapshotSpec.backoffLimit` defaults to -1, which is documented
> as meaning unlimited retries:}}
> {{/**}}
> {{ * Maximum number of retries before the snapshot is considered as failed.
> Set to -1 for}}
> {{ * unlimited or 0 for no retries.}}
> {{ */}}
> {{ private int backoffLimit = -1;}}
> {{However, the retry decision in FlinkStateSnapshotController does a plain
> numeric comparison:}}
>
> {{if (resource.getStatus().getFailures() >
> resource.getSpec().getBackoffLimit())`}}
> {{{ }}
> {{ // give up, .withNoRetry() }}
> {{}}}
> With the default backoffLimit = -1, after the very first failure
> getFailures() is 1, so 1 > -1 evaluates to true and the snapshot is
> immediately marked as failed with no retry. This is the exact opposite of the
> documented behavior. More generally, any negative backoffLimit and the
> sentinel -1 are not handled specially, so the contract is never honored.
> Steps to reproduce:
> 1. Create a FlinkStateSnapshot (savepoint or checkpoint) without setting
> backoffLimit (defaults to -1).
> 2. Cause the snapshot to fail once (e.g. unreachable JobManager / transient
> error).
> 3. Observe the snapshot is marked failed with "won't be retried as failure
> count exceeded the backoff limit" instead of retrying.
> Expected behavior: With backoffLimit = -1, the snapshot should be retried
> indefinitely (with the existing exponential backoff). backoffLimit = 0 should
> mean no retries; backoffLimit = N should allow up to N retries.
> Actual behavior: Snapshot fails immediately after the first error and is
> never retried.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)