[
https://issues.apache.org/jira/browse/FLINK-39967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Santwana Verma updated FLINK-39967:
-----------------------------------
Description:
{{`FlinkStateSnapshotSpec.backoffLimit` defaults to -1, which is documented as
meaning unlimited retries:}}
{{/**}}
{{ * Maximum number of retries before the snapshot is considered as failed.
Set to -1 for}}
{{ * unlimited or 0 for no retries.}}
{{ */}}
{{ private int backoffLimit = -1;}}
{{However, the retry decision in FlinkStateSnapshotController does a plain
numeric comparison:}}
{{if (resource.getStatus().getFailures() >
resource.getSpec().getBackoffLimit())`}}
{{{ }}
{{ // give up, .withNoRetry() }}
{{}}}
With the default backoffLimit = -1, after the very first failure getFailures()
is 1, so 1 > -1 evaluates to true and the snapshot is immediately marked as
failed with no retry. This is the exact opposite of the documented behavior.
More generally, any negative backoffLimit and the sentinel -1 are not handled
specially, so the contract is never honored.
Steps to reproduce:
1. Create a FlinkStateSnapshot (savepoint or checkpoint) without setting
backoffLimit (defaults to -1).
2. Cause the snapshot to fail once (e.g. unreachable JobManager / transient
error).
3. Observe the snapshot is marked failed with "won't be retried as failure
count exceeded the backoff limit" instead of retrying.
Expected behavior: With backoffLimit = -1, the snapshot should be retried
indefinitely (with the existing exponential backoff). backoffLimit = 0 should
mean no retries; backoffLimit = N should allow up to N retries.
Actual behavior: Snapshot fails immediately after the first error and is never
retried.
was:
`FlinkStateSnapshotSpec.backoffLimit` defaults to -1, which is documented as
meaning unlimited retries:
```
/**
* Maximum number of retries before the snapshot is considered as failed. Set
to -1 for
* unlimited or 0 for no retries.
*/
private int backoffLimit = -1;
```
However, the retry decision in FlinkStateSnapshotController does a plain
numeric comparison:
```
if (resource.getStatus().getFailures() >
resource.getSpec().getBackoffLimit()) {
// give up, .withNoRetry()
}
```
With the default backoffLimit = -1, after the very first failure getFailures()
is 1, so 1 > -1 evaluates to true and the snapshot is immediately marked as
failed with no retry. This is the exact opposite of the documented behavior.
More generally, any negative backoffLimit and the sentinel -1 are not handled
specially, so the contract is never honored.
Steps to reproduce:
1. Create a FlinkStateSnapshot (savepoint or checkpoint) without setting
backoffLimit (defaults to -1).
2. Cause the snapshot to fail once (e.g. unreachable JobManager / transient
error).
3. Observe the snapshot is marked failed with "won't be retried as failure
count exceeded the backoff limit" instead of retrying.
Expected behavior: With backoffLimit = -1, the snapshot should be retried
indefinitely (with the existing exponential backoff). backoffLimit = 0 should
mean no retries; backoffLimit = N should allow up to N retries.
Actual behavior: Snapshot fails immediately after the first error and is never
retried.
> FlinkStateSnapshot with default backoffLimit=-1 (documented as "unlimited
> retries") never retries and fails immediately on first error
> --------------------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-39967
> URL: https://issues.apache.org/jira/browse/FLINK-39967
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Reporter: Santwana Verma
> Priority: Major
>
> {{`FlinkStateSnapshotSpec.backoffLimit` defaults to -1, which is documented
> as meaning unlimited retries:}}
> {{/**}}
> {{ * Maximum number of retries before the snapshot is considered as failed.
> Set to -1 for}}
> {{ * unlimited or 0 for no retries.}}
> {{ */}}
> {{ private int backoffLimit = -1;}}
> {{However, the retry decision in FlinkStateSnapshotController does a plain
> numeric comparison:}}
>
> {{if (resource.getStatus().getFailures() >
> resource.getSpec().getBackoffLimit())`}}
> {{{ }}
> {{ // give up, .withNoRetry() }}
> {{}}}
> With the default backoffLimit = -1, after the very first failure
> getFailures() is 1, so 1 > -1 evaluates to true and the snapshot is
> immediately marked as failed with no retry. This is the exact opposite of the
> documented behavior. More generally, any negative backoffLimit and the
> sentinel -1 are not handled specially, so the contract is never honored.
> Steps to reproduce:
> 1. Create a FlinkStateSnapshot (savepoint or checkpoint) without setting
> backoffLimit (defaults to -1).
> 2. Cause the snapshot to fail once (e.g. unreachable JobManager / transient
> error).
> 3. Observe the snapshot is marked failed with "won't be retried as failure
> count exceeded the backoff limit" instead of retrying.
> Expected behavior: With backoffLimit = -1, the snapshot should be retried
> indefinitely (with the existing exponential backoff). backoffLimit = 0 should
> mean no retries; backoffLimit = N should allow up to N retries.
> Actual behavior: Snapshot fails immediately after the first error and is
> never retried.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)