[jira] [Updated] (FLINK-39967) FlinkStateSnapshot with default backoffLimit=-1 (documented as "unlimited retries") never retries and fails immediately on first error

Santwana Verma (Jira) Mon, 22 Jun 2026 05:31:21 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-39967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Santwana Verma updated FLINK-39967:
-----------------------------------
    Description: 
{{`FlinkStateSnapshotSpec.backoffLimit` defaults to -1, which is documented as 
meaning unlimited retries:}}

{{/**}}
{{   * Maximum number of retries before the snapshot is considered as failed. 
Set to -1 for}}
{{   * unlimited or 0 for no retries.}}
{{   */}}
{{  private int backoffLimit = -1;}}

{{However, the retry decision in FlinkStateSnapshotController does a plain 
numeric comparison:}}

 

{{if (resource.getStatus().getFailures() > 
resource.getSpec().getBackoffLimit())`}}
{{{       }}
{{   // give up, .withNoRetry()   }}

{{}}}


With the default backoffLimit = -1, after the very first failure getFailures() 
is 1, so 1 > -1 evaluates to true and the snapshot is immediately marked as 
failed with no retry. This is the exact opposite of the documented behavior. 
More generally, any negative backoffLimit and the sentinel -1 are not handled 
specially, so the contract is never honored.

Steps to reproduce:
  1. Create a FlinkStateSnapshot (savepoint or checkpoint) without setting 
backoffLimit (defaults to -1).
  2. Cause the snapshot to fail once (e.g. unreachable JobManager / transient 
error).
  3. Observe the snapshot is marked failed with "won't be retried as failure 
count exceeded the backoff limit" instead of retrying.

Expected behavior: With backoffLimit = -1, the snapshot should be retried 
indefinitely (with the existing exponential backoff). backoffLimit = 0 should 
mean no retries; backoffLimit = N should allow up to N retries.

Actual behavior: Snapshot fails immediately after the first error and is never 
retried.

  was:
`FlinkStateSnapshotSpec.backoffLimit` defaults to -1, which is documented as 
meaning unlimited retries:
```
/**
   * Maximum number of retries before the snapshot is considered as failed. Set 
to -1 for
   * unlimited or 0 for no retries.
   */
  private int backoffLimit = -1;
```
However, the retry decision in FlinkStateSnapshotController does a plain 
numeric comparison:
```

  if (resource.getStatus().getFailures() > 
resource.getSpec().getBackoffLimit()) {
      // give up, .withNoRetry()
  }
```
With the default backoffLimit = -1, after the very first failure getFailures() 
is 1, so 1 > -1 evaluates to true and the snapshot is immediately marked as 
failed with no retry. This is the exact opposite of the documented behavior. 
More generally, any negative backoffLimit and the sentinel -1 are not handled 
specially, so the contract is never honored.

Steps to reproduce:
  1. Create a FlinkStateSnapshot (savepoint or checkpoint) without setting 
backoffLimit (defaults to -1).
  2. Cause the snapshot to fail once (e.g. unreachable JobManager / transient 
error).
  3. Observe the snapshot is marked failed with "won't be retried as failure 
count exceeded the backoff limit" instead of retrying.

Expected behavior: With backoffLimit = -1, the snapshot should be retried 
indefinitely (with the existing exponential backoff). backoffLimit = 0 should 
mean no retries; backoffLimit = N should allow up to N retries.

Actual behavior: Snapshot fails immediately after the first error and is never 
retried.


> FlinkStateSnapshot with default backoffLimit=-1 (documented as "unlimited 
> retries") never retries and fails immediately on first error
> --------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-39967
>                 URL: https://issues.apache.org/jira/browse/FLINK-39967
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>            Reporter: Santwana Verma
>            Priority: Major
>
> {{`FlinkStateSnapshotSpec.backoffLimit` defaults to -1, which is documented 
> as meaning unlimited retries:}}
> {{/**}}
> {{   * Maximum number of retries before the snapshot is considered as failed. 
> Set to -1 for}}
> {{   * unlimited or 0 for no retries.}}
> {{   */}}
> {{  private int backoffLimit = -1;}}
> {{However, the retry decision in FlinkStateSnapshotController does a plain 
> numeric comparison:}}
>  
> {{if (resource.getStatus().getFailures() > 
> resource.getSpec().getBackoffLimit())`}}
> {{{       }}
> {{   // give up, .withNoRetry()   }}
> {{}}}
> With the default backoffLimit = -1, after the very first failure 
> getFailures() is 1, so 1 > -1 evaluates to true and the snapshot is 
> immediately marked as failed with no retry. This is the exact opposite of the 
> documented behavior. More generally, any negative backoffLimit and the 
> sentinel -1 are not handled specially, so the contract is never honored.
> Steps to reproduce:
>   1. Create a FlinkStateSnapshot (savepoint or checkpoint) without setting 
> backoffLimit (defaults to -1).
>   2. Cause the snapshot to fail once (e.g. unreachable JobManager / transient 
> error).
>   3. Observe the snapshot is marked failed with "won't be retried as failure 
> count exceeded the backoff limit" instead of retrying.
> Expected behavior: With backoffLimit = -1, the snapshot should be retried 
> indefinitely (with the existing exponential backoff). backoffLimit = 0 should 
> mean no retries; backoffLimit = N should allow up to N retries.
> Actual behavior: Snapshot fails immediately after the first error and is 
> never retried.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-39967) FlinkStateSnapshot with default backoffLimit=-1 (documented as "unlimited retries") never retries and fails immediately on first error

Reply via email to