[ 
https://issues.apache.org/jira/browse/FLINK-39967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-39967:
-----------------------------------
    Labels: pull-request-available  (was: )

> FlinkStateSnapshot with default backoffLimit=-1 (documented as "unlimited 
> retries") never retries and fails immediately on first error
> --------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-39967
>                 URL: https://issues.apache.org/jira/browse/FLINK-39967
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>            Reporter: Santwana Verma
>            Assignee: Santwana Verma
>            Priority: Major
>              Labels: pull-request-available
>
> {{`FlinkStateSnapshotSpec.backoffLimit` defaults to -1, which is documented 
> as meaning unlimited retries:}}
> {{/**}}
> {{   * Maximum number of retries before the snapshot is considered as failed. 
> Set to -1 for}}
> {{   * unlimited or 0 for no retries.}}
> {{   */}}
> {{  private int backoffLimit = -1;}}
> {{However, the retry decision in FlinkStateSnapshotController does a plain 
> numeric comparison:}}
>  
> {{if (resource.getStatus().getFailures() > 
> resource.getSpec().getBackoffLimit())`}}
> {{{       }}
> {{   // give up, .withNoRetry()   }}
> {{}}}
> With the default backoffLimit = -1, after the very first failure 
> getFailures() is 1, so 1 > -1 evaluates to true and the snapshot is 
> immediately marked as failed with no retry. This is the exact opposite of the 
> documented behavior. More generally, any negative backoffLimit and the 
> sentinel -1 are not handled specially, so the contract is never honored.
> Steps to reproduce:
>   1. Create a FlinkStateSnapshot (savepoint or checkpoint) without setting 
> backoffLimit (defaults to -1).
>   2. Cause the snapshot to fail once (e.g. unreachable JobManager / transient 
> error).
>   3. Observe the snapshot is marked failed with "won't be retried as failure 
> count exceeded the backoff limit" instead of retrying.
> Expected behavior: With backoffLimit = -1, the snapshot should be retried 
> indefinitely (with the existing exponential backoff). backoffLimit = 0 should 
> mean no retries; backoffLimit = N should allow up to N retries.
> Actual behavior: Snapshot fails immediately after the first error and is 
> never retried.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to