[jira] [Commented] (FLINK-31077) Trigger checkpoint failed but it were shown as COMPLETED by rest API

2023-02-14 Thread Junrui Li (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17688841#comment-17688841
 ] 

Junrui Li commented on FLINK-31077:
---

cc [~gaoyunhaii] 

> Trigger checkpoint failed but it were shown as COMPLETED by rest API
> 
>
> Key: FLINK-31077
> URL: https://issues.apache.org/jira/browse/FLINK-31077
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.17.0, 1.15.3, 1.16.1
>Reporter: Junrui Li
>Priority: Major
> Fix For: 1.17.0, 1.15.4, 1.16.2
>
>
> Currently, we can trigger a checkpoint and poll the status of the checkpoint 
> until it is finished by rest according to FLINK-27101. However, even if the 
> checkpoint status returned by rest is completed, it does not mean that the 
> checkpoint is really completed. If an exception occurs after marking the 
> pendingCheckpoint 
> completed([here|https://github.com/apache/flink/blob/bf0ad52cbcb052961c54c94c7013f5ac0110ef8a/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1309]),
>  the checkpoint is not written to the HA service and we can not failover from 
> this checkpoint.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31077) Trigger checkpoint failed but it were shown as COMPLETED by rest API

2023-02-14 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17688858#comment-17688858
 ] 

Zhu Zhu commented on FLINK-31077:
-

Thanks for reporting this issue! [~JunRuiLi]
I think it is indeed a problem. Considering the case of stop-with-savepoint, 
it's possible that the final savepoint is lost if the savepoint is considered 
to be done and the job gets terminated, before it is recorded to HA.
Do you want to fix it?

> Trigger checkpoint failed but it were shown as COMPLETED by rest API
> 
>
> Key: FLINK-31077
> URL: https://issues.apache.org/jira/browse/FLINK-31077
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.17.0, 1.15.3, 1.16.1
>Reporter: Junrui Li
>Priority: Major
> Fix For: 1.17.0, 1.15.4, 1.16.2
>
>
> Currently, we can trigger a checkpoint and poll the status of the checkpoint 
> until it is finished by rest according to FLINK-27101. However, even if the 
> checkpoint status returned by rest is completed, it does not mean that the 
> checkpoint is really completed. If an exception occurs after marking the 
> pendingCheckpoint 
> completed([here|https://github.com/apache/flink/blob/bf0ad52cbcb052961c54c94c7013f5ac0110ef8a/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1309]),
>  the checkpoint is not written to the HA service and we can not failover from 
> this checkpoint.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31077) Trigger checkpoint failed but it were shown as COMPLETED by rest API

2023-02-14 Thread Junrui Li (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17688860#comment-17688860
 ] 

Junrui Li commented on FLINK-31077:
---

[~zhuzh] Sure, I'll fix it.

> Trigger checkpoint failed but it were shown as COMPLETED by rest API
> 
>
> Key: FLINK-31077
> URL: https://issues.apache.org/jira/browse/FLINK-31077
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.17.0, 1.15.3, 1.16.1
>Reporter: Junrui Li
>Priority: Major
> Fix For: 1.17.0, 1.15.4, 1.16.2
>
>
> Currently, we can trigger a checkpoint and poll the status of the checkpoint 
> until it is finished by rest according to FLINK-27101. However, even if the 
> checkpoint status returned by rest is completed, it does not mean that the 
> checkpoint is really completed. If an exception occurs after marking the 
> pendingCheckpoint 
> completed([here|https://github.com/apache/flink/blob/bf0ad52cbcb052961c54c94c7013f5ac0110ef8a/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1309]),
>  the checkpoint is not written to the HA service and we can not failover from 
> this checkpoint.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)