Hi,

if for some reason there exists a checkpoint by same name.
>
Could you give more details about your scenarios here?
>From your description, I guess this problem occurred when a job restart,
does this restart is triggered personally?

In common restart processing, the job will retrieve the latest checkpoint
from a high-available service(zookeeper or kubernetes),
and then restore from it and make a new checkpoint with a new checkpoint-id.
In this case, the job does not recover from the old checkpoint, but the old
checkpoint path already exists.

Best,
Weihua


On Wed, May 10, 2023 at 11:07 AM Hang Ruan <ruanhang1...@gmail.com> wrote:

> Hi, amenreet,
>
> As Hangxiang said, we should use a new checkpoint dir if the new job has
> the same jobId as the old one . Or else you should not use a fixed jobId
> and the checkpoint dir will not conflict.
>
> Best,
> Hang
>
> Hangxiang Yu <master...@gmail.com> 于2023年5月10日周三 10:35写道:
>
>> Hi,
>> I guess you used a fixed JOB_ID, and configured the same checkpoint dir
>> as before ?
>> And you may also start the job without before state ?
>> The new job cannot know anything about before checkpoints, that's why the
>> new job will fail when it tries to generate a new checkpoint.
>> I'd like to suggest you to use different JOB_ID for different jobs, or
>> set a different checkpoint dir for a new job.
>>
>> On Tue, May 9, 2023 at 9:38 PM amenreet sodhi <amenso...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> Is there any way to prevent restart of flink job, or override the
>>> checkpoint metadata, if for some reason there exists a checkpoint by same
>>> name. I get the following exception and my job restarts, have been trying
>>> to find solution for a very long time but havent found anything useful yet,
>>> other than manually cleaning.
>>>
>>> 2023-02-27 10:00:50,360 WARN  
>>> org.apache.flink.runtime.checkpoint.CheckpointFailureManager
>>> [] - Failed to trigger or complete checkpoint 1 for job
>>> 000000006e6b13320000000000000000. (0 consecutive failed attempts so far)
>>>
>>> org.apache.flink.runtime.checkpoint.CheckpointException: Failure to
>>> finalize checkpoint.
>>>
>>> at
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.finalizeCheckpoint(CheckpointCoordinator.java:1375)
>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>
>>> at
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1265)
>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>
>>> at
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1157)
>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>
>>> at
>>> org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89)
>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>
>>> at
>>> org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119)
>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>>> [?:?]
>>>
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>>> [?:?]
>>>
>>> at java.lang.Thread.run(Thread.java:834) [?:?]
>>>
>>> Caused by: java.io.IOException: Target file
>>> file:/opt/flink/pm/checkpoint/000000006e6b13320000000000000000/chk-1/_metadata
>>> already exists.
>>>
>>> at
>>> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.getOutputStreamWrapper(FsCheckpointMetadataOutputStream.java:168)
>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>
>>> at
>>> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.<init>(FsCheckpointMetadataOutputStream.java:64)
>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>
>>> at
>>> org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:109)
>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>
>>> at
>>> org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:332)
>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>
>>> at
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.finalizeCheckpoint(CheckpointCoordinator.java:1361)
>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>
>>> ... 7 more
>>>
>>> 2023-02-27 10:00:50,374 WARN  org.apache.flink.runtime.jobmaster.JobMaster
>>>                 [] - Error while processing AcknowledgeCheckpoint
>>> message
>>>
>>> org.apache.flink.runtime.checkpoint.CheckpointException: Could not
>>> finalize the pending checkpoint 1. Failure reason: Failure to finalize
>>> checkpoint.
>>>
>>> at
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.finalizeCheckpoint(CheckpointCoordinator.java:1381)
>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>
>>> at
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1265)
>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>
>>> at
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1157)
>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>
>>> at
>>> org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89)
>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>
>>> at
>>> org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119)
>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>>> [?:?]
>>>
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>>> [?:?]
>>>
>>> at java.lang.Thread.run(Thread.java:834) [?:?]
>>>
>>> Caused by: java.io.IOException: Target file
>>> file:/opt/flink/pm/checkpoint/000000006e6b13320000000000000000/chk-1/_metadata
>>> already exists.
>>>
>>> at
>>> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.getOutputStreamWrapper(FsCheckpointMetadataOutputStream.java:168)
>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>
>>>
>>> Please let me know if anyone knows how to resolve this issue.
>>>
>>> Thanks and Regards
>>>
>>> Amenreet Singh Sodhi
>>>
>>>
>>>
>>
>> --
>> Best,
>> Hangxiang.
>>
>

Reply via email to