Hey Sweta,

Sorry I did not get back to you earlier.

Could you explain how do you do the upgrade? Do you try to upgrade your
cluster through HA services (e.g. zookeeper)? Meaning you bring down the
1.13.1 cluster down and start a 1.13.2/3 cluster which you intend to
pick up the job automatically along with the latest checkpoint? Am I
guessing correct? As far as I can tell we do not support such a way of
upgrading.

The way we support upgrades is via a savepoint/checkpoint. I'd suggest
to either take a savepoint on 1.13.1 and restore[1] the job on 1.13.2
cluster or use an externalized checkpoint created from 1.13.1.

Best,

Dawid

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/cli/#starting-a-job-from-a-savepoint

On 22/10/2021 16:39, Chesnay Schepler wrote:
> The only suggestion I can offer is to take a savepoint with 1.13.1 and
> try to restore from that.
>
> We will investigate the problem in
> https://issues.apache.org/jira/browse/FLINK-24621; currently we don't
> know why you are experiencing this issue.
>
> On 22/10/2021 16:02, Sweta Kalakuntla wrote:
>> Hi,
>>
>> We are seeing error while upgrading minor versions from 1.13.1 to
>> 1.13.2. JobManager is unable to recover the checkpoint state. What
>> would be the solution to this issue?
>>
>> Caused by: org.apache.flink.util.FlinkException: Could not retrieve
>> checkpoint 2844 from state handle under
>> checkpointID-0000000000000002844. This indicates that the retrieved
>> state handle is broken. Try cleaning the state handle store.
>> at
>> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore.retrieveCompletedCheckpoint(DefaultCompletedCheckpointStore.java:309)
>> ~[flink-dist_2.12-1.13.3.jar:1.13.3]
>> at
>> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore.recover(DefaultCompletedCheckpointStore.java:151)
>> ~[flink-dist_2.12-1.13.3.jar:1.13.3]
>> at
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreLatestCheckpointedStateInternal(CheckpointCoordinator.java:1513)
>> ~[flink-dist_2.12-1.13.3.jar:1.13.3]
>> at
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreInitialCheckpointIfPresent(CheckpointCoordinator.java:1476)
>> ~[flink-dist_2.12-1.13.3.jar:1.13.3]
>> at
>> org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.createAndRestoreExecutionGraph(DefaultExecutionGraphFactory.java:134)
>> ~[flink-dist_2.12-1.13.3.jar:1.13.3]
>> at
>> org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:342)
>> ~[flink-dist_2.12-1.13.3.jar:1.13.3]
>> at
>> org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:190)
>> ~[flink-dist_2.12-1.13.3.jar:1.13.3]
>> at
>> org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:122)
>> ~[flink-dist_2.12-1.13.3.jar:1.13.3]
>> at
>> org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:132)
>> ~[flink-dist_2.12-1.13.3.jar:1.13.3]
>> at
>> org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:110)
>> ~[flink-dist_2.12-1.13.3.jar:1.13.3]
>> at
>> org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:340)
>> ~[flink-dist_2.12-1.13.3.jar:1.13.3]
>> at
>> org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:317)
>> ~[flink-dist_2.12-1.13.3.jar:1.13.3]
>> at
>> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.internalCreateJobMasterService(DefaultJobMasterServiceFactory.java:107)
>> ~[flink-dist_2.12-1.13.3.jar:1.13.3]
>> at
>> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.lambda$createJobMasterService$0(DefaultJobMasterServiceFactory.java:95)
>> ~[flink-dist_2.12-1.13.3.jar:1.13.3]
>> at
>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:112)
>> ~[flink-dist_2.12-1.13.3.jar:1.13.3]
>> at java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown
>> Source) ~[?:?]
>> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown
>> Source) ~[?:?]
>> at java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]
>> at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
>> Source) ~[?:?]
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>> ~[?:?]
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>> ~[?:?]
>> at java.lang.Thread.run(Unknown Source) ~[?:?]
>> Caused by: java.io.InvalidClassException:
>> org.apache.flink.runtime.checkpoint.InflightDataRescalingDescriptor$NoRescalingDescriptor;
>> local class incompatible: stream classdesc serialVersionUID =
>> -5544173933105855751, local class serialVersionUID = 1
>> at java.io.ObjectStreamClass.initNonProxy(Unknown Source) ~[?:?]
>> at java.io.ObjectInputStream.readNonProxyDesc(Unknown Source) ~[?:?]
>>
>>
>>
>> Thank you,
>>
>> Sweta K
>>
>>      
>>
>>
>

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

Reply via email to