Hi,
I’m getting an exception at stop-with-savepoint. The savepoint is still created
but the job fails. I’d like to know what the implications and consequences of
the failure are (having job configured as exactly once) and how can It be
avoided. Starting the job with that savepoint looks to work as expected.
Here is the exception:
2024-10-09 17:23:48 org.apache.flink.runtime.JobException: The failure is not
recoverable
at
org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:155)
at
org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getGlobalFailureHandlingResult(ExecutionFailureHandler.java:126)
at
org.apache.flink.runtime.scheduler.DefaultScheduler.handleGlobalFailure(DefaultScheduler.java:328)
at
org.apache.flink.runtime.scheduler.stopwithsavepoint.StopWithSavepointTerminationHandlerImpl.terminateExceptionallyWithGlobalFailover(StopWithSavepointTerminationHandlerImpl.java:178)
at
org.apache.flink.runtime.scheduler.stopwithsavepoint.StopWithSavepointTerminationHandlerImpl.access$500(StopWithSavepointTerminationHandlerImpl.java:53)
at
org.apache.flink.runtime.scheduler.stopwithsavepoint.StopWithSavepointTerminationHandlerImpl$SavepointCreated.onAnyExecutionNotFinished(StopWithSavepointTerminationHandlerImpl.java:235)
at
org.apache.flink.runtime.scheduler.stopwithsavepoint.StopWithSavepointTerminationHandlerImpl.handleAnyExecutionNotFinished(StopWithSavepointTerminationHandlerImpl.java:150)
at
org.apache.flink.runtime.scheduler.stopwithsavepoint.StopWithSavepointTerminationHandlerImpl.handleExecutionsTermination(StopWithSavepointTerminationHandlerImpl.java:111)
at
java.base/java.util.concurrent.CompletableFuture$UniAccept.tryFire(Unknown
Source)
at
java.base/java.util.concurrent.CompletableFuture$Completion.run(Unknown Source)
at
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:451)
at
org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
at
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRunAsync(PekkoRpcActor.java:451)
at
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:218)
at
org.apache.flink.runtime.rpc.pekko.FencedPekkoRpcActor.handleRpcMessage(FencedPekkoRpcActor.java:85)
at
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleMessage(PekkoRpcActor.java:168)
at
org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33)
at
org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:29)
at scala.PartialFunction.applyOrElse(PartialFunction.scala:127)
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126)
at
org.apache.pekko.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:29)
at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175)
at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
at org.apache.pekko.actor.Actor.aroundReceive(Actor.scala:547)
at org.apache.pekko.actor.Actor.aroundReceive$(Actor.scala:545)
at
org.apache.pekko.actor.AbstractActor.aroundReceive(AbstractActor.scala:229)
at
org.apache.pekko.actor.ActorCell.receiveMessage(ActorCell.scala:590)
at org.apache.pekko.actor.ActorCell.invoke(ActorCell.scala:557)
at
org.apache.pekko.dispatch.Mailbox.processMailbox(Mailbox.scala:280)
at org.apache.pekko.dispatch.Mailbox.run(Mailbox.scala:241)
at org.apache.pekko.dispatch.Mailbox.exec(Mailbox.scala:253)
at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown
Source)
at
java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown
Source)
at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown
Source)
at
java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
at
java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)
Caused by:
org.apache.flink.runtime.scheduler.stopwithsavepoint.StopWithSavepointStoppingException:
A savepoint has been created at:
s3p://bucket/path/to/savepoints/savepoint-f0a4f8-0301aa307ec4, but the
corresponding job f0a4f8fdfa0038f7818cdbac1212b681 failed during stopping. The
savepoint is consistent, but might have uncommitted transactions. If you want
to commit the transaction please restart a job from this savepoint.
at
org.apache.flink.runtime.scheduler.stopwithsavepoint.StopWithSavepointTerminationHandlerImpl.terminateExceptionallyWithGlobalFailover(StopWithSavepointTerminationHandlerImpl.java:169)
... 33 more