Hello Roman,

Thanks for the answer.
I have already had that high-availability.storageDir configured to an S3
location. Our service is not critical enough, so to save the cost, we are
using the single-master EMR setup. I understand that we'll not get YARN HA
in that case, but what I expect here is the ability to quickly restore the
service in the case of failure. Without Zookeeper, when such failure
happens, I'll need to find the last checkpoint of each job and restore from
there. With the help of HA feature, I can just start a new
flink-yarn-session, then all jobs will be restored.

I tried to change zookeeper dataDir config to an EFS location which both the
old and new EMR clusters could access, and that worked.

However, now I have a new question: is it expectable to restore to a new
version of Flink (e.g. saved with Flink1.10 and restored to Flink1.11)? I
tried and got some error messages attached below. Not sure that's a bug or
expected behaviour.

Thanks and best regards,
Averell

============
/07:39:33.906 [main-EventThread] ERROR
org.apache.flink.shaded.curator4.org.apache.curator.ConnectionState -
Authentication failed
07:40:11.585 [flink-akka.actor.default-dispatcher-2] ERROR
org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error occurred
in the cluster entrypoint.
org.apache.flink.runtime.dispatcher.DispatcherException: Could not start
recovered job 6e5c12f1c352dd4e6200c40aebb65745.
        at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$handleRecoveredJobStartError$0(Dispatcher.java:222)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
~[?:1.8.0_265]
        at
java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
~[?:1.8.0_265]
        at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_265]
        at
java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)
~[?:1.8.0_265]
        at
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:753)
~[?:1.8.0_265]
        at
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
~[?:1.8.0_265]
        at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
        at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
[flink-dist_2.11-1.11.0.jar:1.11.0]
        at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.11.0.jar:1.11.0]
        at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
[flink-dist_2.11-1.11.0.jar:1.11.0]
        at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
[flink-dist_2.11-1.11.0.jar:1.11.0]
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
[flink-dist_2.11-1.11.0.jar:1.11.0]
        at akka.actor.ActorCell.invoke(ActorCell.scala:561)
[flink-dist_2.11-1.11.0.jar:1.11.0]
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
[flink-dist_2.11-1.11.0.jar:1.11.0]
        at akka.dispatch.Mailbox.run(Mailbox.scala:225)
[flink-dist_2.11-1.11.0.jar:1.11.0]
        at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
[flink-dist_2.11-1.11.0.jar:1.11.0]
        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[flink-dist_2.11-1.11.0.jar:1.11.0]
Caused by: java.util.concurrent.CompletionException:
org.apache.flink.runtime.client.JobExecutionException: Could not instantiate
JobManager.
        at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$6(Dispatcher.java:398)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
~[?:1.8.0_265]
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
        ... 4 more
Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not
instantiate JobManager.
        at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$6(Dispatcher.java:398)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
~[?:1.8.0_265]
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
        ... 4 more
Caused by: java.lang.NullPointerException
        at
java.util.Collections$UnmodifiableCollection.<init>(Collections.java:1028)
~[?:1.8.0_265]
        at
java.util.Collections$UnmodifiableList.<init>(Collections.java:1304)
~[?:1.8.0_265]
        at java.util.Collections.unmodifiableList(Collections.java:1289)
~[?:1.8.0_265]
        at
org.apache.flink.runtime.jobgraph.JobVertex.getOperatorCoordinators(JobVertex.java:352)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
org.apache.flink.runtime.executiongraph.ExecutionJobVertex.<init>(ExecutionJobVertex.java:232)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
org.apache.flink.runtime.executiongraph.ExecutionGraph.attachJobGraph(ExecutionGraph.java:814)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:228)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
org.apache.flink.runtime.scheduler.SchedulerBase.createExecutionGraph(SchedulerBase.java:269)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:242)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:229)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:119)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:103)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:284)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:272)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:98)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:40)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.<init>(JobManagerRunnerImpl.java:140)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:84)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$6(Dispatcher.java:388)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
~[?:1.8.0_265]
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
        at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
~[flink-dist_2.11-1.11.0.jar:1.11.0]
/



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply via email to