Re:Re: Fw:Re:Re: About "Flink 1.7.0 HA based on zookeepers "
Hi,Till: Thank you very much for answering my question! In the attachment I store the logs from my last attempt to run (flink started debug-level logs), and my flink cluster runs in a kubernate-based hadoop cluster with 3 nodes and kerberos security authentication enabled. I made a flink1.7.2 image based on the docker image given in wiki, and made flink's kubernates tag yaml configuration based on other kubernates tag setting files in hadoop cluster, in which all pods were set to use the host network port directly. So I adjusted some port parameters in my flink cluster Settings. I currently run only this one flink cluster, and just started the jobmanager pod in the test, according to your email reply I added "high - the availability. The zookeeper. Cluster - id" after the relevant parameters, start after 2 nodes of pod, kubernate shows a node normal boot (tdh3), while another node startup anomaly (tdh2), continue to show before the mail within the same error. I have attached flink configuration file flink-yaml, masters, slaves configuration file, hadoop cluster zookeeper configuration file, and jobmanager pod label configuration file jobmanager-yaml. Thank you very much. Could you please help me check what mistakes I have made? Thank you, Alex.hu At 2019-07-02 21:46:35, "Till Rohrmann" wrote: Hi, how did you start the job masters? Could you maybe share the logs of all components? It looks as if the leader election is not working properly. One thing to make sure is that you specify for every new HA cluster a different cluster ID via `high-availability.cluster-id: cluster_xy`. That way you separate the ZNodes in ZooKeeper so that every cluster uses their own nodes and does not interfere with other clusters. Usually this happens via the JobID but in the case of the `StandaloneJobClusterEntrypoint` we set it to 0. More recently, this was slightly changed. See https://issues.apache.org/jira/browse/FLINK-12617 for more information. Cheers, Till On Mon, Jul 1, 2019 at 11:36 AM Alex.Hu wrote: Hi,All: I found some problems about on kubernates flink of 1.6.0 mentioned by Till in "HA for 1.6.0 job cluster with docker-compose" in the email list, but I found that Jira of flink-10291 in the email has been shut down in 1.7.0, and I also found similar errors in on kubernates flink of 1.7.2 at present. Could you please help me check the Settings where I have problems? Here are my Settings: web.log.path: /var/log/flink/flinkweb.log taskmanager.log.pth: /var/log/flink/taskmanager/task.log jobmanager.rpc.address: tdh2 jobmanager.rpc.port: 16223 jobstore.cache-size: 5368709120 jobstore.expiration-time: 864000 jobmanager.heap.size: 4096m taskmanager.heap.size: 6000m taskmanager.numberOfTaskSlots: 6 parallelism.default: 2 high-availability: zookeeper high-availability.storageDir: hdfs:///flink1/ha/ high-availability.zookeeper.quorum: tdh2:2181,tdh4:2181,tdh3:2181 high-availability.zookeeper.path.root: /flink high-availability.zookeeper.client.acl: open high-availability.jobmanager.port: 62236-62239 rest.port: 18801 io.tmp.dirs: /data/disk1:/data/disk2:/data/disk3:/data/disk4:/data/disk5 security.kerberos.login.use-ticket-cache: true security.kerberos.login.contexts: Client security.kerberos.login.keytab: /etc/flink/conf/hdfs.keytab security.kerberos.login.principal: hdfs blob.server.port: 16224 query.server.port: 16225 And the following is the new error report, the earliest error report in the forwarded email message: apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message LocalFencedMessage(98b7a69a48c04a9ca01b1eca2b714146, LocalRpcInvocation(requestRestAddress(Time))) sent to akka.tcp://flink@tdh2:62236/user/dispatcher because the fencing token is null. at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:59) ... 14 common frames omitted 2019-07-01 17:10:40.159 [flink-rest-server-netty-worker-thread-39] ERROR o.a.f.r.rest.handler.legacy.files.StaticFileServerHandler - Could not retrieve the redirect address. java.util.concurrent.CompletionException: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message LocalFencedMessage(98b7a69a48c04a9ca01b1eca2b714146, LocalRpcInvocation(requestRestAddress(Time))) sent to akka.tcp://flink@tdh2:62236/user/dispatcher because the fencing token is null. at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
Re: Fw:Re:Re: About "Flink 1.7.0 HA based on zookeepers "
Hi, how did you start the job masters? Could you maybe share the logs of all components? It looks as if the leader election is not working properly. One thing to make sure is that you specify for every new HA cluster a different cluster ID via `high-availability.cluster-id: cluster_xy`. That way you separate the ZNodes in ZooKeeper so that every cluster uses their own nodes and does not interfere with other clusters. Usually this happens via the JobID but in the case of the `StandaloneJobClusterEntrypoint` we set it to 0. More recently, this was slightly changed. See https://issues.apache.org/jira/browse/FLINK-12617 for more information. Cheers, Till On Mon, Jul 1, 2019 at 11:36 AM Alex.Hu wrote: > *Hi,All:* > > * I found some problems about on kubernates flink of 1.6.0 mentioned by > Till in "HA for 1.6.0 job cluster with docker-compose" in the email list, > but I found that Jira of flink-10291 in the email has been shut down in > 1.7.0, and I also found similar errors in on kubernates flink of 1.7.2 at > present. Could you please help me check the Settings where I have problems? > Here are my Settings:* > web.log.path: /var/log/flink/flinkweb.log > taskmanager.log.pth: /var/log/flink/taskmanager/task.log > > jobmanager.rpc.address: tdh2 > jobmanager.rpc.port: 16223 > jobstore.cache-size: 5368709120 > jobstore.expiration-time: 864000 > jobmanager.heap.size: 4096m > > taskmanager.heap.size: 6000m > taskmanager.numberOfTaskSlots: 6 > parallelism.default: 2 > > high-availability: zookeeper > high-availability.storageDir: hdfs:///flink1/ha/ > high-availability.zookeeper.quorum: tdh2:2181,tdh4:2181,tdh3:2181 > high-availability.zookeeper.path.root: /flink > high-availability.zookeeper.client.acl: open > high-availability.jobmanager.port: 62236-62239 > > rest.port: 18801 > io.tmp.dirs: /data/disk1:/data/disk2:/data/disk3:/data/disk4:/data/disk5 > > security.kerberos.login.use-ticket-cache: true > security.kerberos.login.contexts: Client > security.kerberos.login.keytab: /etc/flink/conf/hdfs.keytab > security.kerberos.login.principal: hdfs > > blob.server.port: 16224 > query.server.port: 16225 > > >*And the following is the new error report, the earliest error report > in the forwarded email message:* > apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token > not set: Ignoring message > LocalFencedMessage(98b7a69a48c04a9ca01b1eca2b714146, > LocalRpcInvocation(requestRestAddress(Time))) sent to > akka.tcp://flink@tdh2:62236/user/dispatcher > because the fencing token is null. > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:59) > ... 14 common frames omitted > 2019-07-01 17:10:40.159 [flink-rest-server-netty-worker-thread-39] ERROR > o.a.f.r.rest.handler.legacy.files.StaticFileServerHandler - Could not > retrieve the redirect address. > java.util.concurrent.CompletionException: > org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing > token not set: Ignoring message > LocalFencedMessage(98b7a69a48c04a9ca01b1eca2b714146, > LocalRpcInvocation(requestRestAddress(Time))) sent to > akka.tcp://flink@tdh2:62236/user/dispatcher > because the fencing token is null. > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) > at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) > at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) > at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > at > org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:772) > at akka.dispatch.OnComplete.internal(Future.scala:258) > at akka.dispatch.OnComplete.internal(Future.scala:256) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > at > org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) > at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534) > at akka.actor.ActorRef.tell(ActorRef.scala:130) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.sendErrorIfSender(AkkaRpcActor.java:371) > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:57) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142) > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40) > at > akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) > at
Fw:Re:Re: About "Flink 1.7.0 HA based on zookeepers "
Hi,All: I found some problems about on kubernates flink of 1.6.0 mentioned by Till in "HA for 1.6.0 job cluster with docker-compose" in the email list, but I found that Jira of flink-10291 in the email has been shut down in 1.7.0, and I also found similar errors in on kubernates flink of 1.7.2 at present. Could you please help me check the Settings where I have problems? Here are my Settings: web.log.path: /var/log/flink/flinkweb.log taskmanager.log.pth: /var/log/flink/taskmanager/task.log jobmanager.rpc.address: tdh2 jobmanager.rpc.port: 16223 jobstore.cache-size: 5368709120 jobstore.expiration-time: 864000 jobmanager.heap.size: 4096m taskmanager.heap.size: 6000m taskmanager.numberOfTaskSlots: 6 parallelism.default: 2 high-availability: zookeeper high-availability.storageDir: hdfs:///flink1/ha/ high-availability.zookeeper.quorum: tdh2:2181,tdh4:2181,tdh3:2181 high-availability.zookeeper.path.root: /flink high-availability.zookeeper.client.acl: open high-availability.jobmanager.port: 62236-62239 rest.port: 18801 io.tmp.dirs: /data/disk1:/data/disk2:/data/disk3:/data/disk4:/data/disk5 security.kerberos.login.use-ticket-cache: true security.kerberos.login.contexts: Client security.kerberos.login.keytab: /etc/flink/conf/hdfs.keytab security.kerberos.login.principal: hdfs blob.server.port: 16224 query.server.port: 16225 And the following is the new error report, the earliest error report in the forwarded email message: apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message LocalFencedMessage(98b7a69a48c04a9ca01b1eca2b714146, LocalRpcInvocation(requestRestAddress(Time))) sent to akka.tcp://flink@tdh2:62236/user/dispatcher because the fencing token is null. at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:59) ... 14 common frames omitted 2019-07-01 17:10:40.159 [flink-rest-server-netty-worker-thread-39] ERROR o.a.f.r.rest.handler.legacy.files.StaticFileServerHandler - Could not retrieve the redirect address. java.util.concurrent.CompletionException: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message LocalFencedMessage(98b7a69a48c04a9ca01b1eca2b714146, LocalRpcInvocation(requestRestAddress(Time))) sent to akka.tcp://flink@tdh2:62236/user/dispatcher because the fencing token is null. at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:772) at akka.dispatch.OnComplete.internal(Future.scala:258) at akka.dispatch.OnComplete.internal(Future.scala:256) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534) at akka.actor.ActorRef.tell(ActorRef.scala:130) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.sendErrorIfSender(AkkaRpcActor.java:371) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:57) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40) at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) at akka.actor.Actor$class.aroundReceive(Actor.scala:502) at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) at akka.actor.ActorCell.invoke(ActorCell.scala:495) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) at akka.dispatch.Mailbox.run(Mailbox.scala:224) at akka.dispatch.Mailbox.exec(Mailbox.scala:234) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message