Re:Re: Fw:Re:Re: About "Flink 1.7.0 HA based on zookeepers "

2019-07-02 Thread Alex.Hu
Hi,Till:
Thank you very much for answering my question!
In the attachment I store the logs from my last attempt to run (flink 
started debug-level logs), and my flink cluster runs in a kubernate-based 
hadoop cluster with 3 nodes and kerberos security authentication enabled. I 
made a flink1.7.2 image based on the docker image given in wiki, and made 
flink's kubernates tag yaml configuration based on other kubernates tag setting 
files in hadoop cluster, in which all pods were set to use the host network 
port directly. So I adjusted some port parameters in my flink cluster Settings. 
I currently run only this one flink cluster, and just started the jobmanager 
pod in the test, according to your email reply I added "high - the 
availability. The zookeeper. Cluster - id" after the relevant parameters, start 
after 2 nodes of pod, kubernate shows a node normal boot (tdh3), while another 
node startup anomaly (tdh2), continue to show before the mail within the same 
error. I have attached flink configuration file flink-yaml, masters, slaves 
configuration file, hadoop cluster zookeeper configuration file, and jobmanager 
pod label configuration file jobmanager-yaml. Thank you very much. Could you 
please help me check what mistakes I have made?




Thank you,
Alex.hu






At 2019-07-02 21:46:35, "Till Rohrmann"  wrote:

Hi,


how did you start the job masters? Could you maybe share the logs of all 
components? It looks as if the leader election is not working properly. One 
thing to make sure is that you specify for every new HA cluster a different 
cluster ID via `high-availability.cluster-id: cluster_xy`. That way you 
separate the ZNodes in ZooKeeper so that every cluster uses their own nodes and 
does not interfere with other clusters. Usually this happens via the JobID but 
in the case of the `StandaloneJobClusterEntrypoint` we set it to 0. More 
recently, this was slightly changed. See 
https://issues.apache.org/jira/browse/FLINK-12617 for more information.


Cheers,
Till


On Mon, Jul 1, 2019 at 11:36 AM Alex.Hu  wrote:


Hi,All:

   I found some problems about on kubernates flink of 1.6.0 mentioned by Till 
in "HA for 1.6.0 job cluster with docker-compose" in the email list, but I 
found that Jira of flink-10291 in the email has been shut down in 1.7.0, and I 
also found similar errors in on kubernates flink of 1.7.2 at present. Could you 
please help me check the Settings where I have problems? Here are my Settings:

web.log.path: /var/log/flink/flinkweb.log 
taskmanager.log.pth: /var/log/flink/taskmanager/task.log 


jobmanager.rpc.address: tdh2
jobmanager.rpc.port: 16223
jobstore.cache-size: 5368709120
jobstore.expiration-time: 864000
jobmanager.heap.size: 4096m


taskmanager.heap.size:  6000m
taskmanager.numberOfTaskSlots: 6
parallelism.default: 2


high-availability: zookeeper
high-availability.storageDir: hdfs:///flink1/ha/
high-availability.zookeeper.quorum: tdh2:2181,tdh4:2181,tdh3:2181
high-availability.zookeeper.path.root: /flink
high-availability.zookeeper.client.acl: open
high-availability.jobmanager.port: 62236-62239


rest.port: 18801
io.tmp.dirs: /data/disk1:/data/disk2:/data/disk3:/data/disk4:/data/disk5


security.kerberos.login.use-ticket-cache: true
security.kerberos.login.contexts: Client
security.kerberos.login.keytab: /etc/flink/conf/hdfs.keytab
security.kerberos.login.principal: hdfs


blob.server.port: 16224
query.server.port: 16225




   And the following is the new error report, the earliest error report in the 
forwarded email message:


apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not 
set: Ignoring message LocalFencedMessage(98b7a69a48c04a9ca01b1eca2b714146, 
LocalRpcInvocation(requestRestAddress(Time))) sent to 
akka.tcp://flink@tdh2:62236/user/dispatcher because the fencing token is null.
at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:59)
... 14 common frames omitted
2019-07-01 17:10:40.159 [flink-rest-server-netty-worker-thread-39] ERROR 
o.a.f.r.rest.handler.legacy.files.StaticFileServerHandler  - Could not retrieve 
the redirect address.
java.util.concurrent.CompletionException: 
org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token 
not set: Ignoring message LocalFencedMessage(98b7a69a48c04a9ca01b1eca2b714146, 
LocalRpcInvocation(requestRestAddress(Time))) sent to 
akka.tcp://flink@tdh2:62236/user/dispatcher because the fencing token is null.
at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

Re: Fw:Re:Re: About "Flink 1.7.0 HA based on zookeepers "

2019-07-02 Thread Till Rohrmann
Hi,

how did you start the job masters? Could you maybe share the logs of all
components? It looks as if the leader election is not working properly. One
thing to make sure is that you specify for every new HA cluster a different
cluster ID via `high-availability.cluster-id: cluster_xy`. That way you
separate the ZNodes in ZooKeeper so that every cluster uses their own nodes
and does not interfere with other clusters. Usually this happens via the
JobID but in the case of the `StandaloneJobClusterEntrypoint` we set it to
0. More recently, this was slightly changed. See
https://issues.apache.org/jira/browse/FLINK-12617 for more information.

Cheers,
Till

On Mon, Jul 1, 2019 at 11:36 AM Alex.Hu  wrote:

> *Hi,All:*
>
> *   I found some problems about on kubernates flink of 1.6.0 mentioned by
> Till in "HA for 1.6.0 job cluster with docker-compose" in the email list,
> but I found that Jira of flink-10291 in the email has been shut down in
> 1.7.0, and I also found similar errors in on kubernates flink of 1.7.2 at
> present. Could you please help me check the Settings where I have problems?
> Here are my Settings:*
> web.log.path: /var/log/flink/flinkweb.log
> taskmanager.log.pth: /var/log/flink/taskmanager/task.log
>
> jobmanager.rpc.address: tdh2
> jobmanager.rpc.port: 16223
> jobstore.cache-size: 5368709120
> jobstore.expiration-time: 864000
> jobmanager.heap.size: 4096m
>
> taskmanager.heap.size:  6000m
> taskmanager.numberOfTaskSlots: 6
> parallelism.default: 2
>
> high-availability: zookeeper
> high-availability.storageDir: hdfs:///flink1/ha/
> high-availability.zookeeper.quorum: tdh2:2181,tdh4:2181,tdh3:2181
> high-availability.zookeeper.path.root: /flink
> high-availability.zookeeper.client.acl: open
> high-availability.jobmanager.port: 62236-62239
>
> rest.port: 18801
> io.tmp.dirs: /data/disk1:/data/disk2:/data/disk3:/data/disk4:/data/disk5
>
> security.kerberos.login.use-ticket-cache: true
> security.kerberos.login.contexts: Client
> security.kerberos.login.keytab: /etc/flink/conf/hdfs.keytab
> security.kerberos.login.principal: hdfs
>
> blob.server.port: 16224
> query.server.port: 16225
>
>
>*And the following is the new error report, the earliest error report
> in the forwarded email message:*
> apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token
> not set: Ignoring message
> LocalFencedMessage(98b7a69a48c04a9ca01b1eca2b714146,
> LocalRpcInvocation(requestRestAddress(Time))) sent to 
> akka.tcp://flink@tdh2:62236/user/dispatcher
> because the fencing token is null.
> at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:59)
> ... 14 common frames omitted
> 2019-07-01 17:10:40.159 [flink-rest-server-netty-worker-thread-39] ERROR
> o.a.f.r.rest.handler.legacy.files.StaticFileServerHandler  - Could not
> retrieve the redirect address.
> java.util.concurrent.CompletionException:
> org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing
> token not set: Ignoring message
> LocalFencedMessage(98b7a69a48c04a9ca01b1eca2b714146,
> LocalRpcInvocation(requestRestAddress(Time))) sent to 
> akka.tcp://flink@tdh2:62236/user/dispatcher
> because the fencing token is null.
> at
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> at
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> at
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
> at
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
> at
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:772)
> at akka.dispatch.OnComplete.internal(Future.scala:258)
> at akka.dispatch.OnComplete.internal(Future.scala:256)
> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> at
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
> at
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> at
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534)
> at akka.actor.ActorRef.tell(ActorRef.scala:130)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.sendErrorIfSender(AkkaRpcActor.java:371)
> at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:57)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
> at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
> at
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> at 

Fw:Re:Re: About "Flink 1.7.0 HA based on zookeepers "

2019-07-01 Thread Alex.Hu
Hi,All:

   I found some problems about on kubernates flink of 1.6.0 mentioned by Till 
in "HA for 1.6.0 job cluster with docker-compose" in the email list, but I 
found that Jira of flink-10291 in the email has been shut down in 1.7.0, and I 
also found similar errors in on kubernates flink of 1.7.2 at present. Could you 
please help me check the Settings where I have problems? Here are my Settings:

web.log.path: /var/log/flink/flinkweb.log 
taskmanager.log.pth: /var/log/flink/taskmanager/task.log 


jobmanager.rpc.address: tdh2
jobmanager.rpc.port: 16223
jobstore.cache-size: 5368709120
jobstore.expiration-time: 864000
jobmanager.heap.size: 4096m


taskmanager.heap.size:  6000m
taskmanager.numberOfTaskSlots: 6
parallelism.default: 2


high-availability: zookeeper
high-availability.storageDir: hdfs:///flink1/ha/
high-availability.zookeeper.quorum: tdh2:2181,tdh4:2181,tdh3:2181
high-availability.zookeeper.path.root: /flink
high-availability.zookeeper.client.acl: open
high-availability.jobmanager.port: 62236-62239


rest.port: 18801
io.tmp.dirs: /data/disk1:/data/disk2:/data/disk3:/data/disk4:/data/disk5


security.kerberos.login.use-ticket-cache: true
security.kerberos.login.contexts: Client
security.kerberos.login.keytab: /etc/flink/conf/hdfs.keytab
security.kerberos.login.principal: hdfs


blob.server.port: 16224
query.server.port: 16225




   And the following is the new error report, the earliest error report in the 
forwarded email message:


apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not 
set: Ignoring message LocalFencedMessage(98b7a69a48c04a9ca01b1eca2b714146, 
LocalRpcInvocation(requestRestAddress(Time))) sent to 
akka.tcp://flink@tdh2:62236/user/dispatcher because the fencing token is null.
at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:59)
... 14 common frames omitted
2019-07-01 17:10:40.159 [flink-rest-server-netty-worker-thread-39] ERROR 
o.a.f.r.rest.handler.legacy.files.StaticFileServerHandler  - Could not retrieve 
the redirect address.
java.util.concurrent.CompletionException: 
org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token 
not set: Ignoring message LocalFencedMessage(98b7a69a48c04a9ca01b1eca2b714146, 
LocalRpcInvocation(requestRestAddress(Time))) sent to 
akka.tcp://flink@tdh2:62236/user/dispatcher because the fencing token is null.
at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at 
org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:772)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at 
org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534)
at akka.actor.ActorRef.tell(ActorRef.scala:130)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.sendErrorIfSender(AkkaRpcActor.java:371)
at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:57)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
at 
akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: 
Fencing token not set: Ignoring message