Flink v1.12.2 Kubernetes Session Mode cannot mount log4j.properties in configMap

2021-06-19 Thread Chenyu Zheng
  items:
  - key: flink-conf.yaml
path: flink-conf.yaml
  name: flink-config-session-cluster-test
name: flink-config-volume

And there doesn’t log config file in jobmanager container.
root@session-cluster-test-689b595f8f-dg4h6:/opt/flink# ls -l $FLINK_HOME/conf/
total 0
lrwxrwxrwx 1 root root 22 Jun 19 09:23 flink-conf.yaml -> ..data/flink-conf.yaml

After I deep dive in the flink source code, I found the root cause could be 
here:
https://github.com/apache/flink/blob/release-1.13.1/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/decorators/FlinkConfMountDecorator.java#L104
It only add flink-conf.yaml to container volume mount.

Could you please give me some guide or support? Thanks so much!

BRs.
Chenyu Zheng



Flink 1.12.5: The heartbeat of JobManager/TaskManager with id xxx timed out

2021-08-10 Thread Chenyu Zheng
Hi there,

I’m trying to run my flink job on Kubernetes cluster, but when I try to give my 
job a larger parallelism (128) I get an error said 
“java.util.concurrent.TimeoutException: The heartbeat of JobManager with id 
56ad1a5ded99f9f16ec1c786ad299159 timed out.” And then my job is cancelled.

We confirmed it cannot be a network issue, since:

  *   We encounter this error every time we run this job with larger 
parallelism (128), but it’s OK with smaller parallelism (32/64).
  *   We are using the k8s cluster in the production environment, and no other 
containers have the network problems.
  *   When we give “heartbeat.timeout” a larger value like 300s, the error 
never occurs again.

My settings and environment:

  *   Flink 1.12.5 with java8, scala 2.11
  *   Jobmanager Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH 
-Xmx15703474176 -Xms15703474176 -XX:MaxMetaspaceSize=268435456 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
-XX:+PrintPromotionFailure -XX:+PrintGCCause -XX:+PrintHeapAtGC 
-XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 
-Dlog.file=/opt/flink/log/jobmanager.log 
-Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties 
-Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties 
org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint 
-D jobmanager.memory.off-heap.size=134217728b -D 
jobmanager.memory.jvm-overhead.min=1073741824b -D 
jobmanager.memory.jvm-metaspace.size=268435456b -D 
jobmanager.memory.heap.size=15703474176b -D 
jobmanager.memory.jvm-overhead.max=1073741824b
  *   Taskmanager Start command: $JAVA_HOME/bin/java -classpath 
$FLINK_CLASSPATH -Xmx1664299798 -Xms1664299798 
-XX:MaxDirectMemorySize=493921243 -XX:MaxMetaspaceSize=268435456 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
-XX:+PrintPromotionFailure -XX:+PrintGCCause -XX:+PrintHeapAtGC 
-XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 
-Dlog.file=/opt/flink/log/taskmanager.log 
-Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties 
-Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties 
org.apache.flink.kubernetes.taskmanager.KubernetesTaskExecutorRunner -D 
taskmanager.memory.framework.off-heap.size=134217728b -D 
taskmanager.memory.network.max=359703515b -D 
taskmanager.memory.network.min=359703515b -D 
taskmanager.memory.framework.heap.size=134217728b -D 
taskmanager.memory.managed.size=1438814063b -D taskmanager.cpu.cores=1.0 -D 
taskmanager.memory.task.heap.size=1530082070b -D 
taskmanager.memory.task.off-heap.size=0b -D 
taskmanager.memory.jvm-metaspace.size=268435456b -D 
taskmanager.memory.jvm-overhead.max=429496736b -D 
taskmanager.memory.jvm-overhead.min=429496736b --configDir /opt/flink/conf 
-Djobmanager.rpc.address='10.50.132.154' 
-Dpipeline.classpaths='file:usrlib/flink-playground-clickcountjob-print.jar' 
-Djobmanager.memory.off-heap.size='134217728b' 
-Dweb.tmpdir='/tmp/flink-web-07190d10-c6ea-4b1a-9eee-b2d0b2711a76' 
-Drest.address='10.50.132.154' 
-Djobmanager.memory.jvm-overhead.max='1073741824b' 
-Djobmanager.memory.jvm-overhead.min='1073741824b' 
-Dtaskmanager.resource-id='stream-367f634e41349f7195961cdb0c6c-taskmanager-1-17'
 -Dexecution.target='embedded' 
-Dpipeline.jars='file:/opt/flink/usrlib/flink-playground-clickcountjob-print.jar'
 -Djobmanager.memory.jvm-metaspace.size='268435456b' 
-Djobmanager.memory.heap.size='15703474176b'

Is this an expected behavior? Could you give me some guideline about how to 
troubleshot this issue?

BRs

Chenyu


Re: Flink 1.12.5: The heartbeat of JobManager/TaskManager with id xxx timed out

2021-08-10 Thread Chenyu Zheng
$OrElse.applyOrElse(PartialFunction.scala:171) 
[flink-dist_2.11-1.12.5.jar:1.12.5]
at akka.actor.Actor$class.aroundReceive(Actor.scala:517) 
[flink-dist_2.11-1.12.5.jar:1.12.5]
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) 
[flink-dist_2.11-1.12.5.jar:1.12.5]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) 
[flink-dist_2.11-1.12.5.jar:1.12.5]
at akka.actor.ActorCell.invoke(ActorCell.scala:561) 
[flink-dist_2.11-1.12.5.jar:1.12.5]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) 
[flink-dist_2.11-1.12.5.jar:1.12.5]
at akka.dispatch.Mailbox.run(Mailbox.scala:225) 
[flink-dist_2.11-1.12.5.jar:1.12.5]
at akka.dispatch.Mailbox.exec(Mailbox.scala:235) 
[flink-dist_2.11-1.12.5.jar:1.12.5]
at 
akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
[flink-dist_2.11-1.12.5.jar:1.12.5]
at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
[flink-dist_2.11-1.12.5.jar:1.12.5]
at 
akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
[flink-dist_2.11-1.12.5.jar:1.12.5]
at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 
[flink-dist_2.11-1.12.5.jar:1.12.5]
Caused by: java.util.concurrent.TimeoutException: The heartbeat of JobManager 
with id 1da1bb0693814dd8cc2549e4f5cd368a timed out.
... 27 more



From: Chenyu Zheng 
Date: Tuesday, August 10, 2021 at 7:02 PM
To: "user@flink.apache.org" 
Subject: Flink 1.12.5: The heartbeat of JobManager/TaskManager with id xxx 
timed out

Hi there,

I’m trying to run my flink job on Kubernetes cluster, but when I try to give my 
job a larger parallelism (128) I get an error said 
“java.util.concurrent.TimeoutException: The heartbeat of JobManager with id 
56ad1a5ded99f9f16ec1c786ad299159 timed out.” And then my job is cancelled.

We confirmed it cannot be a network issue, since:

  *   We encounter this error every time we run this job with larger 
parallelism (128), but it’s OK with smaller parallelism (32/64).
  *   We are using the k8s cluster in the production environment, and no other 
containers have the network problems.
  *   When we give “heartbeat.timeout” a larger value like 300s, the error 
never occurs again.

My settings and environment:

  *   Flink 1.12.5 with java8, scala 2.11
  *   Jobmanager Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH 
-Xmx15703474176 -Xms15703474176 -XX:MaxMetaspaceSize=268435456 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
-XX:+PrintPromotionFailure -XX:+PrintGCCause -XX:+PrintHeapAtGC 
-XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 
-Dlog.file=/opt/flink/log/jobmanager.log 
-Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties 
-Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties 
org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint 
-D jobmanager.memory.off-heap.size=134217728b -D 
jobmanager.memory.jvm-overhead.min=1073741824b -D 
jobmanager.memory.jvm-metaspace.size=268435456b -D 
jobmanager.memory.heap.size=15703474176b -D 
jobmanager.memory.jvm-overhead.max=1073741824b
  *   Taskmanager Start command: $JAVA_HOME/bin/java -classpath 
$FLINK_CLASSPATH -Xmx1664299798 -Xms1664299798 
-XX:MaxDirectMemorySize=493921243 -XX:MaxMetaspaceSize=268435456 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
-XX:+PrintPromotionFailure -XX:+PrintGCCause -XX:+PrintHeapAtGC 
-XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 
-Dlog.file=/opt/flink/log/taskmanager.log 
-Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties 
-Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties 
org.apache.flink.kubernetes.taskmanager.KubernetesTaskExecutorRunner -D 
taskmanager.memory.framework.off-heap.size=134217728b -D 
taskmanager.memory.network.max=359703515b -D 
taskmanager.memory.network.min=359703515b -D 
taskmanager.memory.framework.heap.size=134217728b -D 
taskmanager.memory.managed.size=1438814063b -D taskmanager.cpu.cores=1.0 -D 
taskmanager.memory.task.heap.size=1530082070b -D 
taskmanager.memory.task.off-heap.size=0b -D 
taskmanager.memory.jvm-metaspace.size=268435456b -D 
taskmanager.memory.jvm-overhead.max=429496736b -D 
taskmanager.memory.jvm-overhead.min=429496736b --configDir /opt/flink/conf 
-Djobmanager.rpc.address='10.50.132.154' 
-Dpipeline.classpaths='file:usrlib/flink-playground-clickcountjob-print.jar' 
-Djobmanager.memory.off-heap.size='134217728b' 
-Dweb.tmpdir='/tmp/flink-web-07190d10-c6ea-4b1a-9eee-b2d0b2711a76' 
-Drest.address='10.50.132.154' 
-Djobmanager.memory.jvm-overhead.max='1073741824b' 
-Djobmanager.memory.jvm-overhead.min='1073741824b' 
-Dtaskmanager.resource-id='stream-367f634e41349f719596

How can I build the flink docker image from source code?

2021-08-19 Thread Chenyu Zheng
Hi contributors,

I’ve changed a little bit code in flink, and want to build a docker image to 
test it. Could you tell me how can I build the image from source code?

Thx!