Re: job doesn't start via cli after migrating Flink from 1.8 to 1.10

Yang Wang Fri, 10 Apr 2020 00:37:45 -0700

 I am trying to answer your question inline.

> The server has twice more than that, and on flink 1.8 this configuration
> works, why when switching to 1.10 it is not enough resources?



>From 1.10, the taskmanager resource related configuration has changed and
the default value is bigger than before. So you may find the same
application costs more resources. You could checkout the migration guide[1]
for more information.


> > Why ClusterEnterpoint reports -xmx424m ?


Since the default cut-off is 600m(configured via
“containerized.heap-cutoff-min”), the heap size of jobmanager is 1024m.
Only 424m left for the jobmanager heap.

> When I start the job YarnClusterDescriptor.deployJobCluster it reports
> the amount of memory assigned to the task manager, ClusterEnterpoint
> reports -xmx424m is responsible for?


The “-xmx424m” is just for jobmanager heap size. You need to check the
taskmanager logs whether the memory setting is expected.

> What leads to this exception and how am I supposed to configure JAAS
> section named Client?


I am not an export of security. However, it seems that Flink create a
default empty JAAS file and the zookeeper client tries to load it. So it
causes such warning log. But i think it is unrelated with your problem. I
have tries on my YARN cluster, the same logs show up and the Flink job runs
pretty well. If you really want to connect with zookeeper with JAAS, i
think you need to specify your own valid JAAS file.[2]

> What can be the reason of failures  to connect ResourceManager (if with
> flink 1.8 the job didn't have such issues, it's not a firewall issue or
> lack of resources)?


It is quite strange that the JobMaster and FlinkResourceManager is running
in a same process. However, the JobMaster could not connect with the
address “ip-172-31-65-130.eu-central-1.compute.internal:39331”. When you
find such logs, could you login the YARN nodemanager to check whether the
JobManager process is listening at the specified port and then use `telnet`
to check the network connectivity?

Also i think you could have a try to configure the jobmanager rpc port with
a fixed one or port range(configured via “yarn.application-master.port”).

Hope it could help you somewhat.

[1].
https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_migration.html
[2].
https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/security-kerberos.html#jaas-security-module


Best,
Yang

Vitaliy Semochkin <vitaliy...@gmail.com> 于2020年4月9日周四 上午2:02写道：

> Hi,
>
> I've recently migrated from Flink 1.8 to Flink 1.10
> And when I start the job using YarnClusterDescriptor.deployJobCluster
> method everything works fine.
>
> However when I start the job from shell script, the job fails with
> messages:
> *Shell script reports:*
> Cluster specification: ClusterSpecification{masterMemoryMB=1024,
> taskManagerMemoryMB=12000, slotsPerTaskManager=3}
> YarnClusterDescriptor  Deployment took more than 60 seconds. Please check
> if the requested resources are available in the YARN cluster
>
> *1. The server has twice more than that, and on flink 1.8 this
> configuration works, why when switching to 1.10 it is not enough resources?*
> *yarn log content of the job reports:*
> 2020-04-08 14:31:02,558 INFO
>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Starting
> YarnJobClusterEntrypoint (Version: 1.10.0, Rev:aa4eb8f, Date:07.02.202
> 0 @ 19:18:19 CET)
> 2020-04-08 14:31:02,558 INFO
>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  OS
> current user: yarn
> 2020-04-08 14:31:03,092 INFO
>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Current
> Hadoop/Kerberos user: erm_user
> 2020-04-08 14:31:03,092 INFO
>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM: Java
> HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.112-b15
> 2020-04-08 14:31:03,092 INFO
>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Maximum
> heap size: 406 MiBytes
> 2020-04-08 14:31:03,092 INFO
>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>  JAVA_HOME: /usr/jdk64/jdk1.8.0_112
> 2020-04-08 14:31:03,093 INFO
>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Hadoop
> version: 2.7.5
> 2020-04-08 14:31:03,093 INFO
>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM
> Options:
>
> *2020-04-08 14:31:03,093 INFO
>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
> -Xms424m2020-04-08 14:31:03,094 INFO
>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
> -Xmx424m*
> 2020-04-08 14:31:03,094 INFO
>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
> -Dlog.file=/hadoop/yarn/log/application_1586286375485_0025/container_e82_1586286375485_0025_01_000001/jobmanager.log
> 2020-04-08 14:31:03,094 INFO
>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
> -Dlog4j.configuration=file:log4j.properties
> 2020-04-08 14:31:03,094 INFO
>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Program
> Arguments: (none)
>
> *2. Why ClusterEnterpoint reports -xmx424m ?*
>
> *3. When I start the job YarnClusterDescriptor.deployJobCluster it reports
> the amount of memory assigned to the task manager, *
> * ClusterEnterpoint reports -xmx424m is responsible for?*
>
> Second suspicious message in log is:
> 2020-04-08 16:28:50,840 WARN  org.apache.zookeeper.ClientCnxn
>                   - *SASL configuration failed:
> javax.security.auth.login.LoginException: No JAAS configuration section
> named 'Client' was found in specified JAAS configuration file*:
> '/hadoop/yarn/local/usercache/erm_user/appcache/application_1586286375485_0026/jaas-1348005722200054084.conf'.
> Will continue connection to Zookeeper server without SASL authentication,
> if Zookeeper server allows it.
> *4. What leads to this exception and how am I supposed to configure JAAS
> section named Client?*
>
> Third suspicious message, though most likely it is an outcome of something
> being incorrectly configured:
> 2020-04-08 16:28:52,115 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl      - Cannot
> serve slot request, no ResourceManager connected. Adding as pending
> request [SlotRequestId{7a4da93f1f0bed92ccdbd707dfb47b7f}]
> 2020-04-08 16:28:52,120 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl      - Cannot
> serve slot request, no ResourceManager connected. Adding as pending request
> [SlotRequestId{3642e93314a78205854b7dfee80ea1a7}]
> 2020-04-08 16:28:52,121 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl      - Cannot
> serve slot request, no ResourceManager connected. Adding as pending request
> [SlotRequestId{c632edc6775f52762cb5a981a4109b89}]
> 2020-04-08 16:28:52,125 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                  - Connecting to ResourceManager
> akka.tcp://fl...@ip-172-31-65-130.eu-central-1.compute.internal
> :39331/user/resourcemanager(00000000000000000000000000000000)
> 2020-04-08 16:28:52,130 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                  - Could not resolve ResourceManager address
> akka.tcp://fl...@ip-172-31-65-130.eu-central-1.compute.internal:39331/user/resourcemanager,
> retrying in 10000 ms: Could not connect to rpc endpoint under address
> akka.tcp://fl...@ip-172-31-65-130.eu-central-1.compute.internal
> :39331/user/resourcemanager..
>
> 5. *What can be the reason of failures  to connect ResourceManager (if
> with flink 1.8 the job didn't have such issues, it's not a firewall issue
> or lack of resources)*?
>
> PS
> The whole yarn log is attached.
>
>

Re: job doesn't start via cli after migrating Flink from 1.8 to 1.10

Reply via email to