Re: TaskManager unable to register with JobManager

Stephan Ewen Wed, 03 Feb 2016 12:36:46 -0800

There still seems to be something wrong with your network config.

This looks not like a Flink problem and needs work on your end, we cannot
debug that for you.


Please go through your network setup and check for example

  - if the hostnames are right (is "master-IP" really the name of the
network interface on the JobManager)
  - can the machines actually communicate with each other (firewall, etc)
  - if the "master-IP" interface externally visible (such that other
machines can connect to it)

These things are prerequisites for any distributed system installation.


On Wed, Feb 3, 2016 at 9:27 PM, Robert Metzger <rmetz...@apache.org> wrote:

> Hi,
>
> the TaskManager is starting up, but its not able to register at the job
> manager. Did you check the JobManager log? Do you see anything suspicious
> there? Are the ports matching?
>
>
> On Wed, Feb 3, 2016 at 9:23 PM, Ravinder Kaur <neetu0...@gmail.com> wrote:
>
>> Hello,
>>
>> Thank you for pointing it out. I had a little typo while I edited the
>> hostname in flink-conf.yaml. I've reset it and the TaskManager started up.
>> But I still can't run the WordCount example and it throws the same
>> NoResourceAvaliableException.
>>
>> Caused by:
>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableExce
>>
>>      ption: Not enough free slots available to run the job. You can
>> decrease the oper
>>                              ator parallelism or increase the number of
>> slots per TaskManager in the configur
>>                                                  ation. Task to schedule: <
>> Attempt #0 (CHAIN DataSource (at getDefaultTextLineDa
>>
>>  taSet(WordCountData.java:70)
>> (org.apache.flink.api.java.io.CollectionInputFormat
>>                                                                )) ->
>> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(Wo
>>
>>            rdCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with
>> groupID < 31e497f2f6
>>                                  8c9cee5864c8fddaff3d59 > in sharing group
>> < SlotSharingGroup [f9ed1aab933e061a8c
>>                                                    e1ecaa3534f18c,
>> 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d
>>
>>  59] >. Resources available to scheduler: Number of instances=0, total
>> number of
>>                       slots=0, available slots=0
>>         at
>> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(
>>
>>      Scheduler.java:256)
>>         at
>> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmed
>>
>>      iately(Scheduler.java:131)
>>         at
>> org.apache.flink.runtime.executiongraph.Execution.scheduleForExecutio
>>
>>      n(Execution.java:298)
>>         at
>> org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForEx
>>
>>      ecution(ExecutionVertex.java:458)
>>         at
>> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAl
>>
>>      l(ExecutionJobVertex.java:322)
>>         at
>> org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExe
>>
>>      cution(ExecutionGraph.java:679)
>>         at
>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl
>>
>>
>>  
>> ink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982
>>
>>            )
>>         at
>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl
>>
>>
>>  ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
>>         at
>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl
>>
>>
>>  ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
>>         ... 8 more
>>
>> The log of TaskManager again has the same errors as before.
>>
>> 20:58:58,457 INFO  org.apache.flink.runtime.net.ConnectionUtils
>>        - Failed to connect from address '/slave-IP': connect timed out
>> 20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils
>>        - Failed to connect from address '/0:0:0:0:0:0:0:1%1': Network is
>> unreachable
>> 20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils
>>        - Failed to connect from address '/127.0.0.1': Invalid argument
>> 20:58:59,048 WARN  org.apache.flink.runtime.net.ConnectionUtils
>>        - Could not connect to /master-IP:6123. Selecting a local address
>> using heuristics.
>> 20:58:59,050 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        - TaskManager will use hostname/address 'hostname-of-slave'
>> (slave-IP) for communication.
>> 20:58:59,051 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        - Starting TaskManager in streaming mode BATCH_ONLY
>> 20:58:59,052 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        - Starting TaskManager actor system at slave_IP:0
>> 20:58:59,776 INFO  akka.event.slf4j.Slf4jLogger
>>        - Slf4jLogger started
>> 20:58:59,842 INFO  Remoting
>>        - Starting remoting
>> 20:59:00,094 INFO  Remoting
>>        - Remoting started; listening on addresses
>> :[akka.tcp://flink@slave-IP:33813]
>> 20:59:00,100 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        - Starting TaskManager actor
>> 20:59:00,125 INFO  org.apache.flink.runtime.io.network.netty.NettyConfig
>>         - NettyConfig [server address: hostname-of-master/master-IP, server
>> port: 49030, memory segment size (bytes): 32768, transport type: NIO,
>> number of server threads: 0 (use Netty's default), number of client
>> threads: 0 (use Netty's default), server connect backlog: 0 (use Netty's
>> default), client connect timeout (sec): 120, send/receive buffer size
>> (bytes): 0 (use Netty's default)]
>> 20:59:00,131 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        - Messages between TaskManager and JobManager have a max timeout of
>> 100000 milliseconds
>> 20:59:00,142 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        - Temporary file directory '/tmp': total 4 GB, usable 1 GB (25.00%
>> usable)
>> 20:59:00,210 INFO
>>  org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  - Allocated
>> 64 MB for network buffer pool (number of memory segments: 2048, bytes per
>> segment: 32768).
>> 20:59:00,323 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        - Using 0.7 of the currently free heap space for Flink managed heap
>> memory (293 MB).
>> 20:59:00,565 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager
>>        - I/O manager uses directory
>> /tmp/flink-io-c7796b82-6676-4604-97fd-df09001a84e8 for spill files.
>> 20:59:00,578 INFO  org.apache.flink.runtime.filecache.FileCache
>>        - User file cache uses directory
>> /tmp/flink-dist-cache-13ed3e76-cf1e-46fa-9ba2-5177e801429e
>> 20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        - Starting TaskManager actor at
>> akka://flink/user/taskmanager#-157676733.
>> 20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        - TaskManager data connection information: hostname-of-master
>> (dataPort=49030)
>> 20:59:00,909 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        - TaskManager has 1 task slot(s).
>> 20:59:00,910 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        - Memory usage stats: [HEAP: 376/491/491 MB, NON HEAP: 24/49/304 MB
>> (used/committed/max)]
>> 20:59:00,917 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        - Trying to register at JobManager 
>> akka.tcp://flink@master-IP:6123/user/jobmanager
>> (attempt 1, timeout: 500 milliseconds)
>> 20:59:01,443 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        - Trying to register at JobManager 
>> akka.tcp://flink@master-IP:6123/user/jobmanager
>> (attempt 2, timeout: 1000 milliseconds)
>> 20:59:02,873 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        - Trying to register at JobManager 
>> akka.tcp://flink@master-IP:6123/user/jobmanager
>> (attempt 3, timeout: 2000 milliseconds)
>> 20:59:04,893 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        - Trying to register at JobManager 
>> akka.tcp://flink@master-IP:6123/user/jobmanager
>> (attempt 4, timeout: 4000 milliseconds)
>> 20:59:08,914 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        - Trying to register at JobManager 
>> akka.tcp://flink@master-IP:6123/user/jobmanager
>> (attempt 5, timeout: 8000 milliseconds)
>>
>>
>> Kind Regards,
>> Ravinder Kaur
>>
>> On Wed, Feb 3, 2016 at 8:12 PM, Stephan Ewen <se...@apache.org> wrote:
>>
>>> This looks like the reason:
>>>
>>> java.net.UnknownHostException: Cannot resolve the JobManager hostname
>>> 'hostname-of-master' specified in the configuration
>>>
>>> On Wed, Feb 3, 2016 at 7:29 PM, Ravinder Kaur <neetu0...@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> The log file of the Taskmanager now shows the following
>>>>
>>>> 18:27:10,082 WARN  org.apache.hadoop.util.NativeCodeLoader
>>>>           - Unable to load native-hadoop library for your platform... using
>>>> builtin-java classes where applicable
>>>> 18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>          -
>>>> --------------------------------------------------------------------------------
>>>> 18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>          -  Starting TaskManager (Version: 0.10.1, Rev:2e9b231,
>>>> Date:22.11.2015 @ 12:41:12 CET)
>>>> 18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>          -  Current user: flink
>>>> 18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>          -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation -
>>>> 1.7/24.91-b01
>>>> 18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>          -  Maximum heap size: 491 MiBytes
>>>> 18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>          -  JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64
>>>> 18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>          -  Hadoop version: 2.7.0
>>>> 18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>          -  JVM Options:
>>>> 18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>          -     -Xms512M
>>>> 18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>          -     -Xmx512M
>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>          -     -XX:MaxDirectMemorySize=8388607T
>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>          -     -XX:MaxPermSize=256m
>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>          -
>>>> -Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-taskmanager-0-vm-10-155-208-137.cloud.mwn.de.log
>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>          -
>>>> -Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties
>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>          -
>>>> -Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml
>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>          -  Program Arguments:
>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>          -     --configDir
>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>          -     /home/flink/flink-0.10.1/conf
>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>          -     --streamingMode
>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>          -     batch
>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>          -  Classpath:
>>>> /home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar:/usr/lib/jvm/java-1.7.0-openjdk-amd64/lib/tools.jar::
>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>          -
>>>> --------------------------------------------------------------------------------
>>>> 18:27:10,252 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>          - Maximum number of open file descriptors is 4096
>>>> 18:27:10,277 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>          - Loading configuration from /home/flink/flink-0.10.1/conf
>>>> 18:27:10,356 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>          - Security is not enabled. Starting non-authenticated TaskManager.
>>>> 18:27:10,365 ERROR org.apache.flink.runtime.taskmanager.TaskManager
>>>>          - Failed to run TaskManager.
>>>> java.net.UnknownHostException: Cannot resolve the JobManager hostname
>>>> 'hostname-of-master' specified in the configuration
>>>>         at
>>>> org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:79)
>>>>         at
>>>> org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:48)
>>>>         at
>>>> org.apache.flink.runtime.util.LeaderRetrievalUtils.createLeaderRetrievalService(LeaderRetrievalUtils.java:69)
>>>>         at
>>>> org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndPort(TaskManager.scala:1351)
>>>>         at
>>>> org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1328)
>>>>         at
>>>> org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1240)
>>>>         at
>>>> org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala)
>>>>
>>>> Kind Regards,
>>>> Ravinder Kaur
>>>>
>>>> On Wed, Feb 3, 2016 at 7:19 PM, Stephan Ewen <se...@apache.org> wrote:
>>>>
>>>>> What do the TaskManger logs say?
>>>>>
>>>>> On Wed, Feb 3, 2016 at 6:34 PM, Ravinder Kaur <neetu0...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> Thanks for the quick reply. I tried to set jobmanager.rpc.address in
>>>>>> flink-conf.yaml to the hostname of master node on both the nodes.
>>>>>>
>>>>>> Now it does not start the Taskmanager at the worker node at all. When
>>>>>> I start the cluster using ./bin/start-cluster.sh on master it shows the
>>>>>> normal output of starting the Jobmanager and Taskmanager but when I run 
>>>>>> jps
>>>>>> on the nodes the slave does not have the Taskmanager running.
>>>>>>
>>>>>> Running the WordCount example again fails showing the same error.
>>>>>> Stopping the cluster says no taskmanager to stop.
>>>>>>
>>>>>> Kind Regards,
>>>>>> Ravinder Kaur
>>>>>>
>>>>>> On Wed, Feb 3, 2016 at 5:47 PM, Stephan Ewen <se...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Looks like the network configuration is not correct.
>>>>>>>
>>>>>>> I would try setting the full host name (like "master.abc.xyz.com")
>>>>>>> as jobmanager.rpc.address.
>>>>>>>
>>>>>>> Greetings,
>>>>>>> Stephan
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur <neetu0...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Hello Community,
>>>>>>>>
>>>>>>>> I'm a student and new to Apache Flink. I'm trying to learn and have
>>>>>>>> setup a 2- node standalone Flink(0.10.1) cluster (one master and one
>>>>>>>> worker). I'm facing the following issue.
>>>>>>>>
>>>>>>>> Cluster: consists of 2 vms (one master and one worker)
>>>>>>>>
>>>>>>>> The configurations are done as per
>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/cluster_setup.html
>>>>>>>>
>>>>>>>> When I start the cluster both the JobManager and the TaskManager
>>>>>>>> are started on the master and worker respectively.
>>>>>>>>
>>>>>>>> Command to start the cluster : bin/start-cluster.sh
>>>>>>>>
>>>>>>>> JPS shows all the processes running.
>>>>>>>>
>>>>>>>> Then I run the following command to run a WordCount example job: 
>>>>>>>> ./bin/flink
>>>>>>>> run ./examples/WordCount.jar
>>>>>>>>
>>>>>>>> the result is attached with the mail.
>>>>>>>>
>>>>>>>> The error is
>>>>>>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException:
>>>>>>>> Not enough free slots available to run to run the job
>>>>>>>> ....................... Resources available to scheduler: Number of
>>>>>>>> instances=0, total number of slots= 0, available slots=0
>>>>>>>>
>>>>>>>> Therefore I suppose that the JobManager does not find the
>>>>>>>> TaskManager and checked the logs of the TaskManager which indeed shows 
>>>>>>>> that
>>>>>>>> the TaskManager is unable to register at the JobManager for quite a 
>>>>>>>> long
>>>>>>>> time. There are org.apache.flink.runtime.net.ConnectionUtils:
>>>>>>>> Failed to connect from localhost: Connect timed out and 
>>>>>>>> org.apache.flink.runtime.net.ConnectionUtils:
>>>>>>>> Failed to connect from address localhost: Network is Unreachable 
>>>>>>>> messages
>>>>>>>> in the log of the TaskManager. Later when it starts up after a number 
>>>>>>>> of
>>>>>>>> attempts and tries to register at the JobManager, which also fails 
>>>>>>>> after a
>>>>>>>> lot of attempts showing the following message 
>>>>>>>> org.apache.flink.runtime.taskmanager.Taskmanager:
>>>>>>>> Trying to register at JobManager 
>>>>>>>> akka.tcp://flink@master:6123/user'/jobmanager
>>>>>>>> (attempt:92, timeout:30seconds) and 
>>>>>>>> org.apache.flink.runtime.taskmanager.Taskmanager:
>>>>>>>> Tried to associate with unreachable remote host 
>>>>>>>> [akka.tcp://flink@master:6123/user/jobmanager].
>>>>>>>> Address is now gated for 5000ms, all messages to this address will be
>>>>>>>> delivered to dead letters. Reason: Connection timed out: /master:6123
>>>>>>>>
>>>>>>>> I browsed the internet for these and found
>>>>>>>>  
>>>>>>>> http://stackoverflow.com/questions/33601020/flink-job-wont-run-with-higher-taskmanager-heap-mb
>>>>>>>> <http://stackoverflow.com/questions/33601020/flink-job-wont-run-with-higher-taskmanager-heap-mb>
>>>>>>>> and https://issues.apache.org/jira/browse/FLINK-1119 these links
>>>>>>>> helpful. Stephan Ewen the guy who provided the solution in both the 
>>>>>>>> links
>>>>>>>> gives a good explanation that the TaskManagers take quite some time to
>>>>>>>> register at the JobManager and therefore I waited for as long as 20 
>>>>>>>> mins
>>>>>>>> after starting the cluster to run the job. But even after waiting so 
>>>>>>>> long I
>>>>>>>> get the same error.
>>>>>>>>
>>>>>>>> Another suggestion was to run the cluster in streaming mode. So I
>>>>>>>> tried it with the command : bin/start-cluster-streaming.sh and ran
>>>>>>>> the job but I get the same error. I have rechecked all the 
>>>>>>>> configurations
>>>>>>>> but I'm unable to find out the fault.
>>>>>>>>
>>>>>>>> I re-checked all the configurations but could not find anything
>>>>>>>> wrong. Also checked the port 6123 on master which is in LISTEN state 
>>>>>>>> and
>>>>>>>> tcp request from worker to master shows SYN_SENT state using netstat 
>>>>>>>> -na
>>>>>>>> and lsof -i commands.
>>>>>>>>
>>>>>>>> I opened the webpage on master http://localhost:8081 but it shows
>>>>>>>> nothing and localhost:8080 says connection refused.
>>>>>>>>
>>>>>>>> Kindly help me out as it is very important for me. Let me know if
>>>>>>>> you have any questions.
>>>>>>>>
>>>>>>>> Kind Regards,
>>>>>>>> Ravinder Kaur
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: TaskManager unable to register with JobManager

Reply via email to