What do the TaskManger logs say? On Wed, Feb 3, 2016 at 6:34 PM, Ravinder Kaur <neetu0...@gmail.com> wrote:
> Hello, > > Thanks for the quick reply. I tried to set jobmanager.rpc.address in > flink-conf.yaml to the hostname of master node on both the nodes. > > Now it does not start the Taskmanager at the worker node at all. When I > start the cluster using ./bin/start-cluster.sh on master it shows the > normal output of starting the Jobmanager and Taskmanager but when I run jps > on the nodes the slave does not have the Taskmanager running. > > Running the WordCount example again fails showing the same error. Stopping > the cluster says no taskmanager to stop. > > Kind Regards, > Ravinder Kaur > > On Wed, Feb 3, 2016 at 5:47 PM, Stephan Ewen <se...@apache.org> wrote: > >> Looks like the network configuration is not correct. >> >> I would try setting the full host name (like "master.abc.xyz.com") as >> jobmanager.rpc.address. >> >> Greetings, >> Stephan >> >> >> On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur <neetu0...@gmail.com> >> wrote: >> >>> >>> Hello Community, >>> >>> I'm a student and new to Apache Flink. I'm trying to learn and have >>> setup a 2- node standalone Flink(0.10.1) cluster (one master and one >>> worker). I'm facing the following issue. >>> >>> Cluster: consists of 2 vms (one master and one worker) >>> >>> The configurations are done as per >>> https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/cluster_setup.html >>> >>> When I start the cluster both the JobManager and the TaskManager are >>> started on the master and worker respectively. >>> >>> Command to start the cluster : bin/start-cluster.sh >>> >>> JPS shows all the processes running. >>> >>> Then I run the following command to run a WordCount example job: ./bin/flink >>> run ./examples/WordCount.jar >>> >>> the result is attached with the mail. >>> >>> The error is >>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException: >>> Not enough free slots available to run to run the job >>> ....................... Resources available to scheduler: Number of >>> instances=0, total number of slots= 0, available slots=0 >>> >>> Therefore I suppose that the JobManager does not find the TaskManager >>> and checked the logs of the TaskManager which indeed shows that the >>> TaskManager is unable to register at the JobManager for quite a long time. >>> There >>> are org.apache.flink.runtime.net.ConnectionUtils: Failed to connect >>> from localhost: Connect timed out and >>> org.apache.flink.runtime.net.ConnectionUtils: >>> Failed to connect from address localhost: Network is Unreachable messages >>> in the log of the TaskManager. Later when it starts up after a number of >>> attempts and tries to register at the JobManager, which also fails after a >>> lot of attempts showing the following message >>> org.apache.flink.runtime.taskmanager.Taskmanager: >>> Trying to register at JobManager >>> akka.tcp://flink@master:6123/user'/jobmanager >>> (attempt:92, timeout:30seconds) and >>> org.apache.flink.runtime.taskmanager.Taskmanager: >>> Tried to associate with unreachable remote host >>> [akka.tcp://flink@master:6123/user/jobmanager]. >>> Address is now gated for 5000ms, all messages to this address will be >>> delivered to dead letters. Reason: Connection timed out: /master:6123 >>> >>> I browsed the internet for these and found >>> >>> http://stackoverflow.com/questions/33601020/flink-job-wont-run-with-higher-taskmanager-heap-mb >>> <http://stackoverflow.com/questions/33601020/flink-job-wont-run-with-higher-taskmanager-heap-mb> >>> and https://issues.apache.org/jira/browse/FLINK-1119 these links >>> helpful. Stephan Ewen the guy who provided the solution in both the links >>> gives a good explanation that the TaskManagers take quite some time to >>> register at the JobManager and therefore I waited for as long as 20 mins >>> after starting the cluster to run the job. But even after waiting so long I >>> get the same error. >>> >>> Another suggestion was to run the cluster in streaming mode. So I tried >>> it with the command : bin/start-cluster-streaming.sh and ran the job >>> but I get the same error. I have rechecked all the configurations but I'm >>> unable to find out the fault. >>> >>> I re-checked all the configurations but could not find anything wrong. >>> Also checked the port 6123 on master which is in LISTEN state and tcp >>> request from worker to master shows SYN_SENT state using netstat -na and >>> lsof -i commands. >>> >>> I opened the webpage on master http://localhost:8081 but it shows >>> nothing and localhost:8080 says connection refused. >>> >>> Kindly help me out as it is very important for me. Let me know if you >>> have any questions. >>> >>> Kind Regards, >>> Ravinder Kaur >>> >>> >> >