Re: flink yarn-session failure

Stefanos Antaris Thu, 19 Nov 2015 03:45:49 -0800

Yes. You are right. I cannot run any YARN application. However, i have no 
localhost in my yarn-site.xml.


<configuration>
 <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>master:8025</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>master:8030</value>
    </property>
    <property>
        <name>yarn.resourcemanager.address</name>
        <value>master:8050</value>
    </property>
</configuration>


Could someone provide me a correct yarn-site.xml in order to make it work? 
Should the yarn-site.xml be the same in both namenode and datanodes? Sorry for 
this question but different tutorials on google refer to different 
configurations and i am confused.

Thanks,
Stefanos

> On 19 Nov 2015, at 12:47, Robert Metzger <rmetz...@apache.org> wrote:
> 
> The exception is thrown even before Flink code is executed, so I assume that 
> your YARN setup is not properly working.
> Did you try running any other YARN application on the setup? I suspect that 
> other systems like MapReduce or Spark will also not run on the environment.
> 
> Maybe the yarn-site.xml on the NodeManager hosts is not correct (pointing to 
> localhost instead of the master)
> 
> On Thu, Nov 19, 2015 at 11:41 AM, Stefanos Antaris 
> <antaris.stefa...@gmail.com> wrote:
> Hi to all,
> 
> i am trying to use Flink with Hadoop yarn but i am facing an exception while 
> trying to create a yarn-session.
> 
> First of all, i have a Hadoop cluster with 20 VMs that uses yarn. I can start 
> the Hadoop cluster and run Hadoop jobs without any problem. Furthermore, i am 
> trying to deploy a Flink cluster on the same VMs and use the Flink Yarn 
> client. I have the HADOOP_HOME environmental variable set and the hadoop 
> cluster up and running. When i execute the ./bin/yarn-session.sh -n 10 -tm 
> 8192 -s 32 command i have the following exception. Can someone explain me how 
> to solve this?
> 
> 10:20:56,105 INFO  org.apache.hadoop.yarn.client.RMProxy                      
>    - Connecting to ResourceManager at master/192.168.0.194:8032
> 10:20:56,353 WARN  org.apache.hadoop.util.NativeCodeLoader                    
>    - Unable to load native-hadoop library for your platform... using 
> builtin-java classes where applicable
> 10:20:57,095 INFO  org.apache.flink.yarn.FlinkYarnClient                      
>    - Using values:
> 10:20:57,097 INFO  org.apache.flink.yarn.FlinkYarnClient                      
>    -    TaskManager count = 10
> 10:20:57,097 INFO  org.apache.flink.yarn.FlinkYarnClient                      
>    -    JobManager memory = 1024
> 10:20:57,097 INFO  org.apache.flink.yarn.FlinkYarnClient                      
>    -    TaskManager memory = 2048
> 10:20:57,365 WARN  org.apache.flink.yarn.FlinkYarnClient                      
>    - This YARN session requires 21504MB of memory in the cluster. There are 
> currently only 8192MB available.
> The Flink YARN client will try to allocate the YARN session, but maybe not 
> all TaskManagers are connecting from the beginning because the resources are 
> currently not available in the cluster. The allocation might take more time 
> than usual because the Flink YARN client needs to wait until the resources 
> become available.
> 10:20:57,365 WARN  org.apache.flink.yarn.FlinkYarnClient                      
>    - There is not enough memory available in the YARN cluster. The 
> TaskManager(s) require 2048MB each. NodeManagers available: [8192]
> After allocating the JobManager (1024MB) and (3/10) TaskManagers, the 
> following NodeManagers are available: [1024]
> The Flink YARN client will try to allocate the YARN session, but maybe not 
> all TaskManagers are connecting from the beginning because the resources are 
> currently not available in the cluster. The allocation might take more time 
> than usual because the Flink YARN client needs to wait until the resources 
> become available.
> 10:20:57,365 WARN  org.apache.flink.yarn.FlinkYarnClient                      
>    - There is not enough memory available in the YARN cluster. The 
> TaskManager(s) require 2048MB each. NodeManagers available: [8192]
> After allocating the JobManager (1024MB) and (4/10) TaskManagers, the 
> following NodeManagers are available: [1024]
> The Flink YARN client will try to allocate the YARN session, but maybe not 
> all TaskManagers are connecting from the beginning because the resources are 
> currently not available in the cluster. The allocation might take more time 
> than usual because the Flink YARN client needs to wait until the resources 
> become available.
> 10:20:57,366 WARN  org.apache.flink.yarn.FlinkYarnClient                      
>    - There is not enough memory available in the YARN cluster. The 
> TaskManager(s) require 2048MB each. NodeManagers available: [8192]
> After allocating the JobManager (1024MB) and (5/10) TaskManagers, the 
> following NodeManagers are available: [1024]
> The Flink YARN client will try to allocate the YARN session, but maybe not 
> all TaskManagers are connecting from the beginning because the resources are 
> currently not available in the cluster. The allocation might take more time 
> than usual because the Flink YARN client needs to wait until the resources 
> become available.
> 10:20:57,366 WARN  org.apache.flink.yarn.FlinkYarnClient                      
>    - There is not enough memory available in the YARN cluster. The 
> TaskManager(s) require 2048MB each. NodeManagers available: [8192]
> After allocating the JobManager (1024MB) and (6/10) TaskManagers, the 
> following NodeManagers are available: [1024]
> The Flink YARN client will try to allocate the YARN session, but maybe not 
> all TaskManagers are connecting from the beginning because the resources are 
> currently not available in the cluster. The allocation might take more time 
> than usual because the Flink YARN client needs to wait until the resources 
> become available.
> 10:20:57,366 WARN  org.apache.flink.yarn.FlinkYarnClient                      
>    - There is not enough memory available in the YARN cluster. The 
> TaskManager(s) require 2048MB each. NodeManagers available: [8192]
> After allocating the JobManager (1024MB) and (7/10) TaskManagers, the 
> following NodeManagers are available: [1024]
> The Flink YARN client will try to allocate the YARN session, but maybe not 
> all TaskManagers are connecting from the beginning because the resources are 
> currently not available in the cluster. The allocation might take more time 
> than usual because the Flink YARN client needs to wait until the resources 
> become available.
> 10:20:57,366 WARN  org.apache.flink.yarn.FlinkYarnClient                      
>    - There is not enough memory available in the YARN cluster. The 
> TaskManager(s) require 2048MB each. NodeManagers available: [8192]
> After allocating the JobManager (1024MB) and (8/10) TaskManagers, the 
> following NodeManagers are available: [1024]
> The Flink YARN client will try to allocate the YARN session, but maybe not 
> all TaskManagers are connecting from the beginning because the resources are 
> currently not available in the cluster. The allocation might take more time 
> than usual because the Flink YARN client needs to wait until the resources 
> become available.
> 10:20:57,366 WARN  org.apache.flink.yarn.FlinkYarnClient                      
>    - There is not enough memory available in the YARN cluster. The 
> TaskManager(s) require 2048MB each. NodeManagers available: [8192]
> After allocating the JobManager (1024MB) and (9/10) TaskManagers, the 
> following NodeManagers are available: [1024]
> The Flink YARN client will try to allocate the YARN session, but maybe not 
> all TaskManagers are connecting from the beginning because the resources are 
> currently not available in the cluster. The allocation might take more time 
> than usual because the Flink YARN client needs to wait until the resources 
> become available.
> 10:20:58,204 INFO  org.apache.flink.yarn.Utils                                
>    - Copying from file:/home/hduser/flink-0.10.0/lib/flink-dist-0.10.0.jar to 
> hdfs://master:54310/user/hduser/.flink/application_1447928096470_0002/flink-dist-0.10.0.jar
> 10:21:00,235 INFO  org.apache.flink.yarn.Utils                                
>    - Copying from /home/hduser/flink-0.10.0/conf/flink-conf.yaml to 
> hdfs://master:54310/user/hduser/.flink/application_1447928096470_0002/flink-conf.yaml
> 10:21:00,277 INFO  org.apache.flink.yarn.Utils                                
>    - Copying from file:/home/hduser/flink-0.10.0/lib/log4j-1.2.17.jar to 
> hdfs://master:54310/user/hduser/.flink/application_1447928096470_0002/log4j-1.2.17.jar
> 10:21:00,349 INFO  org.apache.flink.yarn.Utils                                
>    - Copying from file:/home/hduser/flink-0.10.0/lib/slf4j-log4j12-1.7.7.jar 
> to 
> hdfs://master:54310/user/hduser/.flink/application_1447928096470_0002/slf4j-log4j12-1.7.7.jar
> 10:21:00,400 INFO  org.apache.flink.yarn.Utils                                
>    - Copying from file:/home/hduser/flink-0.10.0/lib/flink-python-0.10.0.jar 
> to 
> hdfs://master:54310/user/hduser/.flink/application_1447928096470_0002/flink-python-0.10.0.jar
> 10:21:00,441 INFO  org.apache.flink.yarn.Utils                                
>    - Copying from file:/home/hduser/flink-0.10.0/conf/logback.xml to 
> hdfs://master:54310/user/hduser/.flink/application_1447928096470_0002/logback.xml
> 10:21:00,486 INFO  org.apache.flink.yarn.Utils                                
>    - Copying from file:/home/hduser/flink-0.10.0/conf/log4j.properties to 
> hdfs://master:54310/user/hduser/.flink/application_1447928096470_0002/log4j.properties
> 10:21:00,553 INFO  org.apache.flink.yarn.FlinkYarnClient                      
>    - Submitting application master application_1447928096470_0002
> 10:21:00,963 INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl      
>    - Submitted application application_1447928096470_0002
> 10:21:00,964 INFO  org.apache.flink.yarn.FlinkYarnClient                      
>    - Waiting for the cluster to be allocated
> 10:21:00,969 INFO  org.apache.flink.yarn.FlinkYarnClient                      
>    - Deploying cluster, current state ACCEPTED
> 10:21:01,973 INFO  org.apache.flink.yarn.FlinkYarnClient                      
>    - Deploying cluster, current state ACCEPTED
> 10:21:02,977 INFO  org.apache.flink.yarn.FlinkYarnClient                      
>    - Deploying cluster, current state ACCEPTED
> 10:21:03,982 INFO  org.apache.flink.yarn.FlinkYarnClient                      
>    - Deploying cluster, current state ACCEPTED
> 10:21:04,986 INFO  org.apache.flink.yarn.FlinkYarnClient                      
>    - Deploying cluster, current state ACCEPTED
> 10:21:05,990 INFO  org.apache.flink.yarn.FlinkYarnClient                      
>    - Deploying cluster, current state ACCEPTED
> 10:21:06,994 INFO  org.apache.flink.yarn.FlinkYarnClient                      
>    - Deploying cluster, current state ACCEPTED
> 10:21:07,996 INFO  org.apache.flink.yarn.FlinkYarnClient                      
>    - Deploying cluster, current state ACCEPTED
> 10:21:09,003 INFO  org.apache.flink.yarn.FlinkYarnClient                      
>    - Deploying cluster, current state ACCEPTED
> 10:21:10,007 INFO  org.apache.flink.yarn.FlinkYarnClient                      
>    - Deploying cluster, current state ACCEPTED
> 10:21:11,011 INFO  org.apache.flink.yarn.FlinkYarnClient                      
>    - Deploying cluster, current state ACCEPTED
> Error while deploying YARN cluster: The YARN application unexpectedly 
> switched to state FAILED during deployment. 
> Diagnostics from YARN: Application application_1447928096470_0002 failed 1 
> times due to Error launching appattempt_1447928096470_0002_000001. Got 
> exception: java.net.ConnectException: Call From flink-master/127.0.0.1 to 
> localhost:38425 failed on connection exception: java.net.ConnectException: 
> Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>       at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>       at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>       at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>       at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>       at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
>       at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1480)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1407)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>       at com.sun.proxy.$Proxy31.startContainers(Unknown Source)
>       at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:119)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:254)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.ConnectException: Connection refused
>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>       at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
>       at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>       at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
>       at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
>       at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:609)
>       at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
>       at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
>       at org.apache.hadoop.ipc.Client.getConnection(Client.java:1529)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1446)
>       ... 9 more
> . Failing the application.
> If log aggregation is enabled on your cluster, use this command to further 
> investigate the issue:
> yarn logs -applicationId application_1447928096470_0002
> org.apache.flink.yarn.FlinkYarnClientBase$YarnDeploymentException: The YARN 
> application unexpectedly switched to state FAILED during deployment. 
> Diagnostics from YARN: Application application_1447928096470_0002 failed 1 
> times due to Error launching appattempt_1447928096470_0002_000001. Got 
> exception: java.net.ConnectException: Call From flink-master/127.0.0.1 to 
> localhost:38425 failed on connection exception: java.net.ConnectException: 
> Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>       at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>       at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>       at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>       at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>       at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
>       at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1480)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1407)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>       at com.sun.proxy.$Proxy31.startContainers(Unknown Source)
>       at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:119)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:254)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.ConnectException: Connection refused
>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>       at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
>       at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>       at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
>       at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
>       at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:609)
>       at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
>       at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
>       at org.apache.hadoop.ipc.Client.getConnection(Client.java:1529)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1446)
>       ... 9 more
> . Failing the application.
> If log aggregation is enabled on your cluster, use this command to further 
> investigate the issue:
> yarn logs -applicationId application_1447928096470_0002
>       at 
> org.apache.flink.yarn.FlinkYarnClientBase.deployInternal(FlinkYarnClientBase.java:646)
>       at 
> org.apache.flink.yarn.FlinkYarnClientBase.deploy(FlinkYarnClientBase.java:338)
>       at 
> org.apache.flink.client.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:409)
>       at 
> org.apache.flink.client.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:351)
> 
> 
> 
> Just to mention that my link-conf.yaml is the following : 
> #==============================================================================
> # Common
> #==============================================================================
> 
> # The host on which the JobManager runs. Only used in non-high-availability 
> mode.
> # The JobManager process will use this hostname to bind the listening servers 
> to.
> # The TaskManagers will try to connect to the JobManager on that host.
> 
> jobmanager.rpc.address: master
> 
> 
> # The port where the JobManager's main actor system listens for messages.
> 
> jobmanager.rpc.port: 6123
> 
> 
> # The heap size for the JobManager JVM
> 
> jobmanager.heap.mb: 256
> 
> 
> # The heap size for the TaskManager JVM
> 
> taskmanager.heap.mb: 512
> 
> 
> # The number of task slots that each TaskManager offers. Each slot runs one 
> parallel pipeline.
> 
> taskmanager.numberOfTaskSlots: 10
> 
> 
> # The parallelism used for programs that did not specify and other 
> parallelism.
> 
> parallelism.default: 5
> 
> 
> #==============================================================================
> # Web Frontend
> #==============================================================================
> 
> # The port under which the web-based runtime monitor listens.
> # A value of -1 deactivates the web server.
> 
> jobmanager.web.port: 8081
> 
> 
> # The port uder which the standalone web client
> # (for job upload and submit) listens.
> 
> webclient.port: 8080
> 
> 
> #==============================================================================
> # Streaming state checkpointing
> #==============================================================================
> 
> # The backend that will be used to store operator state checkpoints if 
> # checkpointing is enabled. 
> #
> # Supported backends: jobmanager, filesystem, <class-name-of-factory> 
> #
> #state.backend: filesystem
> 
> 
> # Directory for storing checkpoints in a Flink-supported filesystem
> # Note: State backend must be accessible from the JobManager and all 
> TaskManagers.
> # Use "hdfs://" for HDFS setups, "file://" for UNIX/POSIX-compliant file 
> systems,
> # (or any local file system under Windows), or "S3://" for S3 file system.
> #
> # state.backend.fs.checkpointdir: hdfs://namenode-host:port/flink-checkpoints
> 
> 
> #==============================================================================
> # Advanced
> #==============================================================================
> 
> # The number of buffers for the network stack.
> #
> # taskmanager.network.numberOfBuffers: 2048
> 
> 
> # Directories for temporary files.
> #
> # Add a delimited list for multiple directories, using the system directory
> # delimiter (colon ':' on unix) or a comma, e.g.:
> #     /data1/tmp:/data2/tmp:/data3/tmp
> #
> # Note: Each directory entry is read from and written to by a different I/O
> # thread. You can include the same directory multiple times in order to create
> # multiple I/O threads against that directory. This is for example relevant 
> for
> # high-throughput RAIDs.
> #
> # If not specified, the system-specific Java temporary directory 
> (java.io.tmpdir
> # property) is taken.
> #
> # taskmanager.tmp.dirs: /tmp
> 
> 
> # Path to the Hadoop configuration directory.
> #
> # This configuration is used when writing into HDFS. Unless specified 
> otherwise,
> # HDFS file creation will use HDFS default settings with respect to 
> block-size,
> # replication factor, etc.
> #
> # You can also directly specify the paths to hdfs-default.xml and 
> hdfs-site.xml
> # via keys 'fs.hdfs.hdfsdefault' and 'fs.hdfs.hdfssite'.
> #
> fs.hdfs.hadoopconf: /usr/local/hadoop/etc/hadoop/
> 
> 
> #==============================================================================
> # Master High Availability (required configuration)
> #==============================================================================
> 
> # The list of ZooKepper quorum peers that coordinate the high-availability
> # setup. This must be a list of the form:
> # "host1:clientPort,host2[:clientPort],..." (default clientPort: 2181)
> #
> # recovery.mode: zookeeper
> #
> # recovery.zookeeper.quorum: localhost:2181,...
> #
> # Note: You need to set the state backend to 'filesystem' and the checkpoint
> # directory (see above) before configuring the storageDir.
> #
> # recovery.zookeeper.storageDir: hdfs:///recovery
> 
> Thanks in advance,
> Stefanos Antaris
> 
>

Re: flink yarn-session failure

Reply via email to