Hi Abhi, I'm also looking to deploy Flink jobs remotely to YARN, and eventually automate it - just wondering if you found a way to do it?
Thanks, Josh On Wed, May 25, 2016 at 12:36 AM, Bajaj, Abhinav <abhinav.ba...@here.com> wrote: > Hi, > > Has anyone tried to submit a Flink Job remotely to Yarn running in AWS ? > The case I am stuck with is where the Flink client is on my laptop and > YARN is running on AWS. > > @Robert, Did you get a chance to try this out? > > Regards, > Abhi > > From: "Bajaj, Abhinav" <abhinav.ba...@here.com> > Date: Friday, April 29, 2016 at 3:50 PM > > To: "user@flink.apache.org" <user@flink.apache.org> > Subject: Re: Submit Flink Jobs to YARN running on AWS > > Hi Robert, > > Thanks for your reply. > > I am using the Public DNS for the EC2 machines in the yarn and hdfs > configuration files. It looks like " > ec2-203-0-113-25.compute-1.amazonaws.com” > You should be able to connect then. > > I have hadoop installed locally and the YARN_CONF_DIR is pointing to it. > The yarn-site.xml and core-site.xml files use the resource manager > address(Public DNS) running in AWS. > > So, whenever I submit the job using the client on my laptop, it connects > to RM. > The RM starts the YARN application and starts the Job manager. > The job manager starts the actor system using the internal IP of the > nodemanager. In my understanding, this is where the problem lies. > > The local client tries to connect to the Job manager actor system but the > messages are dropped by the actor system as the IP address(EC2 internal IP) > that actor system started with does not match the external IP > address(Public IP) that was used by Flink client to send the message. > Please see my first mail below for detailed logs. > > Please keep me posted with your progress. > > I plan to move the cluster to VPC for other reasons. > I have limited knowledge of VPC but I guess the difference in internal and > external IP address will not be resolved. > Please correct if your views are different. > > It will be great if you are able to reproduce the issue. > > Thanks again. > Abhi > > > > *[image: cid:DACBF116-FD8C-48DB-B91D-D54510B306E8]* > > *Abhinav Bajaj* > > Senior Engineer > > HERE Predictive Analytics > > Office: +12062092767 > > Mobile: +17083299516 > > *HERE Seattle* > > 701 Pike Street, #2000, Seattle, WA 98101, USA > > *47° 36' 41" N. 122° 19' 57" W* > > *HERE Maps* > > > > > From: Robert Metzger <rmetz...@apache.org> > Reply-To: "user@flink.apache.org" <user@flink.apache.org> > Date: Tuesday, April 26, 2016 at 3:16 AM > To: "user@flink.apache.org" <user@flink.apache.org> > Subject: Re: Submit Flink Jobs to YARN running on AWS > > I've started my own EMR cluster and tried to launch a Flink job from my > local machine on it. > I have to admin that configuring the EMR launched Hadoop for external > access is quite a hassle. > > I'm not even able to submit Flink to the YARN cluster because the client > can not connect to the ResourceManager. I've change the resource manager > hostname to the public one in the yarn-site.xml on the cluster and > restarted it, but the client still can not connect. > It seems that the RM address is being overwritten by the Hadoop code? > [image: Inline image 1] > > How did you manage to get this working? > > In the VM settings, I disabled the "Source/Dest checks", but I don't think > this is related. > > Have you considered using Amazon's VPN service, I guess then you would > have "local" access to the cluster? > > On YARN, Flink is not using the flink-conf.yaml setting for the > jobmanager's hostname. Its using YARN's "yarn.nodemanager.hostname" from > the yarn-site.xml. > I haven't tried it, but it could work if you set the public hostname of > each NodeManager in the yarn-site.xml. > > Also, maybe the product forum / customer support of Amazon can help you > here. Other systems like Spark or Storm have very similar architectures and > will face the same issues. I guess they have some recipes for such > situations. > > Regards, > Robert > > > > > On Tue, Apr 26, 2016 at 10:47 AM, Robert Metzger <rmetz...@apache.org> > wrote: > >> Hi Abhi, >> >> I'll try to reproduce the issue and come up with a solution. >> >> On Tue, Apr 26, 2016 at 1:13 AM, Bajaj, Abhinav <abhinav.ba...@here.com> >> wrote: >> >>> Hi Fabian, >>> >>> Thanks for your reply and the pointers to documentation. >>> >>> In these steps, I think the Flink client is installed on the master >>> node, referring to steps mentioned in Flink docs here >>> <https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/aws.html> >>> . >>> However, the scenario I have is to run the client on my local machine >>> and submit jobs remotely to the YARN Cluster (running on EMR or >>> independently). >>> >>> Let me describe in more detail here. >>> I am trying to submit a single Flink Job to YARN using the client, >>> running on my dev machine - >>> >>> ./bin/flink run -m yarn-cluster -yn 4 -yjm 1024 -ytm 4096 >>> ./examples/batch/WordCount.jar >>> >>> In my understanding, YARN (running in AWS) allocates a container for the >>> Jobmanager. >>> Jobmanager discovers the IP and started the Actor system. At this step >>> the IP it uses is the internal IP address, of the EC2 instance. >>> >>> The client, running on my dev machine, is not able to connect to the >>> Jobmanager for reasons explained in my mail below. >>> >>> Is there a way, where I can set Jobmanager to use the hostname and not >>> the IP address? >>> >>> Or any other suggestions? >>> >>> Thanks, >>> Abhi >>> >>> *[image: cid:DACBF116-FD8C-48DB-B91D-D54510B306E8]* >>> >>> *Abhinav Bajaj* >>> >>> Senior Engineer >>> >>> HERE Predictive Analytics >>> >>> Office: +12062092767 >>> >>> Mobile: +17083299516 >>> >>> *HERE Seattle* >>> >>> 701 Pike Street, #2000, Seattle, WA 98101, USA >>> >>> *47° 36' 41" N. 122° 19' 57" W* >>> >>> *HERE Maps* >>> >>> >>> >>> >>> From: Fabian Hueske <fhue...@gmail.com> >>> Reply-To: "user@flink.apache.org" <user@flink.apache.org> >>> Date: Wednesday, March 9, 2016 at 12:51 AM >>> To: "user@flink.apache.org" <user@flink.apache.org> >>> Subject: Re: Submit Flink Jobs to YARN running on AWS >>> >>> Hi Abhi, >>> >>> I have used Flink on EMR via YARN a couple of times without problems. >>> I started a Flink YARN session like this: >>> >>> ./bin/yarn-session.sh -n 4 -jm 1024 -tm 4096 >>> >>> This will start five YARN containers (1 JobManager with 1024MB, 4 >>> Taskmanagers with 4096MB). See more config options in the documentation [1]. >>> In one of the last lines of the std-out output you should find a line >>> that tells you the IP and port of the JobManager. >>> >>> With the IP and port, you can submit a job as follows: >>> >>> ./bin/flink run -m jmIP:jmPort -p 4 jobJarFile.jar <arguments> >>> >>> This will send the job to the JobManager specified by IP and port and >>> execute the program with a parallelism of 4. See more config options in the >>> documentation [2]. >>> >>> If this does not help, could you share the exact command that you use to >>> start the YARN session and submit the job? >>> >>> Best, Fabian >>> >>> [1] >>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/yarn_setup.html >>> [2] >>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/cli.html >>> >>> 2016-03-08 0:25 GMT+01:00 Bajaj, Abhinav <abhinav.ba...@here.com>: >>> >>>> Hi, >>>> >>>> I am a newbie to Flink and trying to use it in AWS. >>>> I have created a YARN cluster on AWS EC2 machines. >>>> Trying to submit Flink job to the remote YARN cluster using the Flink >>>> Client running on my local machine. >>>> >>>> The Jobmanager start successfully on the YARN container but the client >>>> is not able to connect to the Jobmanager. >>>> >>>> Flink Client Logs - >>>> >>>> 13:57:34,877 INFO org.apache.flink.yarn.FlinkYarnClient >>>> - Deploying cluster, current state ACCEPTED >>>> 13:57:35,951 INFO org.apache.flink.yarn.FlinkYarnClient >>>> - Deploying cluster, current state ACCEPTED >>>> 13:57:37,027 INFO org.apache.flink.yarn.FlinkYarnClient >>>> - YARN application has been deployed successfully. >>>> 13:57:37,100 INFO org.apache.flink.yarn.FlinkYarnCluster >>>> - Start actor system. >>>> 13:57:37,532 INFO org.apache.flink.yarn.FlinkYarnCluster >>>> - Start application client. >>>> YARN cluster started >>>> JobManager web interface address >>>> http://ec2-XX-XX-XX-XX.compute-1.amazonaws.com:8088/proxy/application_1456184947990_0003/ >>>> Waiting until all TaskManagers have connected >>>> 13:57:37,540 INFO org.apache.flink.yarn.ApplicationClient >>>> - Notification about new leader address akka.tcp: >>>> //flink@54.35.41.12:41292/user/jobmanager with session ID null. >>>> No status updates from the YARN cluster received so far. Waiting ... >>>> 13:57:37,543 INFO org.apache.flink.yarn.ApplicationClient >>>> - Received address of new leader >>>> akka.tcp://flink@54.35.41.12:41292/user/jobmanager >>>> with session ID null. >>>> 13:57:37,543 INFO org.apache.flink.yarn.ApplicationClient >>>> - Disconnect from JobManager null. >>>> 13:57:37,545 INFO org.apache.flink.yarn.ApplicationClient >>>> - Trying to register at JobManager akka.tcp: >>>> //flink@54.35.41.12:41292/user/jobmanager. >>>> No status updates from the YARN cluster received so far. Waiting ... >>>> >>>> The logs of the Jobmanager contains the following - >>>> >>>> 21:57:39,142 ERROR akka.remote.EndpointWriter >>>> - dropping message [class akka.actor.ActorSelectionMessage] for >>>> non-local recipient [Actor[akka.tcp://flink@54.35.41.12:41292/]] arriving >>>> at [akka.tcp://flink@54.35.41.12:41292] inbound addresses are >>>> [akka.tcp://flink@172.31.23.18:41292] >>>> 21:57:40,782 INFO org.apache.flink.runtime.instance.InstanceManager >>>> - Registered TaskManager at ec2-54-35-41-12 >>>> (akka.tcp://flink@172.31.23.18:60565/user/taskmanager) as >>>> 72101dd2ee94caa7a5ec5a75488359aa. Current number of registered hosts is 1. >>>> Current number of alive task slots is 1. >>>> 21:57:41,162 ERROR akka.remote.EndpointWriter >>>> - dropping message [class akka.actor.ActorSelectionMessage] for >>>> non-local recipient [Actor[akka.tcp://flink@54.35.41.12:41292/]] arriving >>>> at [akka.tcp://flink@54.35.41.12:41292] inbound addresses are >>>> [akka.tcp://flink@172.31.23.18:41292] >>>> >>>> It seems the problem is in the mismatch of the Jobmanager Akka actors >>>> system running address and the one user by the Client. >>>> 172.31.23.18 – is the internal private IP of the EC2 machine where the >>>> Jobmanager container is running. >>>> 54.35.41.12 – is the external IP of the EC2 machine, used by Flink >>>> client to submit the Job. >>>> Because of this mismatch the messages are ignored by the Akka actor >>>> System. >>>> >>>> Can someone please help me with this issue. >>>> I can share the detailed logs, if required. >>>> >>>> Thanks, >>>> Abhi >>>> >>>> >>> >> >