I'm no expert here, but are 4 yarn containers/task managers (-yn 4) not too many for 3 data nodes (=3 dn?)?
also, isn't the YARN UI reflecting its own jobs, i.e. running flink, as opposed to running the actual flink job? or did you mean that the flink web ui (through yarn) showed the submitted job as succeeded? Nico On Thursday, 8 June 2017 16:49:37 CEST Biplob Biswas wrote: > I am trying to run a flink job on our cluster with 3 dn and 1 nn. I am usng > the following command line argument to run this job, but I get an exception > saying "Could not connect to the leading JobManager. Please check that the > JobManager is running" ... what could I be doing wrong? > Surprisingly, on yarn UI, i shows that the job finished and it succeeded, > which is completely false. > > Can anyone shed some light on this? > > ./bin/flink run -m yarn-cluster -yn 4 -yt ./lib -c CEPForBAMOther > ~/flinkCEPJob/FlinkCEP-1.0-SNAPSHOT.jar > 2017-06-08 14:58:45,247 INFO > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline > service address: > http://airpluspoc-hdp-mn1.germanycentral.cloudapp.microsoftazure.de:8188/ws/ > v1/timeline/ 2017-06-08 14:58:45,372 INFO > org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at > airpluspoc-hdp-mn1.germanycentral.cloudapp.microsoftazure.de/172.16.6.31:805 > 0 2017-06-08 14:58:45,883 WARN > org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory - The > short-circuit local reads feature cannot be used because libhadoop cannot be > loaded. > 2017-06-08 14:58:51,296 INFO > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted > application application_1496901009273_0048 > Cluster started: Yarn cluster with application id > application_1496901009273_0048 > Using address > airpluspoc-hdp-dn2.germanycentral.cloudapp.microsoftazure.de:36661 to > connect to JobManager. > JobManager web interface address > http://airpluspoc-hdp-mn1.germanycentral.cloudapp.microsoftazure.de:8088/pro > xy/application_1496901009273_0048/ Using the parallelism provided by the > remote cluster (4). To use another parallelism, set it at the ./bin/flink > client. > Starting execution of program > RecordReadEventType > Test String > Waiting until all TaskManagers have connected > > ------------------------------------------------------------ > The program finished with the following exception: > > org.apache.flink.client.program.ProgramInvocationException: The main method > caused an error. > at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgr > am.java:545) at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExec > ution(PackagedProgram.java:419) at > org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:381) > at > org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:838) > at org.apache.flink.client.CliFrontend.run(CliFrontend.java:259) > at > org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1086) > at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1133) > at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1130) at > org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurity > Context.java:43) at java.security.AccessController.doPrivileged(Native > Method) at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja > va:1657) at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSec > urityContext.java:40) at > org.apache.flink.client.CliFrontend.main(CliFrontend.java:1129) Caused by: > java.lang.RuntimeException: Unable to get ClusterClient status from > Application Client > at > org.apache.flink.yarn.YarnClusterClient.getClusterStatus(YarnClusterClient.j > ava:243) at > org.apache.flink.yarn.YarnClusterClient.waitForClusterToBeReady(YarnClusterC > lient.java:507) at > org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:454) > at > org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:205 > ) at > org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:442) > at > org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute( > StreamContextEnvironment.java:71) at > CEPForBAMOther.main(CEPForBAMOther.java:134) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62 > ) at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl > .java:43) at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgr > am.java:528) ... 13 more > Caused by: org.apache.flink.util.FlinkException: Could not connect to the > leading JobManager. Please check that the JobManager is running. > at > org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterCl > ient.java:789) at > org.apache.flink.yarn.YarnClusterClient.getClusterStatus(YarnClusterClient.j > ava:237) ... 24 more > Caused by: > org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not > retrieve the leader gateway. > at > org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(Lea > derRetrievalUtils.java:79) at > org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterCl > ient.java:784) ... 25 more > Caused by: java.util.concurrent.TimeoutException: Futures timed out after > [10000 milliseconds] > at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scal > a:53) at scala.concurrent.Await$.result(package.scala:190) > at scala.concurrent.Await.result(package.scala) > at > org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(Lea > derRetrievalUtils.java:77) ... 26 more > > > > > Regards, > Biplob > > > > -- > View this message in context: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-r > unning-Flink-job-in-Yarn-cluster-mode-tp13593.html Sent from the Apache > Flink User Mailing List archive. mailing list archive at Nabble.com.
signature.asc
Description: This is a digitally signed message part.