Maybe it is a basic question, but your cluster has enough resource to run your application? It is requesting 208G of RAM Thanks,
Sent from Yahoo Mail for iPhone On Friday, October 4, 2019, 2:31 PM, Jochen Hebbrecht <jochenhebbre...@gmail.com> wrote: Hi Igor, We are deploying by submitting a batch job on a Livy server (from our local PC or a Jenkins node). The Livy server then deploys the Spark job on the cluster itself. For example: --- Running '/usr/lib/spark/bin/spark-submit' '--class' '##MY_MAIN_CLASS##' '--conf' 'spark.driver.userClassPathFirst=true' '--conf' 'spark.default.parallelism=180' '--conf' 'spark.executor.memory=52g' '--conf' 'spark.driver.memory=52g' '--conf' 'spark.yarn.tags=livy-batch-0-owjPBdmC' '--conf' 'spark.executor.instances=3' '--conf' 'spark.executor.memoryOverhead=6144' '--conf' 'spark.driver.cores=6' '--conf' 'spark.driver.memoryOverhead=6144' '--conf' 'spark.executor.extraJavaOptions=-XX:ThreadStackSize=2048 -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError=\'kill -9 %p\'' '--conf' 'spark.executor.userClassPathFirst=true' '--conf' 'spark.submit.deployMode=cluster' '--conf' 'spark.yarn.submit.waitAppCompletion=false' '--conf' 'spark.executor.extraClassPath=true' '-- ...--- Jochen Op vr 4 okt. 2019 om 17:42 schreef igor cabral uchoa <igorucho...@yahoo.com.br>: Hi Roland! What deploy mode are you using when you submit your applications? It is client or cluster mode? Regards, Sent from Yahoo Mail for iPhone On Friday, October 4, 2019, 12:37 PM, Roland Johann <roland.joh...@phenetic.io.INVALID> wrote: This are dynamic port ranges and dependa on configuration of your cluster. Per job there is a separate application master so there can‘t be just one port.If I remeber correctly the default EMR setup creates worker security groups with unrestricted traffic within the group, e.g. Between the worker nodes.Depending on your security requirements I suggest that you start with a default like setup and determine ports and port ranges from the docs afterwards to further restrict traffic between the nodes. Kind regards Jochen Hebbrecht <jochenhebbre...@gmail.com> schrieb am Fr. 4. Okt. 2019 um 17:16: Hi Roland, We have indeed custom security groups. Can you tell me where exactly I need to be able to access what? For example, is it from the master instance to the driver instance? And which port should be open? Jochen Op vr 4 okt. 2019 om 17:14 schreef Roland Johann <roland.joh...@phenetic.io>: Ho Jochen, did you setup the EMR cluster with custom security groups? Can you confirm that the relevant EC2 instances can connect through relevant ports? Best regards Jochen Hebbrecht <jochenhebbre...@gmail.com> schrieb am Fr. 4. Okt. 2019 um 17:09: Hi Jeff, Thanks! Just tried that, but the same timeout occurs :-( ... Jochen Op vr 4 okt. 2019 om 16:37 schreef Jeff Zhang <zjf...@gmail.com>: You can try to increase property spark.yarn.am.waitTime (by default it is 100s) Maybe you are doing some very time consuming operation when initializing SparkContext, which cause timeout. See this property here http://spark.apache.org/docs/latest/running-on-yarn.html Jochen Hebbrecht <jochenhebbre...@gmail.com> 于2019年10月4日周五 下午10:08写道: Hi, I'm using Spark 2.4.2 on AWS EMR 5.24.0. I'm trying to send a Spark job towards the cluster. Thhe job gets accepted, but the YARN application fails with: {code} 19/09/27 14:33:35 ERROR ApplicationMaster: Uncaught exception: java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227) at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220) at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468) at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803) at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) 19/09/27 14:33:35 INFO ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: Uncaught exception: java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227) at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220) at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468) at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803) at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) {code} It actually goes wrong at this line: https://github.com/apache/spark/blob/v2.4.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L468 Now, I'm 100% sure Spark is OK and there's no bug, but there must be something wrong with my setup. I don't understand the code of the ApplicationMaster, so could somebody explain me what it is trying to reach? Where exactly does the connection timeout? So at least I can debug it further because I don't have a clue what it is doing :-) Thanks for any help! Jochen -- Best Regards Jeff Zhang -- Roland Johann Software Developer/Data Engineer phenetic GmbH Lütticher Straße 10, 50674 Köln, Germany Mobil: +49 172 365 26 46 Mail: roland.joh...@phenetic.io Web: phenetic.io Handelsregister: Amtsgericht Köln (HRB 92595) Geschäftsführer: Roland Johann, Uwe Reimann -- Roland Johann Software Developer/Data Engineer phenetic GmbH Lütticher Straße 10, 50674 Köln, Germany Mobil: +49 172 365 26 46 Mail: roland.joh...@phenetic.io Web: phenetic.io Handelsregister: Amtsgericht Köln (HRB 92595) Geschäftsführer: Roland Johann, Uwe Reimann