Is your laptop behind a NAT ? I got bitten by a similar issue and (I think) it was because I was behind a NAT that did not forward the public ip back to my private ip unless the connection originated from my private ip
cheers On Tue, Feb 24, 2015 at 5:20 AM, Oleg Shirokikh <o...@solver.com> wrote: > Dear Patrick, > > Thanks a lot again for your help. > > > What happens if you submit from the master node itself on ec2 (in client > mode), does that work? What about in cluster mode? > > If I SSH to the machine with Spark master, then everything works - shell, > and regular submit in both client and cluster mode (after rsyncing the same > jar I'm using for remote submission). Below is the output when I deploy in > cluster mode from master machine itself: > > //******************// > root@ip-172-31-34-83 spark]$ ./bin/spark-submit --class SparkPi --master > spark://ec2-52-10-138-75.us-west-2.compute.amazonaws.com:7077 > --deploy-mode=cluster > /root/spark/sparktest/target/scala-2.10/ec2test_2.10-0.0.1.jar 100 > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > Sending launch command to spark:// > ec2-52-10-138-75.us-west-2.compute.amazonaws.com:7077 > Driver successfully submitted as driver-20150223174819-0008 > ... waiting before polling master for driver state > ... polling master for driver state > State of driver-20150223174819-0008 is RUNNING > Driver running on ip-172-31-33-194.us-west-2.compute.internal:56183 > (worker-20150223171519-ip-172-31-33-194.us-west-2.compute.internal-56183) > //******************// > > Observation: when I submit the job from remote host (and all these > warnings [..initial job has not accepted any resources...] and errors > [..asked to remove non-existent executor..] start appearing) and leave it > running, I simultaneously try to submit a job (or run a shell) from an EC2 > node with master itself. In this scenario it starts to produce similar > warnings (not errors) [..initial job has not accepted any resources...] and > doesn't execute the job. Probably there are not enough cores devoted to 2 > apps running simulateneously. > > > >It would be helpful if you could print the full command that the executor > is failing. That might show that spark.driver.host is being set strangely. > IIRC we print the launch command before starting the executor. > > I'd be very happy to provide this command but I'm not sure where to find > it... When I launch the submit script, I immediately see [WARN > TaskSchedulerImpl:...]s and [ERROR SparkDeploySchedulerBackend]s in the > terminal output. > > In Master Web UI, I have this application running indefinitely (listed in > "Running APplications" with State=RUNNING). When I go into this app UI I > see tons of Executors listed in "Executor Summary" - at each moment two of > them are RUNNING (I have two workers) and all others EXITED. > > Here is stderr from one of the RUNNING ones: > > /***************/ > 15/02/23 18:11:49 INFO executor.CoarseGrainedExecutorBackend: Registered > signal handlers for [TERM, HUP, INT] > 15/02/23 18:11:49 INFO spark.SecurityManager: Changing view acls to: > root,oleg > 15/02/23 18:11:49 INFO spark.SecurityManager: Changing modify acls to: > root,oleg > 15/02/23 18:11:49 INFO spark.SecurityManager: SecurityManager: > authentication disabled; ui acls disabled; users with view permissions: > Set(root, oleg); users with modify permissions: Set(root, oleg) > 15/02/23 18:11:49 INFO slf4j.Slf4jLogger: Slf4jLogger started > 15/02/23 18:11:50 INFO Remoting: Starting remoting > 15/02/23 18:11:50 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://driverpropsfetc...@ip-172-31-33-195.us-west-2.compute.internal > :57681] > 15/02/23 18:11:50 INFO util.Utils: Successfully started service > 'driverPropsFetcher' on port 57681. > /*****************/ > > Here is stderr from one of the EXITED ones: > > /***************/ > 15/02/23 18:10:09 INFO executor.CoarseGrainedExecutorBackend: Registered > signal handlers for [TERM, HUP, INT] > 15/02/23 18:10:10 INFO spark.SecurityManager: Changing view acls to: > root,oleg > 15/02/23 18:10:10 INFO spark.SecurityManager: Changing modify acls to: > root,oleg > 15/02/23 18:10:10 INFO spark.SecurityManager: SecurityManager: > authentication disabled; ui acls disabled; users with view permissions: > Set(root, oleg); users with modify permissions: Set(root, oleg) > 15/02/23 18:10:10 INFO slf4j.Slf4jLogger: Slf4jLogger started > 15/02/23 18:10:10 INFO Remoting: Starting remoting > 15/02/23 18:10:10 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://driverpropsfetc...@ip-172-31-33-194.us-west-2.compute.internal > :42607] > 15/02/23 18:10:10 INFO util.Utils: Successfully started service > 'driverPropsFetcher' on port 42607. > 15/02/23 18:10:40 ERROR security.UserGroupInformation: > PriviledgedActionException as:oleg > cause:java.util.concurrent.TimeoutException: Futures timed out after [30 > seconds] > Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: > Unknown exception in doAs > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) > Caused by: java.security.PrivilegedActionException: > java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) > ... 4 more > Caused by: java.util.concurrent.TimeoutException: Futures timed out after > [30 seconds] > at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at > scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:127) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59) > ... 7 more > /***************/ > > > When I go into worker UI from Master page, I can see the RUNNING Executor > - it's in LOADING state. Here is its stderr: > > /***************/ > 15/02/23 18:15:05 INFO executor.CoarseGrainedExecutorBackend: Registered > signal handlers for [TERM, HUP, INT] > 15/02/23 18:15:06 INFO spark.SecurityManager: Changing view acls to: > root,oleg > 15/02/23 18:15:06 INFO spark.SecurityManager: Changing modify acls to: > root,oleg > 15/02/23 18:15:06 INFO spark.SecurityManager: SecurityManager: > authentication disabled; ui acls disabled; users with view permissions: > Set(root, oleg); users with modify permissions: Set(root, oleg) > 15/02/23 18:15:06 INFO slf4j.Slf4jLogger: Slf4jLogger started > 15/02/23 18:15:06 INFO Remoting: Starting remoting > 15/02/23 18:15:06 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://driverpropsfetc...@ip-172-31-33-195.us-west-2.compute.internal > :34609] > 15/02/23 18:15:06 INFO util.Utils: Successfully started service > 'driverPropsFetcher' on port 34609. > /***************/ > > > So it seems that there is a problem with starting executors... > > > Hopefully this clarifies the environment and workflow. I'd be happy to > provide any additional information. > > Again, thanks a lot for help and time looking into this. Although I know > the perfectly legit way how to work with Spark EC2 cluster (run the driver > within the cluster), it's extremely interesting to understand how remoting > works with Spark. And in general it would be very useful to have the > ability to submit jobs remotely. > > Thanks, > Oleg > > > -----Original Message----- > From: Patrick Wendell [mailto:pwend...@gmail.com] > Sent: Monday, February 23, 2015 1:22 AM > To: Oleg Shirokikh > Cc: user@spark.apache.org > Subject: Re: FW: Submitting jobs to Spark EC2 cluster remotely > > What happens if you submit from the master node itself on ec2 (in client > mode), does that work? What about in cluster mode? > > It would be helpful if you could print the full command that the executor > is failing. That might show that spark.driver.host is being set strangely. > IIRC we print the launch command before starting the executor. > > Overall the standalone cluster mode is not as well tested across > environments with asymmetric connectivity. I didn't actually realize that > akka (which the submission uses) can handle this scenario. But it does seem > like the job is submitted, it's just not starting correctly. > > - Patrick > > On Mon, Feb 23, 2015 at 1:13 AM, Oleg Shirokikh <o...@solver.com> wrote: > > Patrick, > > > > I haven't changed the configs much. I just executed ec2-script to create > 1 master, 2 slaves cluster. Then I try to submit the jobs from remote > machine leaving all defaults configured by Spark scripts as default. I've > tried to change configs as suggested in other mailing-list and stack > overflow threads (such as setting spark.driver.host, etc...), removed > (hopefully) all security/firewall restrictions from AWS, etc. but it didn't > help. > > > > I think that what you are saying is exactly the issue: on my master node > UI at the bottom I can see the list of "Completed Drivers" all with ERROR > state... > > > > Thanks, > > Oleg > > > > -----Original Message----- > > From: Patrick Wendell [mailto:pwend...@gmail.com] > > Sent: Monday, February 23, 2015 12:59 AM > > To: Oleg Shirokikh > > Cc: user@spark.apache.org > > Subject: Re: Submitting jobs to Spark EC2 cluster remotely > > > > Can you list other configs that you are setting? It looks like the > executor can't communicate back to the driver. I'm actually not sure it's a > good idea to set spark.driver.host here, you want to let spark set that > automatically. > > > > - Patrick > > > > On Mon, Feb 23, 2015 at 12:48 AM, Oleg Shirokikh <o...@solver.com> > wrote: > >> Dear Patrick, > >> > >> Thanks a lot for your quick response. Indeed, following your advice > I've uploaded the jar onto S3 and FileNotFoundException is gone now and job > is submitted in "cluster" deploy mode. > >> > >> However, now both (client and cluster) fail with the following errors > in executors (they keep exiting/killing executors as I see in UI): > >> > >> 15/02/23 08:42:46 ERROR security.UserGroupInformation: > >> PriviledgedActionException as:oleg > >> cause:java.util.concurrent.TimeoutException: Futures timed out after > >> [30 seconds] > >> > >> > >> Full log is: > >> > >> 15/02/23 01:59:11 INFO executor.CoarseGrainedExecutorBackend: > >> Registered signal handlers for [TERM, HUP, INT] > >> 15/02/23 01:59:12 INFO spark.SecurityManager: Changing view acls to: > >> root,oleg > >> 15/02/23 01:59:12 INFO spark.SecurityManager: Changing modify acls to: > >> root,oleg > >> 15/02/23 01:59:12 INFO spark.SecurityManager: SecurityManager: > >> authentication disabled; ui acls disabled; users with view > >> permissions: Set(root, oleg); users with modify permissions: > >> Set(root, > >> oleg) > >> 15/02/23 01:59:12 INFO slf4j.Slf4jLogger: Slf4jLogger started > >> 15/02/23 01:59:12 INFO Remoting: Starting remoting > >> 15/02/23 01:59:13 INFO Remoting: Remoting started; listening on > >> addresses > >> :[akka.tcp://driverpropsfetc...@ip-172-31-33-194.us-west-2.compute.in > >> t > >> ernal:39379] > >> 15/02/23 01:59:13 INFO util.Utils: Successfully started service > 'driverPropsFetcher' on port 39379. > >> 15/02/23 01:59:43 ERROR security.UserGroupInformation: > >> PriviledgedActionException as:oleg > cause:java.util.concurrent.TimeoutException: Futures timed out after [30 > seconds] Exception in thread "main" > java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs > >> at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134) > >> at > org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59) > >> at > org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115) > >> at > org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163) > >> at > >> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGra > >> i > >> nedExecutorBackend.scala) Caused by: > >> java.security.PrivilegedActionException: > java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] > >> at java.security.AccessController.doPrivileged(Native Method) > >> at javax.security.auth.Subject.doAs(Subject.java:415) > >> at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) > >> ... 4 more > >> Caused by: java.util.concurrent.TimeoutException: Futures timed out > after [30 seconds] > >> at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > >> at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > >> at > scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > >> at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > >> at scala.concurrent.Await$.result(package.scala:107) > >> at > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:127) > >> at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60) > >> at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59) > >> ... 7 more > >> > >> > >> > >> > >> -----Original Message----- > >> From: Patrick Wendell [mailto:pwend...@gmail.com] > >> Sent: Monday, February 23, 2015 12:17 AM > >> To: Oleg Shirokikh > >> Subject: Re: Submitting jobs to Spark EC2 cluster remotely > >> > >> The reason is that the file needs to be in a globally visible > >> filesystem where the master node can download. So it needs to be on > >> s3, for instance, rather than on your local filesystem. > >> > >> - Patrick > >> > >> On Sun, Feb 22, 2015 at 11:55 PM, olegshirokikh <o...@solver.com> > wrote: > >>> I've set up the EC2 cluster with Spark. Everything works, all > >>> master/slaves are up and running. > >>> > >>> I'm trying to submit a sample job (SparkPi). When I ssh to cluster > >>> and submit it from there - everything works fine. However when > >>> driver is created on a remote host (my laptop), it doesn't work. > >>> I've tried both modes for > >>> `--deploy-mode`: > >>> > >>> **`--deploy-mode=client`:** > >>> > >>> From my laptop: > >>> > >>> ./bin/spark-submit --master > >>> spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 > >>> --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar > >>> > >>> Results in the following indefinite warnings/errors: > >>> > >>>> WARN TaskSchedulerImpl: Initial job has not accepted any > >>>> resources; check your cluster UI to ensure that workers are > >>>> registered and have sufficient memory 15/02/22 18:30:45 > >>> > >>>> ERROR SparkDeploySchedulerBackend: Asked to remove non-existent > >>>> executor 0 > >>>> 15/02/22 18:30:45 > >>> > >>>> ERROR SparkDeploySchedulerBackend: Asked to remove non-existent > >>>> executor 1 > >>> > >>> ...and failed drivers - in Spark Web UI "Completed Drivers" with > >>> "State=ERROR" appear. > >>> > >>> I've tried to pass limits for cores and memory to submit script but > >>> it didn't help... > >>> > >>> **`--deploy-mode=cluster`:** > >>> > >>> From my laptop: > >>> > >>> ./bin/spark-submit --master > >>> spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 > >>> --deploy-mode cluster --class SparkPi > >>> ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar > >>> > >>> The result is: > >>> > >>>> .... Driver successfully submitted as driver-20150223023734-0007 ... > >>>> waiting before polling master for driver state ... polling master > >>>> for driver state State of driver-20150223023734-0007 is ERROR > >>>> Exception from cluster was: java.io.FileNotFoundException: File > >>>> file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.1 > >>>> 0 -0.0.1.jar does not exist. java.io.FileNotFoundException: File > >>>> > file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar > >>>> does not exist. at > >>>> > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397) > >>>> at > >>>> > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) > >>>> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329) > at > >>>> org.apache.spark.deploy.worker.DriverRunner.org > $apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150) > >>>> at > >>>> org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunne > >>>> r > >>>> .scala:75) > >>> > >>> So, I'd appreciate any pointers on what is going wrong and some > >>> guidance how to deploy jobs from remote client. Thanks. > >>> > >>> > >>> > >>> -- > >>> View this message in context: > >>> http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-jobs- > >>> t o-Spark-EC2-cluster-remotely-tp21762.html > >>> Sent from the Apache Spark User List mailing list archive at > Nabble.com. > >>> > >>> -------------------------------------------------------------------- > >>> - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For > >>> additional commands, e-mail: user-h...@spark.apache.org > >>> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- *Franc Carter* I Systems Architect I RoZetta Technology [image: Description: Description: Description: cid:image003.jpg@01D02903.9B540580] L4. 55 Harrington Street, THE ROCKS, NSW, 2000 PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA *T* +61 2 8355 2515 I www.rozettatechnology.com [image: Description: Description: Description: cid:image002.jpg@01D02903.0B41B280] DISCLAIMER: The contents of this email, inclusive of attachments, may be legally privileged and confidential. Any unauthorised use of the contents is expressly prohibited.