Re: FW: Submitting jobs to Spark EC2 cluster remotely

Franc Carter Mon, 23 Feb 2015 19:39:58 -0800

Is your laptop behind a NAT ?

I got bitten by a similar issue and (I think) it was because I was behind a
NAT that did not forward the public ip back to my private ip unless the
connection originated from my private ip


cheers

On Tue, Feb 24, 2015 at 5:20 AM, Oleg Shirokikh <o...@solver.com> wrote:

> Dear Patrick,
>
> Thanks a lot again for your help.
>
> > What happens if you submit from the master node itself on ec2 (in client
> mode), does that work? What about in cluster mode?
>
> If I SSH to the machine with Spark master, then everything works - shell,
> and regular submit in both client and cluster mode (after rsyncing the same
> jar I'm using for remote submission). Below is the output when I deploy in
> cluster mode from master machine itself:
>
> //******************//
> root@ip-172-31-34-83 spark]$ ./bin/spark-submit --class SparkPi --master
> spark://ec2-52-10-138-75.us-west-2.compute.amazonaws.com:7077
> --deploy-mode=cluster
> /root/spark/sparktest/target/scala-2.10/ec2test_2.10-0.0.1.jar 100
> Spark assembly has been built with Hive, including Datanucleus jars on
> classpath
> Sending launch command to spark://
> ec2-52-10-138-75.us-west-2.compute.amazonaws.com:7077
> Driver successfully submitted as driver-20150223174819-0008
> ... waiting before polling master for driver state
> ... polling master for driver state
> State of driver-20150223174819-0008 is RUNNING
> Driver running on ip-172-31-33-194.us-west-2.compute.internal:56183
> (worker-20150223171519-ip-172-31-33-194.us-west-2.compute.internal-56183)
> //******************//
>
> Observation: when I submit the job from remote host (and all these
> warnings [..initial job has not accepted any resources...] and errors
> [..asked to remove non-existent executor..] start appearing) and leave it
> running, I simultaneously try to submit a job (or run a shell) from an EC2
> node with master itself. In this scenario it starts to produce similar
> warnings (not errors) [..initial job has not accepted any resources...] and
> doesn't execute the job. Probably there are not enough cores devoted to 2
> apps running simulateneously.
>
>
> >It would be helpful if you could print the full command that the executor
> is failing. That might show that spark.driver.host is being set strangely.
> IIRC we print the launch command before starting the executor.
>
> I'd be very happy to provide this command but I'm not sure where to find
> it... When I launch the submit script, I immediately see [WARN
> TaskSchedulerImpl:...]s and [ERROR SparkDeploySchedulerBackend]s in the
> terminal output.
>
> In Master Web UI, I have this application running indefinitely (listed in
> "Running APplications" with State=RUNNING). When I go into this app UI I
> see tons of Executors listed in "Executor Summary" - at each moment two of
> them are RUNNING (I have two workers) and all others EXITED.
>
> Here is stderr from one of the RUNNING ones:
>
> /***************/
> 15/02/23 18:11:49 INFO executor.CoarseGrainedExecutorBackend: Registered
> signal handlers for [TERM, HUP, INT]
> 15/02/23 18:11:49 INFO spark.SecurityManager: Changing view acls to:
> root,oleg
> 15/02/23 18:11:49 INFO spark.SecurityManager: Changing modify acls to:
> root,oleg
> 15/02/23 18:11:49 INFO spark.SecurityManager: SecurityManager:
> authentication disabled; ui acls disabled; users with view permissions:
> Set(root, oleg); users with modify permissions: Set(root, oleg)
> 15/02/23 18:11:49 INFO slf4j.Slf4jLogger: Slf4jLogger started
> 15/02/23 18:11:50 INFO Remoting: Starting remoting
> 15/02/23 18:11:50 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://driverpropsfetc...@ip-172-31-33-195.us-west-2.compute.internal
> :57681]
> 15/02/23 18:11:50 INFO util.Utils: Successfully started service
> 'driverPropsFetcher' on port 57681.
> /*****************/
>
> Here is stderr from one of the EXITED ones:
>
> /***************/
> 15/02/23 18:10:09 INFO executor.CoarseGrainedExecutorBackend: Registered
> signal handlers for [TERM, HUP, INT]
> 15/02/23 18:10:10 INFO spark.SecurityManager: Changing view acls to:
> root,oleg
> 15/02/23 18:10:10 INFO spark.SecurityManager: Changing modify acls to:
> root,oleg
> 15/02/23 18:10:10 INFO spark.SecurityManager: SecurityManager:
> authentication disabled; ui acls disabled; users with view permissions:
> Set(root, oleg); users with modify permissions: Set(root, oleg)
> 15/02/23 18:10:10 INFO slf4j.Slf4jLogger: Slf4jLogger started
> 15/02/23 18:10:10 INFO Remoting: Starting remoting
> 15/02/23 18:10:10 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://driverpropsfetc...@ip-172-31-33-194.us-west-2.compute.internal
> :42607]
> 15/02/23 18:10:10 INFO util.Utils: Successfully started service
> 'driverPropsFetcher' on port 42607.
> 15/02/23 18:10:40 ERROR security.UserGroupInformation:
> PriviledgedActionException as:oleg
> cause:java.util.concurrent.TimeoutException: Futures timed out after [30
> seconds]
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException:
> Unknown exception in doAs
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
>         at
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
>         at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115)
>         at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163)
>         at
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> Caused by: java.security.PrivilegedActionException:
> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>         ... 4 more
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
> [30 seconds]
>         at
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>         at
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>         at
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>         at
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>         at scala.concurrent.Await$.result(package.scala:107)
>         at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:127)
>         at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
>         at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
>         ... 7 more
> /***************/
>
>
> When I go into worker UI from Master page, I can see the RUNNING Executor
> - it's in LOADING state. Here is its stderr:
>
> /***************/
> 15/02/23 18:15:05 INFO executor.CoarseGrainedExecutorBackend: Registered
> signal handlers for [TERM, HUP, INT]
> 15/02/23 18:15:06 INFO spark.SecurityManager: Changing view acls to:
> root,oleg
> 15/02/23 18:15:06 INFO spark.SecurityManager: Changing modify acls to:
> root,oleg
> 15/02/23 18:15:06 INFO spark.SecurityManager: SecurityManager:
> authentication disabled; ui acls disabled; users with view permissions:
> Set(root, oleg); users with modify permissions: Set(root, oleg)
> 15/02/23 18:15:06 INFO slf4j.Slf4jLogger: Slf4jLogger started
> 15/02/23 18:15:06 INFO Remoting: Starting remoting
> 15/02/23 18:15:06 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://driverpropsfetc...@ip-172-31-33-195.us-west-2.compute.internal
> :34609]
> 15/02/23 18:15:06 INFO util.Utils: Successfully started service
> 'driverPropsFetcher' on port 34609.
> /***************/
>
>
> So it seems that there is a problem with starting executors...
>
>
> Hopefully this clarifies the environment and workflow. I'd be happy to
> provide any additional information.
>
> Again, thanks a lot for help and time looking into this. Although I know
> the perfectly legit way how to work with Spark EC2 cluster (run the driver
> within the cluster), it's extremely interesting to understand how remoting
> works with Spark. And in general it would be very useful to have the
> ability to submit jobs remotely.
>
> Thanks,
> Oleg
>
>
> -----Original Message-----
> From: Patrick Wendell [mailto:pwend...@gmail.com]
> Sent: Monday, February 23, 2015 1:22 AM
> To: Oleg Shirokikh
> Cc: user@spark.apache.org
> Subject: Re: FW: Submitting jobs to Spark EC2 cluster remotely
>
> What happens if you submit from the master node itself on ec2 (in client
> mode), does that work? What about in cluster mode?
>
> It would be helpful if you could print the full command that the executor
> is failing. That might show that spark.driver.host is being set strangely.
> IIRC we print the launch command before starting the executor.
>
> Overall the standalone cluster mode is not as well tested across
> environments with asymmetric connectivity. I didn't actually realize that
> akka (which the submission uses) can handle this scenario. But it does seem
> like the job is submitted, it's just not starting correctly.
>
> - Patrick
>
> On Mon, Feb 23, 2015 at 1:13 AM, Oleg Shirokikh <o...@solver.com> wrote:
> > Patrick,
> >
> > I haven't changed the configs much. I just executed ec2-script to create
> 1 master, 2 slaves cluster. Then I try to submit the jobs from remote
> machine leaving all defaults configured by Spark scripts as default. I've
> tried to change configs as suggested in other mailing-list and stack
> overflow threads (such as setting spark.driver.host, etc...), removed
> (hopefully) all security/firewall restrictions from AWS, etc. but it didn't
> help.
> >
> > I think that what you are saying is exactly the issue: on my master node
> UI at the bottom I can see the list of "Completed Drivers" all with ERROR
> state...
> >
> > Thanks,
> > Oleg
> >
> > -----Original Message-----
> > From: Patrick Wendell [mailto:pwend...@gmail.com]
> > Sent: Monday, February 23, 2015 12:59 AM
> > To: Oleg Shirokikh
> > Cc: user@spark.apache.org
> > Subject: Re: Submitting jobs to Spark EC2 cluster remotely
> >
> > Can you list other configs that you are setting? It looks like the
> executor can't communicate back to the driver. I'm actually not sure it's a
> good idea to set spark.driver.host here, you want to let spark set that
> automatically.
> >
> > - Patrick
> >
> > On Mon, Feb 23, 2015 at 12:48 AM, Oleg Shirokikh <o...@solver.com>
> wrote:
> >> Dear Patrick,
> >>
> >> Thanks a lot for your quick response. Indeed, following your advice
> I've uploaded the jar onto S3 and FileNotFoundException is gone now and job
> is submitted in "cluster" deploy mode.
> >>
> >> However, now both (client and cluster) fail with the following errors
> in executors (they keep exiting/killing executors as I see in UI):
> >>
> >> 15/02/23 08:42:46 ERROR security.UserGroupInformation:
> >> PriviledgedActionException as:oleg
> >> cause:java.util.concurrent.TimeoutException: Futures timed out after
> >> [30 seconds]
> >>
> >>
> >> Full log is:
> >>
> >> 15/02/23 01:59:11 INFO executor.CoarseGrainedExecutorBackend:
> >> Registered signal handlers for [TERM, HUP, INT]
> >> 15/02/23 01:59:12 INFO spark.SecurityManager: Changing view acls to:
> >> root,oleg
> >> 15/02/23 01:59:12 INFO spark.SecurityManager: Changing modify acls to:
> >> root,oleg
> >> 15/02/23 01:59:12 INFO spark.SecurityManager: SecurityManager:
> >> authentication disabled; ui acls disabled; users with view
> >> permissions: Set(root, oleg); users with modify permissions:
> >> Set(root,
> >> oleg)
> >> 15/02/23 01:59:12 INFO slf4j.Slf4jLogger: Slf4jLogger started
> >> 15/02/23 01:59:12 INFO Remoting: Starting remoting
> >> 15/02/23 01:59:13 INFO Remoting: Remoting started; listening on
> >> addresses
> >> :[akka.tcp://driverpropsfetc...@ip-172-31-33-194.us-west-2.compute.in
> >> t
> >> ernal:39379]
> >> 15/02/23 01:59:13 INFO util.Utils: Successfully started service
> 'driverPropsFetcher' on port 39379.
> >> 15/02/23 01:59:43 ERROR security.UserGroupInformation:
> >> PriviledgedActionException as:oleg
> cause:java.util.concurrent.TimeoutException: Futures timed out after [30
> seconds] Exception in thread "main"
> java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs
> >>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
> >>         at
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
> >>         at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115)
> >>         at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163)
> >>         at
> >> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGra
> >> i
> >> nedExecutorBackend.scala) Caused by:
> >> java.security.PrivilegedActionException:
> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
> >>         at java.security.AccessController.doPrivileged(Native Method)
> >>         at javax.security.auth.Subject.doAs(Subject.java:415)
> >>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> >>         ... 4 more
> >> Caused by: java.util.concurrent.TimeoutException: Futures timed out
> after [30 seconds]
> >>         at
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
> >>         at
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> >>         at
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> >>         at
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> >>         at scala.concurrent.Await$.result(package.scala:107)
> >>         at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:127)
> >>         at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
> >>         at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
> >>         ... 7 more
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Patrick Wendell [mailto:pwend...@gmail.com]
> >> Sent: Monday, February 23, 2015 12:17 AM
> >> To: Oleg Shirokikh
> >> Subject: Re: Submitting jobs to Spark EC2 cluster remotely
> >>
> >> The reason is that the file needs to be in a globally visible
> >> filesystem where the master node can download. So it needs to be on
> >> s3, for instance, rather than on your local filesystem.
> >>
> >> - Patrick
> >>
> >> On Sun, Feb 22, 2015 at 11:55 PM, olegshirokikh <o...@solver.com>
> wrote:
> >>> I've set up the EC2 cluster with Spark. Everything works, all
> >>> master/slaves are up and running.
> >>>
> >>> I'm trying to submit a sample job (SparkPi). When I ssh to cluster
> >>> and submit it from there - everything works fine. However when
> >>> driver is created on a remote host (my laptop), it doesn't work.
> >>> I've tried both modes for
> >>> `--deploy-mode`:
> >>>
> >>> **`--deploy-mode=client`:**
> >>>
> >>> From my laptop:
> >>>
> >>>     ./bin/spark-submit --master
> >>> spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077
> >>> --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
> >>>
> >>> Results in the following indefinite warnings/errors:
> >>>
> >>>>  WARN TaskSchedulerImpl: Initial job has not accepted any
> >>>> resources; check your cluster UI to ensure that workers are
> >>>> registered and have sufficient memory 15/02/22 18:30:45
> >>>
> >>>> ERROR SparkDeploySchedulerBackend: Asked to remove non-existent
> >>>> executor 0
> >>>> 15/02/22 18:30:45
> >>>
> >>>> ERROR SparkDeploySchedulerBackend: Asked to remove non-existent
> >>>> executor 1
> >>>
> >>> ...and failed drivers - in Spark Web UI "Completed Drivers" with
> >>> "State=ERROR" appear.
> >>>
> >>> I've tried to pass limits for cores and memory to submit script but
> >>> it didn't help...
> >>>
> >>> **`--deploy-mode=cluster`:**
> >>>
> >>> From my laptop:
> >>>
> >>>     ./bin/spark-submit --master
> >>> spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077
> >>> --deploy-mode cluster --class SparkPi
> >>> ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
> >>>
> >>> The result is:
> >>>
> >>>> .... Driver successfully submitted as driver-20150223023734-0007 ...
> >>>> waiting before polling master for driver state ... polling master
> >>>> for driver state State of driver-20150223023734-0007 is ERROR
> >>>> Exception from cluster was: java.io.FileNotFoundException: File
> >>>> file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.1
> >>>> 0 -0.0.1.jar does not exist. java.io.FileNotFoundException: File
> >>>>
> file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
> >>>> does not exist.       at
> >>>>
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
> >>>>       at
> >>>>
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
> >>>>       at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329)
> at
> >>>> org.apache.spark.deploy.worker.DriverRunner.org
> $apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)
> >>>>       at
> >>>> org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunne
> >>>> r
> >>>> .scala:75)
> >>>
> >>>  So, I'd appreciate any pointers on what is going wrong and some
> >>> guidance how to deploy jobs from remote client. Thanks.
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-jobs-
> >>> t o-Spark-EC2-cluster-remotely-tp21762.html
> >>> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >>>
> >>> --------------------------------------------------------------------
> >>> - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
> >>> additional commands, e-mail: user-h...@spark.apache.org
> >>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

*Franc Carter*     I      Systems Architect    I     RoZetta Technology



[image: Description: Description: Description:
cid:image003.jpg@01D02903.9B540580]



L4. 55 Harrington Street,  THE ROCKS,  NSW, 2000

PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA

*T*  +61 2 8355 2515     I    www.rozettatechnology.com

[image: Description: Description: Description:
cid:image002.jpg@01D02903.0B41B280]

DISCLAIMER: The contents of this email, inclusive of attachments, may be
legally

privileged and confidential. Any unauthorised use of the contents is
expressly prohibited.

Re: FW: Submitting jobs to Spark EC2 cluster remotely

Reply via email to