RE: FW: Submitting jobs to Spark EC2 cluster remotely

Oleg Shirokikh Mon, 23 Feb 2015 10:24:54 -0800

Dear Patrick,

Thanks a lot again for your help.


> What happens if you submit from the master node itself on ec2 (in client 
> mode), does that work? What about in cluster mode?

If I SSH to the machine with Spark master, then everything works - shell, and 
regular submit in both client and cluster mode (after rsyncing the same jar I'm 
using for remote submission). Below is the output when I deploy in cluster mode 
from master machine itself:

//******************//
root@ip-172-31-34-83 spark]$ ./bin/spark-submit --class SparkPi --master 
spark://ec2-52-10-138-75.us-west-2.compute.amazonaws.com:7077 
--deploy-mode=cluster 
/root/spark/sparktest/target/scala-2.10/ec2test_2.10-0.0.1.jar 100
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Sending launch command to 
spark://ec2-52-10-138-75.us-west-2.compute.amazonaws.com:7077
Driver successfully submitted as driver-20150223174819-0008
... waiting before polling master for driver state
... polling master for driver state
State of driver-20150223174819-0008 is RUNNING
Driver running on ip-172-31-33-194.us-west-2.compute.internal:56183 
(worker-20150223171519-ip-172-31-33-194.us-west-2.compute.internal-56183)
//******************//

Observation: when I submit the job from remote host (and all these warnings 
[..initial job has not accepted any resources...] and errors [..asked to remove 
non-existent executor..] start appearing) and leave it running, I 
simultaneously try to submit a job (or run a shell) from an EC2 node with 
master itself. In this scenario it starts to produce similar warnings (not 
errors) [..initial job has not accepted any resources...] and doesn't execute 
the job. Probably there are not enough cores devoted to 2 apps running 
simulateneously.


>It would be helpful if you could print the full command that the executor is 
>failing. That might show that spark.driver.host is being set strangely. IIRC 
>we print the launch command before starting the executor.

I'd be very happy to provide this command but I'm not sure where to find it... 
When I launch the submit script, I immediately see [WARN 
TaskSchedulerImpl:...]s and [ERROR SparkDeploySchedulerBackend]s in the 
terminal output.

In Master Web UI, I have this application running indefinitely (listed in 
"Running APplications" with State=RUNNING). When I go into this app UI I see 
tons of Executors listed in "Executor Summary" - at each moment two of them are 
RUNNING (I have two workers) and all others EXITED.

Here is stderr from one of the RUNNING ones:

/***************/
15/02/23 18:11:49 INFO executor.CoarseGrainedExecutorBackend: Registered signal 
handlers for [TERM, HUP, INT]
15/02/23 18:11:49 INFO spark.SecurityManager: Changing view acls to: root,oleg
15/02/23 18:11:49 INFO spark.SecurityManager: Changing modify acls to: root,oleg
15/02/23 18:11:49 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(root, oleg); users 
with modify permissions: Set(root, oleg)
15/02/23 18:11:49 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/02/23 18:11:50 INFO Remoting: Starting remoting
15/02/23 18:11:50 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://driverpropsfetc...@ip-172-31-33-195.us-west-2.compute.internal:57681]
15/02/23 18:11:50 INFO util.Utils: Successfully started service 
'driverPropsFetcher' on port 57681.
/*****************/

Here is stderr from one of the EXITED ones:

/***************/
15/02/23 18:10:09 INFO executor.CoarseGrainedExecutorBackend: Registered signal 
handlers for [TERM, HUP, INT]
15/02/23 18:10:10 INFO spark.SecurityManager: Changing view acls to: root,oleg
15/02/23 18:10:10 INFO spark.SecurityManager: Changing modify acls to: root,oleg
15/02/23 18:10:10 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(root, oleg); users 
with modify permissions: Set(root, oleg)
15/02/23 18:10:10 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/02/23 18:10:10 INFO Remoting: Starting remoting
15/02/23 18:10:10 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://driverpropsfetc...@ip-172-31-33-194.us-west-2.compute.internal:42607]
15/02/23 18:10:10 INFO util.Utils: Successfully started service 
'driverPropsFetcher' on port 42607.
15/02/23 18:10:40 ERROR security.UserGroupInformation: 
PriviledgedActionException as:oleg cause:java.util.concurrent.TimeoutException: 
Futures timed out after [30 seconds]
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: 
Unknown exception in doAs
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
        at 
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
        at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115)
        at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163)
        at 
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.security.PrivilegedActionException: 
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
        ... 4 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
        at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
        at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
        at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
        at scala.concurrent.Await$.result(package.scala:107)
        at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:127)
        at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
        at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
        ... 7 more
/***************/


When I go into worker UI from Master page, I can see the RUNNING Executor - 
it's in LOADING state. Here is its stderr:

/***************/
15/02/23 18:15:05 INFO executor.CoarseGrainedExecutorBackend: Registered signal 
handlers for [TERM, HUP, INT]
15/02/23 18:15:06 INFO spark.SecurityManager: Changing view acls to: root,oleg
15/02/23 18:15:06 INFO spark.SecurityManager: Changing modify acls to: root,oleg
15/02/23 18:15:06 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(root, oleg); users 
with modify permissions: Set(root, oleg)
15/02/23 18:15:06 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/02/23 18:15:06 INFO Remoting: Starting remoting
15/02/23 18:15:06 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://driverpropsfetc...@ip-172-31-33-195.us-west-2.compute.internal:34609]
15/02/23 18:15:06 INFO util.Utils: Successfully started service 
'driverPropsFetcher' on port 34609.
/***************/


So it seems that there is a problem with starting executors...


Hopefully this clarifies the environment and workflow. I'd be happy to provide 
any additional information.

Again, thanks a lot for help and time looking into this. Although I know the 
perfectly legit way how to work with Spark EC2 cluster (run the driver within 
the cluster), it's extremely interesting to understand how remoting works with 
Spark. And in general it would be very useful to have the ability to submit 
jobs remotely.

Thanks,
Oleg


-----Original Message-----
From: Patrick Wendell [mailto:pwend...@gmail.com] 
Sent: Monday, February 23, 2015 1:22 AM
To: Oleg Shirokikh
Cc: user@spark.apache.org
Subject: Re: FW: Submitting jobs to Spark EC2 cluster remotely

What happens if you submit from the master node itself on ec2 (in client mode), 
does that work? What about in cluster mode?

It would be helpful if you could print the full command that the executor is 
failing. That might show that spark.driver.host is being set strangely. IIRC we 
print the launch command before starting the executor.

Overall the standalone cluster mode is not as well tested across environments 
with asymmetric connectivity. I didn't actually realize that akka (which the 
submission uses) can handle this scenario. But it does seem like the job is 
submitted, it's just not starting correctly.

- Patrick

On Mon, Feb 23, 2015 at 1:13 AM, Oleg Shirokikh <o...@solver.com> wrote:
> Patrick,
>
> I haven't changed the configs much. I just executed ec2-script to create 1 
> master, 2 slaves cluster. Then I try to submit the jobs from remote machine 
> leaving all defaults configured by Spark scripts as default. I've tried to 
> change configs as suggested in other mailing-list and stack overflow threads 
> (such as setting spark.driver.host, etc...), removed (hopefully) all 
> security/firewall restrictions from AWS, etc. but it didn't help.
>
> I think that what you are saying is exactly the issue: on my master node UI 
> at the bottom I can see the list of "Completed Drivers" all with ERROR 
> state...
>
> Thanks,
> Oleg
>
> -----Original Message-----
> From: Patrick Wendell [mailto:pwend...@gmail.com]
> Sent: Monday, February 23, 2015 12:59 AM
> To: Oleg Shirokikh
> Cc: user@spark.apache.org
> Subject: Re: Submitting jobs to Spark EC2 cluster remotely
>
> Can you list other configs that you are setting? It looks like the executor 
> can't communicate back to the driver. I'm actually not sure it's a good idea 
> to set spark.driver.host here, you want to let spark set that automatically.
>
> - Patrick
>
> On Mon, Feb 23, 2015 at 12:48 AM, Oleg Shirokikh <o...@solver.com> wrote:
>> Dear Patrick,
>>
>> Thanks a lot for your quick response. Indeed, following your advice I've 
>> uploaded the jar onto S3 and FileNotFoundException is gone now and job is 
>> submitted in "cluster" deploy mode.
>>
>> However, now both (client and cluster) fail with the following errors in 
>> executors (they keep exiting/killing executors as I see in UI):
>>
>> 15/02/23 08:42:46 ERROR security.UserGroupInformation:
>> PriviledgedActionException as:oleg
>> cause:java.util.concurrent.TimeoutException: Futures timed out after
>> [30 seconds]
>>
>>
>> Full log is:
>>
>> 15/02/23 01:59:11 INFO executor.CoarseGrainedExecutorBackend:
>> Registered signal handlers for [TERM, HUP, INT]
>> 15/02/23 01:59:12 INFO spark.SecurityManager: Changing view acls to:
>> root,oleg
>> 15/02/23 01:59:12 INFO spark.SecurityManager: Changing modify acls to:
>> root,oleg
>> 15/02/23 01:59:12 INFO spark.SecurityManager: SecurityManager:
>> authentication disabled; ui acls disabled; users with view
>> permissions: Set(root, oleg); users with modify permissions: 
>> Set(root,
>> oleg)
>> 15/02/23 01:59:12 INFO slf4j.Slf4jLogger: Slf4jLogger started
>> 15/02/23 01:59:12 INFO Remoting: Starting remoting
>> 15/02/23 01:59:13 INFO Remoting: Remoting started; listening on 
>> addresses 
>> :[akka.tcp://driverpropsfetc...@ip-172-31-33-194.us-west-2.compute.in
>> t
>> ernal:39379]
>> 15/02/23 01:59:13 INFO util.Utils: Successfully started service 
>> 'driverPropsFetcher' on port 39379.
>> 15/02/23 01:59:43 ERROR security.UserGroupInformation:
>> PriviledgedActionException as:oleg 
>> cause:java.util.concurrent.TimeoutException: Futures timed out after [30 
>> seconds] Exception in thread "main" 
>> java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs
>>         at 
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
>>         at 
>> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
>>         at 
>> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115)
>>         at 
>> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163)
>>         at
>> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGra
>> i
>> nedExecutorBackend.scala) Caused by:
>> java.security.PrivilegedActionException: 
>> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>         at 
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>>         ... 4 more
>> Caused by: java.util.concurrent.TimeoutException: Futures timed out after 
>> [30 seconds]
>>         at 
>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>>         at 
>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>>         at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>>         at 
>> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>>         at scala.concurrent.Await$.result(package.scala:107)
>>         at 
>> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:127)
>>         at 
>> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
>>         at 
>> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
>>         ... 7 more
>>
>>
>>
>>
>> -----Original Message-----
>> From: Patrick Wendell [mailto:pwend...@gmail.com]
>> Sent: Monday, February 23, 2015 12:17 AM
>> To: Oleg Shirokikh
>> Subject: Re: Submitting jobs to Spark EC2 cluster remotely
>>
>> The reason is that the file needs to be in a globally visible 
>> filesystem where the master node can download. So it needs to be on 
>> s3, for instance, rather than on your local filesystem.
>>
>> - Patrick
>>
>> On Sun, Feb 22, 2015 at 11:55 PM, olegshirokikh <o...@solver.com> wrote:
>>> I've set up the EC2 cluster with Spark. Everything works, all 
>>> master/slaves are up and running.
>>>
>>> I'm trying to submit a sample job (SparkPi). When I ssh to cluster 
>>> and submit it from there - everything works fine. However when 
>>> driver is created on a remote host (my laptop), it doesn't work. 
>>> I've tried both modes for
>>> `--deploy-mode`:
>>>
>>> **`--deploy-mode=client`:**
>>>
>>> From my laptop:
>>>
>>>     ./bin/spark-submit --master
>>> spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 
>>> --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
>>>
>>> Results in the following indefinite warnings/errors:
>>>
>>>>  WARN TaskSchedulerImpl: Initial job has not accepted any 
>>>> resources; check your cluster UI to ensure that workers are 
>>>> registered and have sufficient memory 15/02/22 18:30:45
>>>
>>>> ERROR SparkDeploySchedulerBackend: Asked to remove non-existent 
>>>> executor 0
>>>> 15/02/22 18:30:45
>>>
>>>> ERROR SparkDeploySchedulerBackend: Asked to remove non-existent 
>>>> executor 1
>>>
>>> ...and failed drivers - in Spark Web UI "Completed Drivers" with 
>>> "State=ERROR" appear.
>>>
>>> I've tried to pass limits for cores and memory to submit script but 
>>> it didn't help...
>>>
>>> **`--deploy-mode=cluster`:**
>>>
>>> From my laptop:
>>>
>>>     ./bin/spark-submit --master
>>> spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077
>>> --deploy-mode cluster --class SparkPi 
>>> ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
>>>
>>> The result is:
>>>
>>>> .... Driver successfully submitted as driver-20150223023734-0007 ...
>>>> waiting before polling master for driver state ... polling master 
>>>> for driver state State of driver-20150223023734-0007 is ERROR 
>>>> Exception from cluster was: java.io.FileNotFoundException: File
>>>> file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.1
>>>> 0 -0.0.1.jar does not exist. java.io.FileNotFoundException: File 
>>>> file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
>>>> does not exist.       at
>>>> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
>>>>       at
>>>> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
>>>>       at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329)        at
>>>> org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)
>>>>       at
>>>> org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunne
>>>> r
>>>> .scala:75)
>>>
>>>  So, I'd appreciate any pointers on what is going wrong and some 
>>> guidance how to deploy jobs from remote client. Thanks.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-jobs-
>>> t o-Spark-EC2-cluster-remotely-tp21762.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> --------------------------------------------------------------------
>>> - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
>>> additional commands, e-mail: user-h...@spark.apache.org
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: FW: Submitting jobs to Spark EC2 cluster remotely

Reply via email to