Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

Matt Work Coarr Thu, 17 Jul 2014 13:30:13 -0700

Thanks Marcelo!  This is a huge help!!

Looking at the executor logs (in a vanilla spark install, I'm finding them
in $SPARK_HOME/work/*)...

It launches the executor, but it looks like the
CoarseGrainedExecutorBackend is having trouble talking to the driver
(exactly what you said!!!).

Do you know what the range of random ports that is used for the the
executor-to-driver?  Is that range adjustable?  Any config setting or
environment variable?

I manually setup my ec2 security group to include all the ports that the
spark ec2 script ($SPARK_HOME/ec2/spark_ec2.py) sets up in it's security
groups.  They included (for those listed above 10000):
19999
50060
50070
50075
60060
60070
60075

Obviously I'll need to make some adjustments to my EC2 security group!
 Just need to figure out exactly what should be in there.  To keep things
simple, I just have one security group for the master, slaves, and the
driver machine.

In listing the port ranges in my current security group I looked at the
ports that spark_ec2.py sets up as well as the ports listed in the "spark
standalone mode" documentation page under "configuring ports for network
security":

http://spark.apache.org/docs/latest/spark-standalone.html

Here are the relevant fragments from the executor log:

Spark Executor Command: "/cask/jdk/bin/java" "-cp"
"::/cask/spark/conf:/cask/spark/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/cask/spark/lib/datanucleus-api-jdo-3.

2.1.jar:/cask/spark/lib/datanucleus-rdbms-3.2.1.jar:/cask/spark/lib/datanucleus-core-3.2.2.jar"
"-XX:MaxPermSize=128m" "-Dspark.akka.frameSize=100" "-Dspark.akka.

frameSize=100" "-Xms512M" "-Xmx512M"
"org.apache.spark.executor.CoarseGrainedExecutorBackend"
"akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787/user/CoarseGra

inedScheduler" "0" "ip-10-202-8-45.ec2.internal" "8"
"akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker"
"app-20140717195146-0000"

========================================
...

14/07/17 19:51:47 DEBUG NativeCodeLoader: Trying to load the custom-built
native-hadoop library...

14/07/17 19:51:47 DEBUG NativeCodeLoader: Failed to load native-hadoop with
error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path

14/07/17 19:51:47 DEBUG NativeCodeLoader:
java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib

14/07/17 19:51:47 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable

14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Falling back
to shell based

14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Group
mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping

14/07/17 19:51:48 DEBUG Groups: Group mapping
impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback;
cacheTimeout=300000

14/07/17 19:51:48 DEBUG SparkHadoopUtil: running as user: ec2-user

...

14/07/17 19:51:48 INFO CoarseGrainedExecutorBackend: Connecting to driver:
akka.tcp://spark@ip-10-202-11-191.ec2.internal
:46787/user/CoarseGrainedScheduler

14/07/17 19:51:48 INFO WorkerWatcher: Connecting to worker
akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker

14/07/17 19:51:49 INFO WorkerWatcher: Successfully connected to
akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker

14/07/17 19:53:29 ERROR CoarseGrainedExecutorBackend: Driver Disassociated
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:55670] ->
[akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787] disassociated!
Shutting down.

Thanks a bunch!
Matt

On Thu, Jul 17, 2014 at 1:21 PM, Marcelo Vanzin <van...@cloudera.com> wrote:

> When I meant the executor log, I meant the log of the process launched
> by the worker, not the worker. In my CDH-based Spark install, those
> end up in /var/run/spark/work.
>
> If you look at your worker log, you'll see it's launching the executor
> process. So there should be something there.
>
> Since you say it works when both are run in the same node, that
> probably points to some communication issue, since the executor needs
> to connect back to the driver. Check to see if you don't have any
> firewalls blocking the ports Spark tries to use. (That's one of the
> non-resource-related cases that will cause that message.)
>

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

Reply via email to