Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

Marcelo Vanzin Thu, 17 Jul 2014 14:12:36 -0700

Hi Matt,

I'm not very familiar with setup on ec2; the closest I can point you
at is to look at the "launch_cluster" in ec2/spark_ec2.py, where the
ports seem to be configured.



On Thu, Jul 17, 2014 at 1:29 PM, Matt Work Coarr
<mattcoarr.w...@gmail.com> wrote:
> Thanks Marcelo!  This is a huge help!!
>
> Looking at the executor logs (in a vanilla spark install, I'm finding them
> in $SPARK_HOME/work/*)...
>
> It launches the executor, but it looks like the CoarseGrainedExecutorBackend
> is having trouble talking to the driver (exactly what you said!!!).
>
> Do you know what the range of random ports that is used for the the
> executor-to-driver?  Is that range adjustable?  Any config setting or
> environment variable?
>
> I manually setup my ec2 security group to include all the ports that the
> spark ec2 script ($SPARK_HOME/ec2/spark_ec2.py) sets up in it's security
> groups.  They included (for those listed above 10000):
> 19999
> 50060
> 50070
> 50075
> 60060
> 60070
> 60075
>
> Obviously I'll need to make some adjustments to my EC2 security group!  Just
> need to figure out exactly what should be in there.  To keep things simple,
> I just have one security group for the master, slaves, and the driver
> machine.
>
> In listing the port ranges in my current security group I looked at the
> ports that spark_ec2.py sets up as well as the ports listed in the "spark
> standalone mode" documentation page under "configuring ports for network
> security":
>
> http://spark.apache.org/docs/latest/spark-standalone.html
>
>
> Here are the relevant fragments from the executor log:
>
> Spark Executor Command: "/cask/jdk/bin/java" "-cp"
> "::/cask/spark/conf:/cask/spark/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/cask/spark/lib/datanucleus-api-jdo-3.
>
> 2.1.jar:/cask/spark/lib/datanucleus-rdbms-3.2.1.jar:/cask/spark/lib/datanucleus-core-3.2.2.jar"
> "-XX:MaxPermSize=128m" "-Dspark.akka.frameSize=100" "-Dspark.akka.
>
> frameSize=100" "-Xms512M" "-Xmx512M"
> "org.apache.spark.executor.CoarseGrainedExecutorBackend"
> "akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787/user/CoarseGra
>
> inedScheduler" "0" "ip-10-202-8-45.ec2.internal" "8"
> "akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker"
> "app-20140717195146-0000"
>
> ========================================
>
> ...
>
> 14/07/17 19:51:47 DEBUG NativeCodeLoader: Trying to load the custom-built
> native-hadoop library...
>
> 14/07/17 19:51:47 DEBUG NativeCodeLoader: Failed to load native-hadoop with
> error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
>
> 14/07/17 19:51:47 DEBUG NativeCodeLoader:
> java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
>
> 14/07/17 19:51:47 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
>
> 14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Falling back
> to shell based
>
> 14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Group mapping
> impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping
>
> 14/07/17 19:51:48 DEBUG Groups: Group mapping
> impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback;
> cacheTimeout=300000
>
> 14/07/17 19:51:48 DEBUG SparkHadoopUtil: running as user: ec2-user
>
> ...
>
>
> 14/07/17 19:51:48 INFO CoarseGrainedExecutorBackend: Connecting to driver:
> akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787/user/CoarseGrainedScheduler
>
> 14/07/17 19:51:48 INFO WorkerWatcher: Connecting to worker
> akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker
>
> 14/07/17 19:51:49 INFO WorkerWatcher: Successfully connected to
> akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker
>
> 14/07/17 19:53:29 ERROR CoarseGrainedExecutorBackend: Driver Disassociated
> [akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:55670] ->
> [akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787] disassociated!
> Shutting down.
>
>
> Thanks a bunch!
> Matt
>
>
> On Thu, Jul 17, 2014 at 1:21 PM, Marcelo Vanzin <van...@cloudera.com> wrote:
>>
>> When I meant the executor log, I meant the log of the process launched
>> by the worker, not the worker. In my CDH-based Spark install, those
>> end up in /var/run/spark/work.
>>
>> If you look at your worker log, you'll see it's launching the executor
>> process. So there should be something there.
>>
>> Since you say it works when both are run in the same node, that
>> probably points to some communication issue, since the executor needs
>> to connect back to the driver. Check to see if you don't have any
>> firewalls blocking the ports Spark tries to use. (That's one of the
>> non-resource-related cases that will cause that message.)



-- 
Marcelo

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

Reply via email to