Thanks Marcelo! This is a huge help!! Looking at the executor logs (in a vanilla spark install, I'm finding them in $SPARK_HOME/work/*)...
It launches the executor, but it looks like the CoarseGrainedExecutorBackend is having trouble talking to the driver (exactly what you said!!!). Do you know what the range of random ports that is used for the the executor-to-driver? Is that range adjustable? Any config setting or environment variable? I manually setup my ec2 security group to include all the ports that the spark ec2 script ($SPARK_HOME/ec2/spark_ec2.py) sets up in it's security groups. They included (for those listed above 10000): 19999 50060 50070 50075 60060 60070 60075 Obviously I'll need to make some adjustments to my EC2 security group! Just need to figure out exactly what should be in there. To keep things simple, I just have one security group for the master, slaves, and the driver machine. In listing the port ranges in my current security group I looked at the ports that spark_ec2.py sets up as well as the ports listed in the "spark standalone mode" documentation page under "configuring ports for network security": http://spark.apache.org/docs/latest/spark-standalone.html Here are the relevant fragments from the executor log: Spark Executor Command: "/cask/jdk/bin/java" "-cp" "::/cask/spark/conf:/cask/spark/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/cask/spark/lib/datanucleus-api-jdo-3. 2.1.jar:/cask/spark/lib/datanucleus-rdbms-3.2.1.jar:/cask/spark/lib/datanucleus-core-3.2.2.jar" "-XX:MaxPermSize=128m" "-Dspark.akka.frameSize=100" "-Dspark.akka. frameSize=100" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787/user/CoarseGra inedScheduler" "0" "ip-10-202-8-45.ec2.internal" "8" "akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker" "app-20140717195146-0000" ======================================== ... 14/07/17 19:51:47 DEBUG NativeCodeLoader: Trying to load the custom-built native-hadoop library... 14/07/17 19:51:47 DEBUG NativeCodeLoader: Failed to load native-hadoop with error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path 14/07/17 19:51:47 DEBUG NativeCodeLoader: java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib 14/07/17 19:51:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Falling back to shell based 14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping 14/07/17 19:51:48 DEBUG Groups: Group mapping impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback; cacheTimeout=300000 14/07/17 19:51:48 DEBUG SparkHadoopUtil: running as user: ec2-user ... 14/07/17 19:51:48 INFO CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://spark@ip-10-202-11-191.ec2.internal :46787/user/CoarseGrainedScheduler 14/07/17 19:51:48 INFO WorkerWatcher: Connecting to worker akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker 14/07/17 19:51:49 INFO WorkerWatcher: Successfully connected to akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker 14/07/17 19:53:29 ERROR CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:55670] -> [akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787] disassociated! Shutting down. Thanks a bunch! Matt On Thu, Jul 17, 2014 at 1:21 PM, Marcelo Vanzin <van...@cloudera.com> wrote: > When I meant the executor log, I meant the log of the process launched > by the worker, not the worker. In my CDH-based Spark install, those > end up in /var/run/spark/work. > > If you look at your worker log, you'll see it's launching the executor > process. So there should be something there. > > Since you say it works when both are run in the same node, that > probably points to some communication issue, since the executor needs > to connect back to the driver. Check to see if you don't have any > firewalls blocking the ports Spark tries to use. (That's one of the > non-resource-related cases that will cause that message.) >