Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)
I got this working by having our sysadmin update our security group to allow incoming traffic from the local subnet on ports 1-65535. I'm not sure if there's a more specific range I could have used, but so far, everything is running! Thanks for all the responses Marcelo and Andrew!! Matt On Thu, Jul 17, 2014 at 9:10 PM, Andrew Or wrote: > Hi Matt, > > The security group shouldn't be an issue; the ports listed in > `spark_ec2.py` are only for communication with the outside world. > > How did you launch your application? I notice you did not launch your > driver from your Master node. What happens if you did? Another thing is > that there seems to be some inconsistency or missing pieces in the logs you > posted. After an executor says "driver disassociated," what happens in the > driver logs? Is an exception thrown or something? > > It would be useful if you could also post your conf/spark-env.sh. > > Andrew > > > 2014-07-17 14:11 GMT-07:00 Marcelo Vanzin : > > Hi Matt, >> >> I'm not very familiar with setup on ec2; the closest I can point you >> at is to look at the "launch_cluster" in ec2/spark_ec2.py, where the >> ports seem to be configured. >> >> >> On Thu, Jul 17, 2014 at 1:29 PM, Matt Work Coarr >> wrote: >> > Thanks Marcelo! This is a huge help!! >> > >> > Looking at the executor logs (in a vanilla spark install, I'm finding >> them >> > in $SPARK_HOME/work/*)... >> > >> > It launches the executor, but it looks like the >> CoarseGrainedExecutorBackend >> > is having trouble talking to the driver (exactly what you said!!!). >> > >> > Do you know what the range of random ports that is used for the the >> > executor-to-driver? Is that range adjustable? Any config setting or >> > environment variable? >> > >> > I manually setup my ec2 security group to include all the ports that the >> > spark ec2 script ($SPARK_HOME/ec2/spark_ec2.py) sets up in it's security >> > groups. They included (for those listed above 1): >> > 1 >> > 50060 >> > 50070 >> > 50075 >> > 60060 >> > 60070 >> > 60075 >> > >> > Obviously I'll need to make some adjustments to my EC2 security group! >> Just >> > need to figure out exactly what should be in there. To keep things >> simple, >> > I just have one security group for the master, slaves, and the driver >> > machine. >> > >> > In listing the port ranges in my current security group I looked at the >> > ports that spark_ec2.py sets up as well as the ports listed in the >> "spark >> > standalone mode" documentation page under "configuring ports for network >> > security": >> > >> > http://spark.apache.org/docs/latest/spark-standalone.html >> > >> > >> > Here are the relevant fragments from the executor log: >> > >> > Spark Executor Command: "/cask/jdk/bin/java" "-cp" >> > >> "::/cask/spark/conf:/cask/spark/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/cask/spark/lib/datanucleus-api-jdo-3. >> > >> > >> 2.1.jar:/cask/spark/lib/datanucleus-rdbms-3.2.1.jar:/cask/spark/lib/datanucleus-core-3.2.2.jar" >> > "-XX:MaxPermSize=128m" "-Dspark.akka.frameSize=100" "-Dspark.akka. >> > >> > frameSize=100" "-Xms512M" "-Xmx512M" >> > "org.apache.spark.executor.CoarseGrainedExecutorBackend" >> > "akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787/user/CoarseGra >> > >> > inedScheduler" "0" "ip-10-202-8-45.ec2.internal" "8" >> > "akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker" >> > "app-20140717195146-" >> > >> > >> > >> > ... >> > >> > 14/07/17 19:51:47 DEBUG NativeCodeLoader: Trying to load the >> custom-built >> > native-hadoop library... >> > >> > 14/07/17 19:51:47 DEBUG NativeCodeLoader: Failed to load native-hadoop >> with >> > error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path >> > >> > 14/07/17 19:51:47 DEBUG NativeCodeLoader: >> > >> java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib >> > >> > 14/07/17 19:51:47 WARN NativeCodeLoader: Unable to load native-hadoop >> > libr
Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)
Thanks Marcelo! This is a huge help!! Looking at the executor logs (in a vanilla spark install, I'm finding them in $SPARK_HOME/work/*)... It launches the executor, but it looks like the CoarseGrainedExecutorBackend is having trouble talking to the driver (exactly what you said!!!). Do you know what the range of random ports that is used for the the executor-to-driver? Is that range adjustable? Any config setting or environment variable? I manually setup my ec2 security group to include all the ports that the spark ec2 script ($SPARK_HOME/ec2/spark_ec2.py) sets up in it's security groups. They included (for those listed above 1): 1 50060 50070 50075 60060 60070 60075 Obviously I'll need to make some adjustments to my EC2 security group! Just need to figure out exactly what should be in there. To keep things simple, I just have one security group for the master, slaves, and the driver machine. In listing the port ranges in my current security group I looked at the ports that spark_ec2.py sets up as well as the ports listed in the "spark standalone mode" documentation page under "configuring ports for network security": http://spark.apache.org/docs/latest/spark-standalone.html Here are the relevant fragments from the executor log: Spark Executor Command: "/cask/jdk/bin/java" "-cp" "::/cask/spark/conf:/cask/spark/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/cask/spark/lib/datanucleus-api-jdo-3. 2.1.jar:/cask/spark/lib/datanucleus-rdbms-3.2.1.jar:/cask/spark/lib/datanucleus-core-3.2.2.jar" "-XX:MaxPermSize=128m" "-Dspark.akka.frameSize=100" "-Dspark.akka. frameSize=100" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787/user/CoarseGra inedScheduler" "0" "ip-10-202-8-45.ec2.internal" "8" "akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker" "app-20140717195146-" ... 14/07/17 19:51:47 DEBUG NativeCodeLoader: Trying to load the custom-built native-hadoop library... 14/07/17 19:51:47 DEBUG NativeCodeLoader: Failed to load native-hadoop with error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path 14/07/17 19:51:47 DEBUG NativeCodeLoader: java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib 14/07/17 19:51:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Falling back to shell based 14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping 14/07/17 19:51:48 DEBUG Groups: Group mapping impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback; cacheTimeout=30 14/07/17 19:51:48 DEBUG SparkHadoopUtil: running as user: ec2-user ... 14/07/17 19:51:48 INFO CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://spark@ip-10-202-11-191.ec2.internal :46787/user/CoarseGrainedScheduler 14/07/17 19:51:48 INFO WorkerWatcher: Connecting to worker akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker 14/07/17 19:51:49 INFO WorkerWatcher: Successfully connected to akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker 14/07/17 19:53:29 ERROR CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:55670] -> [akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787] disassociated! Shutting down. Thanks a bunch! Matt On Thu, Jul 17, 2014 at 1:21 PM, Marcelo Vanzin wrote: > When I meant the executor log, I meant the log of the process launched > by the worker, not the worker. In my CDH-based Spark install, those > end up in /var/run/spark/work. > > If you look at your worker log, you'll see it's launching the executor > process. So there should be something there. > > Since you say it works when both are run in the same node, that > probably points to some communication issue, since the executor needs > to connect back to the driver. Check to see if you don't have any > firewalls blocking the ports Spark tries to use. (That's one of the > non-resource-related cases that will cause that message.) >
Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)
0-202-8-45.ec2.internal:46848]: Error [Association failed with [akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-10-202-8-45.ec2.internal/10.202.8.45:46848 ] 14/07/16 19:34:09 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101] -> [akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]: Error [Association failed with [akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-10-202-8-45.ec2.internal/10.202.8.45:46848 ] 14/07/16 19:34:09 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101] -> [akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]: Error [Association failed with [akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-10-202-8-45.ec2.internal/10.202.8.45:46848 ] Spark assembly has been built with Hive, including Datanucleus jars on classpath 14/07/16 19:34:10 INFO ExecutorRunner: Launch command: "/cask/jdk/bin/java" "-cp" "::/cask/spark/conf:/cask/spark/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/cask/spark/lib/datanucleus-api-jdo-3.2.1.jar:/cask/spark/lib/datanucleus-rdbms-3.2.1.jar:/cask/spark/lib/datanucleus-core-3.2.2.jar" "-XX:MaxPermSize=128m" "-Dspark.akka.frameSize=100" "-Dspark.akka.frameSize=100" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://spark@ip-10-202-11-191.ec2.internal:47740/user/CoarseGrainedScheduler" "1" "ip-10-202-8-45.ec2.internal" "8" "akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker" "app-20140716193227-" Matt On Tue, Jul 15, 2014 at 5:47 PM, Marcelo Vanzin wrote: > Have you looked at the slave machine to see if the process has > actually launched? If it has, have you tried peeking into its log > file? > > (That error is printed whenever the executors fail to report back to > the driver. Insufficient resources to launch the executor is the most > common cause of that, but not the only one.) > > On Tue, Jul 15, 2014 at 2:43 PM, Matt Work Coarr > wrote: > > Hello spark folks, > > > > I have a simple spark cluster setup but I can't get jobs to run on it. > I am > > using the standlone mode. > > > > One master, one slave. Both machines have 32GB ram and 8 cores. > > > > The slave is setup with one worker that has 8 cores and 24GB memory > > allocated. > > > > My application requires 2 cores and 5GB of memory. > > > > However, I'm getting the following error: > > > > WARN TaskSchedulerImpl: Initial job has not accepted any resources; check > > your cluster UI to ensure that workers are registered and have sufficient > > memory > > > > > > What else should I check for? > > > > This is a simplified setup (the real cluster has 20 nodes). In this > > simplified setup I am running the master and the slave manually. The > > master's web page shows the worker and it shows the application and the > > memory/core requirements match what I mentioned above. > > > > I also tried running the SparkPi example via bin/run-example and get the > > same result. It requires 8 cores and 512MB of memory, which is also > clearly > > within the limits of the available worker. > > > > Any ideas would be greatly appreciated!! > > > > Matt > > > > -- > Marcelo >
can't get jobs to run on cluster (enough memory and cpus are available on worker)
Hello spark folks, I have a simple spark cluster setup but I can't get jobs to run on it. I am using the standlone mode. One master, one slave. Both machines have 32GB ram and 8 cores. The slave is setup with one worker that has 8 cores and 24GB memory allocated. My application requires 2 cores and 5GB of memory. However, I'm getting the following error: WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory What else should I check for? This is a simplified setup (the real cluster has 20 nodes). In this simplified setup I am running the master and the slave manually. The master's web page shows the worker and it shows the application and the memory/core requirements match what I mentioned above. I also tried running the SparkPi example via bin/run-example and get the same result. It requires 8 cores and 512MB of memory, which is also clearly within the limits of the available worker. Any ideas would be greatly appreciated!! Matt
Re: creating new ami image for spark ec2 commands
Thanks Akhil! I'll give that a try!
Re: creating new ami image for spark ec2 commands
Thanks for the response Akhil. My email may not have been clear, but my question is about what should be inside the AMI image, not how to pass an AMI id in to the spark_ec2 script. Should certain packages be installed? Do certain directories need to exist? etc... On Fri, Jun 6, 2014 at 4:40 AM, Akhil Das wrote: > you can comment out this function and Create a new one which will return > your ami-id and the rest of the script will run fine. > > def get_spark_ami(opts): > instance_types = { > "m1.small":"pvm", > "m1.medium": "pvm", > "m1.large":"pvm", > "m1.xlarge": "pvm", > "t1.micro":"pvm", > "c1.medium": "pvm", > "c1.xlarge": "pvm", > "m2.xlarge": "pvm", > "m2.2xlarge": "pvm", > "m2.4xlarge": "pvm", > "cc1.4xlarge": "hvm", > "cc2.8xlarge": "hvm", > "cg1.4xlarge": "hvm", > "hs1.8xlarge": "hvm", > "hi1.4xlarge": "hvm", > "m3.xlarge": "hvm", > "m3.2xlarge": "hvm", > "cr1.8xlarge": "hvm", > "i2.xlarge": "hvm", > "i2.2xlarge": "hvm", > "i2.4xlarge": "hvm", > "i2.8xlarge": "hvm", > "c3.large":"pvm", > "c3.xlarge": "pvm", > "c3.2xlarge": "pvm", > "c3.4xlarge": "pvm", > "c3.8xlarge": "pvm" > } > if opts.instance_type in instance_types: > instance_type = instance_types[opts.instance_type] > else: > instance_type = "pvm" > print >> stderr,\ > "Don't recognize %s, assuming type is pvm" % opts.instance_type > > ami_path = "%s/%s/%s" % (AMI_PREFIX, opts.region, instance_type) > try: > ami = urllib2.urlopen(ami_path).read().strip() > print "Spark AMI: " + ami > except: > print >> stderr, "Could not resolve AMI at: " + ami_path > sys.exit(1) > > return ami > > Thanks > Best Regards > > > On Fri, Jun 6, 2014 at 2:14 AM, Matt Work Coarr > wrote: > >> How would I go about creating a new AMI image that I can use with the >> spark ec2 commands? I can't seem to find any documentation. I'm looking >> for a list of steps that I'd need to perform to make an Amazon Linux image >> ready to be used by the spark ec2 tools. >> >> I've been reading through the spark 1.0.0 documentation, looking at the >> script itself (spark_ec2.py), and looking at the github project >> mesos/spark-ec2. >> >> From what I can tell, the spark_ec2.py script looks up the id of the AMI >> based on the region and machine type (hvm or pvm) using static content >> derived from the github repo mesos/spark-ec2. >> >> The spark ec2 script loads the AMI id from this base url: >> https://raw.github.com/mesos/spark-ec2/v2/ami-list >> (Which presumably comes from https://github.com/mesos/spark-ec2 ) >> >> For instance, I'm working with us-east-1 and pvm, I'd end up with AMI id: >> ami-5bb18832 >> >> Is there a list of instructions for how this AMI was created? Assuming >> I'm starting with my own Amazon Linux image, what would I need to do to >> make it usable where I could pass that AMI id to spark_ec2.py rather than >> using the default spark-provided AMI? >> >> Thanks, >> Matt >> > >
creating new ami image for spark ec2 commands
How would I go about creating a new AMI image that I can use with the spark ec2 commands? I can't seem to find any documentation. I'm looking for a list of steps that I'd need to perform to make an Amazon Linux image ready to be used by the spark ec2 tools. I've been reading through the spark 1.0.0 documentation, looking at the script itself (spark_ec2.py), and looking at the github project mesos/spark-ec2. >From what I can tell, the spark_ec2.py script looks up the id of the AMI based on the region and machine type (hvm or pvm) using static content derived from the github repo mesos/spark-ec2. The spark ec2 script loads the AMI id from this base url: https://raw.github.com/mesos/spark-ec2/v2/ami-list (Which presumably comes from https://github.com/mesos/spark-ec2 ) For instance, I'm working with us-east-1 and pvm, I'd end up with AMI id: ami-5bb18832 Is there a list of instructions for how this AMI was created? Assuming I'm starting with my own Amazon Linux image, what would I need to do to make it usable where I could pass that AMI id to spark_ec2.py rather than using the default spark-provided AMI? Thanks, Matt
spark ec2 commandline tool error "VPC security groups may not be used for a non-VPC launch"
Hi, I'm attempting to run "spark-ec2 launch" on AWS. My AWS instances would be in our EC2 VPC (which seems to be causing a problem). The two security groups MyClusterName-master and MyClusterName-slaves have already been setup with the same ports open as the security group that spark-ec2 tries to create. (My company has security rules where I don't have permissions to create security groups, so they have to be created by someone else ahead of time.) I'm getting the error VPC security groups may not be used for a non-VPC launch" when I try to run "spark-ec2 launch". Is there something I need to do to make spark-ec2 launch the master and slave instances within the VPC? Here's the command-line and the error that I get... command-line (I've changed the clustername to something generic): $SPARK_HOME/ec2/spark-ec2 --key-pair=MyKeyPair '--identity-file=~/.ssh/id_mysshkey' --slaves=2 --instance-type=m3.large --region=us-east-1 --zone=us-east-1a --ami=myami --spark-version =0.9.1 launch MyClusterName error: ERROR:boto:400 Bad Request ERROR:boto: InvalidParameterCombinationVPC security groups may not be used for a non-VPC launch8374cac5-5869-4f38-a141-2fdaf3b18326 Setting up security groups... Searching for existing cluster MyClusterName... Launching instances... Traceback (most recent call last): File "./spark_ec2.py", line 806, in main() File "./spark_ec2.py", line 799, in main real_main() File "./spark_ec2.py", line 682, in real_main conn, opts, cluster_name) File "./spark_ec2.py", line 344, in launch_cluster block_device_map = block_map) File "/opt/spark-0.9.1-bin-hadoop1/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/image.py", line 255, in run File "/opt/spark-0.9.1-bin-hadoop1/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/connection.py", line 678, in run_instances File "/opt/spark-0.9.1-bin-hadoop1/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/connection.py", line 925, in get_object boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request InvalidParameterCombinationVPC security groups may not be used for a non-VPC launch8374cac5-5869-4f38-a141-2fdaf3b18326 Thanks for your help!! Matt