I have on cloudera vm http://docs.sigmoidanalytics.com/index.php/How_to_Install_Spark_on_Cloudera_VM which version are you trying to setup on cloudera.. also which cloudera version are you using...
Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Mon, Mar 3, 2014 at 4:29 PM, Bin Wang <binwang...@gmail.com> wrote: > Hi Ognen/Mayur, > > Thanks for the reply and it is good to know how easy it is to setup Spark > on AWS cluster. > > My situation is a bit different from yours, our company already have a > cluster and it really doesn't make that much sense not to use them. That is > why I have been "going through" this. I really wish there are some > tutorials teaching you how to set up Spark Cluster on baremetal CDH cluster > or .. some way to tweak the CDH Spark distribution, so it is up to date. > > Ognen, of course it will be very helpful if you can 'history | grep > spark... ' and document the work that you have done since you've already > made it! > > Bin > > > > On Mon, Mar 3, 2014 at 2:06 PM, Ognen Duzlevski < > og...@plainvanillagames.com> wrote: > >> I should add that in this setup you really do not need to look for the >> printout of the master node's IP - you set it yourself a priori. If anyone >> is interested, let me know, I can write it all up so that people can follow >> some set of instructions. Who knows, maybe I can come up with a set of >> scripts to automate it all... >> >> Ognen >> >> >> >> On 3/3/14, 3:02 PM, Ognen Duzlevski wrote: >> >> I have a Standalone spark cluster running in an Amazon VPC that I set up >> by hand. All I did was provision the machines from a common AMI image (my >> underlying distribution is Ubuntu), I created a "sparkuser" on each machine >> and I have a /home/sparkuser/spark folder where I downladed spark. I did >> this on the master only, I did sbt/sbt assemble and I set up the >> conf/spark-env.sh to point to the master which is an IP address (in my case >> 10.10.0.200, the port is the default 7077). I also set up the slaves file >> in the same subdirectory to have all 16 ip addresses of the worker nodes >> (in my case 10.10.0.201-216). After sbt/sbt assembly was done on master, I >> then did cd ~/; tar -czf spark.tgz spark/ and I copied the resulting tgz >> file to each worker using the same "sparkuser" account and unpacked the >> .tgz on each slave (this will effectively replicate everything from master >> to all slaves - you can script this so you don't do it by hand). >> >> Your AMI should have the distribution's version of Java and git installed >> by the way. >> >> All you have to do then is sparkuser@spark-master> >> spark/sbin/start-all.sh (for 0.9, in 0.8.1 it is spark/bin/start-all.sh) >> and it will all automagically start :) >> >> All my Amazon nodes come with 4x400 Gb of ephemeral space which I have >> set up into a 1.6TB RAID0 array on each node and I am pooling this into an >> HDFS filesystem which is operated by a namenode outside the spark cluster >> while all the datanodes are the same nodes as the spark workers. This >> enables replication and extremely fast access since ephemeral is much >> faster than EBS or anything else on Amazon (you can do even better with SSD >> drives on this setup but it will cost ya). >> >> If anyone is interested I can document our pipeline set up - I came up >> with it myself and do not have a clue as to what the industry standards are >> since I could not find any written instructions anywhere online about how >> to set up a whole data analytics pipeline from the point of ingestion to >> the point of analytics (people don't want to share their secrets? or am I >> just in the dark and incapable of using Google properly?). My requirement >> was that I wanted this to run within a VPC for added security and >> simplicity, the Amazon security groups get really old quickly. Added bonus >> is that you can use a VPN as an entry into the whole system and your >> cluster instantly becomes "local" to you in terms of IPs etc. I use OpenVPN >> since I don't like Cisco nor Juniper (the only two options Amazon provides >> for their VPN gateways). >> >> Ognen >> >> >> On 3/3/14, 1:00 PM, Bin Wang wrote: >> >> Hi there, >> >> I have a CDH cluster set up, and I tried using the Spark parcel come >> with Cloudera Manager, but it turned out they even don't have the >> run-example shell command in the bin folder. Then I removed it from the >> cluster and cloned the incubator-spark into the name node of my cluster, >> and built from source there successfully with everything as default. >> >> I ran a few examples and everything seems work fine in the local mode. >> Then I am thinking about scale it to my cluster, which is what the >> "DISTRIBUTE + ACTIVATE" command does in Cloudera Manager. I want to add all >> the datanodes to the slaves and think I should run Spark in the standalone >> mode. >> >> Say I am trying to set up Spark in the standalone mode following this >> instruction: >> https://spark.incubator.apache.org/docs/latest/spark-standalone.html >> However, it says "Once started, the master will print out a >> spark://HOST:PORT URL for itself, which you can use to connect workers >> to it, or pass as the “master” argument to SparkContext. You can also >> find this URL on the master’s web UI, which is http://localhost:8080 by >> default." >> >> After I started the master, there is no URL printed on the screen and >> neither the web UI is running. >> Here is the output: >> [root@box incubator-spark]# ./sbin/start-master.sh >> starting org.apache.spark.deploy.master.Master, logging to >> /root/bwang_spark_new/incubator-spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-box.out >> >> First Question: am I even in the ballpark to run Spark in standalone >> mode if I try to fully utilize my cluster? I saw there are four ways to >> launch Spark on a cluster, AWS-EC2, Spark in standalone, Apache Meso, >> Hadoop Yarn... which I guess standalone mode is the way to go? >> >> Second Question: how to get the Spark URL of the cluster, why the >> output is not like what the instruction says? >> >> Best regards, >> >> Bin >> >> >> -- >> Some people, when confronted with a problem, think "I know, I'll use regular >> expressions." Now they have two problems. >> -- Jamie Zawinski >> >> >> -- >> Some people, when confronted with a problem, think "I know, I'll use regular >> expressions." Now they have two problems. >> -- Jamie Zawinski >> >> >