Hi Mayur, I am using CDH4.6.0p0.26. And the latest Cloudera Spark parcel is Spark 0.9.0 CDH4.6.0p0.50. As I mentioned, somehow, the Cloudera Spark version doesn't contain the run-example shell scripts.. However, it is automatically configured and it is pretty easy to set up across the cluster...
Thanks, Bin On Tue, Mar 4, 2014 at 10:59 AM, Mayur Rustagi <mayur.rust...@gmail.com>wrote: > I have on cloudera vm > http://docs.sigmoidanalytics.com/index.php/How_to_Install_Spark_on_Cloudera_VM > which version are you trying to setup on cloudera.. also which cloudera > version are you using... > > > Mayur Rustagi > Ph: +1 (760) 203 3257 > http://www.sigmoidanalytics.com > @mayur_rustagi <https://twitter.com/mayur_rustagi> > > > > On Mon, Mar 3, 2014 at 4:29 PM, Bin Wang <binwang...@gmail.com> wrote: > >> Hi Ognen/Mayur, >> >> Thanks for the reply and it is good to know how easy it is to setup Spark >> on AWS cluster. >> >> My situation is a bit different from yours, our company already have a >> cluster and it really doesn't make that much sense not to use them. That is >> why I have been "going through" this. I really wish there are some >> tutorials teaching you how to set up Spark Cluster on baremetal CDH cluster >> or .. some way to tweak the CDH Spark distribution, so it is up to date. >> >> Ognen, of course it will be very helpful if you can 'history | grep >> spark... ' and document the work that you have done since you've already >> made it! >> >> Bin >> >> >> >> On Mon, Mar 3, 2014 at 2:06 PM, Ognen Duzlevski < >> og...@plainvanillagames.com> wrote: >> >>> I should add that in this setup you really do not need to look for the >>> printout of the master node's IP - you set it yourself a priori. If anyone >>> is interested, let me know, I can write it all up so that people can follow >>> some set of instructions. Who knows, maybe I can come up with a set of >>> scripts to automate it all... >>> >>> Ognen >>> >>> >>> >>> On 3/3/14, 3:02 PM, Ognen Duzlevski wrote: >>> >>> I have a Standalone spark cluster running in an Amazon VPC that I set up >>> by hand. All I did was provision the machines from a common AMI image (my >>> underlying distribution is Ubuntu), I created a "sparkuser" on each machine >>> and I have a /home/sparkuser/spark folder where I downladed spark. I did >>> this on the master only, I did sbt/sbt assemble and I set up the >>> conf/spark-env.sh to point to the master which is an IP address (in my case >>> 10.10.0.200, the port is the default 7077). I also set up the slaves file >>> in the same subdirectory to have all 16 ip addresses of the worker nodes >>> (in my case 10.10.0.201-216). After sbt/sbt assembly was done on master, I >>> then did cd ~/; tar -czf spark.tgz spark/ and I copied the resulting tgz >>> file to each worker using the same "sparkuser" account and unpacked the >>> .tgz on each slave (this will effectively replicate everything from master >>> to all slaves - you can script this so you don't do it by hand). >>> >>> Your AMI should have the distribution's version of Java and git >>> installed by the way. >>> >>> All you have to do then is sparkuser@spark-master> >>> spark/sbin/start-all.sh (for 0.9, in 0.8.1 it is spark/bin/start-all.sh) >>> and it will all automagically start :) >>> >>> All my Amazon nodes come with 4x400 Gb of ephemeral space which I have >>> set up into a 1.6TB RAID0 array on each node and I am pooling this into an >>> HDFS filesystem which is operated by a namenode outside the spark cluster >>> while all the datanodes are the same nodes as the spark workers. This >>> enables replication and extremely fast access since ephemeral is much >>> faster than EBS or anything else on Amazon (you can do even better with SSD >>> drives on this setup but it will cost ya). >>> >>> If anyone is interested I can document our pipeline set up - I came up >>> with it myself and do not have a clue as to what the industry standards are >>> since I could not find any written instructions anywhere online about how >>> to set up a whole data analytics pipeline from the point of ingestion to >>> the point of analytics (people don't want to share their secrets? or am I >>> just in the dark and incapable of using Google properly?). My requirement >>> was that I wanted this to run within a VPC for added security and >>> simplicity, the Amazon security groups get really old quickly. Added bonus >>> is that you can use a VPN as an entry into the whole system and your >>> cluster instantly becomes "local" to you in terms of IPs etc. I use OpenVPN >>> since I don't like Cisco nor Juniper (the only two options Amazon provides >>> for their VPN gateways). >>> >>> Ognen >>> >>> >>> On 3/3/14, 1:00 PM, Bin Wang wrote: >>> >>> Hi there, >>> >>> I have a CDH cluster set up, and I tried using the Spark parcel come >>> with Cloudera Manager, but it turned out they even don't have the >>> run-example shell command in the bin folder. Then I removed it from the >>> cluster and cloned the incubator-spark into the name node of my cluster, >>> and built from source there successfully with everything as default. >>> >>> I ran a few examples and everything seems work fine in the local mode. >>> Then I am thinking about scale it to my cluster, which is what the >>> "DISTRIBUTE + ACTIVATE" command does in Cloudera Manager. I want to add all >>> the datanodes to the slaves and think I should run Spark in the standalone >>> mode. >>> >>> Say I am trying to set up Spark in the standalone mode following this >>> instruction: >>> https://spark.incubator.apache.org/docs/latest/spark-standalone.html >>> However, it says "Once started, the master will print out a >>> spark://HOST:PORT URL for itself, which you can use to connect workers >>> to it, or pass as the "master" argument to SparkContext. You can also >>> find this URL on the master's web UI, which is http://localhost:8080 by >>> default." >>> >>> After I started the master, there is no URL printed on the screen and >>> neither the web UI is running. >>> Here is the output: >>> [root@box incubator-spark]# ./sbin/start-master.sh >>> starting org.apache.spark.deploy.master.Master, logging to >>> /root/bwang_spark_new/incubator-spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-box.out >>> >>> First Question: am I even in the ballpark to run Spark in standalone >>> mode if I try to fully utilize my cluster? I saw there are four ways to >>> launch Spark on a cluster, AWS-EC2, Spark in standalone, Apache Meso, >>> Hadoop Yarn... which I guess standalone mode is the way to go? >>> >>> Second Question: how to get the Spark URL of the cluster, why the >>> output is not like what the instruction says? >>> >>> Best regards, >>> >>> Bin >>> >>> >>> -- >>> Some people, when confronted with a problem, think "I know, I'll use >>> regular expressions." Now they have two problems. >>> -- Jamie Zawinski >>> >>> >>> -- >>> Some people, when confronted with a problem, think "I know, I'll use >>> regular expressions." Now they have two problems. >>> -- Jamie Zawinski >>> >>> >> >