Hi Mayur,

I am using CDH4.6.0p0.26.  And the latest Cloudera Spark parcel is Spark
0.9.0 CDH4.6.0p0.50.
As I mentioned, somehow, the Cloudera Spark version doesn't contain the
run-example shell scripts.. However, it is automatically configured and it
is pretty easy to set up across the cluster...

Thanks,
Bin


On Tue, Mar 4, 2014 at 10:59 AM, Mayur Rustagi <mayur.rust...@gmail.com>wrote:

> I have on cloudera vm
> http://docs.sigmoidanalytics.com/index.php/How_to_Install_Spark_on_Cloudera_VM
> which version are you trying to setup on cloudera.. also which cloudera
> version are you using...
>
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Mon, Mar 3, 2014 at 4:29 PM, Bin Wang <binwang...@gmail.com> wrote:
>
>> Hi Ognen/Mayur,
>>
>> Thanks for the reply and it is good to know how easy it is to setup Spark
>> on AWS cluster.
>>
>> My situation is a bit different from yours, our company already have a
>> cluster and it really doesn't make that much sense not to use them. That is
>> why I have been "going through" this. I really wish there are some
>> tutorials teaching you how to set up Spark Cluster on baremetal CDH cluster
>> or .. some way to tweak the CDH Spark distribution, so it is up to date.
>>
>> Ognen, of course it will be very helpful if you can 'history | grep
>> spark... ' and document the work that you have done since you've already
>> made it!
>>
>> Bin
>>
>>
>>
>> On Mon, Mar 3, 2014 at 2:06 PM, Ognen Duzlevski <
>> og...@plainvanillagames.com> wrote:
>>
>>>  I should add that in this setup you really do not need to look for the
>>> printout of the master node's IP - you set it yourself a priori. If anyone
>>> is interested, let me know, I can write it all up so that people can follow
>>> some set of instructions. Who knows, maybe I can come up with a set of
>>> scripts to automate it all...
>>>
>>> Ognen
>>>
>>>
>>>
>>> On 3/3/14, 3:02 PM, Ognen Duzlevski wrote:
>>>
>>> I have a Standalone spark cluster running in an Amazon VPC that I set up
>>> by hand. All I did was provision the machines from a common AMI image (my
>>> underlying distribution is Ubuntu), I created a "sparkuser" on each machine
>>> and I have a /home/sparkuser/spark folder where I downladed spark. I did
>>> this on the master only, I did sbt/sbt assemble and I set up the
>>> conf/spark-env.sh to point to the master which is an IP address (in my case
>>> 10.10.0.200, the port is the default 7077). I also set up the slaves file
>>> in the same subdirectory to have all 16 ip addresses of the worker nodes
>>> (in my case 10.10.0.201-216). After sbt/sbt assembly was done on master, I
>>> then did cd ~/; tar -czf spark.tgz spark/ and I copied the resulting tgz
>>> file to each worker using the same "sparkuser" account and unpacked the
>>> .tgz on each slave (this will effectively replicate everything from master
>>> to all slaves - you can script this so you don't do it by hand).
>>>
>>> Your AMI should have the distribution's version of Java and git
>>> installed by the way.
>>>
>>> All you have to do then is sparkuser@spark-master>
>>> spark/sbin/start-all.sh (for 0.9, in 0.8.1 it is spark/bin/start-all.sh)
>>> and it will all automagically start :)
>>>
>>> All my Amazon nodes come with 4x400 Gb of ephemeral space which I have
>>> set up into a 1.6TB RAID0 array on each node and I am pooling this into an
>>> HDFS filesystem which is operated by a namenode outside the spark cluster
>>> while all the datanodes are the same nodes as the spark workers. This
>>> enables replication and extremely fast access since ephemeral is much
>>> faster than EBS or anything else on Amazon (you can do even better with SSD
>>> drives on this setup but it will cost ya).
>>>
>>> If anyone is interested I can document our pipeline set up - I came up
>>> with it myself and do not have a clue as to what the industry standards are
>>> since I could not find any written instructions anywhere online about how
>>> to set up a whole data analytics pipeline from the point of ingestion to
>>> the point of analytics (people don't want to share their secrets? or am I
>>> just in the dark and incapable of using Google properly?). My requirement
>>> was that I wanted this to run within a VPC for added security and
>>> simplicity, the Amazon security groups get really old quickly. Added bonus
>>> is that you can use a VPN as an entry into the whole system and your
>>> cluster instantly becomes "local" to you in terms of IPs etc. I use OpenVPN
>>> since I don't like Cisco nor Juniper (the only two options Amazon provides
>>> for their VPN gateways).
>>>
>>> Ognen
>>>
>>>
>>> On 3/3/14, 1:00 PM, Bin Wang wrote:
>>>
>>> Hi there,
>>>
>>>  I have a CDH cluster set up, and I tried using the Spark parcel come
>>> with Cloudera Manager, but it turned out they even don't have the
>>> run-example shell command in the bin folder. Then I removed it from the
>>> cluster and cloned the incubator-spark into the name node of my cluster,
>>> and built from source there successfully with everything as default.
>>>
>>>  I ran a few examples and everything seems work fine in the local mode.
>>> Then I am thinking about scale it to my cluster, which is what the
>>> "DISTRIBUTE + ACTIVATE" command does in Cloudera Manager. I want to add all
>>> the datanodes to the slaves and think I should run Spark in the standalone
>>> mode.
>>>
>>>  Say I am trying to set up Spark in the standalone mode following this
>>> instruction:
>>> https://spark.incubator.apache.org/docs/latest/spark-standalone.html
>>> However, it says "Once started, the master will print out a
>>> spark://HOST:PORT URL for itself, which you can use to connect workers
>>> to it, or pass as the "master" argument to SparkContext. You can also
>>> find this URL on the master's web UI, which is http://localhost:8080 by
>>> default."
>>>
>>>  After I started the master, there is no URL printed on the screen and
>>> neither the web UI is running.
>>> Here is the output:
>>>  [root@box incubator-spark]# ./sbin/start-master.sh
>>> starting org.apache.spark.deploy.master.Master, logging to
>>> /root/bwang_spark_new/incubator-spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-box.out
>>>
>>>  First Question: am I even in the ballpark to run Spark in standalone
>>> mode if I try to fully utilize my cluster? I saw there are four ways to
>>> launch Spark on a cluster, AWS-EC2, Spark in standalone, Apache Meso,
>>> Hadoop Yarn... which I guess standalone mode is the way to go?
>>>
>>>  Second Question: how to get the Spark URL of the cluster, why the
>>> output is not like what the instruction says?
>>>
>>>  Best regards,
>>>
>>>  Bin
>>>
>>>
>>> --
>>> Some people, when confronted with a problem, think "I know, I'll use 
>>> regular expressions." Now they have two problems.
>>> -- Jamie Zawinski
>>>
>>>
>>> --
>>> Some people, when confronted with a problem, think "I know, I'll use 
>>> regular expressions." Now they have two problems.
>>> -- Jamie Zawinski
>>>
>>>
>>
>

Reply via email to