About 2 months ago I used spark-ec2 to set up a small cluster. The cluster
runs a spark streaming app 7x24 and stores the data to hdfs. I also need to
run some batch analytics on the data.

Now that I have a little more experience I wonder if this was a good way to
set up the cluster the following issues
1. I have not been able to find explicit directions for upgrading the spark
version
> 1. 
> http://search-hadoop.com/m/q3RTt7E0f92v0tKh2&subj=Re+Upgrading+Spark+in+EC2+cl
> usters
2. I am not sure where the data is physically be stored. I think I may
accidentally loose all my data
3. spark-ec2 makes it easy to launch a cluster with as many machines as you
like how ever Its not clear how I would add slaves to an existing
installation

Our Java streaming app we call rdd.saveAsTextFile(³hdfs://path²);

ephemeral-hdfs/conf/hdfs-site.xml:

  <property>

    <name>dfs.data.dir</name>

    <value>/mnt/ephemeral-hdfs/data,/mnt2/ephemeral-hdfs/data</value>

  </property>



persistent-hdfs/conf/hdfs-site.xml



$ mount

/dev/xvdb on /mnt type ext3 (rw,nodiratime)

/dev/xvdf on /mnt2 type ext3 (rw,nodiratime)



http://spark.apache.org/docs/latest/ec2-scripts.html


"The spark-ec2 script also supports pausing a cluster. In this case, the VMs
are stopped but not terminated, so they lose all data on ephemeral disks but
keep the data in their root partitions and their persistent-pdfs.²


Initially I though using HDFS was a good idea. spark-ec2 makes HDFS easy to
use. I incorrectly thought spark some how knew how HDFS partitioned my data.

I think many people are using amazon s3. I do not have an direct experience
with S3. My concern would be that the data is not physically stored closed
to my slaves. I.e. High communication costs.

Any suggestions would be greatly appreciated

Andy


Reply via email to