About 2 months ago I used spark-ec2 to set up a small cluster. The cluster runs a spark streaming app 7x24 and stores the data to hdfs. I also need to run some batch analytics on the data.
Now that I have a little more experience I wonder if this was a good way to set up the cluster the following issues 1. I have not been able to find explicit directions for upgrading the spark version > 1. > http://search-hadoop.com/m/q3RTt7E0f92v0tKh2&subj=Re+Upgrading+Spark+in+EC2+cl > usters 2. I am not sure where the data is physically be stored. I think I may accidentally loose all my data 3. spark-ec2 makes it easy to launch a cluster with as many machines as you like how ever Its not clear how I would add slaves to an existing installation Our Java streaming app we call rdd.saveAsTextFile(³hdfs://path²); ephemeral-hdfs/conf/hdfs-site.xml: <property> <name>dfs.data.dir</name> <value>/mnt/ephemeral-hdfs/data,/mnt2/ephemeral-hdfs/data</value> </property> persistent-hdfs/conf/hdfs-site.xml $ mount /dev/xvdb on /mnt type ext3 (rw,nodiratime) /dev/xvdf on /mnt2 type ext3 (rw,nodiratime) http://spark.apache.org/docs/latest/ec2-scripts.html "The spark-ec2 script also supports pausing a cluster. In this case, the VMs are stopped but not terminated, so they lose all data on ephemeral disks but keep the data in their root partitions and their persistent-pdfs.² Initially I though using HDFS was a good idea. spark-ec2 makes HDFS easy to use. I incorrectly thought spark some how knew how HDFS partitioned my data. I think many people are using amazon s3. I do not have an direct experience with S3. My concern would be that the data is not physically stored closed to my slaves. I.e. High communication costs. Any suggestions would be greatly appreciated Andy