#2: if using hdfs it's on the disks. You can use the HDFS command line to
browse your data. And then use s3distcp or simply distcp to copy data from
hdfs to S3. Or even use hdfs get commands to copy to local disk and then
use S3 cli to copy to s3

#3. Cost of accessing data in S3 from  Ec2 nodes, though not as fast as
local disks, is still fast enough. You can use hdfs for intermediate steps
and use S3 for final storage. Make sure your s3 bucket is in the same
region as your Ec2 cluster.

Regards
Sab
On 04-Dec-2015 3:35 am, "Andy Davidson" <a...@santacruzintegration.com>
wrote:

> About 2 months ago I used spark-ec2 to set up a small cluster. The cluster
> runs a spark streaming app 7x24 and stores the data to hdfs. I also need to
> run some batch analytics on the data.
>
> Now that I have a little more experience I wonder if this was a good way
> to set up the cluster the following issues
>
>    1. I have not been able to find explicit directions for upgrading the
>    spark version
>       1.
>       
> http://search-hadoop.com/m/q3RTt7E0f92v0tKh2&subj=Re+Upgrading+Spark+in+EC2+clusters
>    2. I am not sure where the data is physically be stored. I think I may
>    accidentally loose all my data
>    3. spark-ec2 makes it easy to launch a cluster with as many machines
>    as you like how ever Its not clear how I would add slaves to an existing
>    installation
>
>
> Our Java streaming app we call rdd.saveAsTextFile(“hdfs://path”);
>
> ephemeral-hdfs/conf/hdfs-site.xml:
>
>   <property>
>
>     <name>dfs.data.dir</name>
>
>     <value>/mnt/ephemeral-hdfs/data,/mnt2/ephemeral-hdfs/data</value>
>
>   </property>
>
>
> persistent-hdfs/conf/hdfs-site.xml
>
>
> $ mount
>
> /dev/xvdb on /mnt type ext3 (rw,nodiratime)
>
> /dev/xvdf on /mnt2 type ext3 (rw,nodiratime)
>
>
> http://spark.apache.org/docs/latest/ec2-scripts.html
>
> *"*The spark-ec2 script also supports pausing a cluster. In this case,
> the VMs are stopped but not terminated, so they *lose all data on
> ephemeral disks* but keep the data in their root partitions and their
> persistent-pdfs.”
>
>
> Initially I though using HDFS was a good idea. spark-ec2 makes HDFS easy
> to use. I incorrectly thought spark some how knew how HDFS partitioned my
> data.
>
> I think many people are using amazon s3. I do not have an direct
> experience with S3. My concern would be that the data is not physically
> stored closed to my slaves. I.e. High communication costs.
>
> Any suggestions would be greatly appreciated
>
> Andy
>

Reply via email to