Hey Joe,

With the ephemeral HDFS, you get the instance store of your worker nodes.
For m3.xlarge that will be two 40 GB SSDs local to each instance, which are
very fast.

For the persistent HDFS, you get whatever EBS volumes the launch script
configured. EBS volumes are always network drives, so the usual limitations
apply. To optimize throughput, you can use EBS volumes with provisioned
IOPS and you can use EBS optimized instances. I don't have hard numbers at
hand, but I'd expect this to be noticeably slower than using local SSDs.

As far as only using S3 goes, it depends on your use case (i.e. what you
plan on doing with the data while it is there). If you store it there in
between running different applications, you can likely work around
consistency issues.

Also, if you use Amazon's EMRFS to access data in S3, you can use their new
consistency feature (

Hope this helps!

On Tue, Feb 3, 2015 at 9:32 AM, Joe Wass <jw...@crossref.org> wrote:

> The data is coming from S3 in the first place, and the results will be
> uploaded back there. But even in the same availability zone, fetching 170
> GB (that's gzipped) is slow. From what I understand of the pipelines,
> multiple transforms on the same RDD might involve re-reading the input,
> which very quickly add up in comparison to having the data locally. Unless
> I persisted the data (which I am in fact doing) but that would involve
> storing approximately the same amount of data in HDFS, which wouldn't fit.
> Also, I understood that S3 was unsuitable for practical? See "Why you
> cannot use S3 as a replacement for HDFS"[0]. I'd love to be proved wrong,
> though, that would make things a lot easier.
> [0] http://wiki.apache.org/hadoop/AmazonS3
> On 3 February 2015 at 16:45, David Rosenstrauch <dar...@darose.net> wrote:
>> You could also just push the data to Amazon S3, which would un-link the
>> size of the cluster needed to process the data from the size of the data.
>> DR
>> On 02/03/2015 11:43 AM, Joe Wass wrote:
>>> I want to process about 800 GB of data on an Amazon EC2 cluster. So, I
>>> need
>>> to store the input in HDFS somehow.
>>> I currently have a cluster of 5 x m3.xlarge, each of which has 80GB disk.
>>> Each HDFS node reports 73 GB, and the total capacity is ~370 GB.
>>> If I want to process 800 GB of data (assuming I can't split the jobs up),
>>> I'm guessing I need to get persistent-hdfs involved.
>>> 1 - Does persistent-hdfs have noticeably different performance than
>>> ephemeral-hdfs?
>>> 2 - If so, is there a recommended configuration (like storing input and
>>> output on persistent, but persisted RDDs on ephemeral?)
>>> This seems like a common use-case, so sorry if this has already been
>>> covered.
>>> Joe
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org


Reply via email to