I recommend testing it for yourself. Even if you have no application, you
can just run the spark-ec2 script, log in, run spark-shell and try reading
files from an S3 bucket and from hdfs://<master IP>:9000/. (This is the
ephemeral HDFS cluster, which uses SSD.)

I just tested our application this way yesterday and found the SSD-based
HDFS to outperform S3 by a factor of 2. I don't know the cause. It may be
locality like Akhil suggests, or SSD vs HDD (assuming S3 is HDD-backed). Or
the HDFS client library and protocol are just better than the S3 versions
(which is HTTP-based and uses some 6-year-old libraries).

On Thu, Jul 9, 2015 at 9:54 AM, Sujee Maniyam <su...@sujee.net> wrote:

> latency is much bigger for S3 (if that matters)
> And with HDFS you'd get data-locality that will boost your app performance.
>
> I did some light experimenting on this.
> see my presentation here for some benchmark numbers ..etc
> http://www.slideshare.net/sujee/hadoop-to-sparkv2
> from slide# 34
>
> cheers
> Sujee Maniyam (http://sujee.net | http://www.linkedin.com/in/sujeemaniyam
> )
> teaching Spark
> <http://elephantscale.com/training/spark-for-developers/?utm_source=mailinglist&utm_medium=email&utm_campaign=signature>
>
>
> On Wed, Jul 8, 2015 at 11:35 PM, Brandon White <bwwintheho...@gmail.com>
> wrote:
>
>> Are there any significant performance differences between reading text
>> files from S3 and hdfs?
>>
>
>

Reply via email to