seek() is very, very expensive on s3, even short forward seeks. If your code 
does a lot of, it will kill performance. (forward seeks are better in s3a, 
which with Hadoop 2.3 is now something safe to use, and in the s3 client that 
Amazon include in EMR), but its still sluggish.

The other killers are
 -anything involving renaming files or directories
 -copy operations
 -listing lots of files.

Finally, S3 is HDD backed,1 file == 1 block. In HDFS while you can have >3 
processes reading different replicas of the same block of a file —giving 3x the 
bandwidth, disk bandwidth from an s3 object will be shared by all readers. The 
more readers: the worse performance


On 9 Jul 2015, at 14:31, Daniel Darabos 
<daniel.dara...@lynxanalytics.com<mailto:daniel.dara...@lynxanalytics.com>> 
wrote:

I recommend testing it for yourself. Even if you have no application, you can 
just run the spark-ec2 script, log in, run spark-shell and try reading files 
from an S3 bucket and from hdfs://<master IP>:9000/. (This is the ephemeral 
HDFS cluster, which uses SSD.)

I just tested our application this way yesterday and found the SSD-based HDFS 
to outperform S3 by a factor of 2. I don't know the cause. It may be locality 
like Akhil suggests, or SSD vs HDD (assuming S3 is HDD-backed). Or the HDFS 
client library and protocol are just better than the S3 versions (which is 
HTTP-based and uses some 6-year-old libraries).

On Thu, Jul 9, 2015 at 9:54 AM, Sujee Maniyam 
<su...@sujee.net<mailto:su...@sujee.net>> wrote:
latency is much bigger for S3 (if that matters)
And with HDFS you'd get data-locality that will boost your app performance.

I did some light experimenting on this.
see my presentation here for some benchmark numbers ..etc
http://www.slideshare.net/sujee/hadoop-to-sparkv2
from slide# 34

cheers
Sujee Maniyam (http://sujee.net<http://sujee.net/> | 
http://www.linkedin.com/in/sujeemaniyam )
teaching 
Spark<http://elephantscale.com/training/spark-for-developers/?utm_source=mailinglist&utm_medium=email&utm_campaign=signature>

On Wed, Jul 8, 2015 at 11:35 PM, Brandon White 
<bwwintheho...@gmail.com<mailto:bwwintheho...@gmail.com>> wrote:
Are there any significant performance differences between reading text files 
from S3 and hdfs?



Reply via email to