Note that if you use multi-part upload, each part becomes 1 block, which
allows for multiple concurrent readers. One would typically use fixed-size
block sizes which align with Spark's default HDFS block size (64 MB, I
think) to ensure the reads are aligned.

On Sat, Jul 11, 2015 at 11:14 AM, Steve Loughran <ste...@hortonworks.com>
wrote:

>  seek() is very, very expensive on s3, even short forward seeks. If your
> code does a lot of, it will kill performance. (forward seeks are better in
> s3a, which with Hadoop 2.3 is now something safe to use, and in the s3
> client that Amazon include in EMR), but its still sluggish.
>
>  The other killers are
>  -anything involving renaming files or directories
>  -copy operations
>  -listing lots of files.
>
>  Finally, S3 is HDD backed,1 file == 1 block. In HDFS while you can have
> >3 processes reading different replicas of the same block of a file —giving
> 3x the bandwidth, disk bandwidth from an s3 object will be shared by all
> readers. The more readers: the worse performance
>
>
>  On 9 Jul 2015, at 14:31, Daniel Darabos <daniel.dara...@lynxanalytics.com>
> wrote:
>
>  I recommend testing it for yourself. Even if you have no application,
> you can just run the spark-ec2 script, log in, run spark-shell and try
> reading files from an S3 bucket and from hdfs://<master IP>:9000/. (This is
> the ephemeral HDFS cluster, which uses SSD.)
>
>  I just tested our application this way yesterday and found the SSD-based
> HDFS to outperform S3 by a factor of 2. I don't know the cause. It may be
> locality like Akhil suggests, or SSD vs HDD (assuming S3 is HDD-backed). Or
> the HDFS client library and protocol are just better than the S3 versions
> (which is HTTP-based and uses some 6-year-old libraries).
>
> On Thu, Jul 9, 2015 at 9:54 AM, Sujee Maniyam <su...@sujee.net> wrote:
>
>> latency is much bigger for S3 (if that matters)
>> And with HDFS you'd get data-locality that will boost your app
>> performance.
>>
>>  I did some light experimenting on this.
>> see my presentation here for some benchmark numbers ..etc
>> http://www.slideshare.net/sujee/hadoop-to-sparkv2
>>  from slide# 34
>>
>>  cheers
>>  Sujee Maniyam (http://sujee.net |
>> http://www.linkedin.com/in/sujeemaniyam )
>>  teaching Spark
>> <http://elephantscale.com/training/spark-for-developers/?utm_source=mailinglist&utm_medium=email&utm_campaign=signature>
>>
>>
>> On Wed, Jul 8, 2015 at 11:35 PM, Brandon White <bwwintheho...@gmail.com>
>> wrote:
>>
>>> Are there any significant performance differences between reading text
>>> files from S3 and hdfs?
>>>
>>
>>
>
>

Reply via email to