VPC endpoint can also make a major difference in costs. Without it, access
to S3 incurs data transfer costs and NAT costs, and these can be large.

On Wed, 7 Apr 2021 at 14:13, Hariharan <hariharan...@gmail.com> wrote:

> Hi Tzahi,
>
> Comparing the first two cases:
>
>    - > reads the parquet files from S3 and also writes to S3, it takes 22
>    min
>    - > reads the parquet files from S3 and writes to its local hdfs, it
>    takes the same amount of time (±22 min)
>
>
> It looks like most of the time is being spent in reading, and the time
> spent in writing is likely negligible (probably you're not writing much
> output?)
>
> Can you clarify what is the difference between these two?
>
> > reads the parquet files from S3 and writes to its local hdfs, it takes
> the same amount of time (±22 min)?
> > reads the parquet files from S3 (they were copied into the hdfs before)
> and writes to its local hdfs, the job took 7 min
>
> In the second case, was the data read from hdfs or s3?
>
> Regarding the point from the post you linked to:
> 1, Enhanced networking does make a difference
> <https://laptrinhx.com/hadoop-with-enhanced-networking-on-aws-1893465489/>,
> but it should be automatically enabled if you're using a compatible
> instance type and an AWS AMI. However if you're using a custom AMI, you
> might want to check if it's enabled for you.
> 2. VPC endpoints also can make a difference in performance - at least that
> used to be the case a few years ago. Maybe that has changed now.
>
> Couple of other things you might want to check:
> 1. If your bucket is versioned, you may want to check if you're using the 
> ListObjectsV2
> API in S3A <https://issues.apache.org/jira/browse/HADOOP-13421>.
> 2. Also check these recommendations from Cloudera
> <https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.5/bk_cloud-data-access/content/s3-performance.html>
> for optimal use of S3A.
>
> Thanks,
> Hariharan
>
>
>
> On Wed, Apr 7, 2021 at 12:15 AM Tzahi File <tzahi.f...@ironsrc.com> wrote:
>
>> Hi All,
>>
>> We have a spark cluster on aws ec2 that has 60 X i3.4xlarge.
>>
>> The spark job running on that cluster reads from an S3 bucket and writes
>> to that bucket.
>>
>> the bucket and the ec2 run in the same region.
>>
>> As part of our efforts to reduce the runtime of our spark jobs we found
>> there's serious latency when reading from S3.
>>
>> When the job:
>>
>>    - reads the parquet files from S3 and also writes to S3, it takes 22
>>    min
>>    - reads the parquet files from S3 and writes to its local hdfs, it
>>    takes the same amount of time (±22 min)
>>    - reads the parquet files from S3 (they were copied into the hdfs
>>    before) and writes to its local hdfs, the job took 7 min
>>
>> the spark job has the following S3-related configuration:
>>
>>    - spark.hadoop.fs.s3a.connection.establish.timeout=5000
>>    - spark.hadoop.fs.s3a.connection.maximum=200
>>
>> when reading from S3 we tried to increase the
>> spark.hadoop.fs.s3a.connection.maximum config param from 200 to 400 or 900
>> but it didn't reduce the S3 latency.
>>
>> Do you have any idea for the cause of the read latency from S3?
>>
>> I saw this post
>> <https://aws.amazon.com/premiumsupport/knowledge-center/s3-transfer-data-bucket-instance/>
>>  to
>> improve the transfer speed, is something here relevant?
>>
>>
>> Thanks,
>> Tzahi
>>
> --
Vladimir Prus
http://vladimirprus.com

Reply via email to