VPC endpoint can also make a major difference in costs. Without it, access to S3 incurs data transfer costs and NAT costs, and these can be large.
On Wed, 7 Apr 2021 at 14:13, Hariharan <hariharan...@gmail.com> wrote: > Hi Tzahi, > > Comparing the first two cases: > > - > reads the parquet files from S3 and also writes to S3, it takes 22 > min > - > reads the parquet files from S3 and writes to its local hdfs, it > takes the same amount of time (±22 min) > > > It looks like most of the time is being spent in reading, and the time > spent in writing is likely negligible (probably you're not writing much > output?) > > Can you clarify what is the difference between these two? > > > reads the parquet files from S3 and writes to its local hdfs, it takes > the same amount of time (±22 min)? > > reads the parquet files from S3 (they were copied into the hdfs before) > and writes to its local hdfs, the job took 7 min > > In the second case, was the data read from hdfs or s3? > > Regarding the point from the post you linked to: > 1, Enhanced networking does make a difference > <https://laptrinhx.com/hadoop-with-enhanced-networking-on-aws-1893465489/>, > but it should be automatically enabled if you're using a compatible > instance type and an AWS AMI. However if you're using a custom AMI, you > might want to check if it's enabled for you. > 2. VPC endpoints also can make a difference in performance - at least that > used to be the case a few years ago. Maybe that has changed now. > > Couple of other things you might want to check: > 1. If your bucket is versioned, you may want to check if you're using the > ListObjectsV2 > API in S3A <https://issues.apache.org/jira/browse/HADOOP-13421>. > 2. Also check these recommendations from Cloudera > <https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.5/bk_cloud-data-access/content/s3-performance.html> > for optimal use of S3A. > > Thanks, > Hariharan > > > > On Wed, Apr 7, 2021 at 12:15 AM Tzahi File <tzahi.f...@ironsrc.com> wrote: > >> Hi All, >> >> We have a spark cluster on aws ec2 that has 60 X i3.4xlarge. >> >> The spark job running on that cluster reads from an S3 bucket and writes >> to that bucket. >> >> the bucket and the ec2 run in the same region. >> >> As part of our efforts to reduce the runtime of our spark jobs we found >> there's serious latency when reading from S3. >> >> When the job: >> >> - reads the parquet files from S3 and also writes to S3, it takes 22 >> min >> - reads the parquet files from S3 and writes to its local hdfs, it >> takes the same amount of time (±22 min) >> - reads the parquet files from S3 (they were copied into the hdfs >> before) and writes to its local hdfs, the job took 7 min >> >> the spark job has the following S3-related configuration: >> >> - spark.hadoop.fs.s3a.connection.establish.timeout=5000 >> - spark.hadoop.fs.s3a.connection.maximum=200 >> >> when reading from S3 we tried to increase the >> spark.hadoop.fs.s3a.connection.maximum config param from 200 to 400 or 900 >> but it didn't reduce the S3 latency. >> >> Do you have any idea for the cause of the read latency from S3? >> >> I saw this post >> <https://aws.amazon.com/premiumsupport/knowledge-center/s3-transfer-data-bucket-instance/> >> to >> improve the transfer speed, is something here relevant? >> >> >> Thanks, >> Tzahi >> > -- Vladimir Prus http://vladimirprus.com