[ https://issues.apache.org/jira/browse/HADOOP-17789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steve Loughran updated HADOOP-17789: ------------------------------------ Summary: S3 CSV read performance with Spark with Hadoop 3.3.1 is slower than older Hadoop (was: S3 read performance with Spark with Hadoop 3.3.1 is slower than older Hadoop) > S3 CSV read performance with Spark with Hadoop 3.3.1 is slower than older > Hadoop > -------------------------------------------------------------------------------- > > Key: HADOOP-17789 > URL: https://issues.apache.org/jira/browse/HADOOP-17789 > Project: Hadoop Common > Issue Type: Improvement > Components: fs/s3 > Affects Versions: 3.3.1 > Reporter: Arghya Saha > Priority: Minor > Attachments: storediag.log > > > This is issue is continuation to > https://issues.apache.org/jira/browse/HADOOP-17755 > The input data reported by Spark(Hadoop 3.3.1) was almost double and read > runtime also increased (around 20%) compared to Spark(Hadoop 3.2.0) with same > exact amount of resource and same configuration. And this is happening with > other jobs as well which was not impacted by read fully error as stated above. > *I was having the same exact issue when I was using the workaround > fs.s3a.readahead.range = 1G with Hadoop 3.2.0* > Below is further details : > > |Hadoop Version|Actual size of the files(in SQL Tab)|Reported size of the > file(In Stages)|Time to complete the Stage|fs.s3a.readahead.range| > |Hadoop 3.2.0|29.3 GiB|29.3 GiB|23 min|64K| > |Hadoop 3.3.1|29.3 GiB|*{color:#ff0000}58.7 GiB{color}*|*{color:#ff0000}27 > min{color}*|{color:#172b4d}64K{color}| > |Hadoop 3.2.0|29.3 GiB|*{color:#ff0000}58.7 GiB{color}*|*{color:#ff0000}~27 > min{color}*|{color:#172b4d}1G{color}| > * *Shuffle Write* is same (95.9 GiB) for all the above three cases > I was expecting some improvement(or same as 3.2.0) with Hadoop 3.3.1 with > read operations, please suggest how to approach this and resolve this. > I have used the default s3a config along with below and also using EKS cluster > {code:java} > spark.hadoop.fs.s3a.committer.magic.enabled: 'true' > spark.hadoop.fs.s3a.committer.name: magic > spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a: > org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory > spark.hadoop.fs.s3a.downgrade.syncable.exceptions: "true"{code} > * I did not use > {code:java} > spark.hadoop.fs.s3a.experimental.input.fadvise=random{code} > And as already mentioned I have used same Spark, same amount of resources and > same config. Only change is Hadoop 3.2.0 to Hadoop 3.3.1 (Built with Spark > using ./dev/make-distribution.sh --name spark-patched --pip -Pkubernetes > -Phive -Phive-thriftserver -Dhadoop.version="3.3.1") -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org