[ https://issues.apache.org/jira/browse/SPARK-48571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17854030#comment-17854030 ]
Steve Loughran commented on SPARK-48571: ---------------------------------------- The hadoop openFile() code came with HADOOP-15229 ; spark master can depend on it. I've pretty much given up trying to get patches into spark myself -maybe you can have more luck. > Reduce the number of accesses to S3 object storage > -------------------------------------------------- > > Key: SPARK-48571 > URL: https://issues.apache.org/jira/browse/SPARK-48571 > Project: Spark > Issue Type: Task > Components: Spark Core > Affects Versions: 3.5.0 > Reporter: Oliver Caballero Alvarez > Priority: Major > Attachments: Spark 3.2 Hadoop-aws 3.1.PNG, Spark 3.2 Hadoop-aws > 3.4.PNG, Spark 3.5 Hadoop-aws 3.1.PNG > > > If we access a Spark table on an object storage file system with parquet > files, the object storage suffers many requests that seem to be unnecessary. > To explain this I will do it with an example: > I have created a simple table, with 3 files: > *business/t_filter/country=ES/data_date_part=2023-09-27/part-00000-0f52aae9-2db8-415e-93f3-8331539c0ead.c000* > *business/t_filter/country=ES/data_date_part=2023-06-01/part-00000-0f52aae9-2db8-415e-93f3-8331539c0ead.c000* > > *business/t_filter/country=ES/data_date_part=2023-09-27/part-00000-f10096c1-53bc-4e2f-bc56-eba65acfa44a.c000* > > and I have put a table that represents business/t_filter with country and > data_date_part partitions, you have the following requests. > If you use versions prior to Spark 3.5 or Hadoop 3.4, in my case it is > exactly Spark 3.2 and Hadoop 3.1, the number of requests you have are the > following -> IMAGE Spark 3.2 Hadoop 3.1 > In this image we can see all the requests where we can find the following > errors: > * Two HEAD and two LIST are made with the implementation of S3, of the > folders where the files are located, which could only be resolved with a > single list. This bug has already been resolved in -> > https://issues.apache.org/jira/browse/HADOOP-18073 -> Result : IMAGE 2 Spark > 3.2 Hadoop 3.4 > * For each file, the parquet footing is listed twice. This bug is resolved > in -> https://issues.apache.org/jira/browse/SPARK-42388 -> Result : IMAGE > Spark 3.5 Hadoop 3.1 > * A Head Object is launched twice each time a file is read, this could be > reduced by implementing the FileSystem interface so that it could receive the > FileStatus that has already been calculated above. > ** https://issues.apache.org/jira/browse/HADOOP-19199 > ** https://issues.apache.org/jira/browse/PARQUET-2493 > ** https://issues.apache.org/jira/browse/HADOOP-19200 > * The requests could be reduced when reading the parquet footer, since first > you have to read the size of the schema and then the schema, which implies > two HTTP/HTTPS requests to S3. It would be nice if there was a minimum > threshold, for example 100KB, in which, if the file is smaller than that, it > would not have to make two requests, and the entire file would be brought, > since bringing 100 KB will take less time in one request to bring 8 B in a > request and then another request for x KB. Even so, I don't know if this task > makes sense. > ** It would be to change this implementation, with an environment variable, > that if it is set to -1 it does the same, but if it has a threshold set, up > to that threshold you do not have to call the seek function twice, which > repeats a GET Object. > [https://github.com/apache/parquet-java/blob/apache-parquet-1.14.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java] > > With all these improvements, updating to the latest version of Spark and > Hadoop would go from more than 30 requests to 11 in the proposed example. > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org