[ 
https://issues.apache.org/jira/browse/SPARK-48571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17854030#comment-17854030
 ] 

Steve Loughran commented on SPARK-48571:
----------------------------------------

The hadoop openFile() code came with HADOOP-15229 ; spark master can depend on 
it. I've pretty much given up trying to get patches into spark myself -maybe 
you can have more luck.

> Reduce the number of accesses to S3 object storage
> --------------------------------------------------
>
>                 Key: SPARK-48571
>                 URL: https://issues.apache.org/jira/browse/SPARK-48571
>             Project: Spark
>          Issue Type: Task
>          Components: Spark Core
>    Affects Versions: 3.5.0
>            Reporter: Oliver Caballero Alvarez
>            Priority: Major
>         Attachments: Spark 3.2 Hadoop-aws 3.1.PNG, Spark 3.2 Hadoop-aws 
> 3.4.PNG, Spark 3.5 Hadoop-aws 3.1.PNG
>
>
> If we access a Spark table on an object storage file system with parquet 
> files, the object storage suffers many requests that seem to be unnecessary. 
> To explain this I will do it with an example:
> I have created a simple table, with 3 files:
> *business/t_filter/country=ES/data_date_part=2023-09-27/part-00000-0f52aae9-2db8-415e-93f3-8331539c0ead.c000*
> *business/t_filter/country=ES/data_date_part=2023-06-01/part-00000-0f52aae9-2db8-415e-93f3-8331539c0ead.c000*
>     
> *business/t_filter/country=ES/data_date_part=2023-09-27/part-00000-f10096c1-53bc-4e2f-bc56-eba65acfa44a.c000*
>     
> and I have put a table that represents business/t_filter with country and 
> data_date_part partitions, you have the following requests.
> If you use versions prior to Spark 3.5 or Hadoop 3.4, in my case it is 
> exactly Spark 3.2 and Hadoop 3.1, the number of requests you have are the 
> following -> IMAGE Spark 3.2 Hadoop 3.1
> In this image we can see all the requests where we can find the following 
> errors:
>  * Two HEAD and two LIST are made with the implementation of S3, of the 
> folders where the files are located, which could only be resolved with a 
> single list. This bug has already been resolved in -> 
> https://issues.apache.org/jira/browse/HADOOP-18073 -> Result : IMAGE 2 Spark 
> 3.2 Hadoop 3.4
>  * For each file, the parquet footing is listed twice. This bug is resolved 
> in -> https://issues.apache.org/jira/browse/SPARK-42388 -> Result : IMAGE 
> Spark 3.5 Hadoop 3.1
>  * A Head Object is launched twice each time a file is read, this could be 
> reduced by implementing the FileSystem interface so that it could receive the 
> FileStatus that has already been calculated above.
>  ** https://issues.apache.org/jira/browse/HADOOP-19199
>  ** https://issues.apache.org/jira/browse/PARQUET-2493
>  ** https://issues.apache.org/jira/browse/HADOOP-19200
>  * The requests could be reduced when reading the parquet footer, since first 
> you have to read the size of the schema and then the schema, which implies 
> two HTTP/HTTPS requests to S3. It would be nice if there was a minimum 
> threshold, for example 100KB, in which, if the file is smaller than that, it 
> would not have to make two requests, and the entire file would be brought, 
> since bringing 100 KB will take less time in one request to bring 8 B in a 
> request and then another request for x KB. Even so, I don't know if this task 
> makes sense.
>  ** It would be to change this implementation, with an environment variable, 
> that if it is set to -1 it does the same, but if it has a threshold set, up 
> to that threshold you do not have to call the seek function twice, which 
> repeats a GET Object.  
> [https://github.com/apache/parquet-java/blob/apache-parquet-1.14.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java]
>  
> With all these improvements, updating to the latest version of Spark and 
> Hadoop would go from more than 30 requests to 11 in the proposed example.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to